15: Missing Data

Content for Wednesday, May 20, 2026

Before class

📖 Reading:

ImportantAssignment 6 is due today

Assignment 6: Data Types & Wrangling — due Sunday, May 17 at 11:59 PM.

ImportantFinal Project Draft is due today

Submit your working code and preliminary visualizations before class.

During class

We’ll cover:

  • Why missing data matters in psychology — attrition, non-response, skip patterns
  • Explicit vs. implicit missing values
  • Counting and exploring missingness patterns
  • drop_na() — complete case analysis (listwise deletion)
  • replace_na() — filling in known values
  • fill() — carrying values forward/backward
  • complete() — making implicit missing values explicit
  • When not to fill — don’t make up data!
  • Brief mention: multiple imputation exists (beyond this course)

Slides

View slides in new tab Download PDF

Embedded slides

After class

Practice:

  1. Count the number of NAs in each column of a dataset using summarize(across(everything(), ~sum(is.na(.x))))
  2. Use drop_na() on a specific column vs. the whole dataset — what’s the difference in rows lost?
  3. Try replace_na() to fill missing values with a sensible default (e.g., 0 for “no response”)
  4. Use complete() to make implicit missing values explicit in a longitudinal dataset
  5. Think critically: when is it okay to drop missing data? When is it dangerous?
NoteMissing data is information

Missing values aren’t just annoying — they can tell you something. Before dropping or filling NAs, ask:

  • Why is this value missing? (Didn’t answer? Wasn’t asked? Data entry error?)
  • Is the missingness random? If participants who dropped out were systematically different, dropping them biases your results
  • How much is missing? A few values vs. 40% of a column require different strategies

Document your decisions about missing data — your future self (and reviewers) will thank you.

Counting missing values

# Quick summary of missingness across all columns
data |>
  summarize(
    across(everything(), ~sum(is.na(.x)))
  ) |>
  pivot_longer(everything(),
    names_to = "variable",
    values_to = "n_missing"
  ) |>
  arrange(desc(n_missing))