10: EDA — Variation

Content for Wednesday, April 29, 2026

Before class

📖 Reading:

R4DS Ch 10: Exploratory data analysis (sections 10.1–10.4)

Final Project Proposal is due today

Submit your dataset selection, research questions, and initial plan before class.

During class

We’ll cover:

What is EDA? Exploratory vs. confirmatory thinking
The EDA mindset — curiosity, not confirmation
Exploring distributions: histograms, density plots, bar charts
Choosing binwidth and when it matters
Spotting outliers visually and programmatically
What to do with outliers (investigate, don’t delete)
A first look at missing data
Putting it all together: a systematic EDA workflow

Slides

View slides in new tab Download PDF

Embedded slides

After class

✅ Practice:

Using the bfi dataset from the psych package (loaded in class today):

Explore the distribution of age. What shape is it? Is it skewed?
Check for outliers in age using both a histogram and a boxplot. Are any values suspicious?
Pick 3 personality items (e.g., E1, N2, A4) and plot their distributions. What do you notice?
Count how many missing values each variable has. Which variables have the most?
Write down 2–3 questions that your exploration makes you want to answer

The EDA loop

EDA isn’t a linear checklist — it’s a cycle:

Ask a question
Visualize the data to answer it
Notice something new
Ask another question
Repeat

There’s no “finished.” The goal is understanding, not a final answer.

Loading the dataset

library(psych)
data(bfi)
bfi <- as_tibble(bfi)   # Convert to tibble for tidyverse compatibility

Outlier investigation template

# Quick outlier check for any numeric variable
variable_name <- "age"

bfi |>
  summarize(
    min = min(!!sym(variable_name), na.rm = TRUE),
    max = max(!!sym(variable_name), na.rm = TRUE),
    mean = mean(!!sym(variable_name), na.rm = TRUE),
    sd = sd(!!sym(variable_name), na.rm = TRUE)
  )

# Flag values beyond 3 SD from the mean
bfi |>
  filter(abs(age - mean(age, na.rm = TRUE)) > 3 * sd(age, na.rm = TRUE))