10: EDA — Variation
Content for Wednesday, April 29, 2026
Before class
📖 Reading:
- R4DS Ch 10: Exploratory data analysis (sections 10.1–10.4)
WarningFinal Project Proposal is due today
Submit your dataset selection, research questions, and initial plan before class.
During class
We’ll cover:
- What is EDA? Exploratory vs. confirmatory thinking
- The EDA mindset — curiosity, not confirmation
- Exploring distributions: histograms, density plots, bar charts
- Choosing binwidth and when it matters
- Spotting outliers visually and programmatically
- What to do with outliers (investigate, don’t delete)
- A first look at missing data
- Putting it all together: a systematic EDA workflow
Slides
View slides in new tab Download PDFEmbedded slides
After class
✅ Practice:
Using the bfi dataset from the psych package (loaded in class today):
- Explore the distribution of
age. What shape is it? Is it skewed? - Check for outliers in
ageusing both a histogram and a boxplot. Are any values suspicious? - Pick 3 personality items (e.g.,
E1,N2,A4) and plot their distributions. What do you notice? - Count how many missing values each variable has. Which variables have the most?
- Write down 2–3 questions that your exploration makes you want to answer
NoteThe EDA loop
EDA isn’t a linear checklist — it’s a cycle:
- Ask a question
- Visualize the data to answer it
- Notice something new
- Ask another question
- Repeat
There’s no “finished.” The goal is understanding, not a final answer.
Loading the dataset
library(psych)
data(bfi)
bfi <- as_tibble(bfi) # Convert to tibble for tidyverse compatibilityOutlier investigation template
# Quick outlier check for any numeric variable
variable_name <- "age"
bfi |>
summarize(
min = min(!!sym(variable_name), na.rm = TRUE),
max = max(!!sym(variable_name), na.rm = TRUE),
mean = mean(!!sym(variable_name), na.rm = TRUE),
sd = sd(!!sym(variable_name), na.rm = TRUE)
)
# Flag values beyond 3 SD from the mean
bfi |>
filter(abs(age - mean(age, na.rm = TRUE)) > 3 * sd(age, na.rm = TRUE))