11: EDA — Covariation
Content for Monday, May 4, 2026
Before class
📖 Reading:
- R4DS Ch 10: Exploratory data analysis (sections 10.5–10.6)
ImportantAssignment 4 is due today
Assignment 4: Visualization Deep Dive — due Sunday, May 3 at 11:59 PM.
During class
We’ll cover:
- Exploring covariation — relationships between variables
- Categorical + continuous: boxplots, violin plots,
geom_freqpoly()with color - Categorical + categorical:
geom_count(),geom_tile()with counts - Continuous + continuous: scatterplots and trend lines
- Dealing with overplotting: alpha, jitter, binning
- Patterns, models, and the correlation vs. causation reminder
- Hands-on: Explore a real psychology dataset
TipAssignment 5 is assigned today
Assignment 5: Exploratory Data Analysis — due Sunday, May 10 at 11:59 PM.
Slides
View slides in new tab Download PDFEmbedded slides
After class
✅ Practice:
Using the bfi dataset from the psych package:
- Pick two personality items (e.g.,
E1andA1). Make a scatterplot — is there a relationship? - Create boxplots of
agegrouped byeducation. What patterns do you see? - Use
geom_count()to visualize the relationship betweeneducationandgender - Try adding
alpha = 0.3to a crowded scatterplot. Does it help? - Calculate the correlation between two continuous variables using
cor(x, y, use = "complete.obs")
NoteCorrelation vs. causation
Finding a relationship between two variables does not mean one causes the other. Always ask:
- Could there be a confounding variable?
- Does the direction of causation make sense?
- Is this a meaningful effect or just a large sample?
In psychology, this distinction is critical — observational data can suggest hypotheses but rarely prove causation.
Overplotting solutions
# Too many points? Try these:
# 1. Transparency
ggplot(data, aes(x, y)) + geom_point(alpha = 0.3)
# 2. Jitter (adds random noise)
ggplot(data, aes(x, y)) + geom_jitter(width = 0.2, height = 0.2)
# 3. 2D binning (heatmap-style)
ggplot(data, aes(x, y)) + geom_bin2d()