11: EDA — Covariation

Content for Monday, May 4, 2026

Before class

📖 Reading:

ImportantAssignment 4 is due today

Assignment 4: Visualization Deep Dive — due Sunday, May 3 at 11:59 PM.

During class

We’ll cover:

  • Exploring covariation — relationships between variables
  • Categorical + continuous: boxplots, violin plots, geom_freqpoly() with color
  • Categorical + categorical: geom_count(), geom_tile() with counts
  • Continuous + continuous: scatterplots and trend lines
  • Dealing with overplotting: alpha, jitter, binning
  • Patterns, models, and the correlation vs. causation reminder
  • Hands-on: Explore a real psychology dataset
TipAssignment 5 is assigned today

Assignment 5: Exploratory Data Analysis — due Sunday, May 10 at 11:59 PM.

Slides

View slides in new tab Download PDF

Embedded slides

After class

Practice:

Using the bfi dataset from the psych package:

  1. Pick two personality items (e.g., E1 and A1). Make a scatterplot — is there a relationship?
  2. Create boxplots of age grouped by education. What patterns do you see?
  3. Use geom_count() to visualize the relationship between education and gender
  4. Try adding alpha = 0.3 to a crowded scatterplot. Does it help?
  5. Calculate the correlation between two continuous variables using cor(x, y, use = "complete.obs")
NoteCorrelation vs. causation

Finding a relationship between two variables does not mean one causes the other. Always ask:

  • Could there be a confounding variable?
  • Does the direction of causation make sense?
  • Is this a meaningful effect or just a large sample?

In psychology, this distinction is critical — observational data can suggest hypotheses but rarely prove causation.

Overplotting solutions

# Too many points? Try these:

# 1. Transparency
ggplot(data, aes(x, y)) + geom_point(alpha = 0.3)

# 2. Jitter (adds random noise)
ggplot(data, aes(x, y)) + geom_jitter(width = 0.2, height = 0.2)

# 3. 2D binning (heatmap-style)
ggplot(data, aes(x, y)) + geom_bin2d()