4: Data Transformation II

Content for Wednesday, April 8, 2026

Before class

📖 Readings:

R4DS Ch 3: Data transformation (section 3.5)
R4DS Ch 4: Workflow: code style

Assignment 2 is assigned today

Assignment 2: Data Transformation — due Sunday, April 12 at 11:59 PM.

During class

We’ll cover:

group_by() — define groups for operations
summarize() — calculate summary statistics
Combining group_by() + summarize() for grouped statistics
count() — a quick shortcut for counting
Code style best practices
Building complete analysis pipelines

Slides

View slides in new tab Download PDF

Embedded slides

After class

✅ Practice:

Using the flights dataset:

Calculate the average departure delay by carrier
Find which carrier has the most flights
Calculate the percentage of flights that were delayed (arr_delay > 0) by month
Find the top 5 destinations with the highest average arrival delay
Create a complete pipeline that filters, groups, summarizes, and arranges

The power of group_by()

group_by() doesn’t change how your data looks — it changes how other verbs work on it. Think of it as setting up the “behind the scenes” grouping structure.

data |>
  group_by(condition) |>
  summarize(mean = mean(score))

Psychology application

# Calculate descriptive statistics by condition
experiment_data |>
  group_by(condition) |>
  summarize(
    M = mean(score, na.rm = TRUE),
    SD = sd(score, na.rm = TRUE),
    n = n()
  )