4: Data Transformation II

Content for Wednesday, April 8, 2026

Before class

📖 Readings:

TipAssignment 2 is assigned today

Assignment 2: Data Transformation — due Sunday, April 12 at 11:59 PM.

During class

We’ll cover:

  • group_by() — define groups for operations
  • summarize() — calculate summary statistics
  • Combining group_by() + summarize() for grouped statistics
  • count() — a quick shortcut for counting
  • Code style best practices
  • Building complete analysis pipelines

Slides

View slides in new tab Download PDF

Embedded slides

After class

Practice:

Using the flights dataset:

  1. Calculate the average departure delay by carrier
  2. Find which carrier has the most flights
  3. Calculate the percentage of flights that were delayed (arr_delay > 0) by month
  4. Find the top 5 destinations with the highest average arrival delay
  5. Create a complete pipeline that filters, groups, summarizes, and arranges
NoteThe power of group_by()

group_by() doesn’t change how your data looks — it changes how other verbs work on it. Think of it as setting up the “behind the scenes” grouping structure.

data |>
  group_by(condition) |>
  summarize(mean = mean(score))

Psychology application

# Calculate descriptive statistics by condition
experiment_data |>
  group_by(condition) |>
  summarize(
    M = mean(score, na.rm = TRUE),
    SD = sd(score, na.rm = TRUE),
    n = n()
  )