Assignment 2: Data Transformation
Due by 11:59 PM on Sunday, April 12, 2026
Assigned: Wednesday, April 8 (Session 4) Due: Sunday, April 12 at 11:59 PM Submit: R script (.R file) on Canvas
See the step-by-step guides: Setting Up an R Project | Using R Scripts
Overview
This assignment practices data transformation using dplyr verbs and the pipe operator. You’ll work with the nycflights13 dataset to answer questions about flight delays.
Setup
# Assignment 2: Data Transformation
# Your Name
# Date
library(tidyverse)
library(nycflights13)
# If you haven't installed nycflights13:
# install.packages("nycflights13")Part 1: Filtering and Arranging (25 points)
Task 1.1
Find all flights that departed from JFK in June and were delayed by more than 1 hour. How many flights meet these criteria?
Task 1.2
Of those flights, which one had the longest departure delay? Use arrange() to find out. Report the carrier, destination, and delay time.
Task 1.3
Find all flights to Los Angeles (LAX) or San Francisco (SFO) that departed on time or early. How many were there?
Part 2: Selecting and Mutating (35 points)
Task 2.1
Create a new dataset that contains only:
carrierorigindestdep_delayarr_delay
Call this dataset flights_subset.
Task 2.2
Using flights_subset, create two new variables:
delay_diff: the difference between arrival delay and departure delaydep_delay_hrs: the departure delay in hours (not minutes)
Task 2.3
What does a positive delay_diff mean? What does a negative value mean? Answer in a comment, then find a flight that “made up time” (arrived less delayed than it departed).
Part 3: Grouped Summaries (30 points)
Task 3.1
Calculate the average departure delay for each carrier. Which carrier has the worst average delay?
Task 3.2
Calculate the average departure delay for each origin airport, by month. Which month is the worst for delays at each airport?
Task 3.3
Create a summary table showing, for each carrier:
- Average departure delay
- Average arrival delay
- Number of flights
- Number of destinations served (
n_distinct())
Arrange by number of flights (descending).
Putting It Together
Write a single piped sequence that:
- Filters to United Airlines (UA) flights
- Removes rows with missing arrival delay
- Groups by destination
- Calculates mean arrival delay and number of flights
- Filters to destinations with at least 100 flights
- Arranges by mean delay (worst first)
What are the top 3 worst destinations for United delays?
Grading Rubric
| Component | Points |
|---|---|
| Part 1: Filtering & Arranging | 25 |
| Part 2: Selecting & Mutating | 35 |
| Part 3: Grouped Summaries | 30 |
| Code runs without errors | 10 |
| Total | 100 |
Submission
Submit your .R file on Canvas.