Assignment 2: Data Transformation

Due by 11:59 PM on Sunday, April 12, 2026

NoteAssignment Details

Assigned: Wednesday, April 8 (Session 4) Due: Sunday, April 12 at 11:59 PM Submit: R script (.R file) on Canvas

TipGetting started

See the step-by-step guides: Setting Up an R Project | Using R Scripts

Overview

This assignment practices data transformation using dplyr verbs and the pipe operator. You’ll work with the nycflights13 dataset to answer questions about flight delays.

Setup

# Assignment 2: Data Transformation
# Your Name
# Date

library(tidyverse)
library(nycflights13)

# If you haven't installed nycflights13:
# install.packages("nycflights13")

Part 1: Filtering and Arranging (25 points)

Task 1.1

Find all flights that departed from JFK in June and were delayed by more than 1 hour. How many flights meet these criteria?

Task 1.2

Of those flights, which one had the longest departure delay? Use arrange() to find out. Report the carrier, destination, and delay time.

Task 1.3

Find all flights to Los Angeles (LAX) or San Francisco (SFO) that departed on time or early. How many were there?

Part 2: Selecting and Mutating (35 points)

Task 2.1

Create a new dataset that contains only:

  • carrier
  • origin
  • dest
  • dep_delay
  • arr_delay

Call this dataset flights_subset.

Task 2.2

Using flights_subset, create two new variables:

  • delay_diff: the difference between arrival delay and departure delay
  • dep_delay_hrs: the departure delay in hours (not minutes)

Task 2.3

What does a positive delay_diff mean? What does a negative value mean? Answer in a comment, then find a flight that “made up time” (arrived less delayed than it departed).

Part 3: Grouped Summaries (30 points)

Task 3.1

Calculate the average departure delay for each carrier. Which carrier has the worst average delay?

Task 3.2

Calculate the average departure delay for each origin airport, by month. Which month is the worst for delays at each airport?

Task 3.3

Create a summary table showing, for each carrier:

  • Average departure delay
  • Average arrival delay
  • Number of flights
  • Number of destinations served (n_distinct())

Arrange by number of flights (descending).

Putting It Together

Write a single piped sequence that:

  1. Filters to United Airlines (UA) flights
  2. Removes rows with missing arrival delay
  3. Groups by destination
  4. Calculates mean arrival delay and number of flights
  5. Filters to destinations with at least 100 flights
  6. Arranges by mean delay (worst first)

What are the top 3 worst destinations for United delays?

Grading Rubric

Component Points
Part 1: Filtering & Arranging 25
Part 2: Selecting & Mutating 35
Part 3: Grouped Summaries 30
Code runs without errors 10
Total 100

Submission

Submit your .R file on Canvas.