Assignment 7: Joins & Missing Data

Due by 11:59 PM on Sunday, May 24, 2026

NoteAssignment Details

Assigned: Wednesday, May 13 (Session 14) Due: Sunday, May 24 at 11:59 PM Submit: Quarto document (.qmd) AND rendered HTML on Canvas

TipGetting started

Overview

This assignment practices joining multiple datasets and handling missing data — essential skills for working with real research data where information comes from multiple sources.

Setup

# Assignment 7: Joins & Missing Data
# Your Name
# Date

library(tidyverse)

The Datasets

A research study has data in three separate files:

# Participant demographics
demographics <- tibble(
  participant_id = c(1, 2, 3, 4, 5, 6, 7, 8),
  age = c(22, 35, 28, 41, 19, 33, 27, 45),
  gender = c("F", "M", "F", "F", "M", "M", "F", "M"),
  condition = c("A", "B", "A", "B", "A", "B", "A", "B")
)

# Survey responses (some participants didn't complete)
survey_responses <- tibble(
  participant_id = c(1, 2, 3, 4, 6, 7, 9, 10),  # Note: 5, 8 missing; 9, 10 extra
  anxiety = c(45, 32, 38, 41, 28, 52, 35, 40),
  depression = c(12, 8, NA, 15, 6, 18, NA, 11),
  stress = c(28, 22, 25, NA, 18, 32, 24, NA)
)

# Behavioral task data
task_data <- tibble(
  participant_id = c(1, 1, 2, 2, 3, 3, 5, 5, 6, 6),
  trial = rep(1:2, 5),
  reaction_time = c(450, 420, 380, 390, 410, 405, 520, 480, 370, 365),
  accuracy = c(1, 1, 1, 0, 1, 1, 0, 1, 1, 1)
)

Part 1: Understanding Your Data (15 points)

Task 1.1

Before joining, explore each dataset. How many participants are in each? Are there any participants who appear in one dataset but not others?

Task 1.2

Use anti_join() to find:

  • Participants in demographics but not in survey_responses
  • Participants in survey_responses but not in demographics

What might explain these discrepancies in a real study?

Part 2: Joining Data (30 points)

Task 2.1

Create a dataset that contains demographics and survey responses for all participants who have both. Use inner_join(). How many participants are in this dataset?

Task 2.2

Create a dataset that keeps all participants from demographics and adds survey responses where available. Use left_join(). How many participants have missing survey data?

Task 2.3

The task_data has multiple rows per participant (one per trial). First, summarize it to get mean reaction time and mean accuracy per participant. Then join this with your demographics + survey dataset.

Part 3: Missing Data Analysis (30 points)

Using your joined dataset from Task 2.2:

Task 3.1

Create a summary showing the number and percentage of missing values for each variable.

Task 3.2

Is missingness related to condition? Calculate the percentage of missing anxiety scores in each condition. Do the same for depression and stress.

Task 3.3

Create a visualization showing the pattern of missingness across variables. You might use:

# One approach - missing data heatmap
your_data |>
  mutate(across(everything(), ~is.na(.))) |>
  # ... continue to visualize

Task 3.4

Create two versions of a summary:

  1. Complete cases only: Mean anxiety by condition, excluding any participants with missing data
  2. Available data: Mean anxiety by condition, using na.rm = TRUE

Do the results differ? What are the implications?

Part 4: Reflection (15 points)

Answer in comments:

  1. What are the risks of analyzing only complete cases?
  2. When might it be appropriate to exclude participants with missing data?
  3. How would you report the missing data in a research paper?

Grading Rubric

Component Points
Part 1: Understanding data 15
Part 2: Joining data 30
Part 3: Missing data analysis 30
Part 4: Reflection 15
Code runs without errors 10
Total 100

Submission


NotePSY 510 (Graduate Students)

Students enrolled in PSY 510 must complete the following extension in addition to all tasks above.

Graduate Extension: Formal Missing Data Report

Reviewers regularly ask authors to characterize and justify how they handled missing data. This extension gives you a reusable script and language for doing that.

Task G.1

Install and use the naniar package to create a missingness matrix — a visualization showing which participants have missing values across which variables.

# install.packages("naniar")
library(naniar)
vis_miss(your_joined_data)

Task G.2

Classify the likely missingness mechanism for the survey data in this assignment. Is it MCAR (missing completely at random), MAR (missing at random — related to other observed variables), or MNAR (missing not at random — related to the missing value itself)? Justify your classification using your results from Part 3 — point to specific numbers.

Task G.3

Write a missing data paragraph as it would appear in a manuscript’s Method section (3–5 sentences). It should state: (a) how much data is missing overall, (b) which variables are most affected, (c) whether missingness appears random or systematic, and (d) how you handled it in your analysis. Write this as something you could drop into a real paper.

Submission: Add your naniar code and the missing data paragraph to your .qmd file under a clearly marked ## Graduate Extension section.