Assignment 6: Data Types & Wrangling
Due by 11:59 PM on Sunday, May 17, 2026
Assigned: Monday, May 11 (Session 13) Due: Sunday, May 17 at 11:59 PM Submit: Quarto document (.qmd) AND rendered HTML on Canvas
See the guides: Setting Up an R Project | Using Quarto Documents
Overview
This assignment practices working with different data types: logical vectors, numbers, strings, and factors. These skills are essential for cleaning real-world psychology data.
Setup
# Assignment 6: Data Types & Wrangling
# Your Name
# Date
library(tidyverse)
library(tidytext)
# If you haven't installed tidytext:
# install.packages("tidytext")Part 1: Logical Vectors & Recoding (25 points)
Here’s a simulated survey dataset:
set.seed(410) # For reproducibility
survey <- tibble(
id = 1:100,
age = round(rnorm(100, mean = 35, sd = 12)),
q1 = sample(1:7, 100, replace = TRUE),
q2 = sample(1:7, 100, replace = TRUE),
q3_reverse = sample(1:7, 100, replace = TRUE), # Needs reverse coding
q4 = sample(1:7, 100, replace = TRUE),
q5_reverse = sample(1:7, 100, replace = TRUE), # Needs reverse coding
attention_check = sample(c(4, 1:3, 5:7), 100, replace = TRUE,
prob = c(0.85, rep(0.025, 6)))
)Task 1.1
Create a logical variable passed_attention that is TRUE if attention_check == 4 and FALSE otherwise. How many participants passed?
Task 1.2
Reverse-code q3_reverse and q5_reverse. On a 1-7 scale, reverse coding means: new_value = 8 - old_value. Create new variables q3 and q5 with the corrected values.
Task 1.3
Create a scale_mean variable that is the mean of q1, q2, q3, q4, and q5 for each participant.
Task 1.4
Create an age_group variable using case_when():
- “young” if age < 30
- “middle” if age >= 30 and age < 50
- “older” if age >= 50
Part 2: Strings (20 points)
Here’s some messy demographic data:
messy_demo <- tibble(
id = 1:10,
gender = c("Male", "female", "FEMALE", "M", "F", "male", "Female", "m", "f", "Male"),
education = c("high school", "High School", "college", "College", "graduate",
"COLLEGE", "Graduate", "high school", "Graduate", "college")
)Task 2.1
Clean the gender variable so all values are standardized to “Male” or “Female”.
Hint: Use str_to_lower() first, then case_when() or if_else().
Task 2.2
Clean the education variable similarly. Standardize to “High School”, “College”, or “Graduate”.
Part 3: Factors (20 points)
Task 3.1
Convert your cleaned education variable to a factor with levels in logical order: “High School”, “College”, “Graduate”.
# Hint: factor(x, levels = c(...))Task 3.2
Create a bar chart of education levels. What happens if you didn’t set the factor levels? Show both versions.
Task 3.3
Using the survey data, create a factor version of age_group with levels in order: “young”, “middle”, “older”. Then create a boxplot of scale_mean by age_group. The x-axis should be in the correct order.
Part 4: Putting It Together (10 points)
Create a complete data cleaning pipeline for the survey data that:
- Filters to participants who passed the attention check
- Reverse-codes the necessary items
- Creates the scale mean
- Creates the age group factor (in correct order)
- Selects only id, age, age_group, and scale_mean
Use a single piped sequence.
Part 5: Text Analysis (15 points)
Open-ended survey responses often contain valuable data. Here are responses from the same study:
open_responses <- tibble(
id = 1:15,
response = c(
"I felt very stressed and anxious throughout the study",
"The tasks were challenging but I found them interesting",
"I was confused by some of the instructions at first",
"Participating made me feel nervous but also curious",
"The study was well organized and the instructions were clear",
"I felt overwhelmed at times but managed to stay focused",
"Some questions were confusing and hard to understand",
"I enjoyed the experience and learned something new",
"The tasks were repetitive and I lost focus near the end",
"I felt calm and confident during the experiment",
"It was stressful but I liked being part of research",
"The instructions were clear and easy to follow",
"I felt anxious about my performance on the tasks",
"Very interesting study about human behavior and emotion",
"I found the experience valuable and thought-provoking"
)
)Task 5.1
Use unnest_tokens() from the tidytext package to tokenize the responses into individual words.
# Hint:
open_responses |>
unnest_tokens(word, response)Task 5.2
Remove common stop words (e.g., “the”, “and”, “I”) using anti_join(stop_words). How many unique words remain?
Task 5.3
Count word frequencies and create a bar chart of the 10 most common words. What themes emerge from the responses?
Grading Rubric
| Component | Points |
|---|---|
| Part 1: Logical vectors & recoding | 25 |
| Part 2: Strings | 20 |
| Part 3: Factors | 20 |
| Part 4: Complete pipeline | 10 |
| Part 5: Text analysis | 15 |
| Code runs without errors | 10 |
| Total | 100 |