Assignment 6: Data Types & Wrangling

Due by 11:59 PM on Sunday, May 17, 2026

Assignment Details

Assigned: Monday, May 11 (Session 13) Due: Sunday, May 17 at 11:59 PM Submit: Quarto document (.qmd) AND rendered HTML on Canvas

Getting started

See the guides: Setting Up an R Project | Using Quarto Documents

Overview

This assignment practices working with different data types: logical vectors, numbers, strings, and factors. These skills are essential for cleaning real-world psychology data.

Setup

# Assignment 6: Data Types & Wrangling
# Your Name
# Date

library(tidyverse)
library(tidytext)

# If you haven't installed tidytext:
# install.packages("tidytext")

Part 1: Logical Vectors & Recoding (25 points)

Here’s a simulated survey dataset:

set.seed(410)  # For reproducibility

survey <- tibble(
  id = 1:100,
  age = round(rnorm(100, mean = 35, sd = 12)),
  q1 = sample(1:7, 100, replace = TRUE),
  q2 = sample(1:7, 100, replace = TRUE),
  q3_reverse = sample(1:7, 100, replace = TRUE),  # Needs reverse coding
  q4 = sample(1:7, 100, replace = TRUE),
  q5_reverse = sample(1:7, 100, replace = TRUE),  # Needs reverse coding
  attention_check = sample(c(4, 1:3, 5:7), 100, replace = TRUE,
                           prob = c(0.85, rep(0.025, 6)))
)

Task 1.1

Create a logical variable passed_attention that is TRUE if attention_check == 4 and FALSE otherwise. How many participants passed?

Task 1.2

Reverse-code q3_reverse and q5_reverse. On a 1-7 scale, reverse coding means: new_value = 8 - old_value. Create new variables q3 and q5 with the corrected values.

Task 1.3

Create a scale_mean variable that is the mean of q1, q2, q3, q4, and q5 for each participant.

Task 1.4

Create an age_group variable using case_when():

“young” if age < 30
“middle” if age >= 30 and age < 50
“older” if age >= 50

Part 2: Strings (20 points)

Here’s some messy demographic data:

messy_demo <- tibble(
  id = 1:10,
  gender = c("Male", "female", "FEMALE", "M", "F", "male", "Female", "m", "f", "Male"),
  education = c("high school", "High School", "college", "College", "graduate",
                "COLLEGE", "Graduate", "high school", "Graduate", "college")
)

Task 2.1

Clean the gender variable so all values are standardized to “Male” or “Female”.

Hint: Use str_to_lower() first, then case_when() or if_else().

Task 2.2

Clean the education variable similarly. Standardize to “High School”, “College”, or “Graduate”.

Part 3: Factors (20 points)

Task 3.1

Convert your cleaned education variable to a factor with levels in logical order: “High School”, “College”, “Graduate”.

# Hint: factor(x, levels = c(...))

Task 3.2

Create a bar chart of education levels. What happens if you didn’t set the factor levels? Show both versions.

Task 3.3

Using the survey data, create a factor version of age_group with levels in order: “young”, “middle”, “older”. Then create a boxplot of scale_mean by age_group. The x-axis should be in the correct order.

Part 4: Putting It Together (10 points)

Create a complete data cleaning pipeline for the survey data that:

Filters to participants who passed the attention check
Reverse-codes the necessary items
Creates the scale mean
Creates the age group factor (in correct order)
Selects only id, age, age_group, and scale_mean

Use a single piped sequence.

Part 5: Text Analysis (15 points)

Open-ended survey responses often contain valuable data. Here are responses from the same study:

open_responses <- tibble(
  id = 1:15,
  response = c(
    "I felt very stressed and anxious throughout the study",
    "The tasks were challenging but I found them interesting",
    "I was confused by some of the instructions at first",
    "Participating made me feel nervous but also curious",
    "The study was well organized and the instructions were clear",
    "I felt overwhelmed at times but managed to stay focused",
    "Some questions were confusing and hard to understand",
    "I enjoyed the experience and learned something new",
    "The tasks were repetitive and I lost focus near the end",
    "I felt calm and confident during the experiment",
    "It was stressful but I liked being part of research",
    "The instructions were clear and easy to follow",
    "I felt anxious about my performance on the tasks",
    "Very interesting study about human behavior and emotion",
    "I found the experience valuable and thought-provoking"
  )
)

Task 5.1

Use unnest_tokens() from the tidytext package to tokenize the responses into individual words.

# Hint:
open_responses |>
  unnest_tokens(word, response)

Task 5.2

Remove common stop words (e.g., “the”, “and”, “I”) using anti_join(stop_words). How many unique words remain?

Task 5.3

Count word frequencies and create a bar chart of the 10 most common words. What themes emerge from the responses?

Grading Rubric

Component	Points
Part 1: Logical vectors & recoding	25
Part 2: Strings	20
Part 3: Factors	20
Part 4: Complete pipeline	10
Part 5: Text analysis	15
Code runs without errors	10
Total	100

Submission

Your .qmd file with all code and narrative
Your rendered HTML file
All data cleaning operations work correctly
Factors are in correct order
Pipeline produces clean final dataset
Text analysis section complete with visualization