Assignment 6: Data Types & Wrangling

Due by 11:59 PM on Sunday, May 17, 2026

NoteAssignment Details

Assigned: Monday, May 11 (Session 13) Due: Sunday, May 17 at 11:59 PM Submit: Quarto document (.qmd) AND rendered HTML on Canvas

TipGetting started

Overview

This assignment practices working with different data types: logical vectors, numbers, strings, and factors. These skills are essential for cleaning real-world psychology data.

Setup

# Assignment 6: Data Types & Wrangling
# Your Name
# Date

library(tidyverse)
library(tidytext)

# If you haven't installed tidytext:
# install.packages("tidytext")

Part 1: Logical Vectors & Recoding (25 points)

Here’s a simulated survey dataset:

set.seed(410)  # For reproducibility

survey <- tibble(
  id = 1:100,
  age = round(rnorm(100, mean = 35, sd = 12)),
  q1 = sample(1:7, 100, replace = TRUE),
  q2 = sample(1:7, 100, replace = TRUE),
  q3_reverse = sample(1:7, 100, replace = TRUE),  # Needs reverse coding
  q4 = sample(1:7, 100, replace = TRUE),
  q5_reverse = sample(1:7, 100, replace = TRUE),  # Needs reverse coding
  attention_check = sample(c(4, 1:3, 5:7), 100, replace = TRUE,
                           prob = c(0.85, rep(0.025, 6)))
)

Task 1.1

Create a logical variable passed_attention that is TRUE if attention_check == 4 and FALSE otherwise. How many participants passed?

Task 1.2

Reverse-code q3_reverse and q5_reverse. On a 1-7 scale, reverse coding means: new_value = 8 - old_value. Create new variables q3 and q5 with the corrected values.

Task 1.3

Create a scale_mean variable that is the mean of q1, q2, q3, q4, and q5 for each participant.

Task 1.4

Create an age_group variable using case_when():

  • “young” if age < 30
  • “middle” if age >= 30 and age < 50
  • “older” if age >= 50

Part 2: Strings (20 points)

Here’s some messy demographic data:

messy_demo <- tibble(
  id = 1:10,
  gender = c("Male", "female", "FEMALE", "M", "F", "male", "Female", "m", "f", "Male"),
  education = c("high school", "High School", "college", "College", "graduate",
                "COLLEGE", "Graduate", "high school", "Graduate", "college")
)

Task 2.1

Clean the gender variable so all values are standardized to “Male” or “Female”.

Hint: Use str_to_lower() first, then case_when() or if_else().

Task 2.2

Clean the education variable similarly. Standardize to “High School”, “College”, or “Graduate”.

Part 3: Factors (20 points)

Task 3.1

Convert your cleaned education variable to a factor with levels in logical order: “High School”, “College”, “Graduate”.

# Hint: factor(x, levels = c(...))

Task 3.2

Create a bar chart of education levels. What happens if you didn’t set the factor levels? Show both versions.

Task 3.3

Using the survey data, create a factor version of age_group with levels in order: “young”, “middle”, “older”. Then create a boxplot of scale_mean by age_group. The x-axis should be in the correct order.

Part 4: Putting It Together (10 points)

Create a complete data cleaning pipeline for the survey data that:

  1. Filters to participants who passed the attention check
  2. Reverse-codes the necessary items
  3. Creates the scale mean
  4. Creates the age group factor (in correct order)
  5. Selects only id, age, age_group, and scale_mean

Use a single piped sequence.

Part 5: Text Analysis (15 points)

Open-ended survey responses often contain valuable data. Here are responses from the same study:

open_responses <- tibble(
  id = 1:15,
  response = c(
    "I felt very stressed and anxious throughout the study",
    "The tasks were challenging but I found them interesting",
    "I was confused by some of the instructions at first",
    "Participating made me feel nervous but also curious",
    "The study was well organized and the instructions were clear",
    "I felt overwhelmed at times but managed to stay focused",
    "Some questions were confusing and hard to understand",
    "I enjoyed the experience and learned something new",
    "The tasks were repetitive and I lost focus near the end",
    "I felt calm and confident during the experiment",
    "It was stressful but I liked being part of research",
    "The instructions were clear and easy to follow",
    "I felt anxious about my performance on the tasks",
    "Very interesting study about human behavior and emotion",
    "I found the experience valuable and thought-provoking"
  )
)

Task 5.1

Use unnest_tokens() from the tidytext package to tokenize the responses into individual words.

# Hint:
open_responses |>
  unnest_tokens(word, response)

Task 5.2

Remove common stop words (e.g., “the”, “and”, “I”) using anti_join(stop_words). How many unique words remain?

Task 5.3

Count word frequencies and create a bar chart of the 10 most common words. What themes emerge from the responses?

Grading Rubric

Component Points
Part 1: Logical vectors & recoding 25
Part 2: Strings 20
Part 3: Factors 20
Part 4: Complete pipeline 10
Part 5: Text analysis 15
Code runs without errors 10
Total 100

Submission