13: Strings, Factors & Text

Content for Monday, May 11, 2026

Before class

📖 Readings:

ImportantAssignment 5 is due today

Assignment 5: Exploratory Data Analysis — due Sunday, May 10 at 11:59 PM.

During class

We’ll cover:

  • Strings — creating, combining, and basic cleaning
  • str_to_lower(), str_to_upper(), str_trim() for cleanup
  • str_detect() and str_replace() for simple pattern matching
  • Factors — categorical data with a fixed set of levels
  • Why factor order matters for plots and tables
  • Reordering factors: fct_relevel(), fct_reorder(), fct_infreq()
  • Recoding factors: fct_recode(), fct_collapse()
  • Practical application: cleaning demographic variables
TipAssignment 6 is assigned today

Assignment 6: Data Types & Wrangling — due Sunday, May 17 at 11:59 PM.

Slides

View slides in new tab Download PDF

Embedded slides

After class

Practice:

  1. Clean up a messy text column using str_trim() and str_to_lower() — try it on free-response demographic data
  2. Use str_detect() to filter rows where a column contains a specific word
  3. Convert a character column to a factor with factor(). What happens to the levels?
  4. Reorder bars in a bar chart using fct_infreq() — most common category first
  5. Use fct_recode() to combine similar categories (e.g., “Male” and “male” and “M”)
NoteWhy factors matter for plots

By default, R orders categories alphabetically. That’s rarely what you want in a plot. Factors let you control the order:

# Alphabetical (default) — not great
ggplot(data, aes(x = condition)) + geom_bar()

# Ordered by frequency — much better
ggplot(data, aes(x = fct_infreq(condition))) + geom_bar()

# Custom order — you decide
data |>
  mutate(condition = fct_relevel(condition, "Control", "Low", "High")) |>
  ggplot(aes(x = condition)) + geom_bar()

Psychology application: cleaning demographics

# Common cleanup pipeline for survey demographics
survey |>
  mutate(
    gender = str_to_lower(str_trim(gender)),
    gender = fct_recode(as_factor(gender),
      "Man" = "male",
      "Man" = "m",
      "Woman" = "female",
      "Woman" = "f"
    ),
    education = fct_relevel(education,
      "High school", "Some college", "Bachelor's", "Graduate"
    )
  )