Text Mining Module 2: A Conceptual Overview
“How are our students actually doing?”
We will once again rely on tidyverse and tidytext for this analysis.
tidyverse: The standard collection of packages for data cleaning.tidytext: The specific package we use to handle text like a spreadsheet to make one word per row. Long, but makes this analysis much easier!# A tibble: 6 × 8
text created_at author_id id conversation_id source
<chr> <dttm> <dbl> <dbl> <dbl> <chr>
1 "@catturd2 Hmmmm… 2021-01-02 00:49:28 1.61e 9 1.35e18 1.35e18 Twitt…
2 "@homebrew1500 I… 2021-01-02 00:40:05 1.25e18 1.35e18 1.35e18 Twitt…
3 "@ClayTravis Dum… 2021-01-02 00:32:46 8.88e17 1.35e18 1.35e18 Twitt…
4 "@KarenGunby @ch… 2021-01-02 00:24:01 1.25e18 1.35e18 1.35e18 Twitt…
5 "@keith3048 I kn… 2021-01-02 00:23:42 1.25e 9 1.35e18 1.35e18 Twitt…
6 "Probably common… 2021-01-02 00:18:38 1.28e18 1.35e18 1.35e18 Twitt…
# ℹ 2 more variables: possibly_sensitive <lgl>, in_reply_to_user_id <dbl>
unnest_tokens() function splits your long sentences into individual rows.anti_joinanti_join(stop_words) to automatically strip these out so we can focus on the “meat” of the feedback.The Simple Binary
Source: Created by Bing Liu and collaborators
Format: Categorical data that labels words as either Positive or Negative
Best for: General “thumbs up/down” vibe checks of a course
The Weighted Score
Format: Assigns a score from -5 (Very Negative) to +5 (Very Positive)
Example: “Outstanding” = +5; “Okay” = +1; “Catastrophic” = -5
Best for: Measuring the intensity of sentiment, provided it lies on a single spectrum
Sentiment Beyond Happy and Sad
Format: Tags words with 8 basic emotions (Joy, Anger, Fear, etc.) and 2 sentiments
Example: The word “final” might be tagged with Anticipation or Fear
Best for: Deep dives into specific student sentiment or engagement
anti_join removes stop words during cleaning, but still leaves words that aren’t present in sentiment lexicons but aren’t “real” words either (e.g., X/Twitter handles).
We can use inner_join to ask R to take student feedback and only keep the words that appear in a given sentiment dictionary.
This creates a smaller, even less noisy dataset, but it has to be tailored to each lexicon.
The below example uses NRC:
# A tibble: 10 × 2
sentiment n
<chr> <int>
1 anger 8000
2 anticipation 9756
3 disgust 5913
4 fear 8173
5 joy 7894
6 negative 16783
7 positive 53188
8 sadness 6743
9 surprise 4814
10 trust 17595
ggplot2 to bring these numbers to lifewordcloudDespite these benefits, dictionaries can’t “read:”
Where could you see dictionary-based sentiment analysis coming in handy in your work?
What are some ways that your college or university may read too much into sentiment analysis for their own evaluation and policy development?

This work was supported by the National Science Foundation grants DRL-2025090 and DRL-2321128 (ECR:BCSER). Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.