Working with Sentiment Lexicons

Welcome to the Text Mining Code Along for Module 2

The Text Mining course is designed for those seeking an introductory understanding of quantifying the text in documents to better understand their properties
The following Code Along is a companion to the Module 2 Case Study’s Explore stages

Figure 2.2 Steps of Data-Intensive Research Workflow

[@krumm2018]

Module Objectives

By the end of this module, we will:

Visualize the date range of our ccss-tweets dataset
Summarize and compare public sentiment using summary statistics
Visualize sentiment through a pie chart

Context of the Problem

Rosenberg, J. M., Borchers, C., Dyer, E. B., Anderson, D., & Fischer, C. (2021). Understanding Public Sentiment About Educational Reforms: The Next Generation Science Standards on Twitter. AERA Open, 7. https://doi.org/10.1177/23328584211024261

Research Questions:

What is the public sentiment expressed toward the NGSS?
How does sentiment for NGSS compare to sentiment for CCSS?

Article Abstract:

While the Next Generation Science Standards (NGSS) are a long-standing and widespread standards-based educational reform effort, they have received less public attention, and no studies have explored the sentiment of the views of multiple stakeholders toward them.

To establish how public sentiment about this reform might be similar to or different from past efforts, we applied a suite of data science techniques to posts about the standards on Twitter from 2010-2020 (N = 571,378) from 87,719 users.

Applying data science techniques to identify teachers and to estimate tweet sentiment, we found that the public sentiment towards the NGSS is overwhelmingly positive—33 times more so than for the CCSS.

Mixed effects models indicated that sentiment became more positive over time and that teachers, in particular, showed a more positive sentiment towards the NGSS. We discuss implications for educational reform efforts and the use of data science methods for understanding their implementation.

Load the tidyverse, tidytext, and textdata packages using library()

library(tidyverse)
library(tidytext)
library(textdata)

Time Series

Quick Viz
👉 Your Turn ⤵
Answer

Let’s take a very quick look at the number of daily tweets over the first 5 months of 2020:

daily_tweets <- ss_tweets |>
  mutate(tweet_date = as.Date(created_at)) |>
  group_by(tweet_date) |>
  summarise(count = n())
# Plot a line chart of the number of tweets over time
ggplot(daily_tweets, aes(x = tweet_date, y = count)) +
  geom_line(color = "#CC0000") +
  labs(
    title = "Number of Tweets Over Time",
    x = "2020",
    y = "Tweet Count")

Now recycle and modify the previous code to plot each standard separately so we can compare the number of tweets over time by Next Generation Science and Common Core Standard:

daily_tweets <- ss_tweets |>
  mutate(tweet_date = as.Date(created_at)) |>
  group_by(standards, tweet_date) |> 
  summarise(count = n()) |>
  ungroup()

ggplot(daily_tweets, aes(x = tweet_date, y = count, color = standards, group = standards)) +
  geom_line(size = 1) +
  # Manually define colors for each standard
  scale_color_manual(values = c("steelblue", "#CC0000")) +
  labs(
    title = "Number of Tweets Over Time",
    x = "2020",
    y = "Tweet Count",
    color = "Standard"
  ) + 
  theme(legend.position = "top")

Sentiment Summaries

Since our primary goals is to compare public sentiment around the NGSS and CCSS state standards, in this section we put together some basic numerical summaries using our different lexicons to see whether tweets are generally more positive or negative for each standard as well as differences between the two. To do this, we revisit the following dplyr functions:

count() lets you quickly count the unique values of one or more variables
group_by() takes a data frame and one or more variables to group by
summarise() creates a numerical summary of data using arguments like mean() and median()
mutate() adds new variables and preserves existing ones

And introduce one new function:

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().

Sentiment Counts

bing
👉 Your Turn ⤵
Answer

Let’s start with bing, our simplest sentiment lexicon, and use the count function to count how many times in our sentiment_bing data frame “positive” and “negative” occur in sentiment column :

summary_bing <- count(sentiment_bing, sentiment, sort = TRUE)
summary_bing

# A tibble: 2 × 2
  sentiment     n
  <chr>     <int>
1 negative  20186
2 positive  16042

Since our main goal is to compare positive and negative sentiment between CCSS and NGSS, let’s use the group_by function again to get sentiment summaries for NGSS and CCSS separately:

summary_bing <- sentiment_bing |> 
  group_by(standards) |> 
  count(sentiment) 

summary_bing

# A tibble: 4 × 3
# Groups:   standards [2]
  standards sentiment     n
  <chr>     <chr>     <int>
1 ccss      negative  18391
2 ccss      positive  10290
3 ngss      negative   1795
4 ngss      positive   5752

Compare Sentiment Counts

First, let’s untidy our data a little by using the pivot_wider function from the tidyr package to transform our sentiment column into separate columns for negative and positive that contains the n counts for each:

# A tibble: 2 × 3
# Groups:   standards [2]
  standards negative positive
  <chr>        <int>    <int>
1 ccss         18391    10290
2 ngss          1795     5752

Finally, we’ll use the mutate function to create two new variables: sentiment and lexicon so we have a single sentiment score and the lexicon from which it was derived:

# A tibble: 2 × 6
# Groups:   standards [2]
  lexicon standards negative positive sentiment ratio
  <chr>   <chr>        <int>    <int>     <int> <dbl>
1 bing    ccss         18391    10290     -8101 0.560
2 bing    ngss          1795     5752      3957 3.20

Compute Sentiment Scores

Score
👉 Your Turn ⤵
Answer

To calculate a summary score, we will need to first group our data by standards again and then use the summarise function to create a new sentiment variable by adding all the positive and negative scores in the value column. This is for AFINN:

summary_afinn <- sentiment_afinn |> 
  group_by(standards) |> 
  summarise(sentiment = sum(value)) |> 
  mutate(lexicon = "AFINN") |>
  relocate(lexicon)

summary_afinn

# A tibble: 2 × 3
  lexicon standards sentiment
  <chr>   <chr>         <dbl>
1 AFINN   ccss         -14998
2 AFINN   ngss          11709

Again, CCSS is overall negative while NGSS is overall positive!

Calculate a single sentiment score for NGSS and CCSS using the NRC lexicon.

summary_nrc <- sentiment_nrc |> 
  filter(sentiment %in% c("positive", "negative")) |>
  group_by(standards) |> 
  count(sentiment, sort = TRUE) |> 
  mutate(method = "nrc")  |>
  spread(sentiment, n) |>
  mutate(sentiment = positive/negative)

summary_nrc

# A tibble: 2 × 5
# Groups:   standards [2]
  standards method negative positive sentiment
  <chr>     <chr>     <int>    <int>     <dbl>
1 ccss      nrc       16715    52867      3.16
2 ngss      nrc        2161    13847      6.41

Visualizing Sentiment

AFINN
👉 Your Turn ⤵
Answer

Let’s create a simple pie chart that we can use to visually communicate the proportion of positive and negative tweets:

Replicate this process to create a similar pie chart for the CCSS tweets.

afinn_counts <- afinn_sentiment |>
  group_by(standards) |> 
  count(sentiment) |>
  filter(standards == "ccss")

afinn_counts |>
ggplot(aes(x="", y=n, fill=sentiment)) +
  geom_bar(width = .6, stat = "identity") +
  labs(title = "Common Core Science Standards",
       subtitle = "Proportion of Positive & Negative Tweets") +
  coord_polar(theta = "y") +
  theme_void()