Latent Dirichlet Allocation

Welcome to the Text Mining Code Along for Module 3

The Text Mining course is designed for those seeking an introductory understanding of quantifying the text in documents to better understand their properties.
The following Code Along is a companion to the Module 3 case study’s Model stage.

Module Objectives

This Code Along dips our toes into modeling text as data. In very simple terms, modeling involves developing a mathematical summary of a dataset, which can help us further explore trends and patterns in our data. By the end of this module we will learn how to:

Fit a Topic Modeling with LDA.We will learn to use the topicmodels package and associated LDA() function for unsupervised classification of our forum discussions to find natural groupings of words, or topics.
Choose K. We will take an introductory look at fitting our LDA to K, the number of topics within the text.

Context of the Problem

Our world is rich with data sources, and technology makes data more accessible than ever before! To help ensure students are future ready to use data for making informed decisions, many countries around the world have increased the emphasis on statistics and data analysis in school curriculum–from elementary/primary grades through college. This course allows you to learn, along with colleagues from other schools, an investigation cycle to teach statistics and to help students explore data to make evidence-based claims. To learn more about engaging learners in making inferences and claims supported by data and how to emphasize inferential reasoning in teaching statistics through posing different types of investigative questions, enroll in our Teaching Statistics through Inferential Reasoning MOOC-Ed.

Akoglu, K., Lee, H. & Kellogg, S. (2019). Participating in a MOOC and Professional Learning Team: How a Blended Approach to Professional Development Makes a Difference. Journal of Technology and Teacher Education, 27(2), 129-163.

From Abstract: Massive Open Online Courses for Educators (MOOC-Eds) provide opportunities for using research-based learning and teaching practices, along with new technological tools and facilitation approaches for delivering quality online professional development. The Teaching Statistics Through Data Investigations MOOC-Ed was built for preparing teachers in pedagogy for teaching statistics, and it has been offered to participants from around the world. During 2016-2017, professional learning teams (PLTs) were formed from a subset of MOOC-Ed participants. These teams met several times to share and discuss their learning and experiences. This study focused on examining the ways that a blended approach to professional development may result in similar or different patterns of engagement to those who only participate in a large-scale online course. Results show the benefits of a blended learning environment for retention, engagement with course materials, and connectedness within the online community of learners in an online professional development on teaching statistics. The findings suggest the use of self-forming autonomous PLTs for supporting a deeper and more comprehensive experience with self-directed online professional developments such as MOOCs. Other online professional development courses, such as MOOCs, may benefit from purposely suggesting and advertising, and perhaps facilitating, the formation of small face-to-face or virtual PLTs who commit to engage in learning together.

tidytext: Tidies text data by tokenizing and allows us to cast our document term matrix, an essential input for LDA.
topicmodels: Implements Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM) for extracting topics from text data.
ldatuning: Helps determine the optimal number of topics for LDA models using various evaluation metrics.

Load those packages into your IDE.
You may have trouble loading in one or more of these packages. Why might that be? What would you need to do first?

library(tidytext)
library(topicmodels)
install.packages("ldatuning") #this one is not typically pre-installed
library(ldatuning)

Read in Your Data

👉 Your Turn ⤵
Answer

Read in “code_along.csv” located in your Data folder as object forums_tidy.

forums_tidy <- read.csv("data/forums_tidy.csv", header = TRUE)

Create a Document Term Matrix

DTMs
👉 Your Turn ⤵
Answer

Latent Dirichlet allocation (LDA) algorithms (via LDA() in R) expects a document-term matrix (DTM) as the data input.
To create our DTM, we’ll need to first count() how many times each word occurs in each document, or post_id in our case, like so:

forums_tidy |>
  count(post_id, word) |>
  arrange(desc(n))

Use cast_dtm() to create a DTM of forums_tidy, saving it as new object forums_dtm

forums_dtm <- forums_tidy |>
  cast_dtm(post_id, word, n)

LDA Recap

LDA Assumes:

Every document contains a mixture of topics.
Every topic contains a mixture of words.

Now we must choose a number of topics that might exist across these discussion forums. Remember: We have to identify this number ourselves!

Before running our first topic model using the LDA() function, let’s quick recap from our readings some basic principles behind Latent Dirichlet Allocation and why LDA is of preferred over other automatic classification or clustering approaches.

Unlike simple forms of cluster analysis such as k-means clustering, LDA is a “mixture” model, which in our context means that:

Every document contains a mixture of topics. Unlike algorithms like k-means, LDA treats each document as a mixture of topics, which allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups. So in practice, this means that a discussion forum post could have an estimated topic proportion of 70% for Topic 1 (e.g. be mostly about a Topic 1), but also be partly about Topic 2.
Every topic contains a mixture of words. For example, if we specified in our LDA model just 2 topics for our discussion posts, we might find that one topic seems to be about pedagogy while another is about learning. The most common words in the pedagogy topic might be “teacher”, “strategies”, and “instruction”, while the learning topic may be made up of words like “understanding” and “students”. However, words can be shared between topics and words like “statistics” or “assessment” might appear in both equally.

Fitting a Topic Modeling

Since it looks like there are about 20 distinct discussion forums, we’ll use that as our value for the k = argument of LDA(). This is a number that we can adjust as we go, if it seems like it’s not capturing a full range of themes or provides redundant themes.

Creating the LDA

Now it’s time to create the LDA from the DTM using our (somewhat) arbitrated k of 20.
NOTE: This is computationally intensive, so don’t panic if R seems to take a long time to create the new forums_lda object (or even freeze!).

forums_lda <- LDA(forums_dtm, 
                  k = 20,
                  method = "VEM",
                  control = list(seed = 588),
                  )

Viewing LDA Output

terms()
👉 Your Turn ⤵
Answer

You can view any number of words within a topic by using the terms() function. We’ll also send this to the as_tibble() function to make the output a little easier to read:

terms(forums_lda, 5) |>
  as_tibble()

# A tibble: 5 × 20
  `Topic 1`  `Topic 2` `Topic 3`   `Topic 4` `Topic 5`  `Topic 6` `Topic 7`
  <chr>      <chr>     <chr>       <chr>     <chr>      <chr>     <chr>    
1 div        students  students    td        stats      amp       data     
2 hypothesis task      statistical 0         ap         resource  students 
3 class      data      statistics  style     feel       link      questions
4 chance     span      discussion  span      class      sharing   question 
5 http       tasks     age         width     statistics http      collect  
# ℹ 13 more variables: `Topic 8` <chr>, `Topic 9` <chr>, `Topic 10` <chr>,
#   `Topic 11` <chr>, `Topic 12` <chr>, `Topic 13` <chr>, `Topic 14` <chr>,
#   `Topic 15` <chr>, `Topic 16` <chr>, `Topic 17` <chr>, `Topic 18` <chr>,
#   `Topic 19` <chr>, `Topic 20` <chr>

Adjust the word number argument in terms() to make the word list 10 per topic instead.
Note the differences in the output. How does adjusting this “window” change your interpretation of the themes?

terms(forums_lda, 10) |>
  as_tibble()

# A tibble: 10 × 20
   `Topic 1`  `Topic 2` `Topic 3`   `Topic 4` `Topic 5`  `Topic 6` `Topic 7` 
   <chr>      <chr>     <chr>       <chr>     <chr>      <chr>     <chr>     
 1 div        students  students    td        stats      amp       data      
 2 hypothesis task      statistical 0         ap         resource  students  
 3 class      data      statistics  style     feel       link      questions 
 4 chance     span      discussion  span      class      sharing   question  
 5 http       tasks     age         width     statistics http      collect   
 6 statistics activity  agree       border    school     9         sets      
 7 null       question  results     color     teach      shoes     set       
 8 difference coke      activity    top       teaching   pdfs      analyze   
 9 section    pepsi     reading     align     resources  site      analysis  
10 href       line      approach    rgb       confident  resources collection
# ℹ 13 more variables: `Topic 8` <chr>, `Topic 9` <chr>, `Topic 10` <chr>,
#   `Topic 11` <chr>, `Topic 12` <chr>, `Topic 13` <chr>, `Topic 14` <chr>,
#   `Topic 15` <chr>, `Topic 16` <chr>, `Topic 17` <chr>, `Topic 18` <chr>,
#   `Topic 19` <chr>, `Topic 20` <chr>

Finding K

k_metrics()
View Output

k_metrics <- FindTopicsNumber(
  forums_dtm,
  topics = seq(10, 75, by = 5),
  metrics = "Griffiths2004",
  method = "Gibbs",
  control = list(),
  mc.cores = NA,
  return_models = FALSE,
  verbose = FALSE,
  libpath = NULL
)

This is just an initial approach. After running this against another algorithm that uses the stm package, 14 looks to be a better K for this LDA.

FindTopicsNumber_plot(k_metrics)

Refitting the LDA

👉 Your Turn ⤵
Answer

Rerun LDA() with our new K of 14, saving as a forums_lda_new object.
Use terms() to compare the outputs.

#Create the refitted LDA
forums_lda_new <- LDA(forums_dtm, 
                  k = 14,
                  method = "VEM",
                  control = list(seed = 588),
                  )
#Look at your new LDA results
terms(forums_lda_new, 5) |> #this example assumes 5 words per topic
  as_tibble()

❓Discussion

What did you notice between the LDA outputs?
What do you think would happen if you asked for a K much lower than 14?