Topic Modeling

Lab 4: Conceptual Overview

Agenda

Part 1: Research Overview

  • What is Topic Modeling?

  • What research questions can topic modeling answer?

  • What are limitations & ethical considerations?

Part 2: R Code-Along

  • Document Term Matrix

  • LDA (Latent Dirichlet allocation)

  • Finding K

A Quick Refresher

Sentiment Analysis

Figure source: Silge & Robinson, 2017

Part 1: Research Overview

Applying Topic Modeling in STEM Education Research

What is Topic Modeling?

“Topic modeling is a field of natural language processing that aims to extract themes by text mining a set of documents.” (Blei, 2012; Vijayan, 2021)

Figure source: Naskar, n.d.

Research Questions

Literature review (e.g., Chen et al., 2020) - In what research topics were the Computers & Education community interested ? - How did such research topics evolve over time?

Assessment (e.g., Ming & Ming, 2015) - Do the concepts discussed by students as inferred by pLSA (Probabilistic latent semantic analysis) predict their course outcomes? - How does the accuracy of these predictions change over time as more student work is analyzed?

Course/project evaluation (e.g., Akoglu et al., 2019) - What are the similarities and differences between how PLT (professional learning team) members and Non-PLT online participants engage and meet course goals in a MOOC-Ed designed for educators?

Take a look at the dataset located here and consider the following:

- What format is this data set stored as?

- What are some things you notice about this dataset?

- What questions do you have about this dataset?

- What similar dataset do you have?

- What research questions do you want to address with your dataset?

What are limitations & ethical considerations?

  • Nuances of language or context may be lost
  • Limited to source material selected by researcher(s)
  • Not good for small corpora
  • “Computing probabilities allows a”generative” process by which a collection of new “synthetic documents” can be generated that would closely reflect the statistical characteristics of the original collection” (Wikipedia).

Part 2: R Code Along

Document Term Matrix, LDA, and Finding K

[Text Mining_Topic Modeling]

Document Term Matrix

Figure source: SPE3DLab, n.d.

LDA

Figure source: Ma, 2019

LDA

Thank you!

Dr. Shiyan Jiang

sjiang24@ncsu.edu