Text Mining Module 3: A Conceptual Overview
In learning analytics often deals with “unlabeled” data, where the outcome of a model isn’t pre-defined. Unsupervised machine learning is:
Unlabeled. You give the model the data and let it find its own logic.
About discovery. Finding hidden patterns or clusters you didn’t know existed.
Low effort upfront; high effort during interpretation.
Common Tools: Topic Modeling & K-Means Clustering.
Megan R. Brett’s Topic Modeling, a Basic Introduction describes topic modeling as a method for finding and tracing clusters of words (topics) in large bodies of text.
Instead of searching for one word within a body of text or index, the algorithm finds groups of words that frequently appear together.
Why use it? It allows us to process thousands of student responses that would be impossible to read manually.
Latent Dirichlet Allocation (LDA) is the most common technique for topic modeling.
LDA operates on two key assumptions:
How can educational professionals use this?
1. Text Preprocessing: tidytext: Tidies and tokenizes text so it is ready to be made into a document-term matrix.
2. Topic Modeling Algorithms: topicmodels implements Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM) for extracting topics from text data.
3. Model Selection & Optimization: ldatuning helps determine the optimal number of topics for LDA models using various evaluation metrics.
To perform LDA in R, we must move from raw text to a Document-Term Matrix (DTM). This involves:
from Text Mining with R (2017)
The computer does not “understand” the topics; it only sees mathematical correlations.
It is the researcher’s responsibility to interpret the groups of terms into their own themes. What would you call the above groups “1” and “2”?
The researcher also must define the number of groups to be sorted (K). Too few leads to broad generalizations; too many leads to redundant clusters. Picking an optimal K can be assisted with ldatuning.
Benefits:
Offers an ostentibly neutral approach to theming texts.
Scalable to massive datasets.
Limitations:
Context can be lost due to the “bag of words” approach to analysis. Text is tokenized into a list of single words, and the LDA process ignores word order in favor of frequency.
Garbage In, Garbage Out: Poorly cleaned data (like encoding errors or human-made typos) can result in nonsensical topics.
Think of a dataset in your current role (e.g., student evaluations). What “latent” topics might you expect?
Where do you see limitations in this approach? What could you pair LDA with to create a more thorough analysis?

This work was supported by the National Science Foundation grants DRL-2025090 and DRL-2321128 (ECR:BCSER). Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.