Text Mining

Text Mining Module 1: A Conceptual Overview

Welcome to Text Mining

Guiding questions:

Why text mining? What is it exactly?
What are common techniques used for mining text?
What are some of the technical and ethical considerations?
How is text mining used in STEM education research?

A Brief Why and What

Data Sources, Research Opportunities, & Simple Definitions

Why Should I Care?

There has been unprecedented increase in text-based data¹ generated by educational processes and digital learning systems, resulting in…

New sources of data

Discussion Forums
Online Assignments
Instant Messaging Tools
Social Media
Other sources?

New opportunities for research

Massive
Always On
Non-reactive
Social relations
Other opportunities?

What is Text Mining?

According to some students, text mining is…

A process for gaining insight into large amounts of text
An exploration of text in search of patterns
Reading tea leaves
Like computer-assisted reading
Magic!

Text mining is not…

A substitute for traditional qualitative analysis
As “automated” as one would like it to be
Magic!

A Central Question of Text Mining

“A central question in text mining and natural language processing is how to quantify what a document is about.”

~ Tidy Text Mining with R (Silge and Robinson 2017)

Techniques to Quantify Text

Frequency, Dictionaries, Topics, & Networks

Basic Text Analysis

“… it’s a mistake to imagine that text mining is now in a sort of crude infancy, whose real possibilities will only be revealed after NLP matures… Wordcounts are amazing!¹

Text Preprocessing
Word Counts
Term Frequencies
TF-IDF

Yep, a wordcloud, the pie chart of text mining.

Dictionary-Based Methods

Dictionary-based text analysis¹ uses predefined list of words, or lexicons, to assign a particular meaning, value, or category to each word in you data:

Sentiment Lexicons
Custom Dictionaries
Stop Words
Linguistic Inquiry Word Count (LIWC)

Bing Sentiment Lexicon Example:

word	sentiment
brighten	positive
idiocies	negative
proper	positive
horrendously	negative
poorest	negative

Topic Modeling

With a bit of tongue-in-cheek¹, Meeks and Weingart describe topic modeling as:

leveraging occult statistical methods like ‘dirichlet priors’ and ‘bayesian models’… to provide seductive but obscure results in the form of easily interpreted (and manipulated) ‘topics.’

Supervised Machine Learning

Supervised machine learning using text involves building a statistical model to predict some output from input that includes language.¹ Outputs may be:

numeric or continuous, such as predicting the year of a United States Supreme Court opinion from text of that opinion.
discrete quantities or class labels, such as predicting whether a GitHub issue is about documentation.

Text as Networks

Text data can also be quantified and represented as relational data in the form of networks, as in the case of:

Text Networks where individual words are the nodes
Epistemic Network Analysis where human or automated coded text are the nodes

And the edges between them describe the regularity with which they co-occur in documents.

Network illustrating principals’ epistemic frames of characteristics used for the selection of Advanced Teachers. go.ncsu.edu/atr-study

Key Considerations

Logistical, Technical, Ethical, & Legal

Text Mining Considerations

Despite the potential advantages of text-based data captured by educational technologies, TM poses a number of challenges¹ for STEM Ed researchers.

Logistical & Technical

Unstructured
Inaccessible
Non-Representative
Incomplete

Ethical & Legal

Bias (algorithmic, positivity)
Sensitive
Terms of Use
FERPA

Research: Academic Performance

TM has been largely used to evaluate academic performance in different contexts, especially to assess essays and online assignments.

writing style
use of argumentation
plagiarism detection
peer interaction

Research: Student Feedback

To help improve performance, TM is used to provide student feedback, often based on both their interactions and activities.

Intelligent Tutoring Systems
Question-answering applications
Assist teacher feedback
Support formative feedback

Research: Learner Engagement

TM has been applied to support student engagement and collaboration, especially in distance learning courses.

Automated writing scaffolds
Analysis of interactional resources
Student sentiment extraction
Dropout prevention

Other Applications

Other applications of text mining in STEM Ed Research include:

Automatic text summarization
Analytics & visualization tools
Curriculum adaptation
Recommendation systems

Discussion

Consider a small network you are a part of (~ 5-10 individuals), or may be interested in studying, and think about the following questions:

Who are the actors in this networks?
What types of ties might connect these actors?
What individual attributes might be important to capture for SNA?
Which actors may be more central (e.g. more ties) in this network?

Acknowledgements

This work was supported by the National Science Foundation grants DRL-2025090 and DRL-2321128 (ECR:BCSER). Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Krumm, Andrew, Barbara Means, and Marie Bienkowski. 2018. Learning Analytics Goes to School. Routledge. https://doi.org/10.4324/9781315650722.

Silge, Julia, and David Robinson. 2017. Text Mining with r: A Tidy Approach. " O’Reilly Media, Inc.". https://www.tidytextmining.com.