Text Mining Basics

Lab 1: Conceptual Overview

Agenda

Part 1: Research Overview

Research question
Word count
Term frequency
Inverse document frequency
TF-IDF

Part 2: R Code-Along

Tokenization
Stemming
Stopword
Filter

Part 1: Research Overview

Turn texts into numbers

Research Questions

Walkthrough example
Discuss

What aspects of online professional development offerings do teachers find most valuable?

Take a look at the dataset located here and consider the following:

- What format is this data set stored as?

- What are some things you notice about this dataset?

- What questions do you have about this dataset?

- What similar dataset do you have?

- What research questions do you want to address with your dataset?

Word Count

Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good

Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Term frequency

The number we fill the matrix with are simply the raw count of the tokens in each document. This is called the term frequency (TF) approach.

Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

IDF, TF-IDF

IDF is a measure of how important a term is. TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents.

Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Part 2: R Code Along

Tokenization, Stemming, Stopword, and Filter

[Text Mining_Basics]

Tokenization, Stemming, Stopword, and Filter

These are some of the methods of processing the data in text mining:

unnest_tokens()
wordStem() (lab 3)
anti_join(dataframe, stop_words)
filter()

Thank you!

Dr. Shiyan Jiang

sjiang24@ncsu.edu