Text Mining Basics

Lab 1: Conceptual Overview

Agenda

Part 1: Research Overview

  • Research question

  • Word count

  • Term frequency

  • Inverse document frequency

  • TF-IDF

Part 2: R Code-Along

  • Tokenization

  • Stemming

  • Stopword

  • Filter

Part 1: Research Overview

Turn texts into numbers

Research Questions

What aspects of online professional development offerings do teachers find most valuable?

Take a look at the dataset located here and consider the following:

- What format is this data set stored as?

- What are some things you notice about this dataset?

- What questions do you have about this dataset?

- What similar dataset do you have?

- What research questions do you want to address with your dataset?

Word Count

  • Review 1: This movie is very scary and long

  • Review 2: This movie is not scary and is slow

  • Review 3: This movie is spooky and good

Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Term frequency

The number we fill the matrix with are simply the raw count of the tokens in each document. This is called the term frequency (TF) approach.

Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

IDF, TF-IDF

IDF is a measure of how important a term is. TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents.

Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/

Part 2: R Code Along

Tokenization, Stemming, Stopword, and Filter

[Text Mining_Basics]

Tokenization, Stemming, Stopword, and Filter

These are some of the methods of processing the data in text mining:

  • unnest_tokens()
  • wordStem() (lab 3)
  • anti_join(dataframe, stop_words)
  • filter()

Thank you!

Dr. Shiyan Jiang

sjiang24@ncsu.edu