Lab 1: Conceptual Overview
Part 1: Research Overview
Research question
Word count
Term frequency
Inverse document frequency
TF-IDF
Part 2: R Code-Along
Tokenization
Stemming
Stopword
Filter
Turn texts into numbers
What aspects of online professional development offerings do teachers find most valuable?
Take a look at the dataset located here and consider the following:
- What format is this data set stored as?
- What are some things you notice about this dataset?
- What questions do you have about this dataset?
- What similar dataset do you have?
- What research questions do you want to address with your dataset?
Review 1: This movie is very scary and long
Review 2: This movie is not scary and is slow
Review 3: This movie is spooky and good
Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
The number we fill the matrix with are simply the raw count of the tokens in each document. This is called the term frequency (TF) approach.
Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
IDF is a measure of how important a term is. TF-IDF is intended to measure how important a word is to a document in a collection (or corpus) of documents.
Figure source: https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
Tokenization, Stemming, Stopword, and Filter
[Text Mining_Basics]
These are some of the methods of processing the data in text mining:
Dr. Shiyan Jiang