Prepare and Wrangle

Text Mining Module 1: A Code-a-long

Welcome to the LAW Code-a-long for Module 1

  • The Text Mining course is designed for those seeking an introductory understanding of quantifying the text in documents to better understand their properties

  • The following code-a-long is a companion to the Module 1 case study’s Prepare and Wrangle stages

Figure 2.2 Steps of Data-Intensive Research Workflow

[@krumm2018]

Module Objectives

This code-a-long is about getting our text “tidy.” By the end of this module we will:

  • Understand the context and data sources you’re working with so you can formulate useful and answerable questions
  • Practice reading, manipulating, cleaning, transforming, and merging text data

Context of the Problem

  • We are working with open-ended survey data from an evaluation of online professional development offered by the North Carolina Department of Public Instruction

  • This PD is part of the state’s Race to the Top (RttT) efforts

  • 2012 Friday Institute Annual Report

Research Questions:

  • What aspects of online professional development offerings do teachers find most valuable?

  • How might resources differ in the value they afford teachers?

Load Libraries

  • Load the tidyverse and tidytext packages using library()
library(tidyverse)
library(tidytext)

Wrangling Text Data

  • Data wrangling is essential because it involves the initial steps of going from raw data to a dataset that can be explored and modeled [@krumm2018]. Text mining is no exception. Today we will:
  1. Read raw data. Before working with data, we need to “read” it into R. It also helps to inspect your data and we’ll review different functions for doing so
  2. Reduce the data. We focus on tools from the dplyr package to view, rename, select, slice, and filter our data in preparation for analysis
  3. Tidy text. We’ll learn how to use the tidytext package to both “tidy” and tokenize our text in order to create a data frame to use for analysis

Reading Data

This section introduces the following functions for reading data into R and inspecting the contents of a data frame:

  • dplyr::read_csv() Reading .csv files into R.
  • base::print() View your data frame in the Console Pane
  • utils::view() View your data frame in the Source Pane
  • tibble::glimpse() Like print, but transposed so you can see all columns
  • utils::head() View the first 6 rows of your data.
  • utils::tail() View last 6 rows of your data.
  • dplyr::write_csv() writing .csv files to directory.

Reading Data, Cont.

  • To get started, we need to import, or “read”, our data into R

  • The RttT Online PD survey data is stored in a CSV file named opd_survey.csv which is located in the data folder of this project

  • Use read_csv() to save opd_survey.csv as a new object opd_survey

  • This file can be found in the local data folder

  • If you have questions about the above function, don’t be afraid to look it up in console using ?read_csv()

opd_survey <- read_csv("data/opd_survey.csv")
  • There is also an Import Dataset feature under the Environment tab in the upper right pane of RStudio

Viewing Data

  • Once your data is in R, there are many different ways you can view it. Use the below code chunk to try each way out:
# A tibble: 57,054 × 19
   RecordedDate          ResponseId Role  Q14   Q16...5 Resource Resource_8_TEXT
   <chr>                 <chr>      <chr> <chr> <chr>   <chr>    <chr>          
 1 "Recorded Date"       "Response… "Wha… "Ple… "Which… "Please… "Please indica…
 2 "{\"ImportId\":\"rec… "{\"Impor… "{\"… "{\"… "{\"Im… "{\"Imp… "{\"ImportId\"…
 3 "3/14/12 12:41"       "R_6fKCyE… "Cen… "K-1… "Eleme… "Summer…  <NA>          
 4 "3/14/12 13:31"       "R_09rHle… "Cen… "Ele… "Eleme… "Online…  <NA>          
 5 "3/14/12 14:51"       "R_1BlKt2… "Sch… "Hig… "Not A… "Online…  <NA>          
 6 "3/14/12 15:02"       "R_bPGUVT… "Sch… "Ele… "Guida… "Calend…  <NA>          
 7 "3/14/12 17:24"       "R_egJEHM… "Tea… "Ele… "Engli… "Live W…  <NA>          
 8 "3/15/12 9:18"        "R_4PEbT4… "Tea… "Ele… "Infor… "Online…  <NA>          
 9 "3/15/12 10:54"       "R_eqB1bM… "Sch… "Mid… "Guida… "Live W…  <NA>          
10 "3/15/12 14:20"       "R_ewwUwi… "Cen… "Pre… "Other… "Wiki"    <NA>          
# ℹ 57,044 more rows
# ℹ 12 more variables: Resource_9_TEXT <chr>, Resource_10_TEXT...9 <chr>,
#   Topic <chr>, Resource_10_TEXT...11 <chr>, Q16...12 <chr>, Q16_9_TEXT <chr>,
#   Q19 <chr>, Q20 <chr>, Q21 <chr>, Q26 <chr>, Q37 <chr>, Q8 <chr>
Rows: 57,054
Columns: 19
$ RecordedDate          <chr> "Recorded Date", "{\"ImportId\":\"recordedDate\"…
$ ResponseId            <chr> "Response ID", "{\"ImportId\":\"_recordId\"}", "…
$ Role                  <chr> "What is your role within your school district o…
$ Q14                   <chr> "Please select the school level(s) you work with…
$ Q16...5               <chr> "Which content area(s) do you specialize in?  (S…
$ Resource              <chr> "Please indicate the online professional develop…
$ Resource_8_TEXT       <chr> "Please indicate the online professional develop…
$ Resource_9_TEXT       <chr> "Please indicate the online professional develop…
$ Resource_10_TEXT...9  <chr> "Please indicate the online professional develop…
$ Topic                 <chr> "What was the primary focus of the webinar you a…
$ Resource_10_TEXT...11 <chr> "What was the primary focus of the webinar you a…
$ Q16...12              <chr> "Which primary content area(s) did the webinar a…
$ Q16_9_TEXT            <chr> "Which primary content area(s) did the webinar a…
$ Q19                   <chr> "Please specify the online learning module you a…
$ Q20                   <chr> "How are you using this resource?", "{\"ImportId…
$ Q21                   <chr> "What was the most beneficial/valuable aspect of…
$ Q26                   <chr> "What recommendations do you have for improving …
$ Q37                   <chr> "What recommendations do you have for making thi…
$ Q8                    <chr> "Which of the following best describe(s) how you…
# A tibble: 6 × 19
  RecordedDate           ResponseId Role  Q14   Q16...5 Resource Resource_8_TEXT
  <chr>                  <chr>      <chr> <chr> <chr>   <chr>    <chr>          
1 "Recorded Date"        "Response… "Wha… "Ple… "Which… "Please… "Please indica…
2 "{\"ImportId\":\"reco… "{\"Impor… "{\"… "{\"… "{\"Im… "{\"Imp… "{\"ImportId\"…
3 "3/14/12 12:41"        "R_6fKCyE… "Cen… "K-1… "Eleme… "Summer…  <NA>          
4 "3/14/12 13:31"        "R_09rHle… "Cen… "Ele… "Eleme… "Online…  <NA>          
5 "3/14/12 14:51"        "R_1BlKt2… "Sch… "Hig… "Not A… "Online…  <NA>          
6 "3/14/12 15:02"        "R_bPGUVT… "Sch… "Ele… "Guida… "Calend…  <NA>          
# ℹ 12 more variables: Resource_9_TEXT <chr>, Resource_10_TEXT...9 <chr>,
#   Topic <chr>, Resource_10_TEXT...11 <chr>, Q16...12 <chr>, Q16_9_TEXT <chr>,
#   Q19 <chr>, Q20 <chr>, Q21 <chr>, Q26 <chr>, Q37 <chr>, Q8 <chr>
# A tibble: 6 × 19
  RecordedDate ResponseId        Role    Q14    Q16...5 Resource Resource_8_TEXT
  <chr>        <chr>             <chr>   <chr>  <chr>   <chr>    <chr>          
1 7/2/13 10:20 R_0cggNPIobej2kHX Teacher Middl… World … Online … <NA>           
2 7/2/13 12:32 R_bpZ2jPQV1BOta2p <NA>    <NA>   <NA>    <NA>     <NA>           
3 7/2/13 12:32 R_4SbNuxFI6qv8pQF Teacher Eleme… Mathem… Online … <NA>           
4 7/2/13 12:32 R_1TT9rRNolK2xSDP <NA>    <NA>   <NA>    <NA>     <NA>           
5 7/2/13 12:32 R_8raUHcIydALnRhH <NA>    <NA>   <NA>    <NA>     <NA>           
6 7/2/13 12:32 R_2bs3lzLBdGWjkOx <NA>    <NA>   <NA>    <NA>     <NA>           
# ℹ 12 more variables: Resource_9_TEXT <chr>, Resource_10_TEXT...9 <chr>,
#   Topic <chr>, Resource_10_TEXT...11 <chr>, Q16...12 <chr>, Q16_9_TEXT <chr>,
#   Q19 <chr>, Q20 <chr>, Q21 <chr>, Q26 <chr>, Q37 <chr>, Q8 <chr>
 [1] "RecordedDate"          "ResponseId"            "Role"                 
 [4] "Q14"                   "Q16...5"               "Resource"             
 [7] "Resource_8_TEXT"       "Resource_9_TEXT"       "Resource_10_TEXT...9" 
[10] "Topic"                 "Resource_10_TEXT...11" "Q16...12"             
[13] "Q16_9_TEXT"            "Q19"                   "Q20"                  
[16] "Q21"                   "Q26"                   "Q37"                  
[19] "Q8"                   

Reducing Data

dplyr functions

  • select() picks variables based on their names.
  • slice() lets you select, remove, and duplicate rows.
  • rename() changes the names of individual variables using new_name = old_name syntax
  • filter() picks cases, or rows, based on their values in a specified column

tidyr functions

  • drop_na() removes rows with missing values (“na”, “N/A”, etc.)

Subset Columns

  • To help address our research question, let’s first reduce our dataset to only the columns needed
  • select() is a dplyr function that can choose specific variables, or columns, of data
  • select() the columns Role, Resource, and Q21

  • Assign these selected columns to a new data frame named opd_selected by using the <- assignment operator

#select relevant columns
opd_selected <- select(opd_survey, Role, Resource, Q21)
#check data
head(opd_selected)
# A tibble: 6 × 3
  Role                                                            Resource Q21  
  <chr>                                                           <chr>    <chr>
1 "What is your role within your school district or organization… "Please… "Wha…
2 "{\"ImportId\":\"QID2\"}"                                       "{\"Imp… "{\"…
3 "Central Office Staff (e.g. Superintendents, Tech Director, Cu… "Summer…  <NA>
4 "Central Office Staff (e.g. Superintendents, Tech Director, Cu… "Online… "Glo…
5 "School Support Staff (e.g. Counselors, Technology Facilitator… "Online…  <NA>
6 "School Support Staff (e.g. Counselors, Technology Facilitator… "Calend… "com…

Rename Columns

  • rename() allows us to change the names of existing columns, which can make them more useful or informative in our data workflow
  • Take our opd_selected data frame and use rename(), along with the = assignment operator, to change the column name from Q21 to text and save it as opd_renamed.
#rename columns
opd_renamed <- rename(opd_selected, text = Q21)
#check data
opd_renamed
# A tibble: 57,054 × 3
   Role                                                           Resource text 
   <chr>                                                          <chr>    <chr>
 1 "What is your role within your school district or organizatio… "Please… "Wha…
 2 "{\"ImportId\":\"QID2\"}"                                      "{\"Imp… "{\"…
 3 "Central Office Staff (e.g. Superintendents, Tech Director, C… "Summer…  <NA>
 4 "Central Office Staff (e.g. Superintendents, Tech Director, C… "Online… "Glo…
 5 "School Support Staff (e.g. Counselors, Technology Facilitato… "Online…  <NA>
 6 "School Support Staff (e.g. Counselors, Technology Facilitato… "Calend… "com…
 7 "Teacher"                                                      "Live W… "lev…
 8 "Teacher"                                                      "Online… "Non…
 9 "School Support Staff (e.g. Counselors, Technology Facilitato… "Live W… "awa…
10 "Central Office Staff (e.g. Superintendents, Tech Director, C… "Wiki"   "Inf…
# ℹ 57,044 more rows

Additional Wrangling

  • The following code shows how to make some slight adjustments to account for Qualtrics artifacts and missing data

  • These are shown in greater detail in the Module 1 Case Study

#use slice() to remove the top two rows of data
#these are extra headers created by Qualtrics that we don't need
opd_sliced <- slice(opd_renamed, -1, -2) # the - operator tells the slice function to NOT keep rows 1 and 2

opd_sliced
# A tibble: 57,052 × 3
   Role                                                           Resource text 
   <chr>                                                          <chr>    <chr>
 1 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Summer …  <NA>
 2 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Online … "Glo…
 3 School Support Staff (e.g. Counselors, Technology Facilitator… Online …  <NA>
 4 School Support Staff (e.g. Counselors, Technology Facilitator… Calendar "com…
 5 Teacher                                                        Live We… "lev…
 6 Teacher                                                        Online … "Non…
 7 School Support Staff (e.g. Counselors, Technology Facilitator… Live We… "awa…
 8 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Wiki     "Inf…
 9 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Online … "Gav…
10 Teacher                                                        Online … "In …
# ℹ 57,042 more rows
#use drop_na() to remove missing data indicated by "NA"

opd_complete <- drop_na(opd_sliced)

opd_complete
# A tibble: 27,283 × 3
   Role                                                           Resource text 
   <chr>                                                          <chr>    <chr>
 1 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Online … "Glo…
 2 School Support Staff (e.g. Counselors, Technology Facilitator… Calendar "com…
 3 Teacher                                                        Live We… "lev…
 4 Teacher                                                        Online … "Non…
 5 School Support Staff (e.g. Counselors, Technology Facilitator… Live We… "awa…
 6 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Wiki     "Inf…
 7 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Online … "Gav…
 8 Teacher                                                        Online … "In …
 9 Teacher                                                        Online … "Und…
10 Teacher                                                        Online … "ove…
# ℹ 27,273 more rows
#use filter() to identify only responses from participants who indicated their role as teacher

opd_teacher <- filter(opd_complete, Role == "Teacher")

opd_teacher
# A tibble: 23,374 × 3
   Role    Resource                            text                             
   <chr>   <chr>                               <chr>                            
 1 Teacher Live Webinar                        "levels ofquestioning and revise…
 2 Teacher Online Learning Module              "None, really."                  
 3 Teacher Online Learning Module              "In any of the modules when a te…
 4 Teacher Online Learning Module              "Understanding the change"       
 5 Teacher Online Learning Module              "overview of reasons for change" 
 6 Teacher Online Learning Module              "online--allowed me to do it on …
 7 Teacher Summer Institute/RESA Presentations "exposure to new information and…
 8 Teacher Summer Institute/RESA Presentations "Knowing what needs to be done." 
 9 Teacher Other, please specify               "How to organize lessons"        
10 Teacher Document, please specify            "Helping me understand what each…
# ℹ 23,364 more rows

Tidying Text

For this part of our workflow we focus on the following functions from the tidytext and dplyr packages respectively:

  • unnest_tokens() splits a column into tokens
  • anti_join() returns all rows from x without a match in y.

Tidy data has a specific structure:

  • Each variable is a column
  • Each observation is a row
  • Each type of observational unit is a table

Tidy Text, Cont.

In this section, our goal is to transform our opd_teacher text data from this:

Tidy Text, Cont.

to this:

Tokenizing Text

  • In order to tidy our text, we need to break the text into individual tokens (a process called tokenization) and transform it to a tidy data structure

  • To do this, we use tidytext’s incredibly powerful unnest_tokens() function

  • Use unnest_tokens() to tokenize opd_teacher into new object opd_tidy

  • Within the function, follow the table opd_teacher with the output and input arguments as word and text, respectively

Remove Stop Words

One final step in tidying our text is to remove words that don’t add much value to our analysis (at least when using this approach) such as “and”, “the”, “of”, “to” etc. The tidytext package contains a stop_words dataset derived from three different lexicons that we’ll use to remove rows that match words in this dataset.

Let’s take a look at the first 20 stop words so we know what we’re getting rid of from our opd_tidy dataset.

# A tibble: 20 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
11 afterwards  SMART  
12 again       SMART  
13 against     SMART  
14 ain't       SMART  
15 all         SMART  
16 allow       SMART  
17 allows      SMART  
18 almost      SMART  
19 alone       SMART  
20 along       SMART  

Now run the following code to view all stop words in each lexicon:

In order to remove these stop words, we will use function called anti_join() that looks for matching values in a specific column from two datasets and returns rows from the original dataset that have no matches. For a good overview of the different dplyr joins see here: https://medium.com/the-codehub/beginners-guide-to-using-joins-in-r-682fc9b1f119

Let’s remove rows from our opd_tidy data frame that contain matches in the word column with those in the stop_words dataset and save it as opd_clean since we were done cleaning our data at this point.

# A tibble: 78,675 × 3
   Role    Resource               word         
   <chr>   <chr>                  <chr>        
 1 Teacher Live Webinar           levels       
 2 Teacher Live Webinar           ofquestioning
 3 Teacher Live Webinar           revised      
 4 Teacher Live Webinar           blooms       
 5 Teacher Online Learning Module modules      
 6 Teacher Online Learning Module teacher      
 7 Teacher Online Learning Module shown        
 8 Teacher Online Learning Module action       
 9 Teacher Online Learning Module classroom    
10 Teacher Online Learning Module modeling     
# ℹ 78,665 more rows

👉 Your Turn ⤵

In the code chunk below, add the anti_join() function to our chain of data wrangling functions as our final text preprocessing step. Don’t forget to include in the function the stop words you are trying to remove.

Also, add a short comment to each line of code explaining each data wrangling step. The first had been done for you.

# A tibble: 78,675 × 3
   Role    Resource               word         
   <chr>   <chr>                  <chr>        
 1 Teacher Live Webinar           levels       
 2 Teacher Live Webinar           ofquestioning
 3 Teacher Live Webinar           revised      
 4 Teacher Live Webinar           blooms       
 5 Teacher Online Learning Module modules      
 6 Teacher Online Learning Module teacher      
 7 Teacher Online Learning Module shown        
 8 Teacher Online Learning Module action       
 9 Teacher Online Learning Module classroom    
10 Teacher Online Learning Module modeling     
# ℹ 78,665 more rows

If you did this correctly, you should now see a wrangled_teacher data frame in the Environment pane that has 78,675 observations and 3 variables.

👉 Your Turn ⤵

Finally, let’s wrangle a data set that includes tokens not just for teachers, but for all educators who particiated in the online professional development offerings.

Complete and run the following code to wrangle and tidy your opd_survey data for all survey respondents.

# A tibble: 92,722 × 3
   Role                                                           Resource word 
   <chr>                                                          <chr>    <chr>
 1 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Online … glob…
 2 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Online … view 
 3 School Support Staff (e.g. Counselors, Technology Facilitator… Calendar comm…
 4 Teacher                                                        Live We… leve…
 5 Teacher                                                        Live We… ofqu…
 6 Teacher                                                        Live We… revi…
 7 Teacher                                                        Live We… bloo…
 8 School Support Staff (e.g. Counselors, Technology Facilitator… Live We… awar…
 9 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Wiki     info…
10 Central Office Staff (e.g. Superintendents, Tech Director, Cu… Wiki     rele…
# ℹ 92,712 more rows

❓Questions

  1. How many observations are in the data frame with all educators?
    • YOUR RESPONSE HERE
  2. Why do you think the console provided the message “Joining, by = ‘word’”?
    • YOUR RESPONSE HERE

Congratulations, we’re now ready to begin exploring our wrangled data!

What’s Next?