LAW Module 1: A Code-A-long
Learning Analytics Workflow (LAW) is designed for those seeking an introductory understanding of learning analytics using basic R programming skills, particularly in the context of STEM education research.
The following code-a-long is aimed at preparing you for the first section of the case study.
By the end of this module:
Data:
Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education, 54(2), 588-599.
Research Questions:
Which LMS tracking data variables correlate significantly with student achievement?
How accurately can measures of student online activity in an online course site predict student achievement in the course under study?
install.packages() downloads a package from a repository, typically the Comprehensive R Archive Network (CRAN) unless you say otherwiseUse install.packages() to install the tidyverse package
Use quotations around the package name since it is identified as a string variable
library()Use library() to load the tidyverse package
Since the package has been installed as an object in RStudio, don’t use quotations around the package name when loading it
readrfile File to path
col_names Uses the first row of raw data available as the column names if TRUE, creates placeholder column names if FALSE
na Looks for text in the data to treat as non-applicable. Can be a list like c("", "na")
skip Tells the function to “skip” ahead by reading at a given row of data.
col_types Coerces columns to specific types of data. If NULL, the function guesses the type of each column, which is useful but not robust

Use read_csv() to read in sci-online.classes.csv
This file is located in your data folder
Excel files are proprietary to Microsoft but also very ubiquitous
R solves this with readxl
Load in the readxl package
Read in csss_tweets.xlsx with the read_excel() function to a new object csss_tweets
Inspect the object with head()
#Install and read in the readxl Package
library(readxl)
#use read_excel() function to read in data/csss_tweets.xlsx
csss_tweets <- read_excel("data/csss_tweets.xlsx")#<<
#use head() function to read
head(csss_tweets, n = 5)# A tibble: 5 × 91
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
# reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
# is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
# quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
# urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
# media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
# ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …
.dta files are used by statistical software like R, Python, Stata, and SAS
They are accessed via haven
Install and load in haven
Read in GPA3.dta with the read_dta() function to a new object gpa_dt
Inspect the object with head()
# Install and read in the haven function
library(haven)
# Use read_dta() function to read in data/GPA3.dta
gpa_dt <- read_dta("data/GPA3.dta")
# Inspect the data
head(gpa_dt, n=3)# A tibble: 3 × 23
term sat tothrs cumgpa season frstsem crsgpa verbmath trmgpa hssize hsrank
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 920 31 2.25 0 0 2.65 0.484 1.5 10 4
2 2 920 43 2.04 1 0 2.51 0.484 2.25 10 4
3 1 780 28 2.03 0 0 2.87 0.814 2.20 123 102
# ℹ 12 more variables: id <dbl>, spring <dbl>, female <dbl>, black <dbl>,
# white <dbl>, ctrmgpa <dbl>, ctothrs <dbl>, ccrsgpa <dbl>, ccrspop <dbl>,
# cseason <dbl>, hsperc <dbl>, football <dbl>
Let’s create some mock data: One data frame for students, and one for scores.
# Left join: Returns all rows from the left table, and the matched rows from the right table
# If there is no match, the result is NA
left_join_result <- left_join(students, scores, by = "student_id")
left_join_result student_id name major score
1 1 Alice Math 85
2 2 Bob Physics 90
3 3 Charlie Biology 75
4 4 David Computer Science NA
# Right join: Returns all rows from the right table, and the matched rows from the left table.
#If there is no match, the result is NA.
right_join_result <- right_join(students, scores, by = "student_id")
right_join_result student_id name major score
1 1 Alice Math 85
2 2 Bob Physics 90
3 3 Charlie Biology 75
4 5 <NA> <NA> 80
# Full join: Returns all rows when there is a match in one of the tables.
# If there is no match, the result is NA for the missing values.
full_join_result <- full_join(students, scores, by = "student_id")
full_join_result student_id name major score
1 1 Alice Math 85
2 2 Bob Physics 90
3 3 Charlie Biology 75
4 4 David Computer Science NA
5 5 <NA> <NA> 80
❓ Why might you choose to use an inner join instead of a left join when analyzing student data alongside their scores and grades?