Prepare and Wrangle

LAW Module 1: A Code-A-long

Welcome to the LAW Code-a-long for Module 1

  • Learning Analytics Workflow (LAW) is designed for those seeking an introductory understanding of learning analytics using basic R programming skills, particularly in the context of STEM education research.

  • The following code-a-long is aimed at preparing you for the first section of the case study.


Module Objectives

By the end of this module:

  • Know how to read in the data:
    • Learners will be able to identify and describe different types of learning environments, explaining their unique features and applications in educational research.
  • Characteristics of Data:
    • Learners will gain proficiency in recognizing and categorizing various data formats commonly used in educational research by the end of this section.

Context of the Problem

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education54(2), 588-599.

  • Explores specific online activities of students indicating their academic success.
  • Focused on “early warning systems” in higher education.
  • Research was extracted from course-based instructor tracking logs and the BB Vista production server.
  • Data Sources
  • A self-report survey assessing three aspects of students’ motivation
  • Log-trace data, such as data output from the learning management system (LMS)
  • Academic achievement data
  • Discussion board data

Research Questions:

  • Which LMS tracking data variables correlate significantly with student achievement?

  • How accurately can measures of student online activity in an online course site predict student achievement in the course under study?

Loading and Installing packages

Load Pacakges

  • First time using a package
  • Do this ONLY ONCE in the “console”
# Install Packages
install.packages("tidyverse")

Load Pacakges

  • First time using a package
  • Do this ONLY ONCE in the “console”
# Install Packages
install.packages("tidyverse")

Your turn 👉 Your Turn

# Load tidyverse package 
#
#

What is the function to load the tidyverse library?

👉 Your Turn -> Answer

# Load tidyverse package 
library(tidyverse)

If you see red it does not always mean error.

Reading in data

Function

Common {readr} functions to read in different types of data

Using the readr function

  • Use read_csv() function to read in CSV.
#use read_csv function to read in sci-online-classes.csv rename object to online_classes
online_classes <- read_csv("data/sci-online-classes.csv")

#use head() function
knitr::kable(head(online_classes, n=5), format = 'html')

Using the readr function

  • Use read_csv() function to read in CSV.
#use read_csv function to read in sci-online-classes.csv rename object to online_classes
online_classes <- read_csv("data/sci-online-classes.csv")

#use head() function
knitr::kable(head(online_classes, n=5), format = 'html')

Your turn 👉 Your Turn

In the corresponding script do the following:

  • load in the readxl package,
  • load in "data/csss_tweets.xlsx" file save to a new object csss_tweets
  • inspect the data by head() function
# 1. Install and read in the readxl Package 

# 2. use read_excel() function to read in data.
csss_tweets <- 
  
# 3. use head() function to read 
head()

👉 Your Turn -> Answer

#Install and read in the readxl Package 
library(readxl)

#use read_excel() function to read in data/csss_tweets.xlsx
csss_tweets <- read_excel("data/csss_tweets.xlsx")#<<

#use head() function to read 
head(csss_tweets, n = 5)
# A tibble: 5 × 91
  user_id             status_id     created_at          screen_name text  source
  <chr>               <chr>         <dttm>              <chr>       <chr> <chr> 
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
#   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
#   quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
#   urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
#   media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
#   ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …

What did you notice in the printed output?

import Excel Sheet

excel_sheets("data/csss_tweets.xlsx") 
[1] "Sheet1"
csss_tweets <- read_excel("data/csss_tweets.xlsx", sheet = "Sheet1")

#use head() function to read 
head(csss_tweets, n = 5)
# A tibble: 5 × 91
  user_id             status_id     created_at          screen_name text  source
  <chr>               <chr>         <dttm>              <chr>       <chr> <chr> 
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
#   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
#   quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
#   urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
#   media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
#   ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …

Hint: To learn more about functions for this package type:
?read_excel in the script.

👉 Your Turn

  • Do the following:
  • load in the haven package,
  • load in “data/GPA3.dta” file save to a new object called gpa_dt
  • inspect the data with function of your choice
  • explain what you see
# 1. Install and read in the haven function 

# 2. use read_excel() function to read in data/GPA3.dta
gpa_dt <- 
  
# 3. Inspect the data
head()

👉 Your Turn

  • Do the following:
  • load in the haven package,
  • load in “data/GPA3.dta” file save to a new object called gpa_dt
  • inspect the data with function of your choice
  • explain what you see
# 1. Install and read in the haven function 

# 2. use read_dta() function to read in data/GPA3.dta
gpa_dt <- 
  
# 3. Inspect the data
head()

👉 Your Turn -> Answer

# 1. Install and read in the haven function 
library(haven)

# 2. use read_dta() function to read in data/GPA3.dta
gpa_dt <- read_dta("data/GPA3.dta")
  
# 3. Inspect the data
head(gpa_dt, n=3)
# A tibble: 3 × 23
   term   sat tothrs cumgpa season frstsem crsgpa verbmath trmgpa hssize hsrank
  <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
1     1   920     31   2.25      0       0   2.65    0.484   1.5      10      4
2     2   920     43   2.04      1       0   2.51    0.484   2.25     10      4
3     1   780     28   2.03      0       0   2.87    0.814   2.20    123    102
# ℹ 12 more variables: id <dbl>, spring <dbl>, female <dbl>, black <dbl>,
#   white <dbl>, ctrmgpa <dbl>, ctothrs <dbl>, ccrsgpa <dbl>, ccrspop <dbl>,
#   cseason <dbl>, hsperc <dbl>, football <dbl>

Joins Overview

Let’s create Mock Data Generation

# Create mock data
students <- data.frame(
  student_id = c(1, 2, 3, 4),
  name = c("Alice", "Bob", "Charlie", "David"),
  major = c("Math", "Physics", "Biology", "Computer Science")
)
students
  student_id    name            major
1          1   Alice             Math
2          2     Bob          Physics
3          3 Charlie          Biology
4          4   David Computer Science



scores <- data.frame(
  student_id = c(1, 2, 3, 5),
  score = c(85, 90, 75, 80)
)
scores
  student_id score
1          1    85
2          2    90
3          3    75
4          5    80
# Inner join: Returns rows that have matching values in both tables
inner_join_result <- inner_join(students, scores, by = "student_id")
inner_join_result
  student_id    name   major score
1          1   Alice    Math    85
2          2     Bob Physics    90
3          3 Charlie Biology    75
# Left join: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NA.
left_join_result <- left_join(students, scores, by = "student_id")
left_join_result
  student_id    name            major score
1          1   Alice             Math    85
2          2     Bob          Physics    90
3          3 Charlie          Biology    75
4          4   David Computer Science    NA
# Right join: Returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NA.
right_join_result <- right_join(students, scores, by = "student_id")
right_join_result
  student_id    name   major score
1          1   Alice    Math    85
2          2     Bob Physics    90
3          3 Charlie Biology    75
4          5    <NA>    <NA>    80



❓ Why might you choose to use an inner join instead of a left join when analyzing student data alongside their scores and grades?

Full join

# Full join: Returns all rows when there is a match in one of the tables. If there is no match, the result is NA for the missing values.
full_join_result <- full_join(students, scores, by = "student_id")
full_join_result
  student_id    name            major score
1          1   Alice             Math    85
2          2     Bob          Physics    90
3          3 Charlie          Biology    75
4          4   David Computer Science    NA
5          5    <NA>             <NA>    80

What’s Next?