Prepare and Wrangle

LAW Module 1: A Code-A-long

Welcome to the LAW Code-a-long for Module 1

  • Learning Analytics Workflow (LAW) is designed for those seeking an introductory understanding of learning analytics using basic R programming skills, particularly in the context of STEM education research.

  • The following code-a-long is aimed at preparing you for the first section of the case study.


Module Objectives

By the end of this module:

  • Know how to read in the data:
    • Learners will be able to identify and describe different types of learning environments, explaining their unique features and applications in educational research.
  • Characteristics of Data:
    • Learners will gain proficiency in recognizing and categorizing various data formats commonly used in educational research by the end of this section.

Context of the Problem

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education54(2), 588-599.

  • Explores specific online activities of students indicating their academic success.
  • Focused on “early warning systems” in higher education.
  • Research was extracted from course-based instructor tracking logs and the BB Vista production server.
  • Data Sources
  • A self-report survey assessing three aspects of students’ motivation
  • Log-trace data, such as data output from the learning management system (LMS)
  • Academic achievement data
  • Discussion board data

Research Questions:

  • Which LMS tracking data variables correlate significantly with student achievement?

  • How accurately can measures of student online activity in an online course site predict student achievement in the course under study?

Installing and Loading Packages

  • When you start a new session, you have to first install most packages to your development environment (RStudio)
  • install.packages() downloads a package from a repository, typically the Comprehensive R Archive Network (CRAN) unless you say otherwise
  • You only have to do this once per session
  • When installing, red output text usually doesn’t necessarily mean there is a problem!
  • Use install.packages() to install the tidyverse package

  • Use quotations around the package name since it is identified as a string variable

# Install the tidyverse package
install.packages("tidyverse")
  • Once you have a package installed, you load it into your session with library()
  • Use library() to load the tidyverse package

  • Since the package has been installed as an object in RStudio, don’t use quotations around the package name when loading it

# Load the tidyverse package
library(tidyverse)

Reading in Data with readr

  • file File to path

  • col_names Uses the first row of raw data available as the column names if TRUE, creates placeholder column names if FALSE

  • na Looks for text in the data to treat as non-applicable. Can be a list like c("", "na")

  • skip Tells the function to “skip” ahead by reading at a given row of data.

  • col_types Coerces columns to specific types of data. If NULL, the function guesses the type of each column, which is useful but not robust

  • Use read_csv() to read in sci-online.classes.csv

  • This file is located in your data folder

#use read_csv function to read in sci-online-classes.csv rename object to online_classes
online_classes <- read_csv("data/sci-online-classes.csv")

#You can take a peek at the read in data with head()
knitr::kable(head(online_classes, n=5), format = 'html')

Other Data Types

  • Excel files are proprietary to Microsoft but also very ubiquitous

  • R solves this with readxl

  • Load in the readxl package

  • Read in csss_tweets.xlsx with the read_excel() function to a new object csss_tweets

  • Inspect the object with head()

#Install and read in the readxl Package 
library(readxl)

#use read_excel() function to read in data/csss_tweets.xlsx
csss_tweets <- read_excel("data/csss_tweets.xlsx")#<<

#use head() function to read 
head(csss_tweets, n = 5)
# A tibble: 5 × 91
  user_id             status_id     created_at          screen_name text  source
  <chr>               <chr>         <dttm>              <chr>       <chr> <chr> 
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
#   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
#   quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
#   urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
#   media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
#   ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …
  • .dta files are used by statistical software like R, Python, Stata, and SAS

  • They are accessed via haven

  • Install and load in haven

  • Read in GPA3.dta with the read_dta() function to a new object gpa_dt

  • Inspect the object with head()

# Install and read in the haven function 
library(haven)

# Use read_dta() function to read in data/GPA3.dta
gpa_dt <- read_dta("data/GPA3.dta")
  
# Inspect the data
head(gpa_dt, n=3)
# A tibble: 3 × 23
   term   sat tothrs cumgpa season frstsem crsgpa verbmath trmgpa hssize hsrank
  <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
1     1   920     31   2.25      0       0   2.65    0.484   1.5      10      4
2     2   920     43   2.04      1       0   2.51    0.484   2.25     10      4
3     1   780     28   2.03      0       0   2.87    0.814   2.20    123    102
# ℹ 12 more variables: id <dbl>, spring <dbl>, female <dbl>, black <dbl>,
#   white <dbl>, ctrmgpa <dbl>, ctothrs <dbl>, ccrsgpa <dbl>, ccrspop <dbl>,
#   cseason <dbl>, hsperc <dbl>, football <dbl>

Joining Data

Let’s create some mock data: One data frame for students, and one for scores.

# Create mock data
students <- data.frame(
  student_id = c(1, 2, 3, 4),
  name = c("Alice", "Bob", "Charlie", "David"),
  major = c("Math", "Physics", "Biology", "Computer Science")
)
students
  student_id    name            major
1          1   Alice             Math
2          2     Bob          Physics
3          3 Charlie          Biology
4          4   David Computer Science



scores <- data.frame(
  student_id = c(1, 2, 3, 5),
  score = c(85, 90, 75, 80)
)
scores
  student_id score
1          1    85
2          2    90
3          3    75
4          5    80
library(tidyverse)
# Inner join: Returns rows that have matching values in both tables
inner_join_result <- inner_join(students, scores, by = "student_id")
inner_join_result
  student_id    name   major score
1          1   Alice    Math    85
2          2     Bob Physics    90
3          3 Charlie Biology    75
# Left join: Returns all rows from the left table, and the matched rows from the right table
# If there is no match, the result is NA
left_join_result <- left_join(students, scores, by = "student_id")
left_join_result
  student_id    name            major score
1          1   Alice             Math    85
2          2     Bob          Physics    90
3          3 Charlie          Biology    75
4          4   David Computer Science    NA
# Right join: Returns all rows from the right table, and the matched rows from the left table. 
#If there is no match, the result is NA.
right_join_result <- right_join(students, scores, by = "student_id")
right_join_result
  student_id    name   major score
1          1   Alice    Math    85
2          2     Bob Physics    90
3          3 Charlie Biology    75
4          5    <NA>    <NA>    80
# Full join: Returns all rows when there is a match in one of the tables. 
# If there is no match, the result is NA for the missing values.
full_join_result <- full_join(students, scores, by = "student_id")
full_join_result
  student_id    name            major score
1          1   Alice             Math    85
2          2     Bob          Physics    90
3          3 Charlie          Biology    75
4          4   David Computer Science    NA
5          5    <NA>             <NA>    80

❓ Why might you choose to use an inner join instead of a left join when analyzing student data alongside their scores and grades?

What’s Next?