Prepare and Wrangle

Foundations Module 1: A Code-A-long

Welcome to Foundations code along for Module 1

Foundations of Learning Analytics are designed for those seeking an introductory understanding of learning analytics and either basic R programming skills or basic Python skills, particularly in the context of STEM education research. The following code along is aimed at preparing you for the first section of the case study.



Module Objectives

By the end of this module:

  • Know how to read in the data:
    • Learners will be able to identify and describe different types of learning environments, explaining their unique features and applications in educational research.
  • Characteristics of Data:
    • Learners will gain proficiency in recognizing and categorizing various data formats commonly used in educational research by the end of this section.

Context of the Problem

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education54(2), 588-599.

  • Explores specific online activities of students indicating their academic success.
  • Focused on “early warning systems” in higher education.
  • Research was extracted from course-based instructor tracking logs and the BB Vista production server.
  • Data Sources
  • A self-report survey assessing three aspects of students’ motivation
  • Log-trace data, such as data output from the learning management system (LMS)
  • Academic achievement data
  • Discussion board data

Research Questions:

  • Which LMS tracking data variables correlate significantly with student achievement?

  • How accurately can measures of student online activity in an online course site predict student achievement in the course under study?

Loading and Installing packages

Load Pacakges

  • First time using a package
  • Do this ONLY ONCE in the “console”
# Install Packages
install.packages("tidyverse")

Load Pacakges

  • First time using a package
  • Do this ONLY ONCE in the “console”
# Install Packages
install.packages("tidyverse")

Your turn 👉 Your Turn

# Load tidyverse package 
#
#

What is the function to load the tidyverse library?

👉 Your Turn -> Answer

# Load tidyverse package 
library(tidyverse)

If you see red it does not always mean error.

Reading in data

Function

Common {readr} functions to read in different types of data

Using the readr function

  • Use read_csv() function to read in CSV.
#use read_csv function to read in sci-online-classes.csv rename object to online_classes
online_classes <- read_csv("data/sci-online-classes.csv")

#use head() function
knitr::kable(head(online_classes, n=5), format = 'html')

Using the readr function

  • Use read_csv() function to read in CSV.
#use read_csv function to read in sci-online-classes.csv rename object to online_classes
online_classes <- read_csv("data/sci-online-classes.csv")

#use head() function
knitr::kable(head(online_classes, n=5), format = 'html')

Your turn 👉 Your Turn

In the corresponding script do the following:

  • load in the readxl package,
  • load in "data/csss_tweets.xlsx" file save to a new object csss_tweets
  • inspect the data by head() function
# 1. Install and read in the readxl Package 

# 2. use read_excel() function to read in data.
csss_tweets <- 
  
# 3. use head() function to read 
head()

👉 Your Turn -> Answer

#Install and read in the readxl Package 
library(readxl)

#use read_excel() function to read in data/csss_tweets.xlsx
csss_tweets <- read_excel("data/csss_tweets.xlsx")#<<

#use head() function to read 
head(csss_tweets, n = 5)
# A tibble: 5 × 91
  user_id             status_id     created_at          screen_name text  source
  <chr>               <chr>         <dttm>              <chr>       <chr> <chr> 
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
#   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
#   quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
#   urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
#   media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
#   ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …

What did you notice in the printed output?

import Excel Sheet

excel_sheets("data/csss_tweets.xlsx") 
[1] "Sheet1"
csss_tweets <- read_excel("data/csss_tweets.xlsx", sheet = "Sheet1")

#use head() function to read 
head(csss_tweets, n = 5)
# A tibble: 5 × 91
  user_id             status_id     created_at          screen_name text  source
  <chr>               <chr>         <dttm>              <chr>       <chr> <chr> 
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
#   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
#   is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
#   quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
#   urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
#   media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
#   ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …

Hint: To learn more about functions for this package type:
?read_excel in the script.

👉 Your Turn

  • Do the following:
  • load in the haven package,
  • load in “data/GPA3.dta” file save to a new object called gpa_dt
  • inspect the data with function of your choice
  • explain what you see
# 1. Install and read in the haven function 

# 2. use read_excel() function to read in data/GPA3.dta
gpa_dt <- 
  
# 3. Inspect the data
head()

👉 Your Turn

  • Do the following:
  • load in the haven package,
  • load in “data/GPA3.dta” file save to a new object called gpa_dt
  • inspect the data with function of your choice
  • explain what you see
# 1. Install and read in the haven function 

# 2. use read_dta() function to read in data/GPA3.dta
gpa_dt <- 
  
# 3. Inspect the data
head()

👉 Your Turn -> Answer

# 1. Install and read in the haven function 
library(haven)

# 2. use read_dta() function to read in data/GPA3.dta
gpa_dt <- read_dta("data/GPA3.dta")
  
# 3. Inspect the data
head(gpa_dt, n=3)
# A tibble: 3 × 23
   term   sat tothrs cumgpa season frstsem crsgpa verbmath trmgpa hssize hsrank
  <dbl> <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
1     1   920     31   2.25      0       0   2.65    0.484   1.5      10      4
2     2   920     43   2.04      1       0   2.51    0.484   2.25     10      4
3     1   780     28   2.03      0       0   2.87    0.814   2.20    123    102
# ℹ 12 more variables: id <dbl>, spring <dbl>, female <dbl>, black <dbl>,
#   white <dbl>, ctrmgpa <dbl>, ctothrs <dbl>, ccrsgpa <dbl>, ccrspop <dbl>,
#   cseason <dbl>, hsperc <dbl>, football <dbl>

Joins Overview

Let’s create Mock Data Generation

# Create mock data
students <- data.frame(
  student_id = c(1, 2, 3, 4),
  name = c("Alice", "Bob", "Charlie", "David"),
  major = c("Math", "Physics", "Biology", "Computer Science")
)
students
  student_id    name            major
1          1   Alice             Math
2          2     Bob          Physics
3          3 Charlie          Biology
4          4   David Computer Science



scores <- data.frame(
  student_id = c(1, 2, 3, 5),
  score = c(85, 90, 75, 80)
)
scores
  student_id score
1          1    85
2          2    90
3          3    75
4          5    80
# Inner join: Returns rows that have matching values in both tables
inner_join_result <- inner_join(students, scores, by = "student_id")
inner_join_result
  student_id    name   major score
1          1   Alice    Math    85
2          2     Bob Physics    90
3          3 Charlie Biology    75
# Left join: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NA.
left_join_result <- left_join(students, scores, by = "student_id")
left_join_result
  student_id    name            major score
1          1   Alice             Math    85
2          2     Bob          Physics    90
3          3 Charlie          Biology    75
4          4   David Computer Science    NA
# Right join: Returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NA.
right_join_result <- right_join(students, scores, by = "student_id")
right_join_result
  student_id    name   major score
1          1   Alice    Math    85
2          2     Bob Physics    90
3          3 Charlie Biology    75
4          5    <NA>    <NA>    80



❓ Why might you choose to use an inner join instead of a left join when analyzing student data alongside their scores and grades?

Full join

# Full join: Returns all rows when there is a match in one of the tables. If there is no match, the result is NA for the missing values.
full_join_result <- full_join(students, scores, by = "student_id")
full_join_result
  student_id    name            major score
1          1   Alice             Math    85
2          2     Bob          Physics    90
3          3 Charlie          Biology    75
4          4   David Computer Science    NA
5          5    <NA>             <NA>    80

What’s next?





  • Complete the Prepare and Wrangle parts of the Case Study.
  • Complete the Badge requirement document Foundations badge - Data Sources
  • Do required readings for the next Foundations Module 2.