Foundations Module 1: A Code-A-long
Foundations of Learning Analytics
are designed for those seeking an introductory understanding of learning analytics and either basic R programming skills or basic Python skills, particularly in the context of STEM education research. The following code along is aimed at preparing you for the first section of the case study.
By the end of this module:
Data
:
Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education, 54(2), 588-599.
Research Questions:
Which LMS tracking data variables correlate significantly with student achievement?
How accurately can measures of student online activity in an online course site predict student achievement in the course under study?
Common {readr} functions to read in different types of data
read_csv()
function to read in CSV.Your turn 👉 Your Turn ⤵
In the corresponding script do the following:
readxl
package,"data/csss_tweets.xlsx"
file save to a new object csss_tweetshead()
function#Install and read in the readxl Package
library(readxl)
#use read_excel() function to read in data/csss_tweets.xlsx
csss_tweets <- read_excel("data/csss_tweets.xlsx")#<<
#use head() function to read
head(csss_tweets, n = 5)
# A tibble: 5 × 91
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
# reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
# is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
# quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
# urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
# media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
# ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …
What did you notice in the printed output?
[1] "Sheet1"
csss_tweets <- read_excel("data/csss_tweets.xlsx", sheet = "Sheet1")
#use head() function to read
head(csss_tweets, n = 5)
# A tibble: 5 × 91
user_id status_id created_at screen_name text source
<chr> <chr> <dttm> <chr> <chr> <chr>
1 1331246991762976769 136572200862… 2021-02-27 17:54:35 InnerSchol… "@We… Twitt…
2 1331246991762976769 136572187371… 2021-02-27 17:54:03 InnerSchol… "@Bo… Twitt…
3 1331246991762976769 136572178780… 2021-02-27 17:53:42 InnerSchol… "@Co… Twitt…
4 1331246991762976769 136572174606… 2021-02-27 17:53:32 InnerSchol… "@Co… Twitt…
5 1331246991762976769 136572164488… 2021-02-27 17:53:08 InnerSchol… "Ano… Twitt…
# ℹ 85 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
# reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
# is_retweet <lgl>, favorite_count <dbl>, retweet_count <dbl>,
# quote_count <lgl>, reply_count <lgl>, hashtags <lgl>, symbols <lgl>,
# urls_url <lgl>, urls_t.co <lgl>, urls_expanded_url <lgl>, media_url <lgl>,
# media_t.co <lgl>, media_expanded_url <lgl>, media_type <lgl>,
# ext_media_url <lgl>, ext_media_t.co <lgl>, ext_media_expanded_url <lgl>, …
Hint: To learn more about functions for this package type:
?read_excel in the script.
haven
package,haven
package,👉 Your Turn ⤵ -> Answer
# 1. Install and read in the haven function
library(haven)
# 2. use read_dta() function to read in data/GPA3.dta
gpa_dt <- read_dta("data/GPA3.dta")
# 3. Inspect the data
head(gpa_dt, n=3)
# A tibble: 3 × 23
term sat tothrs cumgpa season frstsem crsgpa verbmath trmgpa hssize hsrank
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 920 31 2.25 0 0 2.65 0.484 1.5 10 4
2 2 920 43 2.04 1 0 2.51 0.484 2.25 10 4
3 1 780 28 2.03 0 0 2.87 0.814 2.20 123 102
# ℹ 12 more variables: id <dbl>, spring <dbl>, female <dbl>, black <dbl>,
# white <dbl>, ctrmgpa <dbl>, ctothrs <dbl>, ccrsgpa <dbl>, ccrspop <dbl>,
# cseason <dbl>, hsperc <dbl>, football <dbl>
Let’s create Mock Data Generation
# Left join: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NA.
left_join_result <- left_join(students, scores, by = "student_id")
left_join_result
student_id name major score
1 1 Alice Math 85
2 2 Bob Physics 90
3 3 Charlie Biology 75
4 4 David Computer Science NA
# Right join: Returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NA.
right_join_result <- right_join(students, scores, by = "student_id")
right_join_result
student_id name major score
1 1 Alice Math 85
2 2 Bob Physics 90
3 3 Charlie Biology 75
4 5 <NA> <NA> 80
❓ Why might you choose to use an inner join instead of a left join when analyzing student data alongside their scores and grades?
# Full join: Returns all rows when there is a match in one of the tables. If there is no match, the result is NA for the missing values.
full_join_result <- full_join(students, scores, by = "student_id")
full_join_result
student_id name major score
1 1 Alice Math 85
2 2 Bob Physics 90
3 3 Charlie Biology 75
4 4 David Computer Science NA
5 5 <NA> <NA> 80
What’s next?
Prepare
and Wrangle
parts of the Case Study.