Narrated: Foundations Case Study

Independent/Group work




July 17, 2024


We will focus on online science classes provided through a state-wide online virtual school and conduct an analysis that help product students’ performance in these online courses. This case study is guided by a foundational study in Learning Analytics that illustrates how analyses like these can be used develop an early warning system for educators to identify students at risk of failing and intervene before that happens.

Over the next labs we will dive into the Learning Analytics Workflow as follows:

Figure 1. Steps of Data-Intensive Research Workflow

  1. Prepare: Prior to analysis, it’s critical to understand the context and data sources you’re working with so you can formulate useful and answerable questions. You’ll also need to become familiar with and load essential packages for analysis, and learn to load and view the data for analysis.
  2. Wrangle: Wrangling data entails the work of manipulating, cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files, tidying and joining our data.
  3. Explore: In Part 3, we use basic data visualization and calculate some summary statistics to explore our data and see what insight it provides in response to our questions.
  4. Model: After identifying variables that may be related to student performance through exploratory analysis, we’ll look at correlations and create some simple models of our data using linear regression.
  5. Communicate: To wrap up our case study, we’ll develop our first “data product” and share our analyses and findings by creating our first web page using Markdown.
  6. Change Idea: Having developed a webpage using Markdown, share your findings with the colleagues. The page will include interactive plots and a detailed explanation of the analysis process, serving as a case study for other educators in your school. Present your findings at a staff meeting, advocating for a broader adoption of data-driven strategies across curriculums.

Module 1: Prepare and Wrangle


This case study is guided by a well-cited publication from two authors that have made numerous contributions to the field of Learning Analytics over the years. This article is focused on “early warning systems” in higher education, and where adoption of learning management systems (LMS) like Moodle and Canvas gained a quicker foothold.

Macfadyen, L. P., & Dawson, S. (2010). Mining LMS data to develop an “early warning system” for educators: A proof of concept. Computers & education54(2), 588-599.

ABOUT the study

Previous research has indicated that universities and colleges could utilize Learning Management System (LMS) data to create reporting tools that identify students who are at risk and enable prompt pedagogical interventions. The present study validates and expands upon this idea by presenting data from an international research project that explores the specific online activities of students that reliably indicate their academic success. This paper confirms and extends this proposition by providing data from an international research project investigating which student online activities accurately predict academic achievement.

The data analyzed in this exploratory research was extracted from the course-based instructor tracking logs and the BB Vista production server.

Data collected on each student included ‘whole term’ counts for frequency of usage of course materials and tools supporting content delivery, engagement and discussion, assessment and administration/management. In addition, tracking data indicating total time spent on certain tool-based activities (assessments, assignments, total time online) offered a total measure of individual student time on task.

The authors used scatter plots for identifying potential relationships between variables under investigation, followed by a a simple correlation analysis of each variable to further interrogate the significance of selected variables as indicators of student achievement. Finally, a linear multiple regression analysis was conducted in order to develop a predictive model in which a student final grade was the continuous dependent variable.

Introduction to the Stakeholder

Name: Alex Johnson

Role: University Science Professor

Experience: 5 years teaching, enthusiastic about integrating technology in education

Goal: Alex aims to improve student engagement and performance in her online science classes.

Teacher Persona Alex begins by understanding the importance of data analysis in identifying students who might need extra support. The cited foundational study motivates her to explore similar analyses to develop her own early warning system.

Load Libraries

Remember libraries are also called packages. They are shareable collections of code that can contain functions, data, and/or documentation and extend the functionality of the coding language.

  • Tidyverse: is a collection of R packages designed for data manipulation, visualization, and analysis.
#Load Libraries below needed for analysis
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<>) to force all conflicts to become errors

Data Sources

Data Source #1: Log Data

Log-trace data is data generated from our interactions with digital technologies, such as archived data from social media postings. In education, an increasingly common source of log-trace data is that generated from interactions with LMS and other digital tools.

The data we will use has already been “wrangled” quite a bit and is a summary type of log-trace data: the number of minutes students spent on the course. While this data type is fairly straightforward, there are even more complex sources of log-trace data out there (e.g., time stamps associated with when students started and stopped accessing the course).

Variable Description:

Course Acronym description
Variable Description
student_id students id at institution
course_id abbreviation for `course, course number, semester
gender m ale/female/NA
e nrol lment_reason reason student decided to take the course
e nrol lment_status ap pr ove/enrolled, dropped, withdrawn
time_spent Time spent in hours for entire course
  • “AnPhA” = “Anatomy”,
  • “BioA” = “Biology”,
  • “FrScA” = “Forensics”,
  • “OcnA” = “Oceanography”,
  • “PhysA” = “Physics”

Data Source #2: Academic Achievement Data

Variable Description:

Variable Description
total_p oints_possible available points for the course
total _points_earned stud | ent earned for the entire course |

Data Source #3: Self-Report Survey

The third data source is a self-report survey. This was data collected before the start of the course. The survey included ten items, each corresponding to one of three motivation measures: interest, utility value, and perceived competence. These were chosen for their alignment with one way to think about students’ motivation, to what extent they expect to do well (corresponding to their perceived competence) and their value for what they are learning (corresponding to their interest and utility value).

Variable Description:

Var iable Description
int student science interest
tv tv
Q1 -Q10 survey questions
  1. I think this course is an interesting subject. (Interest)
  2. What I am learning in this class is relevant to my life. (Utility value)
  3. I consider this topic to be one of my best subjects. (Perceived competence)
  4. I am not interested in this course. (Interest—reverse coded)
  5. I think I will like learning about this topic. (Interest)
  6. I think what we are studying in this course is useful for me to know. (Utility value)
  7. I don’t feel comfortable when it comes to answering questions in this area. (Perceived competence–reverse coded)
  8. I think this subject is interesting. (Interest)
  9. I find the content of this course to be personally meaningful. (Utility value)
  10. I’ve always wanted to learn more about this subject. (Interest)


Import data

We will need to load in and inspect each of the dataframes that we will use for this lab. You will first read about the dataframe and then learn how to load (or read in) the dataframe into the quarto document.

Time spent

Let’s use the read_csv() function from to import our log-data.csv file directly from our data folder and name this data set time_spent, to help us to quickly recollect what function it serves in this analysis:

Load the file log-data.csv from data folder and save object as time_spent.

Creating new variable

To do that, we need to create a new variable time_spent which is done by naming the variable and assigning its value using <- operator.

Press the green arrow head to run the code below:

#load log-data file from data folder
time_spent <- read_csv("module_1/data/log-data.csv")
Rows: 716 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): course_id, gender, enrollment_reason, enrollment_status
dbl (2): student_id, time_spent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 716 × 6
   student_id course_id    gender enrollment_reason enrollment_status time_spent
        <dbl> <chr>        <chr>  <chr>             <chr>                  <dbl>
 1      60186 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2087.
 2      66693 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2309.
 3      66811 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      5299.
 4      66862 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1747.
 5      67508 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      2668.
 6      70532 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      2938.
 7      77010 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      1533.
 8      85249 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1210.
 9      85411 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled       473.
10      85583 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      5532.
# ℹ 706 more rows


Load the file gradebook-summary.csv from data folder and save object as gradebook

❗️In R, everything is an object. An object can be a simple value (like a number or a string), a complex structure (like a data frame or a list), or even a function or a model. For example, when you load a CSV file into R and store it in a variable, that variable is an object that contains your dataset.

A dataset typically refers to a collection of data, often stored in a tabular format with rows and columns.

👉 Your Turn

You need to:

  1. First, use the correct function to read in the .csv file and load the gradebook-summary.csv file.
  2. Second, add a function to the code (to inspect the data (your choice).
  3. Third, press the green arrow head to run the code.
#load grade book data from data folder
gradebook <- read_csv("module_1/data/gradebook-summary.csv",show_col_types = FALSE)

Attitude survey

Load the file survey.csv from data folder.

👉 Your Turn

You need to:

  1. First, use the correct function to read in the .csv file and load the survey.csv file.
  2. Second, add a function to the code (to inspect the data (your choice).
  3. Third, press the green arrow head to run the code.
#load survey data from data folder
#(add code below)
survey <- read_csv("module_1/data/survey.csv",show_col_types = FALSE)

Inspect data

There are several ways you can look at data object in R and posit cloud.

Typing object name

Type the name of your object and run the code:

# A tibble: 716 × 6
   student_id course_id    gender enrollment_reason enrollment_status time_spent
        <dbl> <chr>        <chr>  <chr>             <chr>                  <dbl>
 1      60186 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2087.
 2      66693 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2309.
 3      66811 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      5299.
 4      66862 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1747.
 5      67508 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      2668.
 6      70532 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      2938.
 7      77010 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      1533.
 8      85249 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1210.
 9      85411 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled       473.
10      85583 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      5532.
# ℹ 706 more rows

Using glimpse() function

Rows: 717
Columns: 4
$ student_id            <dbl> 43146, 44638, 47448, 47979, 48797, 51943, 52326,…
$ course_id             <chr> "FrScA-S216-02", "OcnA-S116-01", "FrScA-S216-01"…
$ total_points_possible <dbl> 1217, 1676, 1232, 1833, 2225, 1222, 1775, 2225, …
$ total_points_earned   <dbl> 1150.00, 1384.23, 1116.00, 1492.73, 1994.75, 70.…

Using Global Environment

Inspecting first and last few rows

#first few rows
# A tibble: 6 × 26
  student_ID course_ID  subject semester section   int   val percomp    tv    q1
  <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
1 43146      FrScA-S21… FrScA   S216     02        4.2  3.67     4    3.86     4
2 44638      OcnA-S116… OcnA    S116     01        4    3        3    3.57     4
3 47448      FrScA-S21… FrScA   S216     01        4.2  3        3    3.71     5
4 47979      OcnA-S216… OcnA    S216     01        4    3.67     2.5  3.86     4
5 48797      PhysA-S11… PhysA   S116     01        3.8  3.67     3.5  3.71     4
6 51943      FrScA-S21… FrScA   S216     03        3.8  3.67     3.5  3.71     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>
#last few rows
# A tibble: 6 × 26
  student_ID course_ID  subject semester section   int   val percomp    tv    q1
  <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
1 19         AnPhA-S21… AnPhA   S217     02        4.2  5        5    4.5      5
2 42         FrScA-S21… FrScA   S217     01        4    4        4    4        4
3 52         FrScA-S21… FrScA   S217     03        4.4  2.67     3.5  3.75     4
4 57         FrScA-S21… FrScA   S217     01        4.4  2.33     2.5  3.62     5
5 72         FrScA-S21… FrScA   S217     01        5    3        4    4.25     5
6 80         FrScA-S21… FrScA   S217     01        3.6  2.33     3    3.12     4
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

Using sample() function

# A tibble: 662 × 26
   section    q9 subject    tv    q1 date.y              course_ID        q3
   <chr>   <dbl> <chr>   <dbl> <dbl> <dttm>              <chr>         <dbl>
 1 02          3 FrScA    3.86     4 NA                  FrScA-S216-02     4
 2 01          3 OcnA     3.57     4 NA                  OcnA-S116-01      2
 3 01          3 FrScA    3.71     5 NA                  FrScA-S216-01     3
 4 01          4 OcnA     3.86     4 NA                  OcnA-S216-01      2
 5 01          3 PhysA    3.71     4 NA                  PhysA-S116-01     3
 6 03          4 FrScA    3.71     4 NA                  FrScA-S216-03     3
 7 01          4 AnPhA    4        4 NA                  AnPhA-S216-01     4
 8 01          3 PhysA    4        4 2016-01-02 00:41:00 PhysA-S116-01     3
 9 01          2 FrScA    3        5 2015-10-13 14:11:00 FrScA-S116-01     3
10 01          3 FrScA    4.14     5 NA                  FrScA-S216-01     4
# ℹ 652 more rows
# ℹ 18 more variables: student_ID <chr>, q5 <dbl>, q10 <dbl>, q4 <dbl>,
#   date <dttm>, date.x <dttm>, post_uv <dbl>, q7 <dbl>, q6 <dbl>,
#   post_int <dbl>, int <dbl>, post_percomp <dbl>, q2 <dbl>, semester <chr>,
#   val <dbl>, post_tv <dbl>, percomp <dbl>, q8 <dbl>

👉 Your Turn

Inspect three datasets we loaded and answer the question:

❓ What do you notice? What do you wonder about? Did you note the number of observations, the different variables names? Finally what about the classes the variables are such as numeric, integer, character, or logical.


Tidy data

1. Time Spent

Use separate() function from tidyr

We will separate course_id variable in the time-spent.

The c() function in R is used used to combine or concatenate its argument. You can use it to get the output by giving parameters inside the function.

For example we want to separate course_id variables from

#separate variable to individual subject, semester and section
time_spent %>%  
           c("subject", "semester", "section"))
# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>
# A tibble: 716 × 6
   student_id course_id    gender enrollment_reason enrollment_status time_spent
        <dbl> <chr>        <chr>  <chr>             <chr>                  <dbl>
 1      60186 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2087.
 2      66693 AnPhA-S116-… M      Course Unavailab… Approved/Enrolled      2309.
 3      66811 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      5299.
 4      66862 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1747.
 5      67508 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      2668.
 6      70532 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      2938.
 7      77010 AnPhA-S116-… F      Learning Prefere… Approved/Enrolled      1533.
 8      85249 AnPhA-S116-… F      Course Unavailab… Approved/Enrolled      1210.
 9      85411 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled       473.
10      85583 AnPhA-S116-… F      Scheduling Confl… Approved/Enrolled      5532.
# ℹ 706 more rows

Make sure to save it to the time_spent object.

Saving an object is accomplished by using an assignment operator, which looks kind of like an arrow (<-).

#separate variable to individual subject, semester and section and save as same object name
time_spent <- time_spent %>%  
           c("subject", "semester", "section"))

# A tibble: 716 × 8
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 2 more variables: enrollment_status <chr>, time_spent <dbl>
Use mutate() function from dplyr

As you can see from the dataset, time_spent variable is not set as hour.

Let’s change that.

In pandas, you can easily create new variables or modify existing ones in a DataFrame directly using column assignments.

#mutate minutes to hours on time spent and save as new variable.
time_spent <- time_spent %>% 
  mutate(time_spent_hours = time_spent / 60)

# A tibble: 716 × 9
   student_id subject semester section gender enrollment_reason                 
        <dbl> <chr>   <chr>    <chr>   <chr>  <chr>                             
 1      60186 AnPhA   S116     01      M      Course Unavailable at Local School
 2      66693 AnPhA   S116     01      M      Course Unavailable at Local School
 3      66811 AnPhA   S116     01      F      Course Unavailable at Local School
 4      66862 AnPhA   S116     01      F      Course Unavailable at Local School
 5      67508 AnPhA   S116     01      F      Scheduling Conflict               
 6      70532 AnPhA   S116     01      F      Learning Preference of the Student
 7      77010 AnPhA   S116     01      F      Learning Preference of the Student
 8      85249 AnPhA   S116     01      F      Course Unavailable at Local School
 9      85411 AnPhA   S116     01      F      Scheduling Conflict               
10      85583 AnPhA   S116     01      F      Scheduling Conflict               
# ℹ 706 more rows
# ℹ 3 more variables: enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

In R, you can create new variables in a dataset (data frame or tibble) using the mutate() which allows you to add new columns to your data frame or modify existing ones.

❗️In this example, gradebook is the data frame, pass_fail is the new variable, and if_else() is a function that assigns “Pass” if the grade is greater than or equal to 50, and “Fail” otherwise.

2. Gradebook

Use separate() function from tidyr

Now, we will work on the gradebook dataset. Like the previous dataset, we will separate course_id variable again.

👉 Your Turn

You need to:

  1. First, use the pipe operator to separate course_id variable (like we just did in time_spent).
  2. Second, press the green arrow head to run the code.
#separate the course_id variable and save to 'gradebook' object
gradebook <- gradebook %>% separate(course_id,
           c("subject", "semester", "section"))

# A tibble: 717 × 6
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 707 more rows
Use the mutate() function from dplyr

As you can see in the gradebook dataframe the total points earned is in points and it is hard to know the proportion. Therefore, we want it to mutate that to a proportion.

👉 Your Turn

You need to:

  1. First, take total_points_earned divide by total_points_possible and multiple by 100. Save this as proportion_earned.
  2. Second, press the green arrow head to run the code.
# Mutate to a proportion_earned, take 'total points earned' divide by 'total points possible.' Save as a new variable proportion_earned.
gradebook <- gradebook %>%
  mutate(proportion_earned = (total_points_earned /total_points_possible) *100)


#inspect data
# A tibble: 717 × 7
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 707 more rows
# ℹ 1 more variable: proportion_earned <dbl>

3. Survey

Let’s process our data. First though, take a quick look again by typing survey into the console or using a preferred viewing method to take a look at the data.

❓ Does it appear to be the correct file? What do the variables seem to be about? What wrangling steps do we need to take? Taking a quick peak at the data helps us to begin to formulate answers to these and is an important step in any data analysis, especially as we prepare for what we are going to do.

#inspect data to view the column names
# A tibble: 662 × 26
   student_ID course_ID subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 43146      FrScA-S2… FrScA   S216     02        4.2  3.67     4    3.86     4
 2 44638      OcnA-S11… OcnA    S116     01        4    3        3    3.57     4
 3 47448      FrScA-S2… FrScA   S216     01        4.2  3        3    3.71     5
 4 47979      OcnA-S21… OcnA    S216     01        4    3.67     2.5  3.86     4
 5 48797      PhysA-S1… PhysA   S116     01        3.8  3.67     3.5  3.71     4
 6 51943      FrScA-S2… FrScA   S216     03        3.8  3.67     3.5  3.71     4
 7 52326      AnPhA-S2… AnPhA   S216     01        3.6  4        3    4        4
 8 52446      PhysA-S1… PhysA   S116     01        4.2  3.67     3    4        4
 9 53447      FrScA-S1… FrScA   S116     01        3.8  2        3    3        5
10 53475      FrScA-S2… FrScA   S216     01        4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date.x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date.y <dttm>,
#   date <dttm>

💡 Look at the variable names.

👉 Answer below

Add one or more of the things you notice or wonder about the data here:

You may have noticed that student_ID is not formatted exactly the same as student_id in our other files. This is important because in the next section when we “join,” or merge, our data files, these variables will need to have identical names.

Use the Janitor package

Fortunately the {janitor} package has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. There is also a handy function called clean_names() in the {janitor} package for standardizing variable names.

👉 Your Turn

You need to:

  1. First, add janitor package using the library function.
  2. Second clean the columns by adding the survey object to the clean_names() functon and saving it to the surveyobject.
  3. Third, inspect the data.
  4. Fourth, press the green arrow head to run the code.
# load janitor library to clean variable names that do not match
#clean columns of the survey data and save to survey object
#(add code below - some code is given to you)

survey <- clean_names(survey)

#inspect data to check for consistency with other data
#(add code below)

# A tibble: 662 × 26
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 43146      FrScA-S2… FrScA   S216     02        4.2  3.67     4    3.86     4
 2 44638      OcnA-S11… OcnA    S116     01        4    3        3    3.57     4
 3 47448      FrScA-S2… FrScA   S216     01        4.2  3        3    3.71     5
 4 47979      OcnA-S21… OcnA    S216     01        4    3.67     2.5  3.86     4
 5 48797      PhysA-S1… PhysA   S116     01        3.8  3.67     3.5  3.71     4
 6 51943      FrScA-S2… FrScA   S216     03        3.8  3.67     3.5  3.71     4
 7 52326      AnPhA-S2… AnPhA   S216     01        3.6  4        3    4        4
 8 52446      PhysA-S1… PhysA   S116     01        4.2  3.67     3    4        4
 9 53447      FrScA-S1… FrScA   S116     01        3.8  2        3    3        5
10 53475      FrScA-S2… FrScA   S216     01        4.8  3.33     4    4.14     5
# ℹ 652 more rows
# ℹ 16 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>

Data merging and reshaping

Join gradebook data with time spent datasets

We think, a full_join is best for our dataset as we need all the information from all three datasets.

The full join returns all of the records in a new table, whether it matches on either the left or right tables. If the table rows match, then a join will be executed, otherwise it will return NULL in places where a matching row does not exist.

When we are combining gradebook1 and time_spent datasets, we should identify column names. In this case, we will use the following variables for the match:

  • student_id

  • subject

  • semester

  • section

#use single join to join data sets by student_id, subject, semester and section.
joined_data <- full_join(gradebook, time_spent, 
                         by = c("student_id", "subject", "semester", "section"))

# A tibble: 830 × 12
   student_id subject semester section total_points_possible total_points_earned
        <dbl> <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1      43146 FrScA   S216     02                       1217               1150 
 2      44638 OcnA    S116     01                       1676               1384.
 3      47448 FrScA   S216     01                       1232               1116 
 4      47979 OcnA    S216     01                       1833               1493.
 5      48797 PhysA   S116     01                       2225               1995.
 6      51943 FrScA   S216     03                       1222                 70 
 7      52326 AnPhA   S216     01                       1775               1519.
 8      52446 PhysA   S116     01                       2225               2198 
 9      53447 FrScA   S116     01                       1212               1173 
10      53475 FrScA   S116     02                       1212                  0 
# ℹ 820 more rows
# ℹ 6 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>

As you can see, we have a new dataset, joined_data with 12 variables.Those variables came from the gradebook and time_spent datasets.

Join survey dataset with joined dataset

Join Tables

As a reminder there are different joins. But we will mainly focus on full_join for our dataset.


Similar to what we learned in the code-a-long - combine the dataset joined_data with survey dataset

👉 Your Turn

You need to:

1. Use full join function to join joined_data with survey dataset with the following variables:

  • student_id

  • subject

  • semester

  • section

2. Save to a new object called data_to_explore.

3. Inspect the data by clicking the green arrow head.

# use join to join data sets by student_id, subject, semester and section.
#(add code below - some code has been added for you)
#data_to_explore  <- full_join(survey, joined_data, 
                         #by = c("student_id", "subject", "semester", "section"))

DON’T PANIC if you are getting an error - read below!!

Datasets cannot be joined because the class (type) of “student_id” is different.

To fix this we need the same types of variables to join the datasets - we will turn a numerical variable into a character variable.

👉 Answer below

❓ Check out what class student_id is in joined_data compared to survey data. What do you notice? (HInt: think about the class)


Use as.character() function
👉 Your Turn

In the joined_data you may notice student_id is numerical so we also need to rename have the unanimity in naming before we could join the data.

You need to:

  1. First, use the mutate function and as.character() function to change student_id variable from numeric to character class.

  2. Save the new value to student_id variable.

  3. Finally, press the green arrow head to run the code.

#mutate to change variable class from double or numeric to character
#(add code below - some code has been already added)
joined_data <- joined_data %>%
#survey <- subset(survey, select = -student_ID) #drop the column

# A tibble: 830 × 12
   student_id subject semester section total_points_possible total_points_earned
   <chr>      <chr>   <chr>    <chr>                   <dbl>               <dbl>
 1 43146      FrScA   S216     02                       1217               1150 
 2 44638      OcnA    S116     01                       1676               1384.
 3 47448      FrScA   S216     01                       1232               1116 
 4 47979      OcnA    S216     01                       1833               1493.
 5 48797      PhysA   S116     01                       2225               1995.
 6 51943      FrScA   S216     03                       1222                 70 
 7 52326      AnPhA   S216     01                       1775               1519.
 8 52446      PhysA   S116     01                       2225               2198 
 9 53447      FrScA   S116     01                       1212               1173 
10 53475      FrScA   S116     02                       1212                  0 
# ℹ 820 more rows
# ℹ 6 more variables: proportion_earned <dbl>, gender <chr>,
#   enrollment_reason <chr>, enrollment_status <chr>, time_spent <dbl>,
#   time_spent_hours <dbl>
NOW: - Full join
👉 Your Turn

Now, that the variables are the same class you need to:

  1. Use full join function to join joined_data with survey dataset with the following variables:
  • student_id

  • subject

  • semester

  • section

  1. Save to a new object called data_to_explore.
  2. Inspect the data by clicking the green arrow head
#try again to together the grade_book and log_wrangled
#(add code below - some code has been already added)
data_to_explore  <- full_join(survey, joined_data, 
                         by = c("student_id", "subject", "semester", "section"))
# A tibble: 943 × 34
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 43146      FrScA-S2… FrScA   S216     02        4.2  3.67     4    3.86     4
 2 44638      OcnA-S11… OcnA    S116     01        4    3        3    3.57     4
 3 47448      FrScA-S2… FrScA   S216     01        4.2  3        3    3.71     5
 4 47979      OcnA-S21… OcnA    S216     01        4    3.67     2.5  3.86     4
 5 48797      PhysA-S1… PhysA   S116     01        3.8  3.67     3.5  3.71     4
 6 51943      FrScA-S2… FrScA   S216     03        3.8  3.67     3.5  3.71     4
 7 52326      AnPhA-S2… AnPhA   S216     01        3.6  4        3    4        4
 8 52446      PhysA-S1… PhysA   S116     01        4.2  3.67     3    4        4
 9 53447      FrScA-S1… FrScA   S116     01        3.8  2        3    3        5
10 53475      FrScA-S2… FrScA   S216     01        4.8  3.33     4    4.14     5
# ℹ 933 more rows
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>

Let’s transform the subject names to a more convenient format:

data_to_explore <- data_to_explore %>%
  mutate(subject = case_when(
    subject == "AnPhA"  ~ "Anatomy",
    subject == "BioA"   ~ "Biology",
    subject == "FrScA"  ~ "Forensics",
    subject == "OcnA"   ~ "Oceanography",
    subject == "PhysA"  ~ "Physics",
    TRUE ~ subject  #This line keeps the original value if none of the conditions above are met
# A tibble: 943 × 34
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 43146      FrScA-S2… Forens… S216     02        4.2  3.67     4    3.86     4
 2 44638      OcnA-S11… Oceano… S116     01        4    3        3    3.57     4
 3 47448      FrScA-S2… Forens… S216     01        4.2  3        3    3.71     5
 4 47979      OcnA-S21… Oceano… S216     01        4    3.67     2.5  3.86     4
 5 48797      PhysA-S1… Physics S116     01        3.8  3.67     3.5  3.71     4
 6 51943      FrScA-S2… Forens… S216     03        3.8  3.67     3.5  3.71     4
 7 52326      AnPhA-S2… Anatomy S216     01        3.6  4        3    4        4
 8 52446      PhysA-S1… Physics S116     01        4.2  3.67     3    4        4
 9 53447      FrScA-S1… Forens… S116     01        3.8  2        3    3        5
10 53475      FrScA-S2… Forens… S216     01        4.8  3.33     4    4.14     5
# ℹ 933 more rows
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>

Teacher Persona Alex follows the steps to load and wrangle data, reflecting on how each step can provide insights into her students’ engagement levels. She is particularly interested in understanding patterns in the time students spend on different course materials and how these patterns correlate with their performance.

Filtering and sorting data

Use filter() function from {dplyr} package

We can identify students at risk of failing the course using the filter function looking at students below 70:

#Filter students with lower grades
at_risk_students <- data_to_explore %>%

#Print the at-risk students
# A tibble: 155 × 34
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 51943      FrScA-S2… Forens… S216     03        3.8  3.67     3.5  3.71     4
 2 54346      OcnA-S11… Oceano… S116     01        4.4  2        3    3.29     4
 3 54567      OcnA-S21… Oceano… S216     02        4    3.67     3.5  3.86     4
 4 55283      FrScA-S1… Forens… S116     01        5    3.67     4    4.43     5
 5 61357      FrScA-S1… Forens… S116     02        5    4.67     4.5  4.86     5
 6 66508      AnPhA-T1… Anatomy T116     01        4    3.67     3.5  3.86     4
 7 67013      AnPhA-S2… Anatomy S216     01        4.2  4        2.5  4.14     4
 8 68768      FrScA-S1… Forens… S116     02        4.6  2.67     4    3.71     4
 9 69937      BioA-S11… Biology S116     01       NA    3.67     2   NA        4
10 71415      FrScA-S2… Forens… S216     01        4.8  3.67     4.5  4.43     5
# ℹ 145 more rows
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>
Use arrange() function from {dplyr} package to sort
#sort in ascending order
data_to_explore %>%
# A tibble: 943 × 34
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 61357      FrScA-S1… Forens… S116     02        5    4.67     4.5  4.86     5
 2 66508      AnPhA-T1… Anatomy T116     01        4    3.67     3.5  3.86     4
 3 69937      BioA-S11… Biology S116     01       NA    3.67     2   NA        4
 4 85487      OcnA-S11… Oceano… S116     02        3.4  4.33     3    3.71     3
 5 89465      FrScA-S2… Forens… S216     01        3    3.33     2.5  3.14     3
 6 53475      <NA>      Forens… S116     02       NA   NA       NA   NA       NA
 7 85258      <NA>      Forens… S116     02       NA   NA       NA   NA       NA
 8 85659      <NA>      Forens… S116     01       NA   NA       NA   NA       NA
 9 88568      <NA>      Biology S216     01       NA   NA       NA   NA       NA
10 90995      <NA>      Biology S116     01       NA   NA       NA   NA       NA
# ℹ 933 more rows
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>
#sort in descending order
data_to_explore %>%
# A tibble: 943 × 34
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 85650      FrScA-S1… Forens… S116     01        4.2  3        3.5  3.57     4
 2 91067      BioA-S11… Biology S116     01        4.2  3.33     4    3.86     4
 3 78153      PhysA-S2… Physics S216     01        4.2  4.67     4.5  4.29     4
 4 88261      FrScA-S1… Forens… S116     01        3.6  2.33     2.5  3        4
 5 66740      OcnA-S11… Oceano… S116     01        4    3        3    3.57     4
 6 86792      FrScA-S1… Forens… S116     01        4.6  4        4    4.29     5
 7 85522      PhysA-S1… Physics S116     01        3.4  3.67     3    3.57     4
 8 66689      FrScA-S2… Forens… S216     01        4.4  3.33     3.5  3.86     4
 9 52446      PhysA-S1… Physics S116     01        4.2  3.67     3    4        4
10 86365      FrScA-S1… Forens… S116     01        4.4  4        4    4.14     4
# ℹ 933 more rows
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>
👉 Your Turn

Think what other factors are important to identify students at risk. Run your code and analyze the results:


Use write_csv() function

Now let’s write the file to our data folder using the write_csv() to save for later or download.

# add the function to write data to file to use later
write_csv(data_to_explore, "module_1/data/data_to_explore.csv")

Check the data folder to confirm the location of your new file.

🛑 Stop here. Congratulations you finished the first part of the case study.

3. EXPLORE (Module 2)

Exploratory Data Analysis

Use the skimr package

We’ve already wrangled out data - but let’s look at the data frame to make sure it is still correct. A quick way to look at the data frame is with the skimr package.

This output is best for internal use.This is because the output is rich, but not well-suited to exporting to a table that you add, for instance, to a Google Docs or Microsoft Word manuscript.

Of course, these values can be entered manually into a table, but we’ll also discuss ways later on to create tables that are ready, or nearly-ready-to be added directly to manuscripts.

👉 Your Turn

You need to:

  1. First, load skimr package with the correct function.

Normally you would do this above but we want to make sure you know which packages are used with the new functions.

#load library by adding skimr as the package name
#(add code below)
👉 Your Turn
  1. Second, use the skim() function to view the data_to explore
#skim the data by adding the skim function in front of the data
#(add code below)
Data summary
Name data_to_explore
Number of rows 943
Number of columns 34
Column type frequency:
character 8
numeric 23
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
student_id 0 1.00 2 6 0 879 0
course_id 281 0.70 12 13 0 36 0
subject 0 1.00 7 12 0 5 0
semester 0 1.00 4 4 0 4 0
section 0 1.00 2 2 0 4 0
gender 227 0.76 1 1 0 2 0
enrollment_reason 227 0.76 5 34 0 5 0
enrollment_status 227 0.76 7 17 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
int 293 0.69 4.30 0.60 1.80 4.00 4.40 4.80 5.00 ▁▁▂▆▇
val 287 0.70 3.75 0.75 1.00 3.33 3.67 4.33 5.00 ▁▁▆▇▆
percomp 288 0.69 3.64 0.69 1.50 3.00 3.50 4.00 5.00 ▁▁▇▃▃
tv 292 0.69 4.07 0.59 1.00 3.71 4.12 4.46 5.00 ▁▁▂▇▇
q1 285 0.70 4.34 0.66 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q2 285 0.70 3.66 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▃
q3 286 0.70 3.31 0.85 1.00 3.00 3.00 4.00 5.00 ▁▂▇▅▂
q4 289 0.69 4.35 0.80 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q5 286 0.70 4.28 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q6 285 0.70 4.05 0.80 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▅
q7 286 0.70 3.96 0.85 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q8 286 0.70 4.35 0.65 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q9 286 0.70 3.55 0.92 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q10 285 0.70 4.17 0.87 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
post_int 848 0.10 3.88 0.94 1.00 3.50 4.00 4.50 5.00 ▁▁▃▇▇
post_uv 848 0.10 3.48 0.99 1.00 3.00 3.67 4.00 5.00 ▂▂▅▇▅
post_tv 848 0.10 3.71 0.90 1.00 3.29 3.86 4.29 5.00 ▁▂▃▇▆
post_percomp 848 0.10 3.47 0.88 1.00 3.00 3.50 4.00 5.00 ▁▂▂▇▂
total_points_possible 226 0.76 1619.55 387.12 1212.00 1217.00 1676.00 1791.00 2425.00 ▇▂▆▁▃
total_points_earned 226 0.76 1229.98 510.64 0.00 1002.50 1177.13 1572.45 2413.50 ▂▂▇▅▂
proportion_earned 226 0.76 76.23 25.20 0.00 72.36 85.59 92.29 100.74 ▁▁▁▃▇
time_spent 232 0.75 1828.80 1363.13 0.45 895.57 1559.97 2423.94 8870.88 ▇▅▁▁▁
time_spent_hours 232 0.75 30.48 22.72 0.01 14.93 26.00 40.40 147.85 ▇▅▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date_x 393 0.58 2015-09-02 15:40:00 2016-05-24 15:53:00 2015-10-01 15:57:30 536
date_y 848 0.10 2015-09-02 15:31:00 2016-01-22 15:43:00 2016-01-04 13:25:00 95
date 834 0.12 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-25 18:43:00 107

In the code chunk below: us the group_by subject variable, then skim()`

👉 Your Turn
  1. Third, use the group_by() function from dplyr with subject variable - then skim() function from skimr package.
data_to_explore %>%
Data summary
Name Piped data
Number of rows 943
Number of columns 34
Column type frequency:
character 7
numeric 23
Group variables subject

Variable type: character

skim_variable subject n_missing complete_rate min max empty n_unique whitespace
student_id Anatomy 0 1.00 2 6 0 207 0
student_id Biology 0 1.00 3 6 0 47 0
student_id Forensics 0 1.00 2 6 0 414 0
student_id Oceanography 0 1.00 2 6 0 171 0
student_id Physics 0 1.00 3 6 0 74 0
course_id Anatomy 58 0.72 13 13 0 7 0
course_id Biology 7 0.86 12 12 0 4 0
course_id Forensics 150 0.66 13 13 0 12 0
course_id Oceanography 55 0.69 12 12 0 9 0
course_id Physics 11 0.85 13 13 0 4 0
semester Anatomy 0 1.00 4 4 0 4 0
semester Biology 0 1.00 4 4 0 4 0
semester Forensics 0 1.00 4 4 0 4 0
semester Oceanography 0 1.00 4 4 0 4 0
semester Physics 0 1.00 4 4 0 4 0
section Anatomy 0 1.00 2 2 0 2 0
section Biology 0 1.00 2 2 0 1 0
section Forensics 0 1.00 2 2 0 4 0
section Oceanography 0 1.00 2 2 0 3 0
section Physics 0 1.00 2 2 0 1 0
gender Anatomy 45 0.79 1 1 0 2 0
gender Biology 4 0.92 1 1 0 2 0
gender Forensics 130 0.70 1 1 0 2 0
gender Oceanography 42 0.76 1 1 0 2 0
gender Physics 6 0.92 1 1 0 2 0
enrollment_reason Anatomy 45 0.79 5 34 0 4 0
enrollment_reason Biology 4 0.92 5 34 0 5 0
enrollment_reason Forensics 130 0.70 5 34 0 5 0
enrollment_reason Oceanography 42 0.76 5 34 0 5 0
enrollment_reason Physics 6 0.92 5 34 0 4 0
enrollment_status Anatomy 45 0.79 7 17 0 2 0
enrollment_status Biology 4 0.92 7 17 0 3 0
enrollment_status Forensics 130 0.70 7 17 0 3 0
enrollment_status Oceanography 42 0.76 7 17 0 3 0
enrollment_status Physics 6 0.92 7 17 0 2 0

Variable type: numeric

skim_variable subject n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
int Anatomy 62 0.70 4.42 0.57 1.80 4.00 4.40 5.00 5.00 ▁▁▁▅▇
int Biology 9 0.82 3.69 0.63 2.40 3.35 3.80 4.00 5.00 ▂▆▇▆▂
int Forensics 154 0.65 4.42 0.52 2.60 4.00 4.40 5.00 5.00 ▁▁▃▃▇
int Oceanography 56 0.68 4.24 0.58 2.20 4.00 4.20 4.60 5.00 ▁▁▂▇▆
int Physics 12 0.84 4.00 0.65 2.20 3.60 4.00 4.40 5.00 ▁▂▆▇▅
val Anatomy 59 0.72 4.29 0.62 1.00 4.00 4.33 4.67 5.00 ▁▁▁▅▇
val Biology 7 0.86 3.50 0.58 2.67 3.00 3.33 3.67 5.00 ▆▆▇▁▂
val Forensics 155 0.64 3.53 0.72 1.67 3.00 3.67 4.00 5.00 ▂▅▇▅▂
val Oceanography 55 0.69 3.62 0.77 1.00 3.00 3.67 4.00 5.00 ▁▁▅▇▃
val Physics 11 0.85 3.89 0.56 2.00 3.67 4.00 4.33 5.00 ▁▁▇▇▃
percomp Anatomy 61 0.71 3.80 0.67 2.00 3.50 4.00 4.50 5.00 ▂▃▇▆▇
percomp Biology 8 0.84 3.34 0.75 2.00 3.00 3.00 4.00 5.00 ▅▇▃▇▂
percomp Forensics 152 0.65 3.64 0.63 1.50 3.00 3.50 4.00 5.00 ▁▁▇▅▃
percomp Oceanography 56 0.68 3.57 0.67 2.00 3.00 3.50 4.00 5.00 ▂▇▆▅▅
percomp Physics 11 0.85 3.56 0.84 2.00 3.00 3.50 4.00 5.00 ▅▅▇▅▇
tv Anatomy 60 0.71 4.35 0.57 1.00 4.00 4.43 4.83 5.00 ▁▁▁▅▇
tv Biology 9 0.82 3.61 0.56 2.29 3.14 3.57 3.86 5.00 ▁▃▇▂▁
tv Forensics 156 0.64 4.04 0.52 2.29 3.71 4.00 4.43 5.00 ▁▂▆▇▅
tv Oceanography 55 0.69 3.97 0.62 1.71 3.71 4.00 4.38 5.00 ▁▁▂▇▅
tv Physics 12 0.84 3.94 0.56 2.14 3.57 4.00 4.29 5.00 ▁▂▃▇▂
q1 Anatomy 59 0.72 4.43 0.64 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q1 Biology 7 0.86 3.76 0.66 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▁
q1 Forensics 153 0.65 4.50 0.57 2.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q1 Oceanography 55 0.69 4.20 0.69 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q1 Physics 11 0.85 4.03 0.72 2.00 4.00 4.00 4.50 5.00 ▁▃▁▇▃
q2 Anatomy 59 0.72 4.30 0.74 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q2 Biology 7 0.86 3.48 0.71 2.00 3.00 3.00 4.00 5.00 ▁▇▁▆▁
q2 Forensics 152 0.65 3.35 0.89 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q2 Oceanography 56 0.68 3.46 0.93 1.00 3.00 4.00 4.00 5.00 ▁▂▆▇▂
q2 Physics 11 0.85 4.03 0.76 2.00 4.00 4.00 5.00 5.00 ▁▂▁▇▅
q3 Anatomy 60 0.71 3.53 0.87 1.00 3.00 3.00 4.00 5.00 ▁▁▇▅▃
q3 Biology 7 0.86 2.98 0.87 2.00 2.00 3.00 3.00 5.00 ▅▇▁▂▁
q3 Forensics 152 0.65 3.25 0.79 1.00 3.00 3.00 4.00 5.00 ▁▂▇▃▁
q3 Oceanography 56 0.68 3.30 0.86 2.00 3.00 3.00 4.00 5.00 ▃▇▁▅▂
q3 Physics 11 0.85 3.32 0.95 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q4 Anatomy 61 0.71 4.52 0.78 1.00 4.00 5.00 5.00 5.00 ▁▁▁▃▇
q4 Biology 7 0.86 3.69 0.81 2.00 3.00 4.00 4.00 5.00 ▂▃▁▇▂
q4 Forensics 154 0.65 4.44 0.74 1.00 4.00 5.00 5.00 5.00 ▁▁▁▅▇
q4 Oceanography 56 0.68 4.29 0.75 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▇
q4 Physics 11 0.85 4.02 0.87 2.00 4.00 4.00 5.00 5.00 ▁▃▁▇▆
q5 Anatomy 59 0.72 4.36 0.69 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q5 Biology 8 0.84 3.88 0.68 2.00 4.00 4.00 4.00 5.00 ▁▃▁▇▂
q5 Forensics 153 0.65 4.38 0.62 2.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q5 Oceanography 55 0.69 4.20 0.77 1.00 4.00 4.00 5.00 5.00 ▁▁▂▇▆
q5 Physics 11 0.85 4.06 0.67 2.00 4.00 4.00 4.00 5.00 ▁▁▁▇▃
q6 Anatomy 59 0.72 4.50 0.65 1.00 4.00 5.00 5.00 5.00 ▁▁▁▆▇
q6 Biology 7 0.86 3.83 0.70 3.00 3.00 4.00 4.00 5.00 ▅▁▇▁▂
q6 Forensics 153 0.65 3.88 0.79 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▃
q6 Oceanography 55 0.69 3.84 0.84 1.00 3.00 4.00 4.00 5.00 ▁▁▅▇▃
q6 Physics 11 0.85 4.27 0.68 2.00 4.00 4.00 5.00 5.00 ▁▁▁▇▆
q7 Anatomy 60 0.71 4.08 0.85 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▆
q7 Biology 8 0.84 3.71 0.96 2.00 3.00 4.00 4.00 5.00 ▂▇▁▇▆
q7 Forensics 152 0.65 4.02 0.83 1.00 3.00 4.00 5.00 5.00 ▁▁▅▇▆
q7 Oceanography 55 0.69 3.83 0.82 2.00 3.00 4.00 4.00 5.00 ▁▆▁▇▅
q7 Physics 11 0.85 3.81 0.90 2.00 3.00 4.00 4.00 5.00 ▂▅▁▇▅
q8 Anatomy 60 0.71 4.45 0.65 1.00 4.00 5.00 5.00 5.00 ▁▁▁▇▇
q8 Biology 7 0.86 3.79 0.72 2.00 3.00 4.00 4.00 5.00 ▁▃▁▇▂
q8 Forensics 152 0.65 4.45 0.58 3.00 4.00 4.00 5.00 5.00 ▁▁▇▁▇
q8 Oceanography 55 0.69 4.33 0.60 3.00 4.00 4.00 5.00 5.00 ▁▁▇▁▆
q8 Physics 12 0.84 4.05 0.73 2.00 4.00 4.00 4.00 5.00 ▁▁▁▇▃
q9 Anatomy 59 0.72 4.07 0.81 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▆
q9 Biology 7 0.86 3.19 0.86 2.00 3.00 3.00 4.00 5.00 ▃▇▁▅▁
q9 Forensics 154 0.65 3.37 0.91 1.00 3.00 3.00 4.00 5.00 ▁▃▇▆▂
q9 Oceanography 55 0.69 3.54 0.91 1.00 3.00 4.00 4.00 5.00 ▁▂▇▇▃
q9 Physics 11 0.85 3.38 0.83 2.00 3.00 3.00 4.00 5.00 ▃▇▁▇▂
q10 Anatomy 59 0.72 4.35 0.74 1.00 4.00 4.00 5.00 5.00 ▁▁▁▇▇
q10 Biology 8 0.84 3.37 0.89 2.00 3.00 3.00 4.00 5.00 ▂▇▁▅▂
q10 Forensics 152 0.65 4.30 0.81 1.00 4.00 4.00 5.00 5.00 ▁▁▂▆▇
q10 Oceanography 55 0.69 4.13 0.93 1.00 4.00 4.00 5.00 5.00 ▁▁▃▇▇
q10 Physics 11 0.85 3.78 0.89 2.00 3.00 4.00 4.00 5.00 ▂▆▁▇▅
post_int Anatomy 209 0.00 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_int Biology 40 0.18 3.06 0.69 1.75 2.75 3.00 3.25 4.25 ▂▃▇▂▂
post_int Forensics 392 0.10 4.00 0.93 1.50 3.75 4.00 4.88 5.00 ▁▃▁▇▇
post_int Oceanography 157 0.10 4.33 0.56 3.00 4.00 4.25 4.75 5.00 ▁▂▅▅▇
post_int Physics 50 0.32 3.75 0.88 1.50 3.50 4.00 4.25 5.00 ▁▁▂▇▂
post_uv Anatomy 209 0.00 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_uv Biology 40 0.18 3.11 0.80 1.67 2.67 3.33 3.67 4.33 ▂▃▂▇▂
post_uv Forensics 392 0.10 3.38 1.11 1.00 2.67 3.67 4.00 5.00 ▃▃▆▇▆
post_uv Oceanography 157 0.10 3.93 0.88 1.33 3.67 4.00 4.58 5.00 ▁▁▁▇▇
post_uv Physics 50 0.32 3.57 0.66 1.67 3.33 3.67 4.00 4.67 ▁▁▃▇▂
post_tv Anatomy 209 0.00 1.00 NA 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
post_tv Biology 40 0.18 3.08 0.70 1.71 2.86 3.00 3.29 4.29 ▂▂▇▃▂
post_tv Forensics 392 0.10 3.73 0.96 1.29 3.29 4.00 4.43 5.00 ▁▃▅▆▇
post_tv Oceanography 157 0.10 4.16 0.60 3.00 3.86 4.14 4.71 4.86 ▂▁▅▅▇
post_tv Physics 50 0.32 3.67 0.74 1.57 3.43 3.86 4.04 4.71 ▂▁▃▇▅
post_percomp Anatomy 209 0.00 3.00 NA 3.00 3.00 3.00 3.00 3.00 ▁▁▇▁▁
post_percomp Biology 40 0.18 3.06 0.58 2.00 2.50 3.50 3.50 3.50 ▂▃▁▂▇
post_percomp Forensics 392 0.10 3.51 0.96 1.00 3.00 3.50 4.00 5.00 ▁▂▆▇▅
post_percomp Oceanography 157 0.10 3.69 0.75 2.00 3.50 4.00 4.00 5.00 ▃▁▆▇▃
post_percomp Physics 50 0.32 3.40 0.91 1.50 3.00 3.50 4.00 4.50 ▂▂▂▆▇
total_points_possible Anatomy 45 0.79 1776.52 12.28 1655.00 1775.00 1775.00 1775.00 1805.00 ▁▁▁▇▁
total_points_possible Biology 4 0.92 2421.00 2.02 2420.00 2420.00 2420.00 2420.00 2425.00 ▇▁▁▁▂
total_points_possible Forensics 129 0.70 1230.81 38.26 1212.00 1212.00 1217.00 1232.00 1361.00 ▇▁▁▁▁
total_points_possible Oceanography 42 0.76 1738.47 78.48 1480.00 1676.00 1676.00 1833.00 1833.00 ▁▁▇▁▇
total_points_possible Physics 6 0.92 2225.00 0.00 2225.00 2225.00 2225.00 2225.00 2225.00 ▁▁▇▁▁
total_points_earned Anatomy 45 0.79 1340.16 423.45 0.00 1269.09 1511.14 1616.37 1732.52 ▁▁▁▂▇
total_points_earned Biology 4 0.92 1546.66 813.01 0.00 1035.16 1865.13 2198.50 2413.50 ▃▁▁▃▇
total_points_earned Forensics 129 0.70 952.30 305.60 0.00 914.92 1062.75 1130.00 1319.02 ▁▁▁▅▇
total_points_earned Oceanography 42 0.76 1283.25 427.25 0.00 1216.68 1396.85 1572.50 1786.76 ▁▁▁▆▇
total_points_earned Physics 6 0.92 1898.45 469.31 110.00 1891.75 2072.00 2149.12 2216.00 ▁▁▁▂▇
proportion_earned Anatomy 45 0.79 75.44 23.84 0.00 71.57 84.90 90.96 97.61 ▁▁▁▂▇
proportion_earned Biology 4 0.92 63.89 33.58 0.00 42.78 77.07 90.85 99.73 ▃▁▁▃▇
proportion_earned Forensics 129 0.70 77.42 24.82 0.00 74.85 86.43 92.19 100.74 ▁▁▁▃▇
proportion_earned Oceanography 42 0.76 73.99 24.70 0.00 69.76 81.60 91.04 99.22 ▁▁▁▃▇
proportion_earned Physics 6 0.92 85.32 21.09 4.94 85.02 93.12 96.59 99.60 ▁▁▁▂▇
time_spent Anatomy 45 0.79 2374.39 1669.58 0.45 1209.85 2164.90 3134.97 7084.70 ▆▇▃▂▁
time_spent Biology 5 0.90 1404.57 1528.14 1.22 297.02 827.30 1955.08 6664.45 ▇▂▁▁▁
time_spent Forensics 134 0.69 1591.90 1016.76 2.42 935.03 1404.90 2130.75 6537.02 ▇▇▂▁▁
time_spent Oceanography 42 0.76 2031.44 1496.82 0.58 1133.47 1800.22 2573.45 8870.88 ▇▆▂▁▁
time_spent Physics 6 0.92 1431.76 990.40 0.70 749.32 1282.81 2049.85 5373.35 ▇▆▃▁▁
time_spent_hours Anatomy 45 0.79 39.57 27.83 0.01 20.16 36.08 52.25 118.08 ▆▇▃▂▁
time_spent_hours Biology 5 0.90 23.41 25.47 0.02 4.95 13.79 32.58 111.07 ▇▂▁▁▁
time_spent_hours Forensics 134 0.69 26.53 16.95 0.04 15.58 23.42 35.51 108.95 ▇▇▂▁▁
time_spent_hours Oceanography 42 0.76 33.86 24.95 0.01 18.89 30.00 42.89 147.85 ▇▆▂▁▁
time_spent_hours Physics 6 0.92 23.86 16.51 0.01 12.49 21.38 34.16 89.56 ▇▆▃▁▁

Variable type: POSIXct

skim_variable subject n_missing complete_rate min max median n_unique
date_x Anatomy 80 0.62 2015-09-02 15:40:00 2016-03-23 16:11:00 2015-09-27 20:10:30 129
date_x Biology 9 0.82 2015-09-08 19:52:00 2016-03-09 14:07:00 2015-09-16 14:27:00 40
date_x Forensics 215 0.51 2015-09-08 13:10:00 2016-04-27 02:12:00 2015-10-08 19:19:30 218
date_x Oceanography 75 0.57 2015-09-08 20:08:00 2016-03-03 15:57:00 2016-01-25 20:17:00 97
date_x Physics 14 0.81 2015-09-09 12:24:00 2016-05-24 15:53:00 2015-10-08 21:17:00 60
date_y Anatomy 209 0.00 2015-09-02 15:31:00 2015-09-02 15:31:00 2015-09-02 15:31:00 1
date_y Biology 40 0.18 2015-11-17 03:04:00 2016-01-21 23:38:00 2016-01-16 23:48:00 9
date_y Forensics 392 0.10 2015-09-09 15:21:00 2016-01-22 15:43:00 2016-01-04 13:13:00 43
date_y Oceanography 157 0.10 2015-09-12 15:56:00 2016-01-08 17:51:00 2015-09-18 04:08:30 18
date_y Physics 50 0.32 2015-09-14 14:45:00 2016-01-22 05:36:00 2016-01-17 08:24:30 24
date Anatomy 189 0.10 2017-01-23 14:28:00 2017-02-10 15:25:00 2017-02-01 17:09:00 21
date Biology 47 0.04 2017-02-06 20:12:00 2017-02-09 19:15:00 2017-02-08 07:43:30 2
date Forensics 372 0.14 2017-01-23 13:14:00 2017-02-13 13:00:00 2017-01-24 17:23:00 62
date Oceanography 155 0.11 2017-01-23 14:07:00 2017-02-09 18:45:00 2017-02-01 21:53:30 20
date Physics 71 0.04 2017-01-30 14:41:00 2017-02-03 15:23:00 2017-02-02 20:54:00 3

Missing values

The summary function provides additional information. It can be used for the entire dataset, or individual variables

# use the summary() function to look at your data.
  student_id         course_id           subject            semester        
 Length:943         Length:943         Length:943         Length:943        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
   section               int             val           percomp     
 Length:943         Min.   :1.800   Min.   :1.000   Min.   :1.500  
 Class :character   1st Qu.:4.000   1st Qu.:3.333   1st Qu.:3.000  
 Mode  :character   Median :4.400   Median :3.667   Median :3.500  
                    Mean   :4.301   Mean   :3.754   Mean   :3.636  
                    3rd Qu.:4.800   3rd Qu.:4.333   3rd Qu.:4.000  
                    Max.   :5.000   Max.   :5.000   Max.   :5.000  
                    NA's   :293     NA's   :287     NA's   :288    
       tv              q1              q2              q3       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.714   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:3.000  
 Median :4.125   Median :4.000   Median :4.000   Median :3.000  
 Mean   :4.065   Mean   :4.337   Mean   :3.661   Mean   :3.312  
 3rd Qu.:4.464   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :292     NA's   :285     NA's   :285     NA's   :286    
       q4              q5              q6              q7             q8       
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
 1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:4.000  
 Median :5.000   Median :4.000   Median :4.000   Median :4.00   Median :4.000  
 Mean   :4.346   Mean   :4.282   Mean   :4.049   Mean   :3.96   Mean   :4.346  
 3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:5.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
 NA's   :289     NA's   :286     NA's   :285     NA's   :286    NA's   :286    
       q9             q10            date_x                      
 Min.   :1.000   Min.   :1.000   Min.   :2015-09-02 15:40:00.00  
 1st Qu.:3.000   1st Qu.:4.000   1st Qu.:2015-09-11 16:55:15.00  
 Median :4.000   Median :4.000   Median :2015-10-01 15:57:30.00  
 Mean   :3.553   Mean   :4.173   Mean   :2015-11-17 22:56:45.16  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:2016-01-27 16:11:30.00  
 Max.   :5.000   Max.   :5.000   Max.   :2016-05-24 15:53:00.00  
 NA's   :286     NA's   :285     NA's   :393                     
    post_int        post_uv         post_tv       post_percomp  
 Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.:3.500   1st Qu.:3.000   1st Qu.:3.286   1st Qu.:3.000  
 Median :4.000   Median :3.667   Median :3.857   Median :3.500  
 Mean   :3.879   Mean   :3.481   Mean   :3.708   Mean   :3.468  
 3rd Qu.:4.500   3rd Qu.:4.000   3rd Qu.:4.286   3rd Qu.:4.000  
 Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :848     NA's   :848     NA's   :848     NA's   :848    
     date_y                            date                       
 Min.   :2015-09-02 15:31:00.00   Min.   :2017-01-23 13:14:00.00  
 1st Qu.:2015-10-16 14:27:00.00   1st Qu.:2017-01-23 19:13:00.00  
 Median :2016-01-04 13:25:00.00   Median :2017-01-25 18:43:00.00  
 Mean   :2015-12-06 18:24:55.57   Mean   :2017-01-30 08:03:39.07  
 3rd Qu.:2016-01-18 18:53:30.00   3rd Qu.:2017-02-08 13:04:00.00  
 Max.   :2016-01-22 15:43:00.00   Max.   :2017-02-13 13:00:00.00  
 NA's   :848                      NA's   :834                     
 total_points_possible total_points_earned proportion_earned    gender         
 Min.   :1212          Min.   :   0        Min.   :  0.00    Length:943        
 1st Qu.:1217          1st Qu.:1002        1st Qu.: 72.36    Class :character  
 Median :1676          Median :1177        Median : 85.59    Mode  :character  
 Mean   :1620          Mean   :1230        Mean   : 76.23                      
 3rd Qu.:1791          3rd Qu.:1572        3rd Qu.: 92.29                      
 Max.   :2425          Max.   :2414        Max.   :100.74                      
 NA's   :226           NA's   :226         NA's   :226                         
 enrollment_reason  enrollment_status    time_spent      time_spent_hours  
 Length:943         Length:943         Min.   :   0.45   Min.   :  0.0075  
 Class :character   Class :character   1st Qu.: 895.57   1st Qu.: 14.9261  
 Mode  :character   Mode  :character   Median :1559.97   Median : 25.9994  
                                       Mean   :1828.80   Mean   : 30.4801  
                                       3rd Qu.:2423.94   3rd Qu.: 40.3990  
                                       Max.   :8870.88   Max.   :147.8481  
                                       NA's   :232       NA's   :232       

If you want ot look for NA’s in all your columns you can use dplyr along with the function to check for NA’s in your columns

data_to_explore %>%
  select(everything()) %>%  # replace to your needs
  summarize(across(everything(), ~ sum(
# A tibble: 1 × 34
  student_id course_id subject semester section   int   val percomp    tv    q1
       <int>     <int>   <int>    <int>   <int> <int> <int>   <int> <int> <int>
1          0       281       0        0       0   293   287     288   292   285
# ℹ 24 more variables: q2 <int>, q3 <int>, q4 <int>, q5 <int>, q6 <int>,
#   q7 <int>, q8 <int>, q9 <int>, q10 <int>, date_x <int>, post_int <int>,
#   post_uv <int>, post_tv <int>, post_percomp <int>, date_y <int>, date <int>,
#   total_points_possible <int>, total_points_earned <int>,
#   proportion_earned <int>, gender <int>, enrollment_reason <int>,
#   enrollment_status <int>, time_spent <int>, time_spent_hours <int>

👉 Your Turn

Most of the code is completed. You need to:

  1. add the column to the select() function
  2. add the same column to the sum( function
data_to_explore %>%
  select(#add column here) %>%  # Fill in the column you want to look at
  summarize(na_count = sum( column here)))  # count NA values in the "semester" column

Exploration with Data Visualization

Use ggplot2 package

GGplot is designed to work iteratively. You start with a layer that shows the raw data. Then you add layers of annotations and statistical summaries.

Remember GGPLOT is a part of the Tidyverse package so we do not need to load it again.

You can read more about ggplot in the book “GGPLOT: Elegant Graphics for Data Analysis”. You can also find lots of inspiration in the r-graph gallery that includes code. Finally you can use the GGPLOT cheat sheet to help.

“Elegant Graphics for Data Analysis” states that “every ggplot2 plot has three key components:

  • data,

  • A set of aesthetic mappings between variables in the data and visual properties, and

  • At least one layer which describes how to render each observation. Layers are usually created with a geom function.”

One Continuous variable

Create a basic visualization that examines a continuous variable of interest.


We will be guided by the following research question.

❓ Which online course had the largest enrollment numbers?

❓ Which variable should we be looking at?

👉 Your Turn

You need to: 1. First, inspect the data_to_explore to understand what variables we might need to explore the research question.

#inspect the data frame
#(add code below)
# A tibble: 943 × 34
   student_id course_id subject semester section   int   val percomp    tv    q1
   <chr>      <chr>     <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
 1 43146      FrScA-S2… Forens… S216     02        4.2  3.67     4    3.86     4
 2 44638      OcnA-S11… Oceano… S116     01        4    3        3    3.57     4
 3 47448      FrScA-S2… Forens… S216     01        4.2  3        3    3.71     5
 4 47979      OcnA-S21… Oceano… S216     01        4    3.67     2.5  3.86     4
 5 48797      PhysA-S1… Physics S116     01        3.8  3.67     3.5  3.71     4
 6 51943      FrScA-S2… Forens… S216     03        3.8  3.67     3.5  3.71     4
 7 52326      AnPhA-S2… Anatomy S216     01        3.6  4        3    4        4
 8 52446      PhysA-S1… Physics S116     01        4.2  3.67     3    4        4
 9 53447      FrScA-S1… Forens… S116     01        3.8  2        3    3        5
10 53475      FrScA-S2… Forens… S216     01        4.8  3.33     4    4.14     5
# ℹ 933 more rows
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>

Level a. The most basic level for a plot

As a reminder the most basic visualization that you can make with GGPLOT include three things:

  • layer 1: data: data_to_explore.csv
  • layer 2: aes() function - one continuous variable:
    • subject mapped to x position
  • layer 3: Geom:geom_bar() function - bar graph
#layer 1: add data 
# layer 2: add aesthetics mapping
ggplot(data_to_explore, aes(x = subject)) +
#layer 3: add geom 
  geom_bar() +
labs(title = "Number of Observations per Subject",
       x = "Subject",
       y = "Count")

Level b. Add another layer with labels

We can add the following to our ggplot visualization as layer 4

  1. title: “Number of Student Enrollments per Subject”

  2. caption: “Which online courses have had the largest enrollment numbers?”

#layer 1: add data 
# layer 2: add aesthetics mapping
ggplot(data_to_explore, aes(x = subject)) +
#layer 3: add geom 
  geom_bar() +

#layer 4: add labels
    labs(title = "Number of Student Enrollments per Subject",
       caption = "Which online courses have had the largest enrollment numbers?")

Level c: Add Scale with a different color.

We will be guided by the following research question.

❓ What can we notice about gender?

To answer the following research question we can add a scale layer:

  • layer 5: scale: fill = gender
#layer 1: add data 
# layer 2: add aesthetics mapping and #layer 5 scale
ggplot(data_to_explore, aes(x = subject, fill = gender)) +
#layer 3: add geom 
  geom_bar() +

#layer 4: add labels
    labs(title = "Gender Distribution of Students Across Subjects",
       caption = "Which subjects enroll more female students?")


We will be guided by the following research question.

❓ What number is the number of hours students watch TV?

  • data: data_to_explore
  • aes() function - one continuous variables:
    • tv variable mapped to x position
  • Geom: geom_histogram() this code is already there you just need to un-comment it.
  • Add a title “Number of Hours Students Watch TV per Day”
  • Add a caption that poses the question “Approximately how many students watch 4+ hours of TV per day?”


👉 Your Turn

# Add data
ggplot(data_to_explore, aes(x = tv)) +
  # Add the geom
  geom_histogram(bins = 5) +
  # Add the labs
  labs(title = "Histogram of TV Watching Hours",
       x = "TV Watching Hours",
       y = "Count")

Two categorical Variables

Create a basic visualization that examines the relationship between two categorical variables.

We will be guided by the following research question.

❓ What do you wonder about the reasons for enrollment in various courses?


  • data: data_to_explore
  • use count() function for subject, enrollment then,
  • ggplot() function
  • aes() function - one continuous variables
    • subject variable mapped to x position
    • enrollment reason variable mapped to x position
  • Geom: geom_tile() function
  • Add a title “Reasons for Enrollment by Subject”
  • Add a caption: “Which subjects were the least available at local schools?”

👉 Your Turn

data_to_explore %>%
  count(subject, enrollment_reason) %>%
  ggplot(aes(x = subject, y = enrollment_reason, fill = n)) + 
    geom_tile() +
    scale_fill_gradient(low = "orange", high = "maroon") +  #Change to red color gradient
    labs(title = "Reasons for Enrollment by Subject",
         caption = "Which subjects were the least available at local schools?",
         x = "Subject",
         y = "Enrollment Reason",
         fill = "Count")

Two continuous variables

Create a basic visualization that examines the relationship between two continuous variables.

Scatter plot

We will be guided by the following research question.

❓ Can we predict the grade on a course from the time spent in the course LMS?

#look at the data frame
 #(add code below)
# A tibble: 6 × 34
  student_id course_id  subject semester section   int   val percomp    tv    q1
  <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
1 43146      FrScA-S21… Forens… S216     02        4.2  3.67     4    3.86     4
2 44638      OcnA-S116… Oceano… S116     01        4    3        3    3.57     4
3 47448      FrScA-S21… Forens… S216     01        4.2  3        3    3.71     5
4 47979      OcnA-S216… Oceano… S216     01        4    3.67     2.5  3.86     4
5 48797      PhysA-S11… Physics S116     01        3.8  3.67     3.5  3.71     4
6 51943      FrScA-S21… Forens… S216     03        3.8  3.67     3.5  3.71     4
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>

❓ Which variables should we be looking at?

👉 Answer here

Level a. The most basic level for a scatter plot


  • data: data_to_explore.csv
  • aes() function - two continuous variables
    • time spent in hours mapped to x position
    • proportion earned mapped to y position
  • Geom: geom_point() function - Scatter plot

👉 Your Turn

#(add code below)
#layer 1: add data and aesthetics mapping 
       aes(x = time_spent_hours, 
           y = proportion_earned)) +
#layer 2: +  geom function type

Level b. Add another layer with labels

  • Add a title: “How Time Spent on Course LMS is Related to Points Earned in the course”
  • Add a x label: “Time Spent (Hours)”
  • Add a y label: “Proportion of Points Earned”

👉 Your Turn

# Look at the data frame
# A tibble: 6 × 34
  student_id course_id  subject semester section   int   val percomp    tv    q1
  <chr>      <chr>      <chr>   <chr>    <chr>   <dbl> <dbl>   <dbl> <dbl> <dbl>
1 43146      FrScA-S21… Forens… S216     02        4.2  3.67     4    3.86     4
2 44638      OcnA-S116… Oceano… S116     01        4    3        3    3.57     4
3 47448      FrScA-S21… Forens… S216     01        4.2  3        3    3.71     5
4 47979      OcnA-S216… Oceano… S216     01        4    3.67     2.5  3.86     4
5 48797      PhysA-S11… Physics S116     01        3.8  3.67     3.5  3.71     4
6 51943      FrScA-S21… Forens… S216     03        3.8  3.67     3.5  3.71     4
# ℹ 24 more variables: q2 <dbl>, q3 <dbl>, q4 <dbl>, q5 <dbl>, q6 <dbl>,
#   q7 <dbl>, q8 <dbl>, q9 <dbl>, q10 <dbl>, date_x <dttm>, post_int <dbl>,
#   post_uv <dbl>, post_tv <dbl>, post_percomp <dbl>, date_y <dttm>,
#   date <dttm>, total_points_possible <dbl>, total_points_earned <dbl>,
#   proportion_earned <dbl>, gender <chr>, enrollment_reason <chr>,
#   enrollment_status <chr>, time_spent <dbl>, time_spent_hours <dbl>
# Create a scatter plot
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned)) +
  geom_point() +
  labs(title = "Relationship Between Time Spent and Proportion of Points Earned",
       x = "Time Spent (Hours)",
       y = "Proportion of Points Earned")

Level c. Add Scale with a different color.

❓ Can we notice anything about enrollment status?

  • Add scale in aes: color = enrollment_status

👉 Your Turn

# Create a scatter plot with color based on enrollment_status
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  labs(title = "How Time Spent on Course LMS is Related to Points Earned in the course?",
       x = "Time Spent (Hours)",
       y = "Proportion of Points Earned",
       color = "Enrollment Status")

Level d. Divide up graphs using facet to visualize by subject.

  • Add facet with facet_wrap() function: by subject

👉 Your Turn

# Create a scatter plot with facets for each subject
ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
  geom_point() +
  facet_wrap(~subject) +
  labs(title = "Relationship Between Time Spent and Proportion of Points Earned by Subject",
       x = "Time Spent (Hours)",
       y = "Proportion of Points Earned",
       color = "Enrollment Status")

Level e. How can we remove NA’s from plot? and What will the code look like without the comments?

  • use data then
  • add enrollment status to the drop_na function to remove na’s
  • add labels to the labs() function like above.
  • Facet wrap by subject
# Drop rows with missing values and create the scatter plot
data_to_explore %>%
  drop_na(time_spent_hours, proportion_earned, enrollment_status, subject) %>%
  ggplot(aes(x = time_spent_hours, 
             y = proportion_earned, 
             color = enrollment_status)) +
  geom_point() +
  labs(title = "Relationship Between Time Spent and Proportion of Points Earned by Subject",
       x = "Time Spent (Hours)",
       y = "Proportion of Points Earned",
       color = "Enrollment Status") +

Teacher Persona As Alex explores the data through visualizations and summary statistics, she begins to see trends that could indicate which students are at risk. Her observations guide her to consider changes in her teaching approach or additional support for certain students.

🛑 Stop here. Congratulations you finished the second part of the case study.

4. Model (Module 3)

Quantify the insights using mathematical models. As highlighted in.Chapter 3 of Data Science in Education Using R, the.Model step of the data science process entails “using statistical models, from simple to complex, to understand trends and patterns in the data.”

The authors note that while descriptive statistics and data visualization during theExplorestep can help us to identify patterns and relationships in our data, statistical models can be used to help us determine if relationships, patterns and trends are actually meaningful.

Correlation Matrix

As highlighted in @macfadyen2010, scatter plots are a useful initial approach for identifying potential correlational trends between variables under investigation, but to further interrogate the significance of selected variables as indicators of student achievement, a simple correlation analysis of each variable with student final grade can be conducted.

There are two efficient ways to create correlation matrices, one that is best for internal use, and one that is best for inclusion in a manuscript. The {corrr} package provides a way to create a correlation matrix in a {tidyverse}-friendly way. Like for the {skimr} package, it can take as little as a line of code to create a correlation matrix. If not familiar, a correlation matrix is a table that presents how all of the variables are related to all of the other variables.

👉 Your Turn

You need to:

1. Load the corrr package using the correct function. (You may need to install.packages() in th console if this is your first time using loading the package.)

# load in corrr library
#(add code below)

Attaching package: 'corrr'
The following object is masked from 'package:skimr':


Simple Correlation

👉 Your Turn

Look and see if there is a simple correlation between by:

You need to:

1. use data_to_explore

2. select(): - time-spent-hours - proportion_earned

3. use correlate() function

#(add code below)
data_to_explore %>% 
  select(proportion_earned, time_spent_hours) %>%
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
# A tibble: 2 × 3
  term              proportion_earned time_spent_hours
  <chr>                         <dbl>            <dbl>
1 proportion_earned            NA                0.438
2 time_spent_hours              0.438           NA    

For printing purposes,the fashion() function can be added for converting a correlation data frame into a matrix with the correlations cleanly formatted (leading zeros removed; spaced for signs) and the diagonal (or any NA) left blank.

#add fashion function
data_to_explore %>% 
  select(proportion_earned, time_spent_hours) %>% 
  correlate() %>% 
  rearrange() %>%
  shave() %>%
Correlation computed with
• Method: 'pearson'
• Missing treated using: 'pairwise.complete.obs'
               term proportion_earned time_spent_hours
1 proportion_earned                                   
2  time_spent_hours               .44                 

❓ What could we write up for a manuscript in APA format or another format? Write below

  • In the study, Pearson’s correlation coefficient was calculated to assess the relationship between the proportion of course materials earned and the time spent in hours. The analysis revealed a moderate positive correlation of .44 between these variables, suggesting that as the time students spent on course materials increased, so did their proportion of earned materials (pairwise complete observations were used to handle missing data).

❓ What other variables would you like to check out? Write below

👉 Your Turn⤵

#(add code below)

APA Formatted Table

While {corrr} is a nice package to quickly create a correlation matrix, you may wish to create one that is ready to be added directly to a dissertation or journal article. {apaTables} is great for creating more formal forms of output that can be added directly to an APA-formatted manuscript; it also has functionality for regression and other types of model output. It is not as friendly to {tidyverse} functions; first, we need to select only the variables we wish to correlate.

Then, we can use that subset of the variables as the argument to theapa.cor.table() function.

Run the following code to create a subset of the larger data_to_explore data frame with the variables you wish to correlate, then create a correlation table using apa.cor.table().

  • load apaTables library

👉 Your Turn

# read in apatables library
#(add code below)

data_to_explore_subset <- data_to_explore %>% 
  select(time_spent_hours, proportion_earned, int)


Means, standard deviations, and correlations with confidence intervals

  Variable             M     SD    1           2         
  1. time_spent_hours  30.48 22.72                       
  2. proportion_earned 76.23 25.20 .44**                 
                                   [.37, .50]            
  3. int               4.30  0.60  .08         .14**     
                                   [-.01, .16] [.06, .22]

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

This may look nice, but how to actually add this into a dissertation or article that you might be interested in publishing?

Read the documentation for apa.cor.table() by running ?apa.cor.table() in the console. Look through the documentation and examples to understand how to output a file with the formatted correlation table, and then run the code to do that with your subset of the data_to_explore data frame.

apa.cor.table(data_to_explore_subset, filename = "cor-table.doc")

Means, standard deviations, and correlations with confidence intervals

  Variable             M     SD    1           2         
  1. time_spent_hours  30.48 22.72                       
  2. proportion_earned 76.23 25.20 .44**                 
                                   [.37, .50]            
  3. int               4.30  0.60  .08         .14**     
                                   [-.01, .16] [.06, .22]

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

You should now see a new Word document in your project folder called survey-cor-table.doc. Click on that and you’ll be prompted to download from your browser.

C. Predict Academic Achievement
Linear Regression

In brief, a linear regression model involves estimating the relationships between one or more independent variables with one dependent variable. Mathematically, it can be written like the following.

\[ \operatorname{dependentvar} = \beta_{0} + \beta_{1}(\operatorname{independentvar}) + \epsilon \]

Does time spent predict grade earned?

The following code estimates a model in which proportion_earned, the proportion of points students earned, is the dependent variable. It is predicted by one independent variable

  • Add + int, after time_spent_hours for students’ self-reported interest in science.

👉 Your Turn

# add predictor variable for `science interest `int`
lm(proportion_earned ~ time_spent_hours , 
   data = data_to_explore)

lm(formula = proportion_earned ~ time_spent_hours, data = data_to_explore)

     (Intercept)  time_spent_hours  
         62.4306            0.4792  

We can see that the intercept is now estimated at 0.44, which tells us that when students’ time spent and interest are equal to zero, they are likely fail the course unsurprisingly. Note that that estimate for interest in science is .046, so for every one-unit increase in int, we should expect an 5 percentage point increase in their grade.

We can save the output of the function to an object—let’s say m1, standing for model 1. We can then use the summary() function built into R to view a much more feature-rich summary of the estimated model.

# save the model
m1 <- lm(proportion_earned ~ time_spent_hours + int, data = data_to_explore)

Run a summary model for the model you just created called, m1.

D. Assumptions

Great! Now that you have defined your linear model m1 in R, which predicts proportion_earned based on time_spent_hours and the interest variable int, let’s go through how to check the assumptions of this linear model using the various diagnostic plots and tests.

We’ll need to check:

  1. Linearity and Interaction Effects

  2. Residuals Analysis

  3. Normality of Residuals

  4. Multicollinearity

Linearity and Interaction effect

Since our model includes an interaction term (int), it’s good to first check if the interaction is meaningful and whether the linearity assumption holds for the predictors in relation to the dependent variable.

ggplot(data_to_explore, aes(x=time_spent_hours, y=proportion_earned, color=int)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 345 rows containing non-finite outside the scale range
Warning: The following aesthetics were dropped during statistical transformation:
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?
Warning: Removed 345 rows containing missing values or values outside the scale range

This plot helps visualize if the interaction term significantly affects the relationship between your predictors and the dependent variable.

Residuals Analysis

Next, plot the residuals against the fitted values to check for independence, homoscedasticity, and any unusual patterns.

Residuals vs. Fitted Values Plot (for homoscedasticity and independence

plot(residuals(m1) ~ fitted(m1))
abline(h = 0, col = "red")

Look for a random dispersion of points. Any pattern or funnel shape indicates issues with homoscedasticity or linearity.

Normality of Residuals

Normal Q-Q Plot (for normality of residuals)


A deviation from the straight line in the Q-Q plot indicates deviations from normality.

Shapiro-Wilk Test:


    Shapiro-Wilk normality test

data:  residuals(m1)
W = 0.88967, p-value < 2.2e-16
  • W statistic: The test statistic W is 0.88967. This value indicates the extent to which the data are normally distributed. W values close to 1 suggest that the data closely follow a normal distribution. In this case, a W value of 0.88967 suggests some deviation from normality.

  • p-value: The p-value is less than 2.2e-16 (a very small number close to zero). In statistical testing, a p-value less than the chosen significance level (typically 0.05) leads to the rejection of the null hypothesis.


#load the package car
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

The following object is masked from 'package:purrr':

time_spent_hours              int 
        1.006076         1.006076 

Interpretation of VIF Scores:

  • VIF = 1: No correlation among the predictor and all other predictors.

  • VIF < 5: Generally indicates a moderate level of multicollinearity.

  • VIF >= 5 to 10: May indicate a problematic amount of multicollinearity, depending on the context and sources.

  • VIF > 10: Signifies high multicollinearity that can severely distort the least squares estimates.

Both of the VIF scores are slightly above 1, which suggests that there is almost no multicollinearity among these predictors. This is a good sign, indicating that each predictor provides unique and independent information to the model, not unduly influenced by the other variables.

Our model was violated and does not follow a normality. However, we are going to practice other functions as if it passed all assumptions. THe assumption testing is just for you to know how to do in the future. You will need to correct issues like normality hby exploring data transformations, adding polynomial or interaction terms to the model, or using a different type of regression model that does not assume normality of residuals.

👉 Your Turn

#run the summary

lm(formula = proportion_earned ~ time_spent_hours + int, data = data_to_explore)

    Min      1Q  Median      3Q     Max 
-66.705  -7.836   5.049  14.695  35.766 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       44.9657     6.6488   6.763 3.54e-11 ***
time_spent_hours   0.4255     0.0410  10.378  < 2e-16 ***
int                4.6283     1.5364   3.012  0.00271 ** 
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.42 on 536 degrees of freedom
  (404 observations deleted due to missingness)
Multiple R-squared:  0.1859,    Adjusted R-squared:  0.1828 
F-statistic: 61.18 on 2 and 536 DF,  p-value: < 2.2e-16

Let’s save this as a nice APA table for possible publication

# use the {apaTables} package to create a nice regression table that could be used for later publication.
apa.reg.table(m1, filename = "lm-table.doc")

Regression results using proportion_earned as the criterion

        Predictor       b       b_95%_CI beta  beta_95%_CI sr2  sr2_95%_CI
      (Intercept) 44.97** [31.90, 58.03]                                  
 time_spent_hours  0.43**   [0.34, 0.51] 0.41 [0.33, 0.48] .16  [.11, .22]
              int  4.63**   [1.61, 7.65] 0.12 [0.04, 0.19] .01 [-.00, .03]
     r             Fit
           R2 = .186**
       95% CI[.13,.24]

Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights. 
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.

Teacher Persona By creating simple models, Alex hopes to predict student outcomes more accurately. She is interested in how variables like time spent on tasks correlate with student grades and uses this information to adjust her instructional strategies.

Summarize predictors

The summarize() function from the {dplyr} package used to create summary statistics such as the mean, standard deviation, or the minimum or maximum of a value.

At its core, think of summarize() as a function that returns a single value (whether it’s a mean, median, standard deviation—whichever!) that summarizes a single column.

In the space below find the mean interest of students. summarize()

data_to_explore %>% 
  summarize(mean_interest = mean(int, na.rm = TRUE))
# A tibble: 1 × 1
1          4.30

Now let’s look at the mean of time_spent_hours and remove any NA’s. - save it to a new variable called mean_time

👉 Your Turn

#add code here
data_to_explore %>% 
  summarize(mean_time = mean(time_spent_hours, na.rm = TRUE))
# A tibble: 1 × 1
1      30.5

The mean value for interest is quite high. If we multiply the estimate relationship between interest and proportion of points earned—0.046—by this, the mean interest across all of the students—we can determine that students’ estimate final grade was 0.046 X 4.3, or 0.197. For hours spent spent, the average students’ estimate final grade was 0.0042 X 30.48, or 0.128.

If we add both 0.197 and 0.128 to the intercept, 0.449, that equals 0.774, or about 77%. In other words, a student with average interest in science who spent an average amount of time in the course earned a pretty average grade.


For your final Your Turn, your goal is to distill our analysis into a FLEXBOARD “data product” designed to illustrate key findings. Feel free to use the template in the lab 4 folder.

The final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. @krumm2018 have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:

  1. Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”

  2. Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.

  3. Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question and might be used to inform new analyses or a “change idea” for improving student learning.

👉 Your Turn

Create a Data Story with our current data set, or your own. Make sure to use the LA workflow as your guide to include

- Develop a research question

- Add ggplot visualizations

- Modeling visualizations

- Communicate by writing up a short write up for the intended stakeholders. Remember to write it in terms the stakeholders understand.

Teacher Persona Finally, Alex prepares to communicate her findings. She creates a simple web page using Markdown to share her insights with colleagues. This acts as a practical example of how data can inform teaching practices.