Who Drops Online Courses? An Analysis Using the OULAD

Case Study Key

Author

LASER Institute

Published

July 19, 2024

1. PREPARE

Conceptually, we focus on prediction and how it differs from the goals of description or explanation. We have two readings in Learning Lab 1 that accompany this. The first reading introduced below focuses on this distinction between prediction and description or explanation. It is one of the most widely-read papers in machine learning and articulates how machine learning differs from other kinds of statistical models. Breiman describes the difference in terms of data modeling (models for description and explanation) and algorithmic modeling (what we call prediction or machine learning models)*

Research Question

Technically, we’ll focus on the core parts of doing a machine learning analysis in R. We’ll use the {tidymodels} set of R packages (add-ons) to do so, like in the first module. However, to help anchor our analysis and provide us with some direction, we’ll focus on the following research question as we explore this new:

How well can we predict students who are at risk of dropping a course?

Reading: Statistical modeling: The two cultures

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199-231. https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.pdf

👉 Your Turn

You’ll be asked to reflect more deeply on this article later on (in the badge activity); but for now, open up the article and take quick scan of the article and note below an observation or question you have about the article.

  • YOUR RESPONSE HERE

Reading: Predicting students’ final grades

Estrellado, R. A., Freer, E. A., Mostipak, J., Rosenberg, J. M., & Velásquez, I. C. (2020). Data science in education using R. Routledge (c14), Predicting students’ final grades using machine learning methods with online course data. http://www.datascienceineducation.com/

Please review this chapter, focusing on the overall goals of the analysis and how the analysis was presented (focusing on predictions, rather than the ways we may typically interpret a statistical model–like measures of statistical significance).

👉 Your Turn

1b. Load Packages

Like in the last module, please load the tidyverse package. Also, please load the tidymodels and janitor packages in the code chunk below.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.5      ✔ rsample      1.2.1 
✔ dials        1.2.1      ✔ tune         1.2.1 
✔ infer        1.0.7      ✔ workflows    1.1.4 
✔ modeldata    1.4.0      ✔ workflowsets 1.1.0 
✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
✔ recipes      1.0.10     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

2. WRANGLE

In general, data wrangling involves some combination of cleaning, reshaping, transforming, and merging data (Wickham & Grolemund, 2017). The importance of data wrangling is difficult to overstate, as it involves the initial steps of going from raw data to a dataset that can be explored and modeled (Krumm et al, 2018). In Part 2, we focus on the the following wrangling processes to:

  1. Importing and Inspecting Data. In this section, we will “read” in our CSV data file and take a quick look at what our file contains.

  2. Mutate Variables. We use the mutate() function to create a dichotomous variable for whether or not the student withdrew from the course.

1a. Import and Inspect Data

For learning labs 1-3, we’ll be using a widely-used data set in the learning analytics field: the Open University Learning Analytics Dataset (OULAD). The OULAD was created by learning analytics researchers at the United Kingdom-based Open University. It includes data from post-secondary learners participation in one of several Massive Open Online Courses (called modules in the OULAD).

Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open university learning analytics dataset. Scientific Data, 4(1), 1-8. https://www.nature.com/articles/sdata2017171

Abstract

Learning Analytics focuses on the collection and analysis of learners’ data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behavior, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.

👉 Your Turn

You don’t need to read the entire article yet, but please open this article, scan the sections, and write down two things you notice or wonder about the dataset.

  1. YOUR RESPONSE HERE

  2. YOUR RESPONSE HERE

Read CSV Data File

Like in the last module, read in the data using read_csv(). Note: we have done some minimal processing of these files to make getting us started easier. If you’re interested in what we’ve done, check out the oulad.R file in the module-2 folder.

students <- read_csv("data/oulad-students.csv")
Rows: 32593 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): code_module, code_presentation, gender, region, highest_education, ...
dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can see a description of the data here. The students file includes three files joined together: studentInfo, courses, and studentRegistration. Take a look at the data description to get a sense for what variables are in which data frame.

👉 Your Turn

Inspect Data

Use the glimpse() function we used in the first module.

glimpse(students)
Rows: 32,593
Columns: 15
$ code_module                <chr> "AAA", "AAA", "AAA", "AAA", "AAA", "AAA", "…
$ code_presentation          <chr> "2013J", "2013J", "2013J", "2013J", "2013J"…
$ id_student                 <dbl> 11391, 28400, 30268, 31604, 32885, 38053, 4…
$ gender                     <chr> "M", "F", "F", "F", "F", "M", "M", "F", "F"…
$ region                     <chr> "East Anglian Region", "Scotland", "North W…
$ highest_education          <chr> "HE Qualification", "HE Qualification", "A …
$ imd_band                   <chr> "90-100%", "20-30%", "30-40%", "50-60%", "5…
$ age_band                   <chr> "55<=", "35-55", "35-55", "35-55", "0-35", …
$ num_of_prev_attempts       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ studied_credits            <dbl> 240, 60, 60, 60, 60, 60, 60, 120, 90, 60, 6…
$ disability                 <chr> "N", "N", "Y", "N", "N", "N", "N", "N", "N"…
$ final_result               <chr> "Pass", "Pass", "Withdrawn", "Pass", "Pass"…
$ module_presentation_length <dbl> 268, 268, 268, 268, 268, 268, 268, 268, 268…
$ date_registration          <dbl> -159, -53, -92, -52, -176, -110, -67, -29, …
$ date_unregistration        <dbl> NA, NA, 12, NA, NA, NA, NA, NA, NA, NA, NA,…

2b. “Mutate” Variables

We’re going to do a few more steps related to data wrangling here, noting we could also do these at later stages of our process (namely, in the feature engineering stage).

First, since we are interested in developing a model that can predict whether a student is at risk of dropping a course, and so we can intervene before that happens, we need an outcome variable that let’s us know if they have passed.

To create this variable, let’s use the mutate() function to create a dichotomous variable for whether or not the student withdrew from the course. Here’s a way we can do this, using if_else() and as.factor(). This will be our outcome variable, or the predicted variable.

students <- students %>%
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a new variable named "pass" and a dummy code of 1 if value of final_result equals "pass" and 0 if not
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

Note: The mutate() function.) is a critical function to learn and is used to create new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

Next, let’s do something similar for whether a student identifies as having a disability. In this case, there are only two values for the disability variable, so we can simply convert it directly to a factor. Look at the code you used above, modifying it for the disability variable. This will be an independent variable, or a predictor variable.

students <- students %>% 
    mutate(disability = as.factor(disability))

👉 Your Turn

In the chunk below, use the view() function to manually check and see if our new variable has indeed been added to our data frame.

view(students)

Write down a few observations after inspecting the data:

  • YOUR RESPONSE HERE
  • YOUR RESPONSE HERE
  • YOUR RESPONSE HERE

Note: The mutate() function.) is a critical function to learn and is used to create new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL).

Next, let’s do something similar for whether a student identifies as having a disability. In this case, there are only two values for the disability variable, so we can simply convert it directly to a factor. Look at the code you used above, modifying it for the disability variable. This will be an independent variable, or a predictor variable.

students <- students %>% 
    mutate(disability = as.factor(disability))

3. EXPLORE

As noted by Krumm et al. (2018), exploratory data analysis often involves some combination of data visualization and feature engineering. In Part 3, we will create a quick visualization to help us spot any potential issues with our data and engineer new predictive variables or “features” that we will use in our predictive models. Specifically, in Part 3 we will:

  1. Examine Outcomes by taking a quick count() of the number of students and the number of specific offerings of each course module.

  2. Engineer Predictors by creating one more predictor variable based on a measure of socioeconomic resources–the index of multiple depravity variable.

👉 Your Turn

3a. Examine Variables

Referring to the data description, in the chunk below, count the number of students. Also, count the number of courses (modules) and specific offerings (as modules can be offered multiple times per year). Learn more about count() here.

students %>% 
    count(id_student) # this many students
# A tibble: 28,785 × 2
   id_student     n
        <dbl> <int>
 1       3733     1
 2       6516     1
 3       8462     2
 4      11391     1
 5      23629     1
 6      23632     1
 7      23698     1
 8      23798     1
 9      24186     1
10      24213     2
# ℹ 28,775 more rows
students %>% 
    count(code_module, code_presentation) # this many offerings
# A tibble: 22 × 3
   code_module code_presentation     n
   <chr>       <chr>             <int>
 1 AAA         2013J               383
 2 AAA         2014J               365
 3 BBB         2013B              1767
 4 BBB         2013J              2237
 5 BBB         2014B              1613
 6 BBB         2014J              2292
 7 CCC         2014B              1936
 8 CCC         2014J              2498
 9 DDD         2013B              1303
10 DDD         2013J              1938
# ℹ 12 more rows

3b. Feature Engineering

As defined by Krumm, Means, and Bienkowski (2018) in Learning Analytics Goes to School:

Feature engineering is the process of creating new variables within a dataset, which goes above and beyond the work of recoding and rescaling variables.

The authors note that feature engineering draws on substantive knowledge from theory or practice, experience with a particular data system, and general experience in data-intensive research. Moreover, these features can be used not only in machine learning models, but also in visualizations and tables comprising descriptive statistics.

Though not often discussed, feature engineering is an important element of data-intensive research that can generate new insights and improve predictive models. You can read more about feature engineering here.

Student Socioeconomic Index

For our first lab, we’ll engage in a very basic feature engineering step, though we’ll do this much more in the next learning lab.

To do feature engineering, let’s create one more predictor variable based on a measure of socioeconomic resources–the index of multiple depravity variable. The process we take here is to turn this variable that is a character string into a number by creating a factor and then coercing it to an integer.

👉 Your Turn

Please replace the ___ values in the code below with the correct variable.

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                        "10-20%",
                                        "20-30%",
                                        "30-40%",
                                        "40-50%",
                                        "50-60%",
                                        "60-70%",
                                        "70-80%",
                                        "80-90%",
                                        "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

students
# A tibble: 32,593 × 16
   code_module code_presentation id_student gender region      highest_education
   <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
 1 AAA         2013J                  11391 M      East Angli… HE Qualification 
 2 AAA         2013J                  28400 F      Scotland    HE Qualification 
 3 AAA         2013J                  30268 F      North West… A Level or Equiv…
 4 AAA         2013J                  31604 F      South East… A Level or Equiv…
 5 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
 6 AAA         2013J                  38053 M      Wales       A Level or Equiv…
 7 AAA         2013J                  45462 M      Scotland    HE Qualification 
 8 AAA         2013J                  45642 F      North West… A Level or Equiv…
 9 AAA         2013J                  52130 F      East Angli… A Level or Equiv…
10 AAA         2013J                  53025 M      North Regi… Post Graduate Qu…
# ℹ 32,583 more rows
# ℹ 10 more variables: imd_band <int>, age_band <chr>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

We’re now ready to proceed to the five machine learning steps!

4. MODEL

In this step, we will dive into the SML modeling process in much more depth than in the last module.

  1. Split Data into a training and test set that will be used to develop a predictive model;

  2. Create a “Recipe” for our predictive model and learn how to deal with nominal data that we would like to use as predictors;

  3. Specify the model and workflow by selecting the functional form of the model that we want and using a model workflow to pair our model and recipe together;

  4. Fit Models to our training set using logistic regression;

  5. Interpret Accuracy of our model to see how well our model can “predict” our outcome of interest.

Step 1. Split data

The authors of Data Science in Education Using R (Estrellado et al.,2020) remind us that:

At its core, machine learning is the process of “showing” your statistical model only some of the data at once and training the model to predict accurately on that training dataset (this is the “learning” part of machine learning). Then, the model as developed on the training data is shown new data - data you had all along, but hid from your computer initially - and you see how well the model that you developed on the training data performs on this new testing data. Eventually, you might use the model on entirely new data.

Training and Testing Sets

It is therefore common when beginning a modeling project to separate the data set into two partitions:

  • The training set is used to estimate, develop and compare models; feature engineering techniques; tune models, etc.

  • The test set is held in reserve until the end of the project, at which point there should only be one or two models under serious consideration. It is used as an unbiased source for measuring final model performance.

There are different ways to create these partitions of the data and there is no uniform guideline for determining how much data should be set aside for testing. The proportion of data can be driven by many factors, including the size of the original pool of samples and the total number of predictors. 

After you decide how much to set aside, the most common approach for actually partitioning your data is to use a random sample. For our purposes, we’ll use random sampling to select 20% for the test set and use the remainder for the training set, which are the defaults for the {rsample} package.

Split Data Sets

To split our data, we will be using our first {tidymodels} function - initial_split().

The function initial_split() function from the {rsample} package takes the original data and saves the information on how to make the partitions. The {rsample} package has two aptly named functions for created a training and testing data set called training() and testing(), respectively.

We also specify the strata to ensure there is not misbalance in the dependent variable (pass).

Run the following code to split the data:

set.seed(20230712)

train_test_split <- initial_split(students, prop = .80, strata = "pass")

data_train <- training(train_test_split)

data_test  <- testing(train_test_split)

Note: Since random sampling uses random numbers, it is important to set the random number seed using the set.seed() function. This ensures that the random numbers can be reproduced at a later time (if needed). We pick the first date on which we may carry out this learning lab as the seed, but any number will work!

👉 Your Turn

Go ahead and type data_train and data_test into the console (in steps) to check that this data set indeed has 80% of the number of observations as in the larger data. Do that in the chunk below:

data_train
# A tibble: 26,073 × 16
   code_module code_presentation id_student gender region      highest_education
   <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
 1 AAA         2013J                  30268 F      North West… A Level or Equiv…
 2 AAA         2013J                  65002 F      East Angli… A Level or Equiv…
 3 AAA         2013J                 106247 M      South Regi… HE Qualification 
 4 AAA         2013J                 129955 M      West Midla… A Level or Equiv…
 5 AAA         2013J                 134143 F      South East… A Level or Equiv…
 6 AAA         2013J                 135400 F      South East… Lower Than A Lev…
 7 AAA         2013J                 147756 M      North Regi… Lower Than A Lev…
 8 AAA         2013J                 147793 F      North Regi… Lower Than A Lev…
 9 AAA         2013J                 148993 F      North West… A Level or Equiv…
10 AAA         2013J                 155984 F      East Angli… Lower Than A Lev…
# ℹ 26,063 more rows
# ℹ 10 more variables: imd_band <int>, age_band <chr>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>, pass <fct>
data_test
# A tibble: 6,520 × 16
   code_module code_presentation id_student gender region      highest_education
   <chr>       <chr>                  <dbl> <chr>  <chr>       <chr>            
 1 AAA         2013J                  32885 F      West Midla… Lower Than A Lev…
 2 AAA         2013J                  45642 F      North West… A Level or Equiv…
 3 AAA         2013J                  74372 M      East Angli… A Level or Equiv…
 4 AAA         2013J                  77367 M      East Midla… A Level or Equiv…
 5 AAA         2013J                  94961 M      South Regi… Lower Than A Lev…
 6 AAA         2013J                 110175 M      East Angli… HE Qualification 
 7 AAA         2013J                 116692 M      East Angli… A Level or Equiv…
 8 AAA         2013J                 123044 M      South Regi… A Level or Equiv…
 9 AAA         2013J                 135335 F      East Angli… Lower Than A Lev…
10 AAA         2013J                 141377 M      South West… A Level or Equiv…
# ℹ 6,510 more rows
# ℹ 10 more variables: imd_band <int>, age_band <chr>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <fct>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>, pass <fct>

Step 2: Create a “Recipe”

In this section, we introduce another tidymodels package named {recipes}, which is designed to help you prepare your data before training your model. Recipes are built as a series of preprocessing steps, such as:

  • converting qualitative predictors to indicator variables (also known as dummy variables),

  • transforming data to be on a different scale (e.g., taking the logarithm of a variable),

  • transforming whole groups of predictors together,

  • extracting key features from raw variables (e.g., getting the day of the week out of a date variable), and so on.

If you are familiar with R’s formula interface, a lot of this might sound familiar and like what a formula already does. Recipes can be used to do many of the same things, but they have a much wider range of possibilities.

Add a formula

To get started, let’s create a recipe for a simple logistic regression model. Before training the model, we can use a recipe.

The recipe()function as we used it here has two arguments:

  • A formula. Any variable on the left-hand side of the tilde (~) is considered the model outcome (code, in our present case). On the right-hand side of the tilde are the predictors. Variables may be listed by name, or you can use the dot (.) to indicate all other variables as predictors.

  • The data. A recipe is associated with the data set used to create the model. This will typically be the training set, so data = train_data here. Naming a data set doesn’t actually change the data itself; it is only used to catalog the names of the variables and their types, like factors, integers, dates, etc.

👉 Your Turn

Let’s create a recipe where we predict pass (the outcome variable) on the basis of the disability and imd_band (predictor) variables. Add these two variables to your recipe, below.

my_rec <- recipe(pass ~ disability + imd_band, data = data_train)

my_rec
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:   1
predictor: 2

Step 3: Specify the model and workflow

With tidymodels, we start building a model by specifying the functional form of the model that we want using the {parsnip} package. Since our outcome is binary, the model type we will use is “logistic regression.” We can declare this with logistic_reg() and assign to an object we will later use in our workflow:

Run the following code to finish specifying our model:

# specify model
my_mod <-
    logistic_reg()

That is pretty underwhelming since, on its own, it doesn’t really do much. However, now that the type of model has been specified, a method for fitting or training the model can be stated using the engine.

Start your engine

To set the engine, let’s rewrite the code above and “pipe” in the set_engine("glm") function and set_mode("classification")) to set the “mode” to classification. Note that this could also be changed to regression for a continuous/numeric outcome.

👉 Your Turn

Below, specify a glm engine, and a classification mode, replacing the placeholder text below.

my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

my_mod
Logistic Regression Model Specification (classification)

Computational engine: glm 

The engine value is often a mash-up of different packages that can be used to fit or train the model as well as the estimation method. For example, we will use “glm” a generalized linear model for binary outcomes and default for logistic regression in the {parsnip} package.

Add to workflow

Now we can use the recipe created earlier across several steps as we train and test our model. To simplify this process, we can use a model workflow, which pairs a model and recipe together.

This is a straightforward approach because different recipes are often needed for different models, so when a model and recipe are bundled, it becomes easier to train and test workflows.

We’ll use the{workflows} package from tidymodels to bundle our parsnip model (lr_mod) with our first recipe (lr_recipe_1).

Add your model and recipe (see their names above)!

my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Step 4: Fit model

Now that we have a single workflow that can be used to prepare the recipe and train the model from the resulting predictors, we can use the fit() function to fit our model to our train_data. And again, we set a random number seed to ensure that if we run this same code again, we will get the same results in terms of the data partition:

Finally, we’ll use the last_fit function, which is the key here: note that it uses the train_test_split data—not just the training data.

Here, then, we fit the data using the training data set and evaluate its accuracy using the testing data set (which is not used to train the model), passing the my_wf object as the first argument, and the split as the second.

final_fit <- last_fit(my_wf, train_test_split)

👉 Your Turn

Type the output of the above function (the name you assigned the output to) below; this is the final, fitted model—one that can be interpreted further in the next step!

final_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits               id              .metrics .notes   .predictions .workflow 
  <list>               <chr>           <list>   <list>   <list>       <list>    
1 <split [26073/6520]> train/test spl… <tibble> <tibble> <tibble>     <workflow>

You may see a message/warning above or when you examine final_fit; you can safely ignore that.

Step 5: Interpret accuracy

Importantly, we can summarize across all of these codes. One way to do this is straightforward; how many of the codes were the same, as in the following chunk of code:

final_fit %>% 
    collect_predictions() %>% # see test set predictions
    select(.pred_class, pass) %>% # just to make the output easier to view 
    mutate(correct = .pred_class == pass) # create a new variable, correct, telling us when the model was and was not correct
# A tibble: 6,520 × 3
   .pred_class pass  correct
   <fct>       <fct> <lgl>  
 1 0           1     FALSE  
 2 0           1     FALSE  
 3 <NA>        0     NA     
 4 0           1     FALSE  
 5 0           0     TRUE   
 6 0           1     FALSE  
 7 0           1     FALSE  
 8 0           1     FALSE  
 9 0           0     TRUE   
10 0           0     TRUE   
# ℹ 6,510 more rows

You may notice some of the rows may be missing values. This is because there were some missing values in the imd_band variable, and for this machine learning algorithm (the generalized linear model), missing values result in row-wise deletion.

When these are the same, the model predicted the code correctly; when they aren’t the same, the model was incorrect.

How accurate was our predictive model? Consider how well our model would have done by chance alone – what would the accuracy be in that case (with the model predicting pass one-half of the time)?

students %>% 
    count(pass)
# A tibble: 2 × 2
  pass      n
  <fct> <int>
1 0     20232
2 1     12361
students %>% 
    mutate(prediction = sample(c(0, 1), nrow(students), replace = TRUE)) %>% 
    mutate(correct = if_else(prediction == 1 & pass == 1 |
                                 prediction == 0 & pass == 0, 1, 0)) %>% 
    tabyl(correct)
 correct     n   percent
       0 16197 0.4969472
       1 16396 0.5030528

Curiously, randomly picking a 0 (did not pass) or a 1 (passed) will always lead to around a 50% accuracy, regardless of how many observations are actually associated with a 0 or a 1.

👉 Your Turn

A short-cut to the above is to simply use the collect_metrics() function, taking final_fit as its one argument; we’ll use this short-cut from here forward, having seen how accuracy is calculated. Write that code below.

Let’s step back a bit. How well could we do if we include more data? And how useful could such a model be in the real world? We’ll dive into these questions more over the forthcoming learning labs.

That’s it for now; the core parts of machine learning are used in the above steps you took; what we’ll do after this leaning lab only adds nuance and complexity to what we’ve already done.

5. COMMUNICATE

For your SML Module 2 Badge, you will have an opportunity to create a simple “data product” designed to illustrate insights some insights gained from your model and ideally highlight an “action step” that can be taken to act upon your findings.

Rendered HTML files can be published online through a variety of ways including Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods. The easiest way to quickly publish your file online is to publish directly from RStudio. You can do so by clicking the “Publish” button located in the Viewer Pane after you render your document as illustrated in the screenshot below.

Congratulations - you’ve completed the second supervised machine learning case study!