Beyond Accuracy: How to Calculate Metrics

Case Study Key

Author

LASER Institute

Published

July 13, 2025

1. PREPARE

When we left off the first case study, we saw that our model was pretty accurate. We can and will do better in terms of accuracy. But, the accuracy value we found also raises a broader question: How good was our model in terms of making predictions?

While accuracy is an easy to understand and interpret value, it provides a limited answer to the above question about how good our model was in terms of making predictions.

In this learning and case study, we explore a variety of ways to understand how good of a predictive model ours is. We do this through a variety of means:

Other statistics, such as sensitivity and specificity
Tables–particularly, a confusion matrix

Our driving question for this case study, then, is: How good is our model at making predictions?

We’ll use the Open University Learning Analytics Dataset (OULAD).

The OULAD was created by learning analytics researchers at the United Kingdom-based Open University. It includes data from post-secondary learners participation in one of several Massive Open Online Courses (called modules in the OULAD).

Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open university learning analytics dataset. Scientific Data, 4(1), 1-8. https://www.nature.com/articles/sdata2017171

Abstract: Learning Analytics focuses on the collection and analysis of learners’ data to improve their learning experience by providing informed guidance and to optimise learning materials. To support the research in this area we have developed a dataset, containing data from courses presented at the Open University (OU). What makes the dataset unique is the fact that it contains demographic data together with aggregated clickstream data of students’ interactions in the Virtual Learning Environment (VLE). This enables the analysis of student behavior, represented by their actions. The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the VLE represented by daily summaries of student clicks (10,655,280 entries). The dataset is freely available at https://analyse.kmi.open.ac.uk/open_dataset under a CC-BY 4.0 license.

👉 Your Turn ⤵

You don’t need to read the entire article yet, but please open this article, scan the sections, and write down two things you notice or wonder about the dataset.

YOUR RESPONSE HERE
YOUR RESPONSE HERE

Like in the first learning module, we’ll first load several packages – {tidyverse}, {tidymodels}, and {janitor}. Make sure these are installed (via install.packages()). Note that if they’re already installed, you don’t need to install them again.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7      ✔ rsample      1.2.1 
✔ dials        1.2.1      ✔ tune         1.2.1 
✔ infer        1.0.7      ✔ workflows    1.1.4 
✔ modeldata    1.4.0      ✔ workflowsets 1.1.0 
✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
✔ recipes      1.0.10     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

2. WRANGLE

Like in the code-along, let’s take a look at the data and do some processing of it.

👉 Your Turn ⤵

We’ll load the students file together; you’ll write code to read the assessments file, which is named “oulad-assessments.csv”. Please assign the name assessments to the loaded assessments file.

students <- read_csv("data/oulad-students.csv")

Rows: 32593 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): code_module, code_presentation, gender, region, highest_education, ...
dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

assessments <- read_csv("data/oulad-assessments.csv")

Rows: 173912 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, assessment_type
dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. EXPLORE

👉 Your Turn ⤵

In the last learning module, we used the count() function. Let’s practice that again, by counting the number of of assessments of different types. If you need, recall that the data dictionary is here. Note what the different types of assessments mean.

assessments %>% 
    count(assessment_type)

# A tibble: 3 × 2
  assessment_type     n
  <chr>           <int>
1 CMA             70527
2 Exam             4959
3 TMA             98426

👉 Your Turn ⤵

We’ll now use another function–like count(), from the {tidyverse}. Specifically, we’ll use the distinct() function. This returns the unique (or distinct) values for a specified variable. Learn more about distinct() here. Below, find the distinct assessment IDs.

assessments %>% 
    distinct(id_assessment) # this many unique assessments

# A tibble: 188 × 1
   id_assessment
           <dbl>
 1          1752
 2          1753
 3          1754
 4          1755
 5          1756
 6          1758
 7          1759
 8          1760
 9          1761
10          1762
# ℹ 178 more rows

Let’s explore the assessments data a bit.

We might be interested in how many assessments there are per course. To calculate that, we need to count() several variables at once; when doing this, count() tabulates the number of unique combinations of the variables passed to it.

assessments %>% 
    count(assessment_type, code_module, code_presentation)

# A tibble: 41 × 4
   assessment_type code_module code_presentation     n
   <chr>           <chr>       <chr>             <int>
 1 CMA             BBB         2013B              5049
 2 CMA             BBB         2013J              6416
 3 CMA             BBB         2014B              4493
 4 CMA             CCC         2014B              3920
 5 CMA             CCC         2014J              5846
 6 CMA             DDD         2013B              5252
 7 CMA             FFF         2013B              6681
 8 CMA             FFF         2013J              8847
 9 CMA             FFF         2014B              5549
10 CMA             FFF         2014J              8915
# ℹ 31 more rows

Let’s explore the dates assignments were submitted a bit – using the summarize() function:

assessments %>% 
    summarize(mean_date = mean(date, na.rm = TRUE), # find the mean date for assignments
              median_date = median(date, na.rm = TRUE), # find the median
              sd_date = sd(date, na.rm = TRUE), # find the sd
              min_date = min(date, na.rm = TRUE), # find the min
              max_date = max(date, na.rm = TRUE)) # find the mad

# A tibble: 1 × 5
  mean_date median_date sd_date min_date max_date
      <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
1      131.         129    78.0       12      261

What can we take from this? It looks like, on average, the average (mean and median) date assignments were due was around 130 – 130 days after the start of the course. The first assignment seems to have been due 12 days into the course, and the last 261 days after.

Crucially, though, these dates vary by course. Thus, we need to first group the data by course. Let’s use a different function this time – quantile(), and calculate the first quantile value. Our reasoning is that we want to be able to act to support students, and if we wait until after the average assignment is due, then that might be too late. Whereas the first quantile comes approximately one-quarter through the semester — with, therefore, more time to intervene.

assessments %>% 
    group_by(code_module, code_presentation) %>% # first, group by course (module: course; presentation: semester)
    summarize(mean_date = mean(date, na.rm = TRUE),
              median_date = median(date, na.rm = TRUE),
              sd_date = sd(date, na.rm = TRUE),
              min_date = min(date, na.rm = TRUE),
              max_date = max(date, na.rm = TRUE),
              first_quantile = quantile(date, probs = .25, na.rm = TRUE)) # find the first (25%) quantile

`summarise()` has grouped output by 'code_module'. You can override using the
`.groups` argument.

# A tibble: 22 × 8
# Groups:   code_module [7]
   code_module code_presentation mean_date median_date sd_date min_date max_date
   <chr>       <chr>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
 1 AAA         2013J                 109.          117    71.3       19      215
 2 AAA         2014J                 109.          117    71.5       19      215
 3 BBB         2013B                 104.           89    55.5       19      187
 4 BBB         2013J                 112.           96    61.6       19      208
 5 BBB         2014B                  98.9          82    58.6       12      194
 6 BBB         2014J                  99.1         110    65.2       19      201
 7 CCC         2014B                  98.4         102    68.0       18      207
 8 CCC         2014J                 104.          109    70.8       18      214
 9 DDD         2013B                 104.           81    66.0       23      240
10 DDD         2013J                 118.           88    77.9       25      261
# ℹ 12 more rows
# ℹ 1 more variable: first_quantile <dbl>

Alright, this is a bit complicated, but we can actually work with this data. Let’s use just a portion of the above code, assigning it the name code_module_dates.

code_module_dates <- assessments %>% 
    group_by(code_module, code_presentation) %>% 
    summarize(quantile_cutoff_date = quantile(date, probs = .25, na.rm = TRUE))

`summarise()` has grouped output by 'code_module'. You can override using the
`.groups` argument.

👉 Your Turn ⤵

Let’s take a look at what we just created; type code_module_dates below:

code_module_dates

# A tibble: 22 × 3
# Groups:   code_module [7]
   code_module code_presentation quantile_cutoff_date
   <chr>       <chr>                            <dbl>
 1 AAA         2013J                               54
 2 AAA         2014J                               54
 3 BBB         2013B                               54
 4 BBB         2013J                               54
 5 BBB         2014B                               47
 6 BBB         2014J                               54
 7 CCC         2014B                               32
 8 CCC         2014J                               32
 9 DDD         2013B                               51
10 DDD         2013J                               53
# ℹ 12 more rows

What have we created? We found the date that is one-quarter of the way through the semester (in terms of the dates assignments are due).

👉 Your Turn ⤵

We can thus use this to group and calculate students’ performance on assignments through this point. To do this, we need to use a join function — left_join(), in particular. Please use left_join() to join together assessments and code_module_dates, with assessments being the “left” data frame, and code_module_dates the right. Save the output of the join the name assessments_joined.

assessments_joined <- assessments %>% 
    left_join(code_module_dates) # join the data based on course_module and course_presentation

Joining with `by = join_by(code_module, code_presentation)`

We’re almost there! The next few lines filter the assessments data so it only includes assessments before our cutoff date.

assessments_filtered <- assessments_joined %>% 
    filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included

Finally, we’ll find the average score for each student.

assessments_summarized <- assessments_filtered %>% 
    mutate(weighted_score = score * weight) %>% # create a new variable that accounts for the "weight" (comparable to points) given each assignment
    group_by(id_student) %>% 
    summarize(mean_weighted_score = mean(weighted_score))

As a point of reflection here, note how much work we’ve done before we’ve gotten to the machine learning steps. Though a challenge, this is typical for both machine learning and more traditional statistical models: a lot of the work is in the preparation and data wrangling stages.

Let’s copy the code below that we used to process the students data (when processing the pass and imd_band variables).

students <- students %>% 
    mutate(pass = ifelse(final_result == "Pass", 1, 0)) %>% # creates a dummy code
    mutate(pass = as.factor(pass)) # makes the variable a factor, helping later steps

students <- students %>% 
    mutate(imd_band = factor(imd_band, levels = c("0-10%",
                                                  "10-20%",
                                                  "20-30%",
                                                  "30-40%",
                                                  "40-50%",
                                                  "50-60%",
                                                  "60-70%",
                                                  "70-80%",
                                                  "80-90%",
                                                  "90-100%"))) %>% # this creates a factor with ordered levels
    mutate(imd_band = as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels

👉 Your Turn ⤵

Finally, let’s join together students and assessments_summarized, assigning the joined the name students_and_assessments.

students_and_assessments <- students %>% 
    left_join(assessments_summarized)

Joining with `by = join_by(id_student)`

# write_csv(students_and_assessents, "students_and_assessments.csv")

Big picture, we’ve added another measure – another feature – that we can use to make predictions: students’ performance on assessments 1/4 of the way through the course.

We’re now ready to proceed to our machine learning steps!

The problem we will be working on - predicting students who pass, based on data from the first one-third of the semester - has an analog in a recent paper by Ryan Baker and colleagues:

In Baker et al. (2020), the authors create a precision-recall (also known as sensitivity) graph - one that demonstrates the trade-off between optimizing these two statistics. Review their paper - especially the results section - to see how they discuss these two statistics.

Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 dropout. Journal of Education for Students Placed at Risk (JESPAR), 25(1), 28-54.

Please review this paper before proceeding, focusing on how they describe

4. MODEL

Step 1. Split data

This is identical to what we did in the first learning module, using the students_and_assessments data. We’ll also create a testing data set we’ll use later.

set.seed(20230712)

students_and_assessments <- students_and_assessments %>% 
    drop_na(mean_weighted_score)

train_test_split <- initial_split(students_and_assessments, prop = .50, strata = "pass")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

Step 2: Engineer features and write down the recipe

We’ll also specify the recipe, adding our mean_weighted_score variable we calculated above as well as variables we used in LL1 (case study and badge). Note how we dummy code two variables using step_dummy() (described further in the first learning module).

my_rec <- recipe(pass ~ disability +
                     date_registration + 
                     gender +
                     code_module +
                     mean_weighted_score, 
                 data = data_train) %>% 
    step_dummy(disability) %>% 
    step_dummy(gender) %>%  
    step_dummy(code_module)

Step 3: Specify the model and workflow

These steps are also the same as in LL1.

# specify model
my_mod <-
    logistic_reg() %>% 
    set_engine("glm") %>% # generalized linear model
    set_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression

# specify workflow
my_wf <-
    workflow() %>% # create a workflow
    add_model(my_mod) %>% # add the model we wrote above
    add_recipe(my_rec) # add our recipe we wrote above

Step 4: Fit model

Finally, we’ll use the last_fit function, but we’ll add something a little different - we’ll add the metric set here. To the above, we’ll add another common metric - Cohen’s Kappa, which is similar to accuracy, but which accounts for chance agreement.

To do so, we have to specify which metrics we want to use using the metric_set() function (see all of the avaimodulele options here). Please specify six metrics in the metric set – accuracy, sensitivity, specificity, ppv (recall), npv, and kappa. Assign the name class_metrics to the output of your use of the metric_set() function.

👉 Your Turn ⤵

class_metrics <- metric_set(accuracy, sensitivity, specificity, ppv, npv, kap) # add probs?

Then, please use last_fit, looking to how we did this in the last learning module for a guide on how to specify the argument nts. To the arguments, add metrics = class_metrics.

final_fit <- last_fit(my_wf, train_test_split, metrics = class_metrics)

We’re now ready to move on to interpreting the accuracy of our model.

Step 5: Interpret accuracy

A confusion matrix and true and false positives and negatives

Let’s start with a simple confusion matrix. The confusion matrix is a 2 x 2 table with values (cells in the table) representing one of four conditions, elaborated below. You’ll fill in the last two columns in a few moments.

Statistic	How to Find	Interpretation	Value	%
True positive (TP)	Truth: 1, Prediction: 1	Proportion of the population that is affected by a condition and correctly tested positive	2443	19.8
True negative (TN)	Truth: 0, Prediction: 0	Proportion of the population that is not affected by a condition and correctly tested negative	4536	36.8
False positive (FP)	Truth: 0, Prediction: 1	Proportion of the population that is not affected by a condition and incorrectly tested positive	2070	16.8
False negative (FN)	Truth: 1, Prediction: 0	Proportion of the population that is affected by a condition and incorrectly tested positive.	3269	26.5

To create a confusion matrix, run collect_predictions(), which does what it suggests - it gathers together the model’s test set predictions. Pass the last_fit object to this function below.

👉 Your Turn ⤵

collect_predictions(final_fit)

# A tibble: 12,318 × 5
   .pred_class id                .row pass  .config             
   <fct>       <chr>            <int> <fct> <chr>               
 1 1           train/test split     1 1     Preprocessor1_Model1
 2 1           train/test split     2 1     Preprocessor1_Model1
 3 1           train/test split     3 1     Preprocessor1_Model1
 4 1           train/test split     5 1     Preprocessor1_Model1
 5 1           train/test split     6 1     Preprocessor1_Model1
 6 1           train/test split     8 1     Preprocessor1_Model1
 7 1           train/test split    10 1     Preprocessor1_Model1
 8 1           train/test split    11 1     Preprocessor1_Model1
 9 1           train/test split    12 1     Preprocessor1_Model1
10 1           train/test split    17 1     Preprocessor1_Model1
# ℹ 12,308 more rows

Take a look at the columns. You’ll need to provide the predictions (created with collect_predictions()) and then pipe that to conf_mat(), to which you provide the names of a) the predictions and b) the known values for the test set. Some code to get you started is below.

collect_predictions(final_fit) %>% 
    conf_mat(.pred_class, pass)

          Truth
Prediction    0    1
         0 4777 1824
         1 3525 2187

You should see a confusion matrix output. If so, nice work! Please fill in the Value and Percentage columns in the table above (with TP, TN, FP, and FN), entering the appropriate values and then converting those into a percentage based on the total number of data points in the test data set. The accuracy can be interpreted as the sum of the true positive and true negative percentages. So, what’s the accuracy? Add that below as a percentage.

Accuracy: 56.6%

Other measures of predictive accuracy

Here’s where things get interesting: There are other statistics that have different denominators. Recall from the overview presentation that we can slice and dice the confusion matrix to calculate statistics that give us insights into the predictive utility of the model. Based on the above Values for TP, TN, FP, and FN you interpreted a few moments ago, add the Statistic Values for sensitivity, specificity, precision, and negative predictive value below to three decimal points.

👉 Your Turn ⤵

Statistic	Equation	Interpretation	Question Answered	Statistic Values
Sensitivity (AKA recall)	TP / (TP + FN)	Proportion of those who are affected by a condition and correctly tested positive	Out of all the actual positives, how many did we correctly predict?	.428
Specificity	TN / (FP + TN)	Proportion of those who are not affected by a condition and correctly tested negative;	Out of all the actual negatives, how many did we correctly predict?	.687
Precision (AKA Positive Predictive Value)	TP / (TP + FP)	Proportion of those who tested positive who are affected by a condition	Out of all the instances we predicted as positive, how many are actually positive?	.581
Negative Predictive Value	TN / (TN + FN)	Proportion of those who tested positive who are not affected by a condition; the probability that a negative test is correct	Out of all the instances we predicted as negative, how many are actually negative?	.541

So, what does this hard-won by output tell us? Let’s interpret each statistic carefully in the table below. Please add the value and provide a substantive interpretation. One is provided to get you started.

👉 Your Turn ⤵

Statistic	Substantive Interpretation
Accuracy
Sensitivity (AKA recall)	The model correctly predicts about 2/3 of students who do not pass correctly (as not passing). This is pretty good, but it could be better.
Specificity
Precision (AKA Positive Predictive Value)
Negative Predictive Value

This process might suggest to you how a “good” model isn’t necessarily one that is the most accurate, but instead is one that has good values for statistics that matter given our particular question and context.

Recall that Baker and colleagues sought to balance between precision and recall (specificity). Please briefly discuss how well our model does this; is it better suited to correctly identifying “positive” pass cases (sensitivity) or “negatively” identifying students who do not pass (specificity)?

5. COMMUNICATE

Quickly calculating metrics

After all of this work, we can calculate the above much more easily given how we setup our metrics (using metric_set() earlier, such as when we want to efficiently communicate the results of our analysis to our audience? Below, use collect_metrics() with the final_fit object, noting that in addition to the four metrics we calculated manually just a few moments ago, we are also provided with the accuracy and Cohen’s Kappa values.

👉 Your Turn ⤵

collect_metrics(final_fit)

# A tibble: 6 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.566 Preprocessor1_Model1
2 sensitivity binary         0.724 Preprocessor1_Model1
3 specificity binary         0.383 Preprocessor1_Model1
4 ppv         binary         0.575 Preprocessor1_Model1
5 npv         binary         0.545 Preprocessor1_Model1
6 kap         binary         0.109 Preprocessor1_Model1

Having invested in understanding the terminology of machine learning metrics, we’ll use this “shortcut” (using collect_metrics() going forward.

🧶 Knit & Check ✅

Congratulations - you’ve completed this case study! Consider moving on to the badge activity next.