How Good is Our Model, Really?

Conceptual Overview

Purpose and Agenda

How do we interpret a machine learning model? What else can we say, besides how accurate a model this? This learning lab is intended to help you to answer these questions by examining output from a classification and a regression model. We again use the OULAD, but add an assessment file.

The key point to make here is that we will go well beyond the relatiely simplistic “Accuracy” metric, which is used to simply represent the proportion of the predictions that are correct. While useful, there are many other metrics that can give us a better sense of for which cases our SML model is making good predictions (e.g., just the “true” cases in a binary classification model, but not the “false” cases – or vice versa). You can make the point that this is especially important in our field, as different incorrect predictions may have different stakes (i.e., incorrectly classifying a student as needing additional support may not be directly harmful, but incorrectly predicting that a in fact student cheated may have substantial consequences). Metrics and interpreting them given the specifics of a particular analysis and context can help to avoid these harms, and to build the best performing models possible.

What we’ll do in this presentation

Discussion 1
Key Concept: Accuracy
Key Concept: Feature Engineering (part A)
Discussion 2
Introduction to the other parts of this learning lab

Two notes

Sometimes, we do things that are a little bit harder in the short-term for pedagogical reasons (evaluating metrics with training data, for instance)—some of these frictions will go away when we progress to our “full” model (in the next module)
Whereas the last module was focused on a big concept (the importance of splitting data into training and testing sets), this module is focused on a bunch of concepts (different fit metrics) that are best understood when they are used in a variety of specific instances (when each metric is needed, used, and interpreted)

Discussion 1

Background
Getting Started
Digging Deeper

We are likely familiar with accuracy and maybe another measure, Cohen’s Kappa
But, you may have heard of other means of determining how good a model is at making predictions: confusion matrices, specificity, sensitivity, recall, AUC-ROC, and others
Broadly, these help us to understand for which cases and types of cases a model is predictively better than others in a finer-grained way than accuracy

Think broadly and not formally (yet): What makes a prediction model a good one?

After having worked through the first learning lab, have your thoughts on what data you might use for a machine learning study evolved? If so, in what ways? If not, please elaborate on your initial thoughts and plans.

Key Concept #1

Let’s start with accuracy and a simple confusion matrix; what is the Accuracy?

Outcome	Prediction	Correct?
1	1	Yes
0	0	Yes
0	1	No
1	0	No
1	1	Yes

Use the tabyl() function (from {janitor} to calculate the accuracy in the code chunk below.

data_for_conf_mat %>% 
    mutate(correct = Outcome == Prediction) %>% 
    tabyl(correct)

 correct n percent
   FALSE 2     0.4
    TRUE 3     0.6

Now, let’s create a confusion matrix based on this data:

library(tidymodels)

data_for_conf_mat %>% 
    conf_mat(Outcome, Prediction)

          Truth
Prediction 0 1
         0 1 1
         1 1 2

Accuracy: Prop. of the sample that is true positive or true negative

True positive (TP): Prop. of the sample that is affected by a condition and correctly tested positive

True negative (TN): Prop. of the sample that is not affected by a condition and correctly tested negative

False positive (FP): Prop. of the sample that is not affected by a condition and incorrectly tested positive

False negative (FN): Prop. of the sample that is affected by a condition and incorrectly tested positive.

AUC-ROC

Area Under the Curve - Receiver Operator Characteristic (AUC-ROC)
Informs us as to how the True Positive rate changes given a different classification threshhold
Classification threshhold: the probability above which a model makes a positive prediction
Higher is better

Key Concept # 2

Feature Engineering (Part A)

Why?
Why (again)?
How?

Let’s consider a very simple data set, d, one with time_point data, var_a, for a single student. How do we add this to our model? Focus on the time element; how could you account for this?

d <- tibble(student_id = "janyia", time_point = 1:10, var_a = c(0.01, 0.32, 0.32, 0.34, 0.04, 0.54, 0.56, 0.75, 0.63, 0.78))
d %>% head(3)

# A tibble: 3 × 3
  student_id time_point var_a
  <chr>           <int> <dbl>
1 janyia              1  0.01
2 janyia              2  0.32
3 janyia              3  0.32

How about a different variable, now focusing on the variable, var_b. How could we add this to a model?

d <- tibble(student_id = "janyia", time_point = 1:10, var_b = c(12, 10, 35, 3, 4, 54, 56, 75, 63, 78))
d %>% head(3)

# A tibble: 3 × 3
  student_id time_point var_b
  <chr>           <int> <dbl>
1 janyia              1    12
2 janyia              2    10
3 janyia              3    35

We can do all of these things manually
But, there are also helpful “{recipes}” functions to do this
Any, the {recipes} package makes it practical to carry out feature engineering steps for not only single variables, but groups of variables (or all of the variables)
Examples, all of which start with step():
- step_dummy()
- step_normalize()
- step_inpute()
- step_date()
- step_holiday()

Discussion 2

Reflecting
Applying

Which metrics for supervised machine learning models (in classification “mode”) are important to interpret? Why?

Thinking broadly about your research interest, what would you need to consider before using a supervised machine learning model? Consider not only model metrics but also the data collection process and how the predictions may be used.

Introduction to the other parts of this learning lab

Readings
Case Study
Badge

Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 dropout. Journal of Education for Students Placed at Risk (JESPAR), 25(1), 28-54.

Baker, R. S., Bosch, N., Hutt, S., Zambrano, A. F., & Bowers, A. J. (2024). On fixing the right problems in predictive analytics: AUC is not the problem. arXiv preprint. https://arxiv.org/pdf/2404.06989

Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using machine learning to score multi-dimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30(2), 239-254.

Adding another data source from the OULAD, assessments data
Interpreting each of the metrics in greater detail
Using metric_set

Adding still another variable
Stepping back and interpreting the model as a whole
Finding another relevant study

fin

Dr. Joshua Rosenberg (jmrosenberg@utk.edu; https://joshuamrosenberg.com)

General troubleshooting tips for R and RStudio