How Good is Our Model, Really?

Conceptual Overview

Purpose and Agenda

How do we interpret a machine learning model? What else can we say, besides how accurate a model this? This learning lab is intended to help you to answer these questions by examining output from a classification and a regression model. We again use the OULAD, but add an assessment file.

What we’ll do in this presentation

  • Discussion 1
  • Key Concept: Accuracy
  • Key Concept: Feature Engineering (part A)
  • Discussion 2
  • Introduction to the other parts of this learning lab

Two notes

  1. Sometimes, we do things that are a little bit harder in the short-term for pedagogical reasons (evaluating metrics with training data, for instance)—some of these frictions will go away when we progress to our “full” model (in the next module)
  2. Whereas the last module was focused on a big concept (the importance of splitting data into training and testing sets), this module is focused on a bunch of concepts (different fit metrics) that are best understood when they are used in a variety of specific instances (when each metric is needed, used, and interpreted)

Discussion 1

  • We are likely familiar with accuracy and maybe another measure, Cohen’s Kappa
  • But, you may have heard of other means of determining how good a model is at making predictions: confusion matrices, specificity, sensitivity, recall, AUC-ROC, and others
  • Broadly, these help us to understand for which cases and types of cases a model is predictively better than others in a finer-grained way than accuracy
  • Think broadly and not formally (yet): What makes a prediction model a good one?
  • After having worked through the first learning lab, have your thoughts on what data you might use for a machine learning study evolved? If so, in what ways? If not, please elaborate on your initial thoughts and plans.

Key Concept #1

Accuracy

Let’s start with accuracy and a simple confusion matrix; what is the Accuracy?

Outcome Prediction Correct?
1 1 Yes
0 0 Yes
0 1 No
1 0 No
1 1 Yes

Use the tabyl() function (from {janitor} to calculate the accuracy in the code chunk below.

data_for_conf_mat %>% 
    mutate(correct = Outcome == Prediction) %>% 
    tabyl(correct)
 correct n percent
   FALSE 2     0.4
    TRUE 3     0.6

Now, let’s create a confusion matrix based on this data:

library(tidymodels)

data_for_conf_mat %>% 
    conf_mat(Outcome, Prediction)
          Truth
Prediction 0 1
         0 1 1
         1 1 2

Accuracy: Prop. of the sample that is true positive or true negative

True positive (TP): Prop. of the sample that is affected by a condition and correctly tested positive

True negative (TN): Prop. of the sample that is not affected by a condition and correctly tested negative

False positive (FP): Prop. of the sample that is not affected by a condition and incorrectly tested positive

False negative (FN): Prop. of the sample that is affected by a condition and incorrectly tested positive.

AUC-ROC

  • Area Under the Curve - Receiver Operator Characteristic (AUC-ROC)
  • Informs us as to how the True Positive rate changes given a different classification threshhold
  • Classification threshhold: the probability above which a model makes a positive prediction
  • Higher is better

Key Concept # 2

Feature Engineering (Part A)

Let’s consider a very simple data set, d, one with time_point data, var_a, for a single student. How do we add this to our model? Focus on the time element; how could you account for this?

d <- tibble(student_id = "janyia", time_point = 1:10, var_a = c(0.01, 0.32, 0.32, 0.34, 0.04, 0.54, 0.56, 0.75, 0.63, 0.78))
d %>% head(3)
# A tibble: 3 × 3
  student_id time_point var_a
  <chr>           <int> <dbl>
1 janyia              1  0.01
2 janyia              2  0.32
3 janyia              3  0.32

How about a different variable, now focusing on the variable, var_b. How could we add this to a model?

d <- tibble(student_id = "janyia", time_point = 1:10, var_b = c(12, 10, 35, 3, 4, 54, 56, 75, 63, 78))
d %>% head(3)
# A tibble: 3 × 3
  student_id time_point var_b
  <chr>           <int> <dbl>
1 janyia              1    12
2 janyia              2    10
3 janyia              3    35
  • We can do all of these things manually
  • But, there are also helpful “{recipes}” functions to do this
  • Any, the {recipes} package makes it practical to carry out feature engineering steps for not only single variables, but groups of variables (or all of the variables)
  • Examples, all of which start with step():
    • step_dummy()
    • step_normalize()
    • step_inpute()
    • step_date()
    • step_holiday()

Discussion 2

  • Which metrics for supervised machine learning models (in classification “mode”) are important to interpret? Why?
  • Thinking broadly about your research interest, what would you need to consider before using a supervised machine learning model? Consider not only model metrics but also the data collection process and how the predictions may be used.

Introduction to the other parts of this learning lab

Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 dropout. Journal of Education for Students Placed at Risk (JESPAR), 25(1), 28-54.

Baker, R. S., Bosch, N., Hutt, S., Zambrano, A. F., & Bowers, A. J. (2024). On fixing the right problems in predictive analytics: AUC is not the problem. arXiv preprint. https://arxiv.org/pdf/2404.06989

Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using machine learning to score multi-dimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30(2), 239-254.

  • Adding another data source from the OULAD, assessments data
  • Interpreting each of the metrics in greater detail
  • Using metric_set
  • Adding still another variable
  • Stepping back and interpreting the model as a whole
  • Finding another relevant study

fin

  • Dr. Joshua Rosenberg (jmrosenberg@utk.edu; https://joshuamrosenberg.com)

General troubleshooting tips for R and RStudio