When we left off the first case study, we saw that our model was pretty accurate. We can and will do better in terms of accuracy. But, the accuracy value we found also raises a broader question: How good was our model in terms of making predictions?
While accuracy is an easy to understand and interpret value, it provides a limited answer to the above question about how good our model was in terms of making predictions.
In this learning and case study, we explore a variety of ways to understand how good of a predictive model ours is. We do this through a variety of means:
Other statistics, such as sensitivity and specificity
Tablesβparticularly, a confusion matrix
Our driving question for this case study, then, is: How good is our model at making predictions?
Like in the first learning lab, weβll first load several packages β {tidyverse}, {tidymodels}, and {janitor}. Make sure these are installed (via install.packages()). Note that if theyβre already installed, you donβt need to install them again.
library(tidyverse)
ββ Attaching core tidyverse packages ββββββββββββββββββββββββ tidyverse 2.0.0 ββ
β dplyr 1.1.4 β readr 2.1.5
β forcats 1.0.0 β stringr 1.5.1
β ggplot2 3.5.0 β tibble 3.2.1
β lubridate 1.9.3 β tidyr 1.3.1
β purrr 1.0.2
ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
β dplyr::filter() masks stats::filter()
β dplyr::lag() masks stats::lag()
βΉ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
2. WRANGLE
Like in the code-along for the overview presentation, letβs take a look at the data and do some processing of it.
π Your Turn‡
Weβll load the students file together; youβll write code to read the assessments file, which is named βoulad-assessments.csvβ. Please assign the name assessments to the loaded assessments file.
students <-read_csv("data/oulad-students.csv")
Rows: 32593 Columns: 15
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (9): code_module, code_presentation, gender, region, highest_education, ...
dbl (6): id_student, num_of_prev_attempts, studied_credits, module_presentat...
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 173912 Columns: 10
ββ Column specification ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Delimiter: ","
chr (3): code_module, code_presentation, assessment_type
dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
βΉ Use `spec()` to retrieve the full column specification for this data.
βΉ Specify the column types or set `show_col_types = FALSE` to quiet this message.
3. EXPLORE
π Your Turn‡
In the last learning lab, we used the count() function. Letβs practice that again, by counting the number of of assessments of different types. If you need, recall that the data dictionary is here. Note what the different types of assessments mean.
assessments %>%count(assessment_type)
# A tibble: 3 Γ 2
assessment_type n
<chr> <int>
1 CMA 70527
2 Exam 4959
3 TMA 98426
π Your Turn‡
Weβll now use another functionβlike count(), from the {tidyverse}. Specifically, weβll use the distinct() function. This returns the unique (or distinct) values for a specified variable. Learn more about distinct()here. Below, find the distinct assessment IDs.
assessments %>%distinct(id_assessment) # this many unique assessments
We might be interested in how many assessments there are per course. To calculate that, we need to count() several variables at once; when doing this, count() tabulates the number of unique combinations of the variables passed to it.
Letβs explore the dates assignments were submitted a bit β using the summarize() function:
assessments %>%summarize(mean_date =mean(date, na.rm =TRUE), # find the mean date for assignmentsmedian_date =median(date, na.rm =TRUE), # find the mediansd_date =sd(date, na.rm =TRUE), # find the sdmin_date =min(date, na.rm =TRUE), # find the minmax_date =max(date, na.rm =TRUE)) # find the mad
What can we take from this? It looks like, on average, the average (mean and median) date assignments were due was around 130 β 130 days after the start of the course. The first assignment seems to have been due 12 days into the course, and the last 261 days after.
Crucially, though, these dates vary by course. Thus, we need to first group the data by course. Letβs use a different function this time β quantile(), and calculate the first quantile value. Our reasoning is that we want to be able to act to support students, and if we wait until after the average assignment is due, then that might be too late. Whereas the first quantile comes approximately one-quarter through the semester β with, therefore, more time to intervene.
assessments %>%group_by(code_module, code_presentation) %>%# first, group by course (module: course; presentation: semester)summarize(mean_date =mean(date, na.rm =TRUE),median_date =median(date, na.rm =TRUE),sd_date =sd(date, na.rm =TRUE),min_date =min(date, na.rm =TRUE),max_date =max(date, na.rm =TRUE),first_quantile =quantile(date, probs = .25, na.rm =TRUE)) # find the first (25%) quantile
`summarise()` has grouped output by 'code_module'. You can override using the
`.groups` argument.
Alright, this is a bit complicated, but we can actually work with this data. Letβs use just a portion of the above code, assigning it the name code_module_dates.
What have we created? We found the date that is one-quarter of the way through the semester (in terms of the dates assignments are due).
π Your Turn‡
We can thus use this to group and calculate studentsβ performance on assignments through this point. To do this, we need to use a join function β left_join(), in particular. Please use left_join() to join together assessments and code_module_dates, with assessments being the βleftβ data frame, and code_module_dates the right. Save the output of the join the name assessments_joined.
assessments_joined <- assessments %>%left_join(code_module_dates) # join the data based on course_module and course_presentation
Joining with `by = join_by(code_module, code_presentation)`
Weβre almost there! The next few lines filter the assessments data so it only includes assessments before our cutoff date.
assessments_filtered <- assessments_joined %>%filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
Finally, weβll find the average score for each student.
assessments_summarized <- assessments_filtered %>%mutate(weighted_score = score * weight) %>%# create a new variable that accounts for the "weight" (comparable to points) given each assignmentgroup_by(id_student) %>%summarize(mean_weighted_score =mean(weighted_score))
As a point of reflection here, note how much work weβve done before weβve gotten to the machine learning steps. Though a challenge, this is typical for both machine learning and more traditional statistical models: a lot of the work is in the preparation and data wrangling stages.
Letβs copy the code below that we used to process the students data (when processing the pass and imd_band variables).
students <- students %>%mutate(pass =ifelse(final_result =="Pass", 1, 0)) %>%# creates a dummy codemutate(pass =as.factor(pass)) # makes the variable a factor, helping later stepsstudents <- students %>%mutate(imd_band =factor(imd_band, levels =c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"))) %>%# this creates a factor with ordered levelsmutate(imd_band =as.integer(imd_band)) # this changes the levels into integers based on the order of the factor levels
π Your Turn‡
Finally, letβs join together students and assessments_summarized, assigning the joined the name students_and_assessments.
students_and_assessments <- students %>%left_join(assessments_summarized)
Big picture, weβve added another measure β another feature β that we can use to make predictions: studentsβ performance on assessments 1/4 of the way through the course.
Weβre now ready to proceed to our machine learning steps!
The problem we will be working on - predicting students who pass, based on data from the first one-third of the semester - has an analog in a recent paper by Ryan Baker and colleagues:
In Baker et al. (2020), the authors create a precision-recall (also known as sensitivity) graph - one that demonstrates the trade-off between optimizing these two statistics. Review their paper - especially the results section - to see how they discuss these two statistics.
Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 dropout. Journal of Education for Students Placed at Risk (JESPAR), 25(1), 28-54.
Please review this paper before proceeding, focusing on how they describe
4. MODEL
Step 1. Split data
This is identical to what we did in the first learning lab, using the students_and_assessments data. Weβll also create a testing data set weβll use later.
Step 2: Engineer features and write down the recipe
Weβll also specify the recipe, adding our mean_weighted_score variable we calculated above as well as variables we used in LL1 (case study and badge). Note how we dummy code two variables using step_dummy() (described further in the first learning lab).
# specify modelmy_mod <-logistic_reg() %>%set_engine("glm") %>%# generalized linear modelset_mode("classification") # since we are predicting a dichotomous outcome, specify classification; for a number, specify regression# specify workflowmy_wf <-workflow() %>%# create a workflowadd_model(my_mod) %>%# add the model we wrote aboveadd_recipe(my_rec) # add our recipe we wrote above
Step 4: Fit model
Finally, weβll use the last_fit function, but weβll add something a little different - weβll add the metric set here. To the above, weβll add another common metric - Cohenβs Kappa, which is similar to accuracy, but which accounts for chance agreement.
To do so, we have to specify which metrics we want to use using the metric_set() function (see all of the available options here). Please specify six metrics in the metric set β accuracy, sensitivity, specificity, ppv (recall), npv, and kappa. Assign the name class_metrics to the output of your use of the metric_set() function.
Then, please use last_fit, looking to how we did this in the last learning lab for a guide on how to specify the argument nts. To the arguments, add metrics = class_metrics.
Weβre now ready to move on to interpreting the accuracy of our model.
Step 5: Interpret accuracy
A confusion matrix and true and false positives and negatives
Letβs start with a simple confusion matrix. The confusion matrix is a 2 x 2 table with values (cells in the table) representing one of four conditions, elaborated below. Youβll fill in the last two columns in a few moments.
Statistic
How to Find
Interpretation
Value
%
True positive (TP)
Truth: 1, Prediction: 1
Proportion of the population that is affected by a condition and correctly tested positive
2443
19.8
True negative (TN)
Truth: 0, Prediction: 0
Proportion of the population that is not affected by a condition and correctly tested negative
4536
36.8
False positive (FP)
Truth: 0, Prediction: 1
Proportion of the population that is not affected by a condition and incorrectly tested positive
2070
16.8
False negative (FN)
Truth: 1, Prediction: 0
Proportion of the population that is affected by a condition and incorrectly tested positive.
3269
26.5
To create a confusion matrix, run collect_predictions(), which does what it suggests - it gathers together the modelβs test set predictions. Pass the last_fit object to this function below.
Take a look at the columns. Youβll need to provide the predictions (created with collect_predictions()) and then pipe that to conf_mat(), to which you provide the names of a) the predictions and b) the known values for the test set. Some code to get you started is below.
You should see a confusion matrix output. If so, nice work! Please fill in the Value and Percentage columns in the table above (with TP, TN, FP, and FN), entering the appropriate values and then converting those into a percentage based on the total number of data points in the test data set. The accuracy can be interpreted as the sum of the true positive and true negative percentages. So, whatβs the accuracy? Add that below as a percentage.
Accuracy: 56.6%
Other measures of predictive accuracy
Hereβs where things get interesting: There are other statistics that have different denominators. Recall from the overview presentation that we can slice and dice the confusion matrix to calculate statistics that give us insights into the predictive utility of the model. Based on the above Values for TP, TN, FP, and FN you interpreted a few moments ago, add the Statistic Values for sensitivity, specificity, precision, and negative predictive value below to three decimal points.
π Your Turn‡
Statistic
Equation
Interpretation
Question Answered
Statistic Values
Sensitivity (AKA recall)
TP / (TP + FN)
Proportion of those who are affected by a condition and correctly tested positive
Out of all the actual positives, how many did we correctly predict?
.428
Specificity
TN / (FP + TN)
Proportion of those who are not affected by a condition and correctly tested negative;
Out of all the actual negatives, how many did we correctly predict?
.687
Precision (AKA Positive Predictive Value)
TP / (TP + FP)
Proportion of those who tested positive who are affected by a condition
Out of all the instances we predicted as positive, how many are actually positive?
.581
Negative Predictive Value
TN / (TN + FN)
Proportion of those who tested positive who are not affected by a condition; the probability that a negative test is correct
Out of all the instances we predicted as negative, how many are actually negative?
.541
So, what does this hard-won by output tell us? Letβs interpret each statistic carefully in the table below. Please add the value and provide a substantive interpretation. One is provided to get you started.
π Your Turn‡
Statistic
Substantive Interpretation
Accuracy
Sensitivity (AKA recall)
The model correctly predicts about 2/3 of students who do not pass correctly (as not passing). This is pretty good, but it could be better.
Specificity
Precision (AKA Positive Predictive Value)
Negative Predictive Value
This process might suggest to you how a βgoodβ model isnβt necessarily one that is the most accurate, but instead is one that has good values for statistics that matter given our particular question and context.
Recall that Baker and colleagues sought to balance between precision and recall (specificity). Please briefly discuss how well our model does this; is it better suited to correctly identifying βpositiveβ pass cases (sensitivity) or βnegativelyβ identifying students who do not pass (specificity)?
5. COMMUNICATE
Quickly calculating metrics
After all of this work, we can calculate the above much more easily given how we setup our metrics (using metric_set() earlier, such as when we want to efficiently communicate the results of our analysis to our audience? Below, use collect_metrics() with the final_fit object, noting that in addition to the four metrics we calculated manually just a few moments ago, we are also provided with the accuracy and Cohenβs Kappa values.