Supervised Machine Learning

Machine learning is increasingly prevalent in our lives—and in educational contexts. Its role in educational research and practice is growing, albeit with some challenges and even controversy. These modules are designed to familiarize you with supervised machine learning (SML) and its applications in STEM education research. Throughout the module, we’ll explore four key questions that correspond to the focus of each of the four modules. By the end, you will have a deep understanding of the key characteristics of supervised machine learning and how to implement supervised machine learning workflows in R and Python.

Github
Repository for Instructors
Posit Cloud Workspace for Learners

Module 1: Supervised Machine Learning Foundations

How is prediction different from explanation? This module provides a gentle introduction to supervised machine learning by drawing out similarities to and differences from a regression modeling approach. The case study will involve modeling the graduation rate across 1000s of higher education institutions in the United States using data from the Integrated Postsecondary Education Data System (IPEDS).

Conceptual
Overview
What is Supervised Machine Learning? (Zoom recording)
Code Along Same Model, Different Analytic Goals
Readings &
Reflection
Considering Models for Inference and Models for Prediction
Case Study Key Explaining or Predicting Graduation Rates Using IPEDS | Answer Key
Badge Initial Interpretations of a Model’s Predictions
Module Survey Feedback Form after Finishing Module

Module 2: Workflows With Training and Testing Data

Building on the foundations from Module 1, this session delves deeper into the workflows we will use when we are using a SML approach. Particularly, we’ll explore the roles of training and testing data and when to use them in a SML workflow. We’ll predict students’ withdrawal from a course using the Open University Learning Analytics Dataset (OULAD).

Conceptual
Overview
Using Training and Testing Data in a Workflow
Code Along How to Split Data into Training and Testing Sets
Readings &
Reflection
What Makes Supervised Machine Learning Distinct
Case Study Who Drops Online Courses? An Analysis Using the OULAD | Answer Key
Badge Adding Additional Predictors to Improve Accuracy
Module Survey Feedback Form after Finishing Module

Module 3: Interpreting Metrics

How is the interpretation of SML models different from more familiar models? In this module, we’ll explore and work to understand the confusion matrix that can and the various metrics (e.g., precision, recall, PPV, NPV, F-score, and AUC) that are used to interpret how good at making dichotomous predictions SML models are. We’ll again use the OULAD, augmenting the variables we used in Module 1, and we’ll introduce a more complex model—the random forest model—as an alternative to the regression models used in previous modules.

Conceptual
Overview
How Good is Our Model, Really?
Code Along Adding Classification Metrics to a Workflow
Readings &
Discussion
Considering Many Metrics
Case Study Beyond Accuracy: How to Calculate Metrics | Answer Key
Badge Metrics for Continuous Outcome Variables
Module Survey Feedback Form after Finishing Module

Module 4: Improving Predictions Through Feature Engineering

How can we improve our predictions? This module introduces the concept of feature engineering to enhance model performance. We’ll explore techniques for creating new variables and refining existing ones to improve prediction accuracy. We also explore cross-validation to revise and refine our model without biasing its predictions. We’ll work with the finest-grained OULAD data—interaction data—to demonstrate key feature engineering steps.

Conceptual
Overview
Making Better Predictions with Random Forests
Code Along How to Cross-Validate and Change the Model
Readings &
Discussion
Diving Deep on Features and Feature Engineering
Case Study Modeling Interactions Data with Random Forest | Answer Key
Badge Feature Engineering Activity Types
Module Survey Feedback Form after Finishing Module

Microcredential

To earn the SML micro-credential, you will carry out an SML analysis using data of your choosing and report the results in a Quarto document. Emphasize why you are using SML (relative to regression or another approach) and use feature engineering and cross-validation to refine your model. Lastly, interpret your model carefully by going beyond reporting accuracy to report the specific metrics for the strength of your SML model’s predictions, given the particular aims of your analysis.

Microcredential Supervised Machine Learning Microcredential