Supervised Machine Learning

Machine learning is increasingly prevalent in our lives—and in educational contexts. Its role in educational research and practice is growing, albeit with some challenges and even controversy. These modules are designed to familiarize you with supervised machine learning (SML) and its applications in STEM education research. Throughout the module, we’ll explore four key questions that correspond to the focus of each of the four modules. By the end, you will have a deep understanding of the key characteristics of supervised machine learning and how to implement supervised machine learning workflows in R and Python.

	Github	Repository for Instructors
	Posit Cloud	Workspace for Learners

Module 1: Supervised Machine Learning Foundations

How is prediction different from explanation? This module provides a gentle introduction to supervised machine learning by drawing out similarities to and differences from a regression modeling approach. The case study will involve modeling the graduation rate across 1,000s of higher education institutions in the United States using data from the Integrated Postsecondary Education Data System (IPEDS).

	Conceptual Overview	What is Supervised Machine Learning?
	Code Along	Same Model, Different Analytic Goals
	Readings & Reflection	Considering Models for Inference and Models for Prediction
	Case Study Key	Explaining or Predicting Graduation Rates Using IPEDS \| Answer Key
	Badge	Initial Interpretations of a Model’s Predictions
	Module Survey	Feedback Form after Finishing Module

Module 2: Using Workflows With Training and Testing Data

Building on the foundations from Module 1, this session delves deeper into the workflows we will use when we are using a SML approach. Particularly, we’ll explore the roles of training and testing data and when to use them in a SML workflow. We’ll predict students’ withdrawal from a course again using the Integrated Postsecondary Education Data System (IPEDS) data.

	Conceptual Overview	Using Training and Testing Data in a Workflow
	Code Along	How to Split Data into Training and Testing Sets
	Readings & Reflection	What Makes Supervised Machine Learning Distinct
	Case Study	Who Drops Online Courses? An Analysis Using the OULAD \| Answer Key
	Badge	Adding Additional Predictors to Improve Accuracy
	Module Survey	Feedback Form after Finishing Module

Module 3: Interpreting Prediction Metrics

How is the interpretation of SML models different from more familiar models? In this module, we’ll explore and work to understand the confusion matrix that can and the various metrics (e.g., precision, recall, PPV, NPV, F-score, and AUC) that are used to interpret how good at making dichotomous predictions SML models are. We’ll again use the OULAD, augmenting the variables we used in Module 1, and we’ll introduce a more complex model—the random forest model—as an alternative to the regression models used in previous modules.

	Conceptual Overview	How Good is Our Model, Really?
	Code Along	Adding Classification Metrics to a Workflow
	Readings & Discussion	Considering Many Metrics
	Case Study	Beyond Accuracy: How to Calculate Metrics \| Answer Key
	Badge	Metrics for Continuous Outcome Variables
	Module Survey	Feedback Form after Finishing Module

Module 4: How Do We Make Our Models Better?

How can we improve our predictions? This module introduces two ways: more sophisticated models and feature engineering. We’ll explore techniques for creating new variables and refining existing ones to improve prediction accuracy. We also explore cross-validation to revise and refine our model without biasing its predictions. We’ll work with the finest-grained OULAD data—interaction data—to demonstrate key feature engineering steps and we’ll fine tune a “boosted” decision tree model to improve our predictions.

	Conceptual Overview	Making Better Predictions with Random Forests
	Code Along	How to Cross-Validate and Change the Model
	Readings & Discussion	Diving Deep on Features and Feature Engineering
	Case Study	Modeling Interactions Data with Random Forest \| Answer Key
	Badge	Feature Engineering Activity Types
	Module Survey	Feedback Form after Finishing Module

Microcredential

To earn the SML micro-credential, you will carry out an SML analysis using data of your choosing and report the results in a Quarto document. Emphasize why you are using SML (relative to regression or another approach) and use feature engineering and cross-validation to refine your model. Lastly, interpret your model carefully by going beyond reporting accuracy to report the specific metrics for the strength of your SML model’s predictions, given the particular aims of your analysis.

Microcredential

Supervised Machine Learning Microcredential