Supervised Machine Learning
Machine learning is increasingly prevalent in our lives—and in educational contexts. Its role in educational research and practice is growing, albeit with some challenges and even controversy. These modules are designed to familiarize you with supervised machine learning (SML) and its applications in STEM education research. Throughout the module, we’ll explore four key questions that correspond to the focus of each of the four modules. By the end, you will have a deep understanding of the key characteristics of supervised machine learning and how to implement supervised machine learning workflows in R and Python.
Github |
Repository for Instructors | |
Posit Cloud | Workspace for Learners |
Module 1: Supervised Machine Learning Foundations
How is prediction different from explanation? This module provides a gentle introduction to supervised machine learning by drawing out similarities to and differences from a regression modeling approach. The case study will involve modeling the graduation rate across 1000s of higher education institutions in the United States using data from the Integrated Postsecondary Education Data System (IPEDS).
Conceptual Overview |
What is Supervised Machine Learning? (Zoom recording) | |
Code Along | Same Model, Different Analytic Goals | |
Readings & Reflection |
Considering Models for Inference and Models for Prediction | |
Case Study Key | Explaining or Predicting Graduation Rates Using IPEDS | Answer Key | |
Badge | Initial Interpretations of a Model’s Predictions | |
Module Survey | Feedback Form after Finishing Module |
Module 2: Workflows With Training and Testing Data
Building on the foundations from Module 1, this session delves deeper into the workflows we will use when we are using a SML approach. Particularly, we’ll explore the roles of training and testing data and when to use them in a SML workflow. We’ll predict students’ withdrawal from a course using the Open University Learning Analytics Dataset (OULAD).
Conceptual Overview |
Using Training and Testing Data in a Workflow | |
Code Along | How to Split Data into Training and Testing Sets | |
Readings & Reflection |
What Makes Supervised Machine Learning Distinct | |
Case Study | Who Drops Online Courses? An Analysis Using the OULAD | Answer Key | |
Badge | Adding Additional Predictors to Improve Accuracy | |
Module Survey | Feedback Form after Finishing Module |
Module 3: Interpreting Metrics
How is the interpretation of SML models different from more familiar models? In this module, we’ll explore and work to understand the confusion matrix that can and the various metrics (e.g., precision, recall, PPV, NPV, F-score, and AUC) that are used to interpret how good at making dichotomous predictions SML models are. We’ll again use the OULAD, augmenting the variables we used in Module 1, and we’ll introduce a more complex model—the random forest model—as an alternative to the regression models used in previous modules.
Conceptual Overview |
How Good is Our Model, Really? | |
Code Along | Adding Classification Metrics to a Workflow | |
Readings & Discussion |
Considering Many Metrics | |
Case Study | Beyond Accuracy: How to Calculate Metrics | Answer Key | |
Badge | Metrics for Continuous Outcome Variables | |
Module Survey | Feedback Form after Finishing Module |
Module 4: Improving Predictions Through Feature Engineering
How can we improve our predictions? This module introduces the concept of feature engineering to enhance model performance. We’ll explore techniques for creating new variables and refining existing ones to improve prediction accuracy. We also explore cross-validation to revise and refine our model without biasing its predictions. We’ll work with the finest-grained OULAD data—interaction data—to demonstrate key feature engineering steps.
Conceptual Overview |
Making Better Predictions with Random Forests | |
Code Along | How to Cross-Validate and Change the Model | |
Readings & Discussion |
Diving Deep on Features and Feature Engineering | |
Case Study | Modeling Interactions Data with Random Forest | Answer Key | |
Badge | Feature Engineering Activity Types | |
Module Survey | Feedback Form after Finishing Module |
Microcredential
To earn the SML micro-credential, you will carry out an SML analysis using data of your choosing and report the results in a Quarto document. Emphasize why you are using SML (relative to regression or another approach) and use feature engineering and cross-validation to refine your model. Lastly, interpret your model carefully by going beyond reporting accuracy to report the specific metrics for the strength of your SML model’s predictions, given the particular aims of your analysis.
Microcredential | Supervised Machine Learning Microcredential |