Conceptual Overview
Machine learning is increasingly prevalent in our lives—and in educational contexts. Its role in educational research and practice is growing, albeit with some challenges and even controversy. These modules are designed to familiarize you with supervised machine learning and its applications in STEM education research. Throughout the module, we’ll explore four key questions that correspond to the focus of each of the four modules. By the end, you will have a deep understanding of the key characteristics of supervised machine learning and how to implement supervised machine learning workflows in R and Python.
Prepare: Prior to analysis, we’ll look at the context from which our data came, formulate a basic research question, and get introduced the {tidymodels} packages for machine learning.
Wrangle: Wrangling data entails the work of cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files and modifying some of our variables.
Explore: We take a quick look at our variables of interest and do some basic “feature engineering” by creating some new variables we think will be predictive of students at risk.
Model: We dive deeper into the five steps in our supervised machine learning process, focusing on the mechanics of making predictions.
Communicate: To wrap up our case study, we’ll create our first “data product” and share our analyses and findings by creating our first web page using R Markdown.
Google’s Teachable Machine!
https://teachablemachine.withgoogle.com/
What’s going on here?
This definition leaves a lot of space for a range of approaches to ML.
Want to detect spam? Get samples of spam messages. Want to forecast stocks? Find the price history. Want to find out user preferences? Parse their activities on Facebook (no, Mark, stop collecting it, enough!) (from ML for Everyone)
In educational research:
The aim is different, the algorithms and methods of estimation are not (or, are differences in degree, rather than in kind).
In a linear regression, our aim is to estimate parameters, such as \(\beta_0\) (intercept) and \(\beta_1\) (slope), and to make inferences about them that are not biased by our particular sample.
In a SML approach, we can use the same linear regression model, but with a goal other than making unbiased inferences about the \(\beta\) parameters:
In supervised ML, our goal is to minimize the difference between a known \(y\) and our predictions, \(\hat{y}\).
We can use the same model for an inferential or an SML approach:
\(y\) = \(b_0\) + \(b_1\) + … + \(e\)
If we are interested in making inferences about a particular \(b\) (e.g., \(b_1\)), we can use theory and prior research to include particular predictors. This often favors transparent models where we understand how predictors relate to \(y\).
If we are interested in making the best possible predictions, we can potentially add more predictor variables than is common when using traditional (i.e., inferential) models
This predictive goal of SML means that we can do things differently:
The focus on prediction opens the door to highly complex models:
There’s no single “best” way; the right tools depend on your goals:
What is your ultimate aim?
Don’t look for a magic answer! No model perfectly reflects reality. Resist the urge to let the model “tell you the answer” without critical thought.
Understand your tools: Even if prediction is the goal, knowing how your chosen model works (its assumptions, strengths, weaknesses) helps you:
What It Is:
A framework where an agent learns to make decisions by interacting with an environment.
Core Components:
Key Idea:
Learning through trial and error to maximize long-term rewards.
Do you have coded data or data with a known outcome – let’s say about K-12 students – and, do you want to:
Supervised methods may be your best bet
Do you not yet have codes/outcomes – and do you want to?
Unsupervised methods may be helpful
Do you want to say something about one or several variables’ relations with an outcome?
Inferential statistics may be best
Some models blend between inferential and SML – we’ll talk about these as we proceed!
One general principle is to start with the simplest useful model and to build toward more complex models as helpful.
This principle applies in multiple ways:
SML Module 1: Foundations
How is prediction different from explanation? This module provides a gentle introduction to supervised machine learning by drawing out similarities to and differences from a regression modeling approach. The case study will involve modeling the graduation rate across 1,000s of higher education institutions in the United States using data from the Integrated Postsecondary Education Data System (IPEDS).
SML Module 2: Workflows With Training and Testing Data
Building on the foundations from Module 1, this session delves deeper into the workflows we will use when we are using a SML approach. Particularly, we’ll explore the roles of training and testing data and when to use them in a SML workflow. We’ll predict students’ withdrawal from a course again using the Integrated Postsecondary Education Data System (IPEDS) data.
SML Module 3: Interpreting SML Metrics
How is the interpretation of SML models different from more familiar models? In this module, we’ll explore and work to understand the confusion matrix that can and the various metrics (e.g., precision, recall, PPV, NPV, F-score, and AUC) that are used to interpret how good at making dichotomous predictions SML models are. We’ll again use the OULAD, augmenting the variables we used in Module 1, and we’ll introduce a more complex model—the random forest model—as an alternative to the regression models used in previous modules
SML Module 4: Improving Predictions Through Feature Engineering
How can we improve our predictions? This module introduces the concept of feature engineering to enhance model performance. We’ll explore techniques for creating new variables and refining existing ones to improve prediction accuracy. We also explore cross-validation to revise and refine our model without biasing its predictions. We’ll work with the finest-grained OULAD data—interaction data—to demonstrate key feature engineering steps.
Please see sml-1-readings.qmd
Brooks, C., & Thompson, C. (2017). Predictive modelling in teaching and learning. Handbook of Learning Analytics, 61-68.
Jaquette, O., & Parra, E. E. (2013). Using IPEDS for panel analyses: Core concepts, data challenges, and empirical applications. In Higher Education: Handbook of Theory and Research: Volume 29 (pp. 467-533). Dordrecht: Springer Netherlands.
Zong, C., & Davis, A. (2022). Modeling university retention and graduation rates using IPEDS. Journal of College Student Retention: Research, Theory & Practice*. https://journals.sagepub.com/doi/full/10.1177/15210251221074379
Please see sml-1-case-study.qmd
Please see sml-1-badge.qmd
General troubleshooting tips for R and RStudio