Conceptual Overview
Machine learning is increasingly prevalent in our lives—and in educational contexts. Its role in educational research and practice is growing, albeit with some challenges and even controversy. These modules are designed to familiarize you with supervised machine learning and its applications in STEM education research. Throughout the module, we’ll explore four key questions that correspond to the focus of each of the four modules. By the end, you will have a deep understanding of the key characteristics of supervised machine learning and how to implement supervised machine learning workflows in R and Python.
Builds upon supervised machine learning models–namely, deep neural networks, the likes of which can be estimated within R; the big advance has to do with the transformer architecture that allows for the fitting of more complex models. It is still important (maybe necessary) to learn about the foundational techniques and methodologies underlying GPT-4.
This definition leaves a lot of space for a range of approaches to ML
Want to detect spam? Get samples of spam messages. Want to forecast stocks? Find the price history. Want to find out user preferences? Parse their activities on Facebook (no, Mark, stop collecting it, enough!) (from ML for Everyone)
In educational research:
The aim is different, the algorithms and methods of estimation are not (or, are differences in degree, rather than in kind).
In a linear regression, our aim is to estimate parameters, such as \(\beta_0\) (intercept) and \(\beta_1\) (slope), and to make inferences about them that are not biased by our particular sample.
In an ML approach, we can use the same linear regression model, but with a goal other than making unbiased inferences about the \(\beta\) parameters:
In supervised ML, our goal is to minimize the difference between a known \(y\) and our predictions, \(\hat{y}\).
We can use the same model for an inferential or a predictive approach:
\(y\) = \(b_0\) + \(b_1\) + … + \(e\)
If we are interested in making inferences about a particular \(b\) (e.g., \(b_1\)), we can use theory and prior research to include particular predictors
If we are interested in making the best possible predictions, we can add a bunch of predictors and see how much we can minimize the prediction error
This predictive goal means that we can do things differently:
Do you have coded data or data with a known outcome – let’s say about K-12 students – and, do you want to:
Supervised methods may be your best bet
Do you not yet have codes/outcomes – and do you want to?
Unsupervised methods may be helpful
Do you want to say something about one or several variables’ relations with an outcome?
Traditional, inferential statistics i.e., regression may be best
One general principle is to start with the simplest useful model and to build toward more complex models as helpful.
This principle applies in multiple ways:
This isn’t just for beginners or science education researchers; most spam filters use Support Vector Machines (and used Naive Bayes until recently) due to their combination of effectiveness and efficiency “in production.”
[1] not always/often used, for reasons we’ll discuss later
SML Module 1: Foundations
How is prediction different from explanation? This lab provides a gentle introduction to supervised machine learning by drawing out similarities to and differences from a regression modeling approach. The case study will involve modeling the graduation rate across 1000s of higher education institutions in the United States using data from the Integrated Postsecondary Education Data System (IPEDS).
SML Module 2: Workflows With Training and Testing Data
Building on the foundations from Lab 1, this session delves deeper into the workflows we will use when we are using a SML approach. Particularly, we’ll explore the roles of training and testing data and when to use them in a SML workflow. We’ll predict students’ withdrawal from a course using the Open University Learning Analytics Dataset (OULAD).
SML Module 3: Interpreting SML Metrics
How is the interpretation of SML models different from more familiar models? In this lab, we’ll explore and work to understand the confusion matrix that can and the various metrics (e.g., precision, recall, PPV, NPV, F-score, and AUC) that are used to interpret how good at making dichotomous predictions SML models are. We’ll again use the OULAD, augmenting the variables we used in Lab 1, and we’ll introduce a more complex model—the random forest model—as an alternative to the regression models used in previous labs.
SML Module 4: Improving Predictions Through Feature Engineering
How can we improve our predictions? This lab introduces the concept of feature engineering to enhance model performance. We’ll explore techniques for creating new variables and refining existing ones to improve prediction accuracy. We also explore cross-validation to revise and refine our model without biasing its predictions. We’ll work with the finest-grained OULAD data—interaction data—to demonstrate key feature engineering steps.
Please see sml-1-readings.qmd
Brooks, C., & Thompson, C. (2017). Predictive modelling in teaching and learning. Handbook of Learning Analytics, 61-68.
Jaquette, O., & Parra, E. E. (2013). Using IPEDS for panel analyses: Core concepts, data challenges, and empirical applications. In Higher Education: Handbook of Theory and Research: Volume 29 (pp. 467-533). Dordrecht: Springer Netherlands.
Zong, C., & Davis, A. (2022). Modeling university retention and graduation rates using IPEDS. Journal of College Student Retention: Research, Theory & Practice*. https://journals.sagepub.com/doi/full/10.1177/15210251221074379
Please see sml-1-case-study.qmd
Please see sml-1-badge.qmd