What is Supervised Machine Learning?

Conceptual Overview

Agenda

  • Purpose and Agenda
  • Discussion 1
  • Key Concept: Machine Learning
  • Discussion 2
  • Modules Overview
  • Introduction to the other parts of this module

Purpose and Agenda

Purpose (All Modules!)

Machine learning is increasingly prevalent in our lives—and in educational contexts. Its role in educational research and practice is growing, albeit with some challenges and even controversy. These modules are designed to familiarize you with supervised machine learning and its applications in STEM education research. Throughout the module, we’ll explore four key questions that correspond to the focus of each of the four modules. By the end, you will have a deep understanding of the key characteristics of supervised machine learning and how to implement supervised machine learning workflows in R and Python.

Discussion 1

  • Explain why are you interested in machine learning.
  • Is there a specific use case of machine learning in which you are especially interested? Consider the data and purpose.

Key Concept #1

Machine learning

Builds upon supervised machine learning models–namely, deep neural networks, the likes of which can be estimated within R; the big advance has to do with the transformer architecture that allows for the fitting of more complex models. It is still important (maybe necessary) to learn about the foundational techniques and methodologies underlying GPT-4.

Defining ML

  • Artificial Intelligence (AI) (i.e., GPT-4): Simulating human intelligence through the use of computers
  • Machine learning (ML): A subset of AI focused on how computers acquire new information/knowledge

This definition leaves a lot of space for a range of approaches to ML

Supervised ML

  • Requires coded data or data with a known outcome
  • Uses coded/outcome data to train an algorithm
  • Uses that algorithm to predict the codes/outcomes for new data (data not used during the training)
  • Can take the form of a classification (predicting a dichotomous or categorical outcome) or a regression (predicting a continuous outcome)
  • Algorithms include:

What kind of coded data?

Want to detect spam? Get samples of spam messages. Want to forecast stocks? Find the price history. Want to find out user preferences? Parse their activities on Facebook (no, Mark, stop collecting it, enough!) (from ML for Everyone)

In educational research:

  • Assessment data (e.g., 1)
  • Data from log files (“trace data”) (e.g., 1)
  • Open-ended written responses (e.g., 1, 2)
  • Achievement data (i.e., end-of-course grades) (e.g., 1, 2)

How is this different from regression?

The aim is different, the algorithms and methods of estimation are not (or, are differences in degree, rather than in kind).

In a linear regression, our aim is to estimate parameters, such as \(\beta_0\) (intercept) and \(\beta_1\) (slope), and to make inferences about them that are not biased by our particular sample.

In an ML approach, we can use the same linear regression model, but with a goal other than making unbiased inferences about the \(\beta\) parameters:

In supervised ML, our goal is to minimize the difference between a known \(y\) and our predictions, \(\hat{y}\).

It’s the same!

We can use the same model for an inferential or a predictive approach:

\(y\) = \(b_0\) + \(b_1\) + … + \(e\)

If we are interested in making inferences about a particular \(b\) (e.g., \(b_1\)), we can use theory and prior research to include particular predictors

If we are interested in making the best possible predictions, we can add a bunch of predictors and see how much we can minimize the prediction error

But different!

This predictive goal means that we can do things differently:

  • Multicollinearity is not an issue because we do not care to make inferences about parameters
  • Because interpreting specific parameters is less of an interest, we can use a great deal more predictors
  • We focus on how accurately a trained model can predict the values in test data (though, we’ll only use one data set only in this module)
  • We can make our models very complex - we can choose different models!

Okay, really complex

  • Neutral/deep networks
    • i.e., GPT-3 (175 B parameters), GPT-4 (>1 T parameters)
  • And, some models can take a different form than familiar regressions:
    • k-nearest neighbors
    • Decision trees (and their extensions of bagged and random forests)
  • Last, the modeling process can look different:
    • Ensemble models that combine or improve on (“boosting”) the predictions of individual models

Unsupervised ML

  • Does not require coded data; one way to think about unsupervised ML is that its purpose is to discover codes/labels
  • Can be used in an exploratory mode (see Nelson, 2020)
  • Warning: The results of unsupervised ML cannot directly be used to provide codes/outcomes for supervised ML techniques
  • Algorithms include:

Which technique should I choose?

Do you have coded data or data with a known outcome – let’s say about K-12 students – and, do you want to:

  • Predict how other students with similar data (but without a known outcome) perform?
  • Scale coding that you have done for a sample of data to a larger sample?
  • Provide timely or instantaneous feedback, like in many learning analytics systems?

Supervised methods may be your best bet

Which technique should I choose?

Do you not yet have codes/outcomes – and do you want to?

  • Achieve a starting point for qualitative coding, perhaps in a “computational grounded theory” mode?
  • Discover groups or patterns in your data that may be of interest?
  • Reduce the number of variables in your dataset to a smaller, but perhaps nearly as explanatory/predictive - set of variables?

Unsupervised methods may be helpful

Which technique should I choose?

Do you want to say something about one or several variables’ relations with an outcome?

  • See how one or more variables relates to an outcome
  • Understand whether a key variable is has a statistically significantly coefficient in terms of its realtion with an outcome of interest

Traditional, inferential statistics i.e., regression may be best

How do I select a model?

One general principle is to start with the simplest useful model and to build toward more complex models as helpful.

This principle applies in multiple ways:

  • To choose an algorithm, start with simpler models that you can efficiently use and understand
  • To carry out feature engineering, understand your predictors well by starting with a subset
  • To tune an algorithm, start with a relatively simple set of tuning parameters

This isn’t just for beginners or science education researchers; most spam filters use Support Vector Machines (and used Naive Bayes until recently) due to their combination of effectiveness and efficiency “in production.”

What’s stopping me from specifying a complex model?

  • Nothing too much, apart from computing power, time, and concerns of
  • A “check” on your work is your predictions on test set data
    • Train data: Coded/outcome data that you use to train (“estimate”) your model
    • Validation data1: Data you use to select a particular algorithm
    • Test (“hold-out”) data: Data that you do not use in any way to train your data
  • An important way to achieve good performance with test data is to balance between the inherent bias in your algorithm and the variance in the predictions of your algorithm; this is referred to as the bias-variance trade-off of all models

[1] not always/often used, for reasons we’ll discuss later

Illustrating the bias-variance tradeoff

A strongly biased algorithm (linear model)

A much less-biased algorithm (GAM/spline)

Slightly different data (bottom)

Still strong bias, but low variance

Low bias, but very high variance

The bias-variance tradeoff

Bias

  • Definition: Difference between our known codes/outcomes and our predicted codes/outcomes; difference between \(y\) and \(\hat{y}\)
  • How (in)correct our models’ (algorithms’) predictions are
  • Models with high bias can fail to capture important relationships—they can be under-fit to our data
  • In short, how well our model reflects the patterns in the data

Variance

  • Definition: Using a different sample of data, the difference in \(\hat{y}\) values
  • How sensitive our predictions are to the specific sample on which we trained the model
  • Models with high variance can fail to predict different data well
  • In short, how stable the predictions of our model are
####
Regardless of model, we often wish to balance between bias and variance—to balance between under- and over-fitting a model to our data

Discussion 2

  • When might you use SML, in general?
  • Think back to the use case you mentioned earlier. How does (or could) SML be an appropriate technique to use?

Modules Overview

SML Module 1: Foundations

How is prediction different from explanation? This lab provides a gentle introduction to supervised machine learning by drawing out similarities to and differences from a regression modeling approach. The case study will involve modeling the graduation rate across 1000s of higher education institutions in the United States using data from the Integrated Postsecondary Education Data System (IPEDS).

SML Module 2: Workflows With Training and Testing Data

Building on the foundations from Lab 1, this session delves deeper into the workflows we will use when we are using a SML approach. Particularly, we’ll explore the roles of training and testing data and when to use them in a SML workflow. We’ll predict students’ withdrawal from a course using the Open University Learning Analytics Dataset (OULAD).

SML Module 3: Interpreting SML Metrics

How is the interpretation of SML models different from more familiar models? In this lab, we’ll explore and work to understand the confusion matrix that can and the various metrics (e.g., precision, recall, PPV, NPV, F-score, and AUC) that are used to interpret how good at making dichotomous predictions SML models are. We’ll again use the OULAD, augmenting the variables we used in Lab 1, and we’ll introduce a more complex model—the random forest model—as an alternative to the regression models used in previous labs.

SML Module 4: Improving Predictions Through Feature Engineering

How can we improve our predictions? This lab introduces the concept of feature engineering to enhance model performance. We’ll explore techniques for creating new variables and refining existing ones to improve prediction accuracy. We also explore cross-validation to revise and refine our model without biasing its predictions. We’ll work with the finest-grained OULAD data—interaction data—to demonstrate key feature engineering steps.

Other parts of this module

  • Uses built-in data (from the General Social Survey)
  • Compares the use of the same logistic regression model in a regression and SML “mode”
  • Introduces the modeling cycle and process

Please see sml-1-readings.qmd

Brooks, C., & Thompson, C. (2017). Predictive modelling in teaching and learning. Handbook of Learning Analytics, 61-68.

Jaquette, O., & Parra, E. E. (2013). Using IPEDS for panel analyses: Core concepts, data challenges, and empirical applications. In Higher Education: Handbook of Theory and Research: Volume 29 (pp. 467-533). Dordrecht: Springer Netherlands.

Zong, C., & Davis, A. (2022). Modeling university retention and graduation rates using IPEDS. Journal of College Student Retention: Research, Theory & Practice*. https://journals.sagepub.com/doi/full/10.1177/15210251221074379

Please see sml-1-case-study.qmd

  • Building a prediction model for graduate rates using data from National Center for Education Statistics data — IPEDS< specifically
  • We’ll use not only the same data but also the same model in two modes - inferential regression modeling and predictive SML, seeing how the key is in how we use the model, not the model type
  • Work with peers to complete this, reading the text, following links to resources (and the reading), and then completing the required 👉 Your Turn ⤵ tasks
  • A key is available, but we strongly encourage you to use it only at the end to check your work, or if you are completely stuck and have tried our recommended troubleshooting steps: https://docs.google.com/document/d/14Jc-KG3m5k1BvyKWqw7KmDD21IugU5nV5edfJkZyspY/edit

Please see sml-1-badge.qmd

  • Involves applying what you have done through this point in the module to a) extending our model and b) reflecting and planning, after which you will knit and submit your work by publishing to Posit Cloud.

Troubleshooting

  • Please view this Google doc
  • Change default settings: Tools -> Global Settings -> Workspace -> DO NOT restore workspace
  • Change default settings: Tools -> Global Settings -> Workspace -> Save workspace to .RData NEVER
  • Session -> Restart R and Clear Output is a good place to start

fin

  • Dr. Joshua Rosenberg (jmrosenberg@utk.edu; https://joshuamrosenberg.com)