Using Training and Testing Data in a Workflow

Conceptual Overview

Purpose and Agenda

We have some data and want to develop a prediction model. Supervised machine learning is suited to this aim. In particular, in this module, we explore how we can train a computer to predict students’ passing a course. We use a large data set, the Open University Learning Analytics Dataset (OULAD), focusing on student data at this point. Our model at this point is relatively simple, a generalized linear model.

What we’ll do in this presentation

  • Discussion 1
  • Key Concept #1: Our framework
  • Key Concept #2: Training and testing data
  • Discussion 2
  • Introduction to the other parts of this module

Discussion 1

Discuss!

  • Provide an example of supervised machine learning in the context of educational research. Discuss why this counts as machine learning.
  • How might presenting the results of a machine learning model differ from presenting those from a more traditional (“explanatory”) model?

Key Concept #1

Our framework

  • We want to make predictions about an outcome of interest based on predictor variables that we think are related to the outcome.
  • We’ll be using a widely-used data set in the learning analytics field: the Open University Learning Analytics Dataset (OULAD).
  • The OULAD was created by learning analytics researchers at the United Kingdom-based Open University.
  • It includes data from post-secondary learners participation in one of several Massive Open Online Courses (called modules in the OULAD).
  • Many students pass these courses, but not all do
  • We have data on students’ initial characteristics and their interactions in the course
  • If we could develop a good prediction model, we could provide additional supports to students–and maybe move the needle on some students succeeding who might not otherwise

We’ll be focusing on three files:

  • studentInfo, courses, and studentRegistration

These are joined together (see oulad.R) for this module. You’ll be doing more joining later!

# A tibble: 3 × 15
  code_module code_presentation id_student gender region       highest_education
  <chr>       <chr>                  <dbl> <chr>  <chr>        <chr>            
1 AAA         2013J                  11391 M      East Anglia… HE Qualification 
2 AAA         2013J                  28400 F      Scotland     HE Qualification 
3 AAA         2013J                  30268 F      North Weste… A Level or Equiv…
# ℹ 9 more variables: imd_band <chr>, age_band <chr>,
#   num_of_prev_attempts <dbl>, studied_credits <dbl>, disability <chr>,
#   final_result <chr>, module_presentation_length <dbl>,
#   date_registration <dbl>, date_unregistration <dbl>
  1. Prepare: Prior to analysis, we’ll take a look at the context from which our data came, formulate some questions, and load R packages.
  2. Wrangle: In the wrangling section, we will learn some basic techniques for manipulating, cleaning, transforming, and merging data.
  3. Explore: The processes of wrangling and exploring data often go hand in hand.
  4. Model: In this step, we carry out the analysis - here, supervised machine learning.
  5. Communicate: Interpreting and communicating the results of our findings is the last step.
  1. Split data (Prepare)
  2. Engineer features and write down the recipe (Wrangle and Explore)
  3. Specify the model and workflow (Model)
  4. Fit model (Model)
  5. Evaluate accuracy (Communicate)

This is the fundamental process we’ll follow for this and the next two modules focused on supervised ML

Key Concept #2

Training and testing

  • Algorithms (or estimation procedures - or models) refer to the structure and process of estimating the parameters of a model
  • This definition provides a wide range of options for what kinds of algorithms we use (from simple to very complex, as we discuss in a later module)
  • For now, we focus on a familiar, easy to interpret algorithm (e.g., 1, also this), logistic regression
  • This is a linear model with a binary (“yes” or “no”) outcome
  • It will be a bad model to start us off!
  • When doing supervised ML, we focus on predicting an outcome: how well we do this overall and for particular cases (more on how in the next module)
  • We do not focus on inference or explanation (i.e., an “explanatory” model): model fit, statistical significance, effect sizes, etc.
  • This is a really key difference – we use different metrics to evaluate what makes for a good model
  • A key concept in the context of supervised machine learning is training vs. testing data:
  • Training data: Data we use to fit (or train, AKA estimate) a supervised machine learning model (AKA algorithm)
  • Testing data: Data we use to see how well our model not used to fit the model
  • By splitting out data into training and testing sets, we can obtain unbiased metrics for how good our model is at predicting
  • If we used only one data set (i.e., only training data) we could fit a model that does a really good job of making predictions
  • But, this model would likely be overfit — a model that is too tailored to the specific data in our training set, rather than a model that can be more generalizable
  • The big picture, very real risk of not using training and testing data is we think we have a better model than we do
  • We could fit a model that perfectly predicts every outcome in our training data, but when the model sees new (i.e., different) data, it performs very poorly
  • This is essential for supervised machine learning; if you review or see a study that only uses a single data set, be skeptical of the prediction metrics!
  • It is often valuable to conduct a stratified split, based on the proportion or distribution of the dependent variables’ values in the dataset
    • this ensures our training and testing data will not have a misbalance in the dependent variable

Discussion 2

  • Why not use our training data to evaluate how good our model is?
  • What data or context are you interested in for your use of SML?

Introduction to the other parts of this module

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199-231.

Estrellado, R. A., Freer, E. A., Mostipak, J., Rosenberg, J. M., & Velásquez, I. C. (2020). Data science in education using R. Routledge (c14), Predicting students’ final grades using machine learning methods with online course data. http://www.datascienceineducation.com/

  • Building a prediction model for students passing the class based just on student data.
  • Work with peers to complete this, reading the text, following links to resources (and the reading), and then completing the required 👉 Your Turn ⤵ tasks
  • A key is available, but we strongly encourage you to use it only at the end to check your work, or if you are completely stuck and have tried our recommended troubleshooting steps: https://docs.google.com/document/d/14Jc-KG3m5k1BvyKWqw7KmDD21IugU5nV5edfJkZyspY/edit
  • Involves applying what you have done through this point in the module to a) extending our model and b) reflecting and planning, after which you will knit and submit your work by publishing to Posit Cloud.

fin

Dr. Joshua Rosenberg (jmrosenberg@utk.edu; https://joshuamrosenberg.com)

General troubleshooting tips for R and RStudio