Explaining or Predicting Graduation Rates Using IPEDS

Case Study

Author

LASER Institute

Published

July 13, 2025

1. PREPARE

The LASER workflow

Each supervised machine learning “case study” is designed to illustrate how supervised machine learning methods and techniques can be applied to address a research question of interest, create useful data products, and conduct reproducible research. Each case study is structured around a basic analytics workflow modeled after the Data-Intensive Research Workflow from Learning Analytics Goes to School (Krumm et al., 2018):

Figure 2.2 Steps of Data-Intensive Research Workflow

In the overview presentation for this learning lab, we considered five steps in our supervised machine learning process. Those steps are mirrored here in this case study, with the addition of some other components of this workflow. For example, to help prepare for analysis, we’ll first take a step back and think about how we want to use machine learning, and predicting is a key word. Many scholars have focused on predicting students who are at-risk: of dropping a course or not succeeding in it. In the ML Learning Lab 1 case study will cover the following workflow topics as we attempt to develop our own model for predicting student drop-out:

Prepare: Prior to analysis, we’ll look at the context from which our data came, formulate a basic research question, and get introduced the {tidymodels} packages for machine learning.
Wrangle: Wrangling data entails the work of cleaning, transforming, and merging data. In Part 2 we focus on importing CSV files and modifying some of our variables.
Explore: We take a quick look at our variables of interest and do some basic “feature engineering” by creating some new variables we think will be predictive of students at risk.
Model: We dive deeper into the five steps in our supervised machine learning process, focusing on the mechanics of making predictions.
Communicate: To wrap up our case study, we’ll create our first “data product” and share our analyses and findings by creating our first web page using R Markdown.

In this module, we will be using data from the IPEDS, the Integrated Postsecondary Education Data System. We use Zong and Davis’ (2022) study as an inspiration for ours. These authors used inferential models to try to understand what relates to the graduation rates of around 700 four-year universities in the United States, predicting this outcome on the basis of student background, finance, academic and social environment, and retention rate independent variables. You can find this in the lit folder (with an elaboration and discussion questions in the Readings file for this module).

Zong, C., & Davis, A. (2022). Modeling university retention and graduation rates using IPEDS. Journal of College Student Retention: Research, Theory & Practice, 15210251221074379.

Loading packages

As highlighted in Chapter 6 of Data Science in Education Using R (DSIEUR), one of the first steps of every workflow should be to set up your “Project” within RStudio. Recall that:

A Project is the home for all of the files, images, reports, and code that are used in any given project

Since we are working from an R project cloned from GitHub, a Project has already been set up for you as indicated by the .Rproj file in your main directory in the Files pane. Instead, we will focus on getting our project set up withe the requisite packages we’ll need for analysis.

Packages, sometimes called libraries, are shareable collections of R code that can contain functions, data, and/or documentation and extend the functionality of R. You can always check to see which packages have already been installed and loaded into RStudio Cloud by looking at the the Files, Plots, & Packages Pane in the lower right-hand corner.

Two packages we’ll use extensively throughout these learning labs are the {tidyverse} and {tidymodels} packages.

tidyverse 📦

One package that we’ll be using extensively throughout LASER is the {tidyverse} package. Recall from earlier tutorials that the {tidyverse} package is actually a collection of R packages designed for reading, wrangling, and exploring data and which all share an underlying design philosophy, grammar, and data structures. These shared features are sometimes “tidy data principles.”

Click the green arrow in the right corner of the “code chunk” that follows to load the {tidyverse} library.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidymodels

The tidymodels package is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse. It includes a core set of packages that are loaded on startup and contains tools for:

data splitting and pre-processing;
model selection, tuning, and evaluation;
feature selection and variable importance estimation;
as well as other functionality.

👉 Your Turn ⤵

In addition to the {tidyverse} package, we’ll also be using {tidymodels} and the lightweight but highly useful {janitor} package to help with some data cleaning tasks. Use the code chunk below to load both tidymodels and janitor.

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

✔ broom        1.0.7      ✔ rsample      1.2.1 
✔ dials        1.2.1      ✔ tune         1.2.1 
✔ infer        1.0.7      ✔ workflows    1.1.4 
✔ modeldata    1.4.0      ✔ workflowsets 1.1.0 
✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
✔ recipes      1.0.10

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(janitor)


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

As a tip, remember to use the library() function to load these packages. After you’ve done that, click the green arrow to run the code chunk. If you see a bunch of messages (not anything labeled as an error), you are good to go! These messages mean the packages loaded correctly.

Loading data

Next, we’ll read in data. We’ll use the read_csv() function to load the data file.

For now, please read in the `ipeds-all-title-9-2022-data.csv` file. Use the read_csv() function to do this, paying attention to where those files are located relative to this case study file – in the data folder!

ipeds <- read_csv("data/ipeds-all-title-9-2022-data.csv")

Rows: 5988 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): institution name, HD2022.Postsecondary and Title IV institution in...
dbl (15): unitid, year, DRVADM2022.Percent admitted - total, DRVIC2022.Tuiti...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We’ll then use a handy function from janitor, clean_names(). It does what it seems like it should - it cleans the names of variables, making them easy to view and type. Run this next code chunk.

ipeds <- janitor::clean_names(ipeds)

👉 Your Turn ⤵

In the chunk below, examine the data set using a function or means of your choice (such as just printing the data set by typing its name or using the glimpse() function). Do this in the code chunk below! Note its dimensions — especially how many rows it has!

glimpse(ipeds)

Rows: 5,988
Columns: 24
$ unitid                                                                                  <dbl> …
$ institution_name                                                                        <chr> …
$ year                                                                                    <dbl> …
$ hd2022_postsecondary_and_title_iv_institution_indicator                                 <chr> …
$ drvadm2022_percent_admitted_total                                                       <dbl> …
$ drvic2022_tuition_and_fees_2021_22                                                      <dbl> …
$ hd2022_institution_size_category                                                        <chr> …
$ hd2022_state_abbreviation                                                               <chr> …
$ hd2022_carnegie_classification_2021_basic                                               <chr> …
$ drvef2022_total_enrollment                                                              <dbl> …
$ drvef122022_total_12_month_unduplicated_headcount                                       <dbl> …
$ drvc2022_bachelors_degree                                                               <dbl> …
$ drvc2022_masters_degree                                                                 <dbl> …
$ drvc2022_doctors_degree_research_scholarship                                            <dbl> …
$ drvgr2022_graduation_rate_total_cohort                                                  <dbl> …
$ sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid        <dbl> …
$ xanyaidp                                                                                <chr> …
$ drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks <dbl> …
$ sfa2122_average_net_price_students_awarded_grant_or_scholarship_aid_2021_22             <dbl> …
$ xnpist2                                                                                 <chr> …
$ adm2022_sat_evidence_based_reading_and_writing_50th_percentile_score                    <dbl> …
$ xsatvr50                                                                                <chr> …
$ adm2022_act_composite_50th_percentile_score                                             <dbl> …
$ xactcm50                                                                                <chr> …

👉 Your Turn ⤵

Write down a few observations after inspecting the data - and any all observations welcome!

YOUR RESPONSE HERE
YOUR RESPONSE HERE
YOUR RESPONSE HERE

2. WRANGLE

Even though we cleaned the names to make them easier to view and type (thanks, clean_names()), they are still pretty long.

The code chunk below uses a very handy function, select(). This allows you to simultaneously choose and rename variables, returning a data frame with only the variables you have selected — named as you like. For now, we’ll just run this code. Later in your analyses, you’ll almost certainly use select() to get a more manageable dataset.

ipeds <- ipeds %>% 
    select(name = institution_name, 
           title_iv = hd2022_postsecondary_and_title_iv_institution_indicator, # is the university a title IV university?
           carnegie_class = hd2022_carnegie_classification_2021_basic, # which carnegie classification
           state = hd2022_state_abbreviation, # state
           total_enroll = drvef2022_total_enrollment, # total enrollment
           pct_admitted = drvadm2022_percent_admitted_total, # percentage of applicants admitted
           n_bach = drvc2022_bachelors_degree, # number of students receiving a bachelor's degree
           n_mast = drvc2022_masters_degree, # number receiving a master's
           n_doc = drvc2022_doctors_degree_research_scholarship, # number receive a doctoral degree
           tuition_fees = drvic2022_tuition_and_fees_2021_22, # total cost of tuition and fees
           grad_rate = drvgr2022_graduation_rate_total_cohort, # graduation rate
           percent_fin_aid = sfa2122_percent_of_full_time_first_time_undergraduates_awarded_any_financial_aid, # percent of students receive financial aid
           avg_salary = drvhr2022_average_salary_equated_to_9_months_of_full_time_instructional_staff_all_ranks) # average salary of instructional staff

A useful function for exploring data is count(); it does what it sounds like! It counts how many times values for a variable appear.

ipeds %>% 
    count(title_iv)

# A tibble: 2 × 2
  title_iv                                             n
  <chr>                                            <int>
1 Title IV NOT primarily postsecondary institution    30
2 Title IV postsecondary institution                5958

This suggests we may wish to filter the 30 non-Title IV institutions — something we’ll do shortly.

👉 Your Turn ⤵

Can you count another variable? Pick another (see the code chunk two above) and add a count below. While simple, counting up different values in our data can be very informative (and can often lead to further explorations)!

ipeds %>% 
    count(carnegie_class)

# A tibble: 34 × 2
   carnegie_class                                                              n
   <chr>                                                                   <int>
 1 Associate's Colleges: High Career & Technical-High Nontraditional          88
 2 Associate's Colleges: High Career & Technical-High Traditional            106
 3 Associate's Colleges: High Career & Technical-Mixed Traditional/Nontra…   116
 4 Associate's Colleges: High Transfer-High Nontraditional                   105
 5 Associate's Colleges: High Transfer-High Traditional                      105
 6 Associate's Colleges: High Transfer-Mixed Traditional/Nontraditional      101
 7 Associate's Colleges: Mixed Transfer/Career & Technical-High Nontradit…   114
 8 Associate's Colleges: Mixed Transfer/Career & Technical-High Tradition…   104
 9 Associate's Colleges: Mixed Transfer/Career & Technical-Mixed Traditio…    97
10 Baccalaureate Colleges: Arts & Sciences Focus                             216
# ℹ 24 more rows

Filtering

filter() is a very handy function that is part of the tidyverse; it filters to include (or exclude) observations in your data based upon logical conditions (e.g., ==, >, <=, etc.). See more here if interested.

Below, we filter the data to include only Title IV postsecondary institutions.

ipeds <- ipeds %>% 
    filter(title_iv == "Title IV postsecondary institution")

👉 Your Turn ⤵

Can you filter the data again, this time to only include institutions with a carnegie classification?

In other words, can you exclude those institutions with a value for the carnegie_class variable that is “Not applicable, not in Carnegie universe (not accredited or nondegree-granting)”)? A little hint: whereas the logical operator == is used to include only matching conditions, the logical operator != excludes matching conditions.

ipeds <- ipeds %>% 
    filter(carnegie_class != "Not applicable, not in Carnegie universe (not accredited or nondegree-granting)")

👉 Your Turn ⤵

We’re cruising! Let’s take another peak at our data - using glimpse() or another means of your choosing below.

glimpse(ipeds)

Rows: 3,818
Columns: 13
$ name            <chr> "Alabama A & M University", "University of Alabama at …
$ title_iv        <chr> "Title IV postsecondary institution", "Title IV postse…
$ carnegie_class  <chr> "Master's Colleges & Universities: Larger Programs", "…
$ state           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
$ total_enroll    <dbl> 6007, 21639, 647, 9237, 3828, 38644, 1777, 2894, 5109,…
$ pct_admitted    <dbl> 68, 87, NA, 78, 97, 80, NA, NA, 92, 44, 57, NA, NA, NA…
$ n_bach          <dbl> 511, 2785, 54, 1624, 480, 6740, NA, 738, 672, 5653, 26…
$ n_mast          <dbl> 249, 2512, 96, 570, 119, 2180, NA, 80, 300, 1415, 0, N…
$ n_doc           <dbl> 9, 166, 20, 41, 2, 215, NA, 0, 0, 284, 0, NA, 0, NA, N…
$ tuition_fees    <dbl> 10024, 8568, NA, 11488, 11068, 11620, 4930, NA, 8860, …
$ grad_rate       <dbl> 27, 64, 50, 63, 28, 73, 22, NA, 36, 81, 65, 26, 9, 29,…
$ percent_fin_aid <dbl> 87, 96, NA, 96, 97, 87, 87, NA, 99, 79, 100, 96, 92, 8…
$ avg_salary      <dbl> 77824, 106434, 36637, 92561, 72635, 97394, 63494, 8140…

3. EXPLORE

One key step in most analyses is to explore the data. Here, we conduct an exploratory data analysis with the IPEDS data, focusing on the key outcome of graduate rate.

Below, we use the ggplot2 package (part of the tidyverse) to visualize the spread of the values of our dependent variable, grad_rate, which represents institutions’ graduation rate. There is a lot to ggplot2, and data visualizations are not the focus of this module, but this web page has a lot of information you can use to learn more, if you are interested. ggplot2 is fantastic for creating publication-ready visualizations!

ipeds %>% 
    ggplot(aes(x = grad_rate)) +
    geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 418 rows containing non-finite outside the scale range
(`stat_bin()`).

What do you notice about this graph – and about graduation rate?

YOUR RESPONSE
YOUR RESPONSE

👉 Your Turn ⤵

Below, can you add one ggplot2 plot with a different variable/variables? Use the ggplot2 page linked above (also here) or the code above as a starting point (another histogram is fine!) for your visualization.

ipeds %>% 
    ggplot(aes(x = pct_admitted)) +
    geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 2040 rows containing non-finite outside the scale range
(`stat_bin()`).

ipeds %>% 
    ggplot(aes(x = tuition_fees, y = total_enroll, color = avg_salary)) +
    geom_point()

Warning: Removed 599 rows containing missing values or values outside the scale range
(`geom_point()`).

We’ll next do a little additional data wrangling. For now, we’ll model our dependent variable, grad_rate, as a dichotomous (i.e., yes or no; 1 or 0) dependent variable. This isn’t necessary, but it makes the contrast between the regression and supervised machine learning model a bit more vivid, and also dichotomous and categorical outcome variables are common in supervised machine learning applications, and so we’ll do this for this case study.

👉 Your Turn ⤵

Your next task is to decide what constitutes a good graduation rate. Our only suggestion - don’t pick a number too close to 0% or 100%. Otherwise, please replace XXX below with the number from 0-100 that represents the graduation rate percentage. Just add the number — don’t add the percentage symbol.

ipeds <- ipeds %>% 
    mutate(good_grad_rate = if_else(grad_rate > 62, 1, 0),
           good_grad_rate = as.factor(good_grad_rate))

Here, add a reason or two for how and why you picked the number you did:

4. MODEL

Now, we reach a fork in the road. Recall from our first reading that there are two general types of modeling approaches: unsupervised and supervised machine learning. In Part 4, we focus on supervised learning models, which are used to quantify relationships between features (e.g., motivation and performance) and a known outcome (e.g., student drop out). These models can be used for classification of binary or categorical outcomes, as we’ll illustrate in this section, or regression as we’ll demonstrate in modules 2 and 3.

Please write our preliminary, draft research questions for both a regression (RQ A) and supervised machine learning (RQ B) analysis. It may help to review the readings for this module; you can find them in the lit folder; they are listed in sml-1-readings.qmd, too. There aren’t right or wrong answers here; the point is to try to draw out what question might accompany these different analyses (or vice versa - what research questions are feasible to answer using different analyses).

👉 Your Turn ⤵

RQ A - Regression Research Question

YOUR RESEARCH QUESTION

RQ B - Supervised Machine Learning Research Question

YOUR RESEARCH QUESTION

Now, we will proceed to the analyses.

We’ll first conduct a regression analysis, like in the code-along. We use a generalized linear model due to the dependent variable being dichotomous. The code is relatively straightforward; the comments explain each step.

m1 <- glm(good_grad_rate ~ total_enroll + pct_admitted + n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + avg_salary, data = ipeds, family = "binomial") # to the left of the ~ is our dependent variable; to the right are the independent variables; family = "binomial" is to specify the correct model type for the dichotomous dependent variable

summary(m1) # summary table of model output


Call:
glm(formula = good_grad_rate ~ total_enroll + pct_admitted + 
    n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + 
    avg_salary, family = "binomial", data = ipeds)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -9.768e-01  7.271e-01  -1.343  0.17913    
total_enroll    -1.501e-04  4.829e-05  -3.109  0.00188 ** 
pct_admitted    -1.135e-02  3.892e-03  -2.916  0.00355 ** 
n_bach           9.693e-04  2.130e-04   4.551 5.34e-06 ***
n_mast          -4.401e-04  2.225e-04  -1.978  0.04790 *  
n_doc            8.313e-03  1.782e-03   4.665 3.08e-06 ***
tuition_fees     7.575e-05  5.551e-06  13.644  < 2e-16 ***
percent_fin_aid -3.301e-02  6.682e-03  -4.940 7.79e-07 ***
avg_salary       3.066e-05  4.428e-06   6.925 4.37e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2164.7  on 1615  degrees of freedom
Residual deviance: 1450.0  on 1607  degrees of freedom
  (2202 observations deleted due to missingness)
AIC: 1468

Number of Fisher Scoring iterations: 6

Then, we’ll conduct a supervised machine learning analysis (with a simple but still commonly used model - in fact, the same model we used for the regression, a generalized linear model!). Again, for now, you’ll run this code; later, you’ll work through each step in detail.

my_rec <- recipe(good_grad_rate ~ total_enroll + pct_admitted + n_bach + n_mast + n_doc + tuition_fees + percent_fin_aid + avg_salary, data = ipeds) # same as above; this sets up what predicts the outcome

# specify model
my_mod <-
    logistic_reg() %>% # specifies a logistic regression
    set_engine("glm") %>% # specifies the specific package used to estimate the logistic regression
    set_mode("classification") # specifies that our outcome is a dichotomous or categorical variable

my_wf <- workflow() %>% # this starts a workflow, which will stitch together the steps in our analysis
    add_model(my_mod) %>% # adds the model
    add_recipe(my_rec) # adds the recipe

fit_model <- fit(my_wf, data = ipeds) # fits the model

ipeds_preds <- predict(fit_model, ipeds, type = "class") %>% # makes predictions
  bind_cols(ipeds %>% select(good_grad_rate)) # binds the known outcomes

metrics(ipeds_preds, truth = good_grad_rate, estimate = .pred_class) # calculates metrics

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.810
2 kap      binary         0.588

The key to observe at this point is what is similar and different between the two approaches (regression and supervised machine learning). Both used the same underlying statistical model, but had some stark differences. Add two or more similarities and two or more differences (no wrong answers!) below.

👉 Your Turn ⤵

Similarities:

Differences:

5. COMMUNICATE

The final step in the workflow/process is sharing the results of your analysis with wider audience. Krumm et al. (2018) have outlined the following 3-step process for communicating with education stakeholders findings from an analysis:

Select. Communicating what one has learned involves selecting among those analyses that are most important and most useful to an intended audience, as well as selecting a form for displaying that information, such as a graph or table in static or interactive form, i.e. a “data product.”
Polish. After creating initial versions of data products, research teams often spend time refining or polishing them, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative to accompany the data products involves, at a minimum, pairing a data product with its related research question, describing how best to interpret the data product, and explaining the ways in which the data product helps answer the research question.

Now, let’s return to our research questions. What did we find? This (especially the supervised machine learning model and its output) is very likely new, and this is meant to elicit initial perceptions, and not the right answer. What did we find for each of your RQs? Add a few thoughts below for each. Focus on what you would communicate about this analysis to a general audience, again, keeping in mind this is based on your very initial interpretations.

RQ A - Regression Research Question

YOUR RESPONSE
YOUR RESPONSE

RQ B - Supervised Machine Learning Research Question

YOUR RESPONSE
YOUR RESPONSE

🧶 Knit & Check ✅

For your SML Module 1 Badge, you will further reflect on and interpret these models, and their distinctions.

Rendered HTML files can be published online through a variety of ways including Posit Cloud, RPubs , GitHub Pages, Quarto Pub, or other methods. The easiest way to quickly publish your file online is to publish directly from RStudio. You can do so by clicking the “Publish” button located in the Viewer Pane after you render your document as illustrated in the screenshot below.

Congratulations - you’ve completed this case study! Move on to the badge activity next.