After interpreting our last model, it is easy to think we can do a little better. But, how? In this learning lab, we will answer this question will building a better model.
Feature engineering is a rich topic in machine learning research, including in the learning analytics and educational data mining communities.
Consider research on online learning and the work of Rodriguez et al. (2021). In these two studies, digital trace data, data generated through users’ interactions with digital technologies. Review this paper – specifically how they processed the “clickstream” data. As this paper illustrates, there is not one way to use such data.
Last, we note that there are methods that intended to automated the process of feature engineering (Bosch et al., 2021), though such processes are not necessarily interpretable and they usually require some degree of tailoring to your particular context. Review this paper on this topic
Even after feature engineering, machine learning approaches can often (but not always) be improved by choosing a more sophisticated model type. Note how we used a regression model in the first two case studies; here, we explore a considerably more sophisticated model, a random forest. Feature engineering and choosing a more sophisticated model adds some complexity to the modeling. As we have discussed, it is easy to bias our results if we repeatedly check the performance of different model fits with the same test data. Cross-validation is one commonly used solution for this problem.
Our driving question is: How much can we improve our model? Looking back to our predictive model from LL 2, we can see that our accuracy was okay. Can we improve on that? Let’s dive in!
2. WRANGLE
Step 0: Loading and setting up
First, let’s load the packages we’ll use—the familiar {tidyverse} and several others focused on modeling.
Your Turn ⤵
Please add to the chunk below code to load three packages we’ve used before – tidyverse, janitor, and tidymodels.
Please install and add two additional packages: {ranger}, for one implementation of random forest modeling, and {vip}, for variable importance metrics.
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vip) # a new package we're adding for variable importance measures
Attaching package: 'vip'
The following object is masked from 'package:utils':
vi
library(ranger) # this is needed for the random forest algorithm
Next, we’ll load a new file — one with interactions (or log-trace) data.
We have to do the same processing we did in the second learning lab, to obtain cut-off dates. As a reminder, the purpose of this is to train the model on data from the first one-third of the class, with the reasoning being this is a good time to intervene–far enough into the class to make an appreciable impact, but not too late to have a limited chance of being able to change students’ trajectory in the class.
We’ll repeat the procedure we carried out with the assessments data — calculating a cut-off for each class and then filtering the data based upon this cut-off. But, since we’ve already done this for the assessment data, to allow us to focus more on some new feature engineering steps in this learning lab, we are providing you with the already-filtered interactions data. Note: you can find the code used to carry out these steps in the Lab 3 folder in process-interactions-data.R.
Please load oulad-interactions-filtered.csv into R, assigning the resulting data frame the name interactions. Note, you may need to unzip this file!
Rows: 5548858 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, activity_type
dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the OULAD documentation, this is called the VLE (virtual learning environment) data source. Please review the description of the variables in the studentVLE and VLE sources (which are joined together for this learning lab) here. Then, read in the oulad-interactions.csv file.
Also, we’ll load another file we created in the last module - oulad-students_and_assessments.csv. That is in this module’s data folder; please load this in with read_csv(), as below.
Rows: 32593 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): code_module, code_presentation, gender, region, highest_education, ...
dbl (9): id_student, imd_band, num_of_prev_attempts, studied_credits, module...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Woah, even filtered, this is a large file! Let’s start to explore it a bit. Please take three steps to explore the data, as follows.
3. EXPLORE
Your Turn ⤵
First, count() the activity_type variable and sort the resulting output by frequency.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What does this tell you? Add one or more notes:
Your Turn ⤵
Third, please conduct one other data exploration step of your choosing. Options include creating simple graphs or calculating descriptive, summary statistics.
library(skimr)skim(interactions)
Data summary
Name
interactions
Number of rows
5548858
Number of columns
11
_______________________
Column type frequency:
character
3
numeric
8
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
code_module
0
1
3
3
0
7
0
code_presentation
0
1
5
5
0
4
0
activity_type
0
1
3
14
0
19
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
id_student
0
1.00
726903.72
571596.96
6516
511001
588131
645451
2698588
▅▇▁▁▁
id_site
0
1.00
734462.44
129889.18
526721
661808
730007
877045
941543
▇▂▇▆▇
date
0
1.00
37.18
38.87
-25
8
27
55
172
▆▇▃▂▁
sum_click
0
1.00
3.68
8.45
1
1
2
4
6977
▇▁▁▁▁
week_from
4701710
0.15
6.08
5.06
0
2
5
9
29
▇▅▂▁▁
week_to
4701710
0.15
6.12
5.03
0
2
5
9
29
▇▅▂▁▁
module_presentation_length
0
1.00
256.69
13.28
234
241
262
269
269
▆▁▁▂▇
quantile_cutoff_date
0
1.00
99.66
40.57
32
54
129
131
173
▅▂▁▇▁
We’re ready to proceed to engineering some features with the interactions data.
First, let’s join together the interactions and the code_module_dates data frames.
We’ll use the same code to create the code_module_dates we used in LL2.
Rows: 173912 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, assessment_type
dbl (7): id_assessment, id_student, date_submitted, is_banked, score, date, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'code_module'. You can override using the
`.groups` argument.
This is so we have the cut-offs (one-third through the semester), which will help us to only use data from that period in our model when we filter the data based upon those dates.
interactions_joined <- interactions %>%left_join(code_module_dates) # join the data based on course_module and course_presentation
Joining with `by = join_by(code_module, code_presentation,
quantile_cutoff_date)`
interactions_filtered <- interactions_joined %>%filter(date < quantile_cutoff_date) # filter the data so only assignments before the cutoff date are included
Then, let’s engineer several features. For the present time, we’ll focus on the sum_click variable, which tells us how many times students clicked on a resource for a given date. We’ll use the interactions_filtered data set we just created. Let’s start simple: please type the name of that data set (interactions_filtered) in the code chunk below.
Rows: 5548858 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): code_module, code_presentation, activity_type
dbl (8): id_student, id_site, date, sum_click, week_from, week_to, module_pr...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
How can we create a feature with sum_click? Think back to our discussion in the presentation; we have many options for working with such time series data. Perhaps the most simple is to count the clicks. Please summarize the number of clicks for each student (specific to a single course). This means you will need to group your data by id_student, code_module, and code_presentation, and then create a summary variable. Assign the resulting output the name interactions_summarized. You may find the documentation for summarize() to be helpful. That is available here. This chapter is also likely helpful (note that summarise() and summarize() are different spellings for the same function).
`summarise()` has grouped output by 'id_student', 'code_module'. You can
override using the `.groups` argument.
Your Turn ⤵
How many times did students click? Let’s create a histogram to see. Please use {ggplot} and geom_histogram() to visualize the distribution of the sum_clicks variable you just created. Turn to the documentation if you need a pointer!
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a good start - we’ve created our first feature based upon the log data, sum_clicks! What are some other features we can add? An affordance of using the summarize() function in R is we can create multiple summary statistics at once. Let’s also calculate the standard deviation of the number of clicks as well as the mean. Please copy the code you wrote above into the code chunk below and then add these two additional summary statistics.
`summarise()` has grouped output by 'id_student', 'code_module'. You can
override using the `.groups` argument.
We’ll take one last step here – creating individual slopes of students’ clicks over time. This code is a bit more involved, and so is annotated for you below; feel free to modify and re-use this.
First, we use a custom function that can deal with cases where the fitting of the slopes fails (e.g., for students with only a single data point, or some other unanticipated issue with the model estimation).
# Define a function to fit the model and handle warningsfit_model <-function(data) {tryCatch( { # Fit a linear model with sum_click as the response and date as the predictor model <-lm(sum_click ~ date, data = data)# Tidy the model output (convert to a tidy data frame)tidy(model) },# Handle errors: return a tibble with NA valueserror =function(e) { tibble(term =NA, estimate =NA, std.error =NA, statistic =NA, p.value =NA) },# Handle warnings: return a tibble with NA valueswarning =function(w) { tibble(term =NA, estimate =NA, std.error =NA, statistic =NA, p.value =NA) } )}
Note: this next step may take up to a minute or so! Please feel free to relax!
# Fit the models for each groupinteractions_slopes <- interactions_filtered %>%group_by(id_student, code_module, code_presentation) %>%# Group by student and coursenest() %>%# Nest the data frame, creating a list of data frames for each groupmutate(model =map(data, fit_model)) %>%# Apply the fit_model function to each nested data frameunnest(model) # Unnest the model output, returning to a regular data frame
We’ll do a bit of data wrangling.
# Process the outputinteractions_slopes <- interactions_slopes %>%ungroup() %>%# Remove groupingselect(code_module, code_presentation, id_student, term, estimate) %>%# Select relevant columnsfilter(!is.na(term)) %>%# Filter out rows where term is NAspread(term, estimate) %>%# Spread the terms into separate columnsmutate_if(is.numeric, round, 4) # Round numeric columns to 4 decimal places
Please rename the resulting intercept and date, changing themintercept and slope, respectively. Note, selecting a variable with a parenthesis in it can be tricky! Think of other ways you could do this, if you wish (hint: think of what a function in janitor can do!).
After running this code, we will have intercepts, linear slopes, and quadratic terms for each students’ clickstream pattern over the semester.
Let’s join together several files. First, let’s join all our features into a single file. Please use left_join() to join interactions_summarized and interactions_slopes, assigning the resulting output the name interactions_summaried_and_slopes.
Joining with `by = join_by(code_module, code_presentation, id_student)`
Just one last step! Let’s join together all of the data we’ll use for our modeling: students_and_assessments and all_features. Use left_join() once more, assigning the resulting output the name students_assessments_and_interactions. Lots of joining! Sometimes, the hardest part of complex analyses lies in the preparation (and joining) of the data.
Joining with `by = join_by(code_module, code_presentation, id_student)`
One more small step. Let’s ensure that our outcome variable – pass – is a factor. It was, but when saving to and reading from a CSV, it lots its factor characteristics, becoming instead a character string. It is necessary for the outcome of a classification model (like the one we are using) to be a factor. Please use mutate() to do just that.
We’ll follow the same steps we followed in learning labs #1 and #2, here. One difference - we’ll use students_assessments_and_interactions instead of the data frame we used in those learning labs. Please port over the code you used in those learning labs here, changing the name of the data frame to the one we are now using.
We discuss this first step minimally as we have now carried out a step very similar to this in LL1 and LL2; return to the case study for those (especially LL1) for more on data splitting.
There is a key difference that is next. In this step. we’ll further process data_train, creating different subsets of the data, or folds of the data, that we can use to fit our model multiple times.
vfcv <-vfold_cv(data_train, v =4) # this differentiates this from what we did before# before, we simple used data_train to fit our modelvfcv
How does this work? data_train is sampled as many time as we se #### Your Turn ⤵
Above, we split the data into 10 different folds. Change the number of folds from 10 to 5 by changing the value of v; 10 is simply the default—not always the best one! For help, run ?vfold_cv to get a hint.
Step 2: Engineer features and write down the recipe
Here, we’ll carry out several feature engineering steps.
Read about possible steps and see more about how the following five feature engineering steps below work. Like in the first learning lab, this is the step in which we set the recipe.
step_normalize(): normalizes numeric data to have a standard deviation of one and a mean of zero
step_dummy(): convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data.
step_impute_knn(): impute missing alues using the nearest neighbors method
We will use all three of these. Please note that there are two ways to use each of these. One of them is by specifying the specific variable you want to use for the feature engineering step. For example, below, we would normalize a hypothetical variable, one called minutes_of_videos_watched:
step_normalize(minutes_of_videos_watched)
Another way is to specify for which types of variables the feature engineering step applies. You can see all of the available types here. For example, the code below would normalize all numeric variables:
step_normalize(all_numeric_predictors())
Your Turn ⤵
Let’s turn to our recipe. Please add to the code below to add the new variables—features—we created. We’ve started you off with ___ code below (where you can add the new variables – be sure to add all of them!).
Then, please add to the following code by completing three steps, building on the code that is started below:
Please normalize all numeric predictors.
Please dummy code the following three variables: disability, gender, and code_module.
Please impute missing values using the nearest neighbors method for the mean_weighted_score, sum_clicks, sd_clicks, mean_clicks, intercept, slope, and date_registration variables.
Next, we specify the model and workflow, using the same engine but a different engine and mode, here, regression for a continuous outcome. Specifically, we are:
using the random_forest() function to set the model as a random forest
using set_engine("ranger", importance = "impurity") to set the engine as that provided for random forests through the {ranger} package; we also add the importance = "impurity" line to be able to interpret a particular variable importance metric specific to random forest models
finally, using set_mode("classification")) as we are again predicting categories (transactional and substantive conversations taking place through #NGSSchat)
# specify modelmy_mod <-rand_forest() %>%# specify the type of model we are fitting, a random forestset_engine("ranger", importance ="impurity") %>%# using a variable importance metric specific to this random forest engineset_mode("classification") # same as before# specify workflowmy_wf <-workflow() %>%# create a workflowadd_model(my_mod) %>%# add the model we wrote aboveadd_recipe(my_rec) # add our recipe we wrote above
Step 4: Fit model
Note that here we use the kfcv data and a different function - fit_resamples(), instead of fit(). This function is required when we have multiple folds of our sample, as we created using the vfcv function above. See more about this function here.
class_metrics <-metric_set(accuracy, sensitivity, specificity, ppv, npv, kap) # specify the same metrics as earlierfitted_model_resamples <-fit_resamples(my_wf, resamples = vfcv, metrics = class_metrics) # specify the workflow, resampled data, and our metrics
Note that you have fit as many models as the value for v that you specified earlier. So, this may take some time. Take a walk, grab a snack, or make a cup of tea!
Then, we can use the same collect_metrics() function we have used to inspect the predictive strength of the model. Please do that for fitted_model_resamples.
We noted earlier that we may introduce bias when we evaluate different models against the same training data. The benefit of using the multiple folds - and fitting multiple models - is we can make changes to our features or model. Thus, if we decide to add a new feature, we can run the above steps, without concern about biasing our interpretation of how accurate our model is.
That is, we can be unconcerned about inadvertently introducing bias until we reach the last step - fitting our last model. We can do that using the familiar fit() function, followed by last_fit. It is important to not make further changes after this point.
We can see that final_fit is for a single fit: a random forest with the best performing tuning parameters trained with the entire training set of data to predict the values in our (otherwise not used/“spent”) testing set of data.
Please add two or more notes on what you notice about which variables (features) are important, focused on what you would say to someone in your audience about what the important variables in your model were.
🧶 Knit & Check ✅
Congratulations - you’ve completed this case study! Consider moving on to the badge activity next.