# A tibble: 2 × 3
id var_a var_b
<int> <dbl> <dbl>
1 1 0.0345 0.325
2 2 0.534 0.860
Conceptual Overview
Having fit and interpreted our machine learning model, how do we make our model better? That’s the focus of this learning lab. There are three core ideas, pertaining to feature engineering, resampling, and the random forest algorithm. We again use the OULAD data.
In earlier modules, we fit and interpreted a single fit of our model (80% training, 20% testing). What if we decided to add new features or change existing features?
We’d need to use the same training data to tune a new model—and the same testing data to evaluate its performance. But, this could lead to fitting a model based on how well we predict the data that happened to end up in the test set.
We could be optimizing our model for our testing data; when used with new data, our model could make poor predictions.
In short, a challenges arises when we wish to use our training data more than once
Namely, if we repeatedly training an algorithm on the same data and then make changes, we may be tailoring our model to specific features of the testing data
Resampling conserves our testing data; we don’t have to spend it until we’ve finalized our model!
Resampling involves blurring the boundaries between training and testing data, but only for the training split of the data
Specifically, it involves combining these two portions of our data into one, iteratively considering some of the data to be for “training” and some for “testing”
Then, fit measures are averaged across these different samples
# A tibble: 2 × 3
id var_a var_b
<int> <dbl> <dbl>
1 1 0.0345 0.325
2 2 0.534 0.860
Using k = 10, how can we split n = 100 cases into ten distinct training and testing sets?
First resampling
if Predictor B >= 0.197 then
| if Predictor A >= 0.13 then Class = 1
| else Class = 2
else Class = 2
if Predictor B >= 0.197 then
| if Predictor A >= 0.13 then
| if Predictor C < -1.04 then Class = 1
| else Class = 2
else Class = 3
As you can imagine, with many variables, these trees can become very complex
mtry
)min_n
)trees
)Boosted trees, like those created using the xgboost package (engine), sequentially build trees where each new tree attempts to correct the errors made by the previous ones
By focusing on correcting mistakes and optimizing the model iteratively, boosted trees can achieve better performance compared to random forests
Boosted trees offer more hyperparameters to fine-tune, such as learning rate and tree depth — which can be challenging
Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using demographic data as predictor variables: A questionable choice. Journal of Educational Data Mining, 15(2), 22-52.
Bosch, N. (2021). AutoML feature engineering for student modeling yields high accuracy, but limited interpretability. Journal of Educational Data Mining, 13(2), 55-79.
Rodriguez, F., Lee, H. R., Rutherford, T., Fischer, C., Potma, E., & Warschauer, M. (2021, April). Using clickstream data mining techniques to understand and support first-generation college students in an online chemistry course. In LAK21: 11th International Learning Analytics and Knowledge Conference (pp. 313-322).
General troubleshooting tips for R and RStudio