Diving Deep on Features and Feature Engineering

Module 1 - Essential Readings

Author

LASER Institute

Published

July 19, 2024

Overview

At this point, you are familiar with the key parts of the supervised machine learning workflow. These readings are intended to help you to deepen you understanding of featuring engineering and the inclusion of features (Baker et al., 2023; Bosch, 2021) and to see how feature engineering is used in the case of a specfic empirical analysis (Rodriguez et al., 2021).

Readings

These readings are all in the /lit folder.

Baker, R. S., Esbenshade, L., Vitale, J., & Karumbaiah, S. (2023). Using demographic data as predictor variables: A questionable choice. Journal of Educational Data Mining, 15(2), 22-52.

Bosch, N. (2021). AutoML feature engineering for student modeling yields high accuracy, but limited interpretability. Journal of Educational Data Mining, 13(2), 55-79.

Rodriguez, F., Lee, H. R., Rutherford, T., Fischer, C., Potma, E., & Warschauer, M. (2021, April). Using clickstream data mining techniques to understand and support first-generation college students in an online chemistry course. In LAK21: 11th International Learning Analytics and Knowledge Conference (pp. 313-322).

Reflection

To help guide your reflection on the readings, a set of guiding questions are provided below. After you have had a chance to work through one or more of the readings, we encourage you to contribute to our learning community by creating a new post to our machine-learning channel on Slack. Your post might contain a response to one or more of the guiding questions, questions you still have about the topics addressed, or insights gained into your own research.

Baker et al. (2023)

In what ways could the inclusion of demographic variables as predictors potentially reinforce biases in predictive analytics within education?
The authors argue that demographic variables should be used to validate fairness rather than as predictors within models. What are the potential benefits and drawbacks of this approach?
How could the limitations of categorization impact the use of demographic variables in predictive models?

Bosch (2021)

Bosch highlights the trade-off between accuracy and interpretability in using AutoML for student modeling. Discuss the implications of this trade-off for educators and policymakers. How can they balance the need for accurate predictions with the necessity of understanding and explaining model decisions?
The paper presents the use of AutoML in feature engineering. What are the potential benefits and limitations of this approach in the context of student modeling, and how might these affect the deployment of predictive analytics in educational environments?
Reflect on the ethical considerations of using highly accurate but less interpretable models in education. How can transparency and accountability be maintained when employing such models to make decisions affecting students?

Rodriguez et al. (2021)

The study uses clickstream data to identify self-regulated learning patterns among first-generation college students. Discuss the four distinct learning patterns identified and their implications for student support and intervention strategies.
The authors found that early planning behaviors were particularly predictive of course success. How can educators design interventions or course structures to promote these behaviors among students, especially first-generation students?
Reflect on the potential of using clickstream data to support underrepresented students in online learning environments. What are the benefits and challenges of implementing such data-driven approaches in practice?