What Makes Supervised Machine Learning Distinct

Essential Readings

Published

July 19, 2024

Overview

These readings principally highlight what makes supervised machine learning distinct from other analytic approaches. The Breiman paper is a classic that provides a high-level introduction to how the goals of the analyst or scholar using supervised machine learning may be similar to and different from those using more traditional (i.e., inferential) models. The Estrellado et al. (2020) chapter provides an example similar to that which we undertook in this chapter, but with a different data set and set of packages (caret) that you may encounter when using R for supervised machine learning; though the code is different in some key places, consider how the modeling process is similar. Kuzilek et al. (2017) provides an overview of the OULAD dataset, if interested in using it for research (or any other purpose).

Readings

These readings are all in the /lit folder.

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199-231.

Estrellado, R. A., Freer, E. A., Mostipak, J., Rosenberg, J. M., & Velásquez, I. C. (2020). Data science in education using R. Routledge (c14), Predicting students’ final grades using machine learning methods with online course data. http://www.datascienceineducation.com/

Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Data descriptor: Open University learning analytics dataset. Scientific Data, 4, 1-8.

Reflection

To help guide your reflection on the readings, a set of guiding questions are provided below. After you have had a chance to work through one or more of the readings, we encourage you to contribute to our learning community by creating a new post to our machine-learning channel on Slack. Your post might contain a response to one or more of the guiding questions, questions you still have about the topics addressed, or insights gained into your own research.

Breiman (2001)

  • Breiman contrasts the data modeling culture and the algorithmic modeling culture. How do these two cultures differ in their approach to statistical modeling, and what are the potential benefits and drawbacks of each?

  • In the context of educational data science, how can the principles of the algorithmic modeling culture be applied to improve the prediction of student outcomes? Provide examples of methods that align with this culture.

  • Breiman emphasizes the importance of prediction accuracy over interpretability in some cases. Discuss scenarios in educational data science where this trade-off might be justified and how educators can balance these competing priorities.

Estrellado et al. (2021)

  • Discuss the importance of data preprocessing and feature engineering in building predictive models for educational data. What specific techniques were used in this chapter, and how did they impact the model’s performance?

  • How do the authors address the issue of missing data in online course datasets? What alternative strategies could be considered for handling missing data, and how might these affect the results?

  • Which algorithm do Estrellado et al. (2021) use, and why? What are the drawbacks and limitations of this model.

Kuzilek et al. (2017)

  • The Open University Learning Analytics Dataset (OULAD) includes demographic data and aggregated clickstream data. Discuss how this combination of data types can be used to gain insights into student behavior and learning outcomes. What are the potential benefits and limitations of this dataset?

  • The authors describe the process of anonymizing the dataset to protect student privacy. Reflect on the ethical considerations of using and sharing educational data, and how anonymization techniques can address these concerns.

  • How might you use the OULAD for your research?