Modeling for educational researchers builds on insights from Exploratory Data Analysis (EDA) to develop predictive or explanatory models that guide decision-making and enhance learning outcomes.
This involves selecting appropriate models, preparing and transforming data, training and validating the models, and interpreting results. Effective modeling helps uncover key factors influencing educational success, allowing for targeted interventions and informed policy decisions.
Module Objectives
By the end of this module:
Introduction to Modeling:
Learners will understand the importance of modeling in the learning analytics workflow and how it helps quantify insights from data.
Creating and Interpreting Correlations:
Learners will create and interpret correlation matrices using the {corrr} package and create APA-formatted correlation tables using the {apaTables} package.
Applying Various Modeling Techniques:
Learners will fit and understand linear regression models to prepare for the case study.
The corr() method (in Pandas) calculates the relationship between each column in your data set.
# Select specific variables for analysisselected_data = demo_data[['completion_percentage', 'hours_online']]# Calculate the correlation matrixcorrelation_matrix = selected_data.corr()# Print the correlation matrixprint(correlation_matrix)
Linear regression is implemented using the statsmodels library
Add the statsmodels.api packages as sm
👉 Your Turn⤵ -> Answer
#import stats modelimport statsmodels.api as sm
Instructions:
Clean the data
Create a linear regression model with hours_online as the independent variable and completion_percentage as the dependent variable.
Fit the model and inspect the summary.
Visualize the model with a scatter plot and regression line
# Step 1: Drop rows with missing valuesdata_cleaned = selected_data.dropna()# Step 2: Define independent and dependent variablesX = data_cleaned[['hours_online']] # Independent variabley = data_cleaned['completion_percentage'] # Dependent variable# Step 3: Add a constant to the model (for the intercept)X = sm.add_constant(X)# Step 4: Fit the modelmodel = sm.OLS(y, X).fit()# Step 5: Output the summary of the modelprint(model.summary())
Create a linear regression model with hours_online and boredom_score as the independent variable and completion_percentage as the dependent variable.
Fit the model and inspect the summary.
#install librariesimport seaborn as sns #seabornimport matplotlib.pyplot as plt #matplot# Step 1: Drop rows with missing values in any of the selected columnscleaned_data = demo_data.dropna()# Step 1a: Verify that there are no missing values in cleaned_dataprint(cleaned_data.isna().sum())# Steps 2, add independent and dependent variables to X and yX = cleaned_data[['hours_online', 'boredom_score']] # independent variablesy = cleaned_data['completion_percentage'] # dependent variable# Adding a constant to the model (for the intercept)X = sm.add_constant(X)# Steps 4 and 5 to Fit the modelmodel = sm.OLS(y, X, missing='drop').fit()# Output the summary of the modelprint(model.summary())