The setting of this study was a public provider of individual online courses in a Midwestern state. Data was collected from two semesters of five online science courses and aimed to understand students’ motivation.
Research Question
Is there a relationship between the time students spend in a learning management system and their final course grade?
In Python, packages are equivalent to R’s libraries, containing functions, modules, and documentation. They can be installed using pip and imported into your scripts.
# Import pandas for data manipulation and matplotlib for visualizationimport pandas as pdimport matplotlib.pyplot as plt# If not installed, you can run in the terminal:# python3 -m pip install matplotlib plotly# python3 -m pip install pandas# or windows# py -m pip install matplotlib plotly# py -m pip install pandas
The pandas library in Python is used for data manipulation and analysis. Similar to the {readr} package in R the function like pd.read_csv() is for importing rectangular data from delimited text files such as comma-separated values (CSV), a preferred file format for reproducible research.
# Load data into the Python environment from a CSV filesci_data = pd.read_csv("data/sci-online-classes.csv")
Inspecting the dataset in Python can be done by displaying the first few rows of the DataFrame.
# Display the first five rows of the dataprint(sci_data.head())
Functions are like verbs: pd.read_csv() is the function used to read a CSV file into a pandas DataFrame.
Objects are the nouns: In this case, sci_data becomes the object that stores the DataFrame created by pd.read_csv(“data/sci-online-classes.csv”).
Arguments are like adverbs: “data/sci-online-classes.csv” is the argument to pd.read_csv(), specifying the path to the CSV file. Unlike R’s read_csv, the default behavior in pandas is to infer column names from the first row in the file, so there’s no need for a col_names argument.
Operators are like “punctuation”: = is the assignment operator in Python, used to assign the DataFrame returned by pd.read_csv(“data/sci-online-classes.csv”) to the object sci_data.
Wrangle
Data wrangling is the process of cleaning, “tidying”, and transforming data. In Learning Analytics, it often involves merging (or joining) data from multiple sources.
Data wrangling in Python is primarily done using pandas, allowing for cleaning, filtering, and transforming data.
Since we are interested the relationship between time spent in an online course and final grades, let’s select() the FinalGradeCEMS and TimeSpent from sci_data.
# Selecting specific columns and creating a new DataFramesci_data_selected = sci_data[['FinalGradeCEMS', 'TimeSpent']]
Exploratory data analysis in Python involves processes of describing your data numerically or graphically, which often includes:
calculating summary statistics like frequency, means, and standard deviations
visualizing your data through charts and graphs
EDA can be used to help answer research questions, generate new questions about your data, discover relationships between and among variables, and create new variables (i.e., feature engineering) for data modeling.
The workflow for making a graph typically involves:
Choosing the type of plot or visualization you want to create.
Using the plotting function directly with your data.
Optionally customizing the plot with titles, labels, and other aesthetic features.
A simple template for creating a scatter plot could look like this:
#import packages for visualizationimport matplotlib.pyplot as pltimport seaborn as sns# Basic scatter plot template# sns.scatterplot(x=<VARIABLE1>, y=<VARIABLE2>, data=<DATA>)# plt.show()
Scatter plots are useful for visualizing the relationship between two continuous variables. Here’s how to create one with seaborn.
# Create a scatter plot showing the relationship between time spent and final gradessns.scatterplot(x='TimeSpent', y='FinalGradeCEMS', data=sci_data)plt.xlabel('Time Spent')plt.ylabel('Final Grade CEMS')plt.show()
Before adding a constant or fitting the model, ensure the data doesn’t contain NaNs (Not a Number) or infinite values. In a pandas DataFrame, you can use a combination of methods provided by pandas and NumPy.
import numpy as np# Drop rows with NaNs in 'TimeSpent' or 'FinalGradeCEMS'sci_data_clean = sci_data.dropna(subset=['TimeSpent', 'FinalGradeCEMS'])# Replace infinite values with NaN and then drop those rows (if any)sci_data_clean.replace([np.inf, -np.inf], np.nan, inplace=True)sci_data_clean.dropna(subset=['TimeSpent', 'FinalGradeCEMS'], inplace=True)
You can use libraries such as statsmodels or scikit-learn. We’ll dive much deeper into modeling in subsequent learning labs, but for now let’s see if there is a statistically significant relationship between students’ final grades, FinaGradeCEMS, and the TimeSpent in the LMS:
# add the statsmodelsimport statsmodels.api as sm# Add a constant term for the intercept to the independent variableX = sm.add_constant(sci_data_clean['TimeSpent']) # Independent variabley = sci_data_clean['FinalGradeCEMS'] # Dependent variable# Fit the modelmodel = sm.OLS(y, X).fit()
# Print the summary of the modelprint(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: FinalGradeCEMS R-squared: 0.134
Model: OLS Adj. R-squared: 0.132
Method: Least Squares F-statistic: 87.99
Date: Fri, 19 Jul 2024 Prob (F-statistic): 1.53e-19
Time: 12:14:17 Log-Likelihood: -2548.5
No. Observations: 573 AIC: 5101.
Df Residuals: 571 BIC: 5110.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 65.8085 1.491 44.131 0.000 62.880 68.737
TimeSpent 0.0061 0.001 9.380 0.000 0.005 0.007
==============================================================================
Omnibus: 136.292 Durbin-Watson: 1.537
Prob(Omnibus): 0.000 Jarque-Bera (JB): 252.021
Skew: -1.381 Prob(JB): 1.88e-55
Kurtosis: 4.711 Cond. No. 3.97e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.97e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Krumm et al. (2018) have outlined the following 3-step process for communicating finding with education stakeholders:
Select. Selecting analyses that are most important and useful to an intended audience, as well as selecting a format for displaying that info (e.g. chart, table).
Polish. Refining or polishing data products, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative pairing a data product with its related research question and describing how best to interpret and use the data product.
This work was supported by the National Science Foundation grants DRL-2025090 and DRL-2321128 (ECR:BCSER). Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.