Workflow & Reproducible Research

Orientation: Code Along

Agenda

Python Code-Along

Data-Intensive Research Workflow

Prepare
Wrangle
Explore
Model
Communicate

Prepare

The Study

The setting of this study was a public provider of individual online courses in a Midwestern state. Data was collected from two semesters of five online science courses and aimed to understand students’ motivation.

Research Question

Is there a relationship between the time students spend in a learning management system and their final course grade?

In Python, packages are equivalent to R’s libraries, containing functions, modules, and documentation. They can be installed using pip and imported into your scripts.

# Import pandas for data manipulation and matplotlib for visualization
import pandas as pd
import matplotlib.pyplot as plt

# If not installed, you can run in the terminal:
# python3 -m pip install matplotlib plotly
# python3 -m pip install pandas
 

# or windows
# py -m pip install matplotlib plotly
# py -m pip install pandas

The pandas library in Python is used for data manipulation and analysis. Similar to the {readr} package in R the function like pd.read_csv() is for importing rectangular data from delimited text files such as comma-separated values (CSV), a preferred file format for reproducible research.

# Load data into the Python environment from a CSV file
sci_data = pd.read_csv("data/sci-online-classes.csv")

Inspecting the dataset in Python can be done by displaying the first few rows of the DataFrame.

# Display the first five rows of the data
print(sci_data.head())

   student_id      course_id  total_points_possible  total_points_earned  \
0       43146  FrScA-S216-02                   3280                 2220   
1       44638   OcnA-S116-01                   3531                 2672   
2       47448  FrScA-S216-01                   2870                 1897   
3       47979   OcnA-S216-01                   4562                 3090   
4       48797  PhysA-S116-01                   2207                 1910   

   percentage_earned subject semester  section  \
0           0.676829   FrScA     S216        2   
1           0.756726    OcnA     S116        1   
2           0.660976   FrScA     S216        1   
3           0.677335    OcnA     S216        1   
4           0.865428   PhysA     S116        1   

                        Gradebook_Item  Grade_Category  ...   q7   q8   q9  \
0  POINTS EARNED & TOTAL COURSE POINTS             NaN  ...  5.0  5.0  4.0   
1                            ATTEMPTED             NaN  ...  4.0  5.0  4.0   
2  POINTS EARNED & TOTAL COURSE POINTS             NaN  ...  4.0  5.0  3.0   
3  POINTS EARNED & TOTAL COURSE POINTS             NaN  ...  4.0  5.0  5.0   
4  POINTS EARNED & TOTAL COURSE POINTS             NaN  ...  4.0  4.0  NaN   

   q10  TimeSpent  TimeSpent_hours  TimeSpent_std  int   pc        uv  
0  5.0  1555.1667        25.919445      -0.180515  5.0  4.5  4.333333  
1  4.0  1382.7001        23.045002      -0.307803  4.2  3.5  4.000000  
2  5.0   860.4335        14.340558      -0.693260  5.0  4.0  3.666667  
3  5.0  1598.6166        26.643610      -0.148447  5.0  3.5  5.000000  
4  3.0  1481.8000        24.696667      -0.234663  3.8  3.5  3.500000  

[5 rows x 30 columns]

What variables do you think might help us answer our research question?

Python syntax for reading and inspecting data can be intuitive and powerful. For example:

sci_data = pd.read_csv("data/sci-online-classes.csv")

Functions are like verbs: pd.read_csv() is the function used to read a CSV file into a pandas DataFrame.
Objects are the nouns: In this case, sci_data becomes the object that stores the DataFrame created by pd.read_csv(“data/sci-online-classes.csv”).
Arguments are like adverbs: “data/sci-online-classes.csv” is the argument to pd.read_csv(), specifying the path to the CSV file. Unlike R’s read_csv, the default behavior in pandas is to infer column names from the first row in the file, so there’s no need for a col_names argument.
Operators are like “punctuation”: = is the assignment operator in Python, used to assign the DataFrame returned by pd.read_csv(“data/sci-online-classes.csv”) to the object sci_data.

Wrangle

Data wrangling is the process of cleaning, “tidying”, and transforming data. In Learning Analytics, it often involves merging (or joining) data from multiple sources.

Data wrangling in Python is primarily done using pandas, allowing for cleaning, filtering, and transforming data.

Since we are interested the relationship between time spent in an online course and final grades, let’s select() the FinalGradeCEMS and TimeSpent from sci_data.

# Selecting specific columns and creating a new DataFrame
sci_data_selected = sci_data[['FinalGradeCEMS', 'TimeSpent']]

Explore

EDA
Graph-template
Our first graph

Exploratory data analysis in Python involves processes of describing your data numerically or graphically, which often includes:

calculating summary statistics like frequency, means, and standard deviations
visualizing your data through charts and graphs

EDA can be used to help answer research questions, generate new questions about your data, discover relationships between and among variables, and create new variables (i.e., feature engineering) for data modeling.

The workflow for making a graph typically involves:

Choosing the type of plot or visualization you want to create.
Using the plotting function directly with your data.
Optionally customizing the plot with titles, labels, and other aesthetic features.

A simple template for creating a scatter plot could look like this:

#import packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Basic scatter plot template
# sns.scatterplot(x=<VARIABLE1>, y=<VARIABLE2>, data=<DATA>)
# plt.show()

Scatter plots are useful for visualizing the relationship between two continuous variables. Here’s how to create one with seaborn.

# Create a scatter plot showing the relationship between time spent and final grades
sns.scatterplot(x='TimeSpent', y='FinalGradeCEMS', data=sci_data)
plt.xlabel('Time Spent')
plt.ylabel('Final Grade CEMS')
plt.show()

Our journey through Exploratory Data Analysis (EDA) with Python equips us with the tools to uncover hidden patterns and relationships. Utilize this segment to showcase how seaborn and matplotlib enable us to visualize data in ways that reveal these underlying structures, whether through histograms, scatter plots, or box plots. EDA is our investigative tool, prompting hypotheses and guiding the direction of our modeling efforts. Explicitly demonstrate creating a scatter plot with sns.scatterplot() to examine the relationship between time spent and final grades, highlighting the direct feedback loop between visual exploration and analytical insight.

Components of the Scatter Plot Command: sns.scatterplot(): This is the function from the seaborn library (often imported as sns) used to create scatter plots. Seaborn is a statistical data visualization library built on top of matplotlib, offering a higher-level interface for drawing attractive and informative statistical graphics.

x=: This argument specifies the data for the x-axis. should be replaced with the name of the column from your dataset that you want to display on the x-axis.

y=: Similarly, this argument specifies the data for the y-axis. should be replaced with the name of another column from your dataset that you want to display on the y-axis.

data=: This argument is where you specify the dataset containing and . should be replaced with the variable name of your dataset, which is typically a pandas DataFrame.

plt.show(): After creating the scatter plot with sns.scatterplot(), this command from matplotlib’s pyplot interface (often imported as plt) is used to display the plot. Without this command, the plot may not be visible (depending on your environment, like Jupyter Notebooks might automatically show plots even without explicitly calling plt.show()).

Model

Dealing with missing data (NaN)
A Simple Model
Interpret

Before adding a constant or fitting the model, ensure the data doesn’t contain NaNs (Not a Number) or infinite values. In a pandas DataFrame, you can use a combination of methods provided by pandas and NumPy.

import numpy as np

# Drop rows with NaNs in 'TimeSpent' or 'FinalGradeCEMS'
sci_data_clean = sci_data.dropna(subset=['TimeSpent', 'FinalGradeCEMS'])

# Replace infinite values with NaN and then drop those rows (if any)
sci_data_clean.replace([np.inf, -np.inf], np.nan, inplace=True)
sci_data_clean.dropna(subset=['TimeSpent', 'FinalGradeCEMS'], inplace=True)

You can use libraries such as statsmodels or scikit-learn. We’ll dive much deeper into modeling in subsequent learning labs, but for now let’s see if there is a statistically significant relationship between students’ final grades, FinaGradeCEMS, and the TimeSpent in the LMS:

# add the statsmodels
import statsmodels.api as sm

# Add a constant term for the intercept to the independent variable
X = sm.add_constant(sci_data_clean['TimeSpent'])  # Independent variable
y = sci_data_clean['FinalGradeCEMS']  # Dependent variable

# Fit the model
model = sm.OLS(y, X).fit()

# Print the summary of the model
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         FinalGradeCEMS   R-squared:                       0.134
Model:                            OLS   Adj. R-squared:                  0.132
Method:                 Least Squares   F-statistic:                     87.99
Date:                Fri, 19 Jul 2024   Prob (F-statistic):           1.53e-19
Time:                        12:14:17   Log-Likelihood:                -2548.5
No. Observations:                 573   AIC:                             5101.
Df Residuals:                     571   BIC:                             5110.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         65.8085      1.491     44.131      0.000      62.880      68.737
TimeSpent      0.0061      0.001      9.380      0.000       0.005       0.007
==============================================================================
Omnibus:                      136.292   Durbin-Watson:                   1.537
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              252.021
Skew:                          -1.381   Prob(JB):                     1.88e-55
Kurtosis:                       4.711   Cond. No.                     3.97e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.97e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Overview Modeling Modeling is where we apply statistical techniques to interpret our data’s story. With Python’s statsmodels, we can fit a linear regression model to quantify the relationship between variables of interest. This section is an opportunity to detail the process of model selection, fitting, and evaluation within Python’s framework. Discuss the interpretation of regression outputs from statsmodels, such as coefficients for understanding variable impact, p-values for statistical significance, and R-squared values for model fit. This not only aids in hypothesis testing but also in predicting outcomes based on observed data.

Dealing with Missing Data (NaN) Importance: Handling missing data is a crucial preprocessing step to ensure the quality and reliability of your statistical models. Missing values (NaNs) or infinite values in your dataset can lead to errors or biased results when fitting models. Thus, cleaning your data is essential.

Process: Drop NaNs: This step removes any rows in your dataset that contain NaN values in specific columns of interest (TimeSpent or FinalGradeCEMS). This ensures that the model only uses complete cases without missing values, which could distort the analysis.

Handle Infinite Values: Infinite values, which can result from divisions by zero or other operations, are not suitable for most statistical models. Replacing these with NaN (np.inf or -np.inf to np.nan) and then dropping them is a way to ensure that your dataset is finite and can be processed by statistical modeling tools.

A Simple Model Importance: Building a statistical model allows us to quantify the relationship between variables in our dataset. In this case, you’re interested in the relationship between TimeSpent in the learning management system (LMS) and students’ final grades (FinalGradeCEMS). A simple linear regression model can provide insights into whether there is a statistically significant association between these variables.

Process: Add a Constant Term: Many statistical models, including linear regression, assume that your equation will have an intercept term. Adding a constant to your independent variables (using sm.add_constant()) accommodates this intercept in the model.

Fit the Model: Using statsmodels’ OLS function (Ordinary Least Squares), the model is fitted to the data. This process involves finding the parameters (intercept and slope) that minimize the sum of squared residuals, providing the best linear approximation of the relationship between the independent and dependent variables.

Interpret Importance: After fitting the model, interpreting the output is crucial for understanding the relationship between the variables. The model summary provides several key pieces of information, including the coefficients of the model, their statistical significance, and the overall fit of the model.

Process: Coefficients: Indicate the expected change in the dependent variable (e.g., FinalGradeCEMS) for a one-unit increase in the independent variable (e.g., TimeSpent), assuming all other variables in the model are held constant.

P-values: Help determine the statistical significance of each coefficient. A low p-value (typically <0.05) suggests that the effect of the independent variable on the dependent variable is statistically significant and not due to chance.

Model Fit: Indicators like R-squared value give an idea of how well the model explains the variability in the dependent variable. A higher R-squared value suggests a better fit, although it’s important to consider other factors and diagnostics to evaluate model performance fully.

Communicate

Data Products
Dashboards
Websites
Books
…and more

Krumm et al. (2018) have outlined the following 3-step process for communicating finding with education stakeholders:

Select. Selecting analyses that are most important and useful to an intended audience, as well as selecting a format for displaying that info (e.g. chart, table).
Polish. Refining or polishing data products, by adding or editing titles, labels, and notations and by working with colors and shapes to highlight key points.
Narrate. Writing a narrative pairing a data product with its related research question and describing how best to interpret and use the data product.

https://quarto.org/docs/dashboards/

https://quarto.org/docs/websites/

]

https://quarto.org/docs/books/

In this section, we highlight the essence of communicating research findings effectively. Krumm et al. (2018) advocate for a targeted approach: select key findings that resonate with your audience, polish your data products for clarity and engagement, and narrate your insights to ensure they are actionable. Quarto enhances this process by supporting diverse formats for data products, from interactive dashboards and websites to traditional publications like books and manuscripts. Its flexibility allows for tailored communication strategies, ensuring your research not only reaches but also impacts your intended audience. By leveraging Quarto, we can create compelling narratives that are accessible across various platforms, fostering broader understanding and application of our findings.

Quarto stands out in the communication phase due to its inherent design for reproducibility and collaboration. It’s a powerful tool that allows researchers to seamlessly integrate Python analysis with narrative text, creating a cohesive document that not only presents findings but also the code and data behind those conclusions. This integration promotes transparency and facilitates peer review, ensuring that the research can be easily validated and reproduced by others.

Moreover, Quarto’s ability to produce a wide array of output formats—from interactive web pages and dashboards to PDFs and slides—ensures that our communication is not just wide-reaching but also adaptable to the preferences and needs of diverse audiences. Whether stakeholders prefer a static report, an interactive web application, or a formal presentation, Quarto enables us to cater to these varied requirements without needing to alter the underlying content. This flexibility, combined with the platform’s emphasis on reproducibility, positions Quarto as an ideal choice for modern data-driven research communication.

What’s Next?

Our First LASER Badge! Next you will complete an interactive “case study” which is a key component to each learning lab.

Navigate to the Files tab and open the following file:

laser-lab-case-study.RMD Change this to the new one

Essential Readings

Reproducible Research with R and RStudio (chapters 1 & 2)
Learning Analytics Goes to School (pages 28 - 58)
Data Science in Education Using R
R for Data Science

Acknowledgements

This work was supported by the National Science Foundation grants DRL-2025090 and DRL-2321128 (ECR:BCSER). Any opinions, findings, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.