Explore

Foundations Python Module 2: Code-A-long

Welcome to Foundations code along for Module 2

Exploratory Data Analysis (EDA) for educational researchers involves investigating and summarizing data sets to uncover patterns, spot anomalies, and test hypotheses, using statistical graphics and other data visualization methods.

This process helps researchers understand underlying trends in educational data before applying more complex analytical techniques.

Module Objectives

By the end of this module:

  • Data Visualization with ggplot2:
  • Learners will understand how to use ggplot2 to create various types of plots and graphs, enabling them to visualize data effectively and identify patterns and trends.
  • Data Transformation and Preprocessing:

  • Learners will gain proficiency in transforming and preprocessing raw data using R, ensuring the data is clean, structured correctly, and ready for analysis.

Explore setup

  • Data Visualization

  • Data Transformation

  • Data Preprocessing (DP)

  • Feature Engineering (FE)

Missing Values

import pandas as pd
import numpy as np

# read and load `sci-online-classes.csv` from data folder
time_spent = pd.read_csv("data/sci-online-classes.csv")

# Find cells with missing values
null_data = time_spent.isnull()

# Calculate the number of missing values
missing_count = null_data.sum()

print(missing_count)
student_id                 0
course_id                  0
total_points_possible      0
total_points_earned        0
percentage_earned          0
subject                    0
semester                   0
section                    0
Gradebook_Item             0
Grade_Category           603
FinalGradeCEMS            30
Points_Possible            0
Points_Earned             92
Gender                     0
q1                       123
q2                       126
q3                       123
q4                       125
q5                       127
q6                       127
q7                       129
q8                       129
q9                       129
q10                      129
TimeSpent                  5
TimeSpent_hours            5
TimeSpent_std              5
int                       76
pc                        75
uv                        75
dtype: int64
# Calculate the number of missing values
missing_count = null_data.sum()
print(missing_count)
student_id                 0
course_id                  0
total_points_possible      0
total_points_earned        0
percentage_earned          0
subject                    0
semester                   0
section                    0
Gradebook_Item             0
Grade_Category           603
FinalGradeCEMS            30
Points_Possible            0
Points_Earned             92
Gender                     0
q1                       123
q2                       126
q3                       123
q4                       125
q5                       127
q6                       127
q7                       129
q8                       129
q9                       129
q10                      129
TimeSpent                  5
TimeSpent_hours            5
TimeSpent_std              5
int                       76
pc                        75
uv                        75
dtype: int64
# Find the number of rows in the DataFrame
total_entries = len(time_spent)

# Calculate the missing counts
missing_percentage = (missing_count / total_entries) * 100
print(missing_percentage)
student_id                 0.000000
course_id                  0.000000
total_points_possible      0.000000
total_points_earned        0.000000
percentage_earned          0.000000
subject                    0.000000
semester                   0.000000
section                    0.000000
Gradebook_Item             0.000000
Grade_Category           100.000000
FinalGradeCEMS             4.975124
Points_Possible            0.000000
Points_Earned             15.257048
Gender                     0.000000
q1                        20.398010
q2                        20.895522
q3                        20.398010
q4                        20.729685
q5                        21.061360
q6                        21.061360
q7                        21.393035
q8                        21.393035
q9                        21.393035
q10                       21.393035
TimeSpent                  0.829187
TimeSpent_hours            0.829187
TimeSpent_std              0.829187
int                       12.603648
pc                        12.437811
uv                        12.437811
dtype: float64
# Calculate the percentage of missing data in each column
missing_percentage = (time_spent.isnull().sum() / len(time_spent)) * 100
print("Percentage of missing data in each column:")
print(missing_percentage[missing_percentage > 0])  # Only display columns with missing percentages
Percentage of missing data in each column:
Grade_Category     100.000000
FinalGradeCEMS       4.975124
Points_Earned       15.257048
q1                  20.398010
q2                  20.895522
q3                  20.398010
q4                  20.729685
q5                  21.061360
q6                  21.061360
q7                  21.393035
q8                  21.393035
q9                  21.393035
q10                 21.393035
TimeSpent            0.829187
TimeSpent_hours      0.829187
TimeSpent_std        0.829187
int                 12.603648
pc                  12.437811
uv                  12.437811
dtype: float64

Handling Missing Data

import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, None, 4, 5],
    'B': [None, 2, 3, 4, None],
    'C': [1, None, None, 4, 5]
}
df = pd.DataFrame(data)
missing_data = df.isnull().sum()
print("Missing values in each column:\n", missing_data)
Missing values in each column:
 A    1
B    2
C    2
dtype: int64
  • Use dropna() to remove rows or columns with missing values.
# Drop rows with any missing values
df_cleaned = df.dropna()

# Drop columns with any missing values
df_cleaned_cols = df.dropna(axis=1)

# Drop rows with missing values in a specific column
df_cleaned_specific = df.dropna(subset=['A'])
  • Use fillna() to fill missing values with a specific value or method.
# Fill missing values with a specific value
df_filled = df.fillna(999)

# Fill missing values with the mean of the column
df_filled_mean = df.fillna(df.mean())

# Forward fill (propagate last valid observation forward)
df_filled_ffill = df.fillna(method='ffill')

# Backward fill (propagate next valid observation backward)
df_filled_bfill = df.fillna(method='bfill')

Visualization

MatPlot is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is used for various types of data visualization, allowing users to generate plots, histograms, power spectra, bar charts, error charts, scatterplots, and more. Matplotlib is particularly useful for creating high-quality graphs and figures for data analysis and publication.

Seaborn is a Python visualization library built on top of Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Vizualize

import matplotlib.pyplot as plt
import seaborn as sns

# Create a bar plot
sns.countplot(data=time_spent, x='subject')
plt.show()

sns.countplot(data=time_spent, x='subject')
plt.title("Number of Student Enrollments per Subject")
plt.xlabel("Subject")
plt.ylabel("Count")
plt.show()

# read and load `sci-online-classes.csv` from data folder
data_to_explore = pd.read_csv("data/data_to_explore.csv")

# Ensure gender NaNs are replaced
data_to_explore['gender'] = data_to_explore['gender'].fillna('Not Provided')

# Create a stacked bar plot
sns.histplot(data=data_to_explore, x='subject', hue='gender', multiple='stack', shrink=0.8)
plt.title("Stacked Gender Distribution Across Subjects")
plt.xlabel("Subject")
plt.ylabel("Count")
plt.show()

# Create a scatter plot with color based on enrollment_status
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data_to_explore, x='time_spent_hours', y='proportion_earned', hue='enrollment_status')
plt.show()

Use your Python script to add labels to the visualization.

# (YOUR CODE HERE)
#
#

👉 Your Turn -> Answer

# (YOUR CODE HERE)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data_to_explore, x='time_spent_hours', y='proportion_earned', hue='enrollment_status')
plt.title("How Time Spent on Course LMS is Related to Points Earned in the Course by Enrollment Status")
plt.xlabel("Time Spent (Hours)")
plt.ylabel("Proportion of Points Earned")
plt.show()

What’s next?





  • Complete the Explore parts of the Case Study.
  • Complete the Badge requirement document Foundations badge - Data Sources
  • Do required readings for the next Foundations Module 3.