Explore

Foundations Python Module 2: Code-A-long

Welcome to Foundations code along for Module 2

Exploratory Data Analysis (EDA) for educational researchers involves investigating and summarizing data sets to uncover patterns, spot anomalies, and test hypotheses, using statistical graphics and other data visualization methods.

This process helps researchers understand underlying trends in educational data before applying more complex analytical techniques.

Module Objectives

By the end of this module:

Data Visualization with ggplot2:
Learners will understand how to use ggplot2 to create various types of plots and graphs, enabling them to visualize data effectively and identify patterns and trends.

Data Transformation and Preprocessing:
Learners will gain proficiency in transforming and preprocessing raw data using R, ensuring the data is clean, structured correctly, and ready for analysis.

Explore setup

Data Visualization
Data Transformation
Data Preprocessing (DP)
Feature Engineering (FE)

EXPLORE PHASE This is essentially exploratory data analysis and this phase allows us to gain an understanding of the data such that we can figure out the course of actions and areas that we can to explore in the modeling phase. This entails the use of descriptive statistics and data visualizations. It is in an unstructured way to uncover initial patterns, characteristics, and points of interest. Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again. The goal of data exploration is to generate many promising leads that you can later explore in more depth

VISUALIZATION One goal in this phase is explore questions that drive the original analysis and develop new questions and hypotheses to test in later stages. ggplot2, one of the core members of the tidyverse consists of a grammar of graphics to organize and make sense of different elements (Wilkinson, 2005).

TRANSFORMATION DATA PREPROCESSING in the wrangling phase is usually about getting large volumes of data from the sources — databases, object stores, data lakes, etc — and performing basic data cleaning and data wrangling preparing them for the later part. TO explore the data here you may need to wrangle or preprocess the data further to get descriptive data

FEATURE ENGINEERING is known as the process of transforming raw data (that has already been processed) into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data.

import pandas as pd
import numpy as np

# read and load `sci-online-classes.csv` from data folder
time_spent = pd.read_csv("data/sci-online-classes.csv")

# Find cells with missing values
null_data = time_spent.isnull()

# Calculate the number of missing values
missing_count = null_data.sum()

print(missing_count)

student_id                 0
course_id                  0
total_points_possible      0
total_points_earned        0
percentage_earned          0
subject                    0
semester                   0
section                    0
Gradebook_Item             0
Grade_Category           603
FinalGradeCEMS            30
Points_Possible            0
Points_Earned             92
Gender                     0
q1                       123
q2                       126
q3                       123
q4                       125
q5                       127
q6                       127
q7                       129
q8                       129
q9                       129
q10                      129
TimeSpent                  5
TimeSpent_hours            5
TimeSpent_std              5
int                       76
pc                        75
uv                        75
dtype: int64

# Calculate the number of missing values
missing_count = null_data.sum()
print(missing_count)

student_id                 0
course_id                  0
total_points_possible      0
total_points_earned        0
percentage_earned          0
subject                    0
semester                   0
section                    0
Gradebook_Item             0
Grade_Category           603
FinalGradeCEMS            30
Points_Possible            0
Points_Earned             92
Gender                     0
q1                       123
q2                       126
q3                       123
q4                       125
q5                       127
q6                       127
q7                       129
q8                       129
q9                       129
q10                      129
TimeSpent                  5
TimeSpent_hours            5
TimeSpent_std              5
int                       76
pc                        75
uv                        75
dtype: int64

# Find the number of rows in the DataFrame
total_entries = len(time_spent)

# Calculate the missing counts
missing_percentage = (missing_count / total_entries) * 100
print(missing_percentage)

student_id                 0.000000
course_id                  0.000000
total_points_possible      0.000000
total_points_earned        0.000000
percentage_earned          0.000000
subject                    0.000000
semester                   0.000000
section                    0.000000
Gradebook_Item             0.000000
Grade_Category           100.000000
FinalGradeCEMS             4.975124
Points_Possible            0.000000
Points_Earned             15.257048
Gender                     0.000000
q1                        20.398010
q2                        20.895522
q3                        20.398010
q4                        20.729685
q5                        21.061360
q6                        21.061360
q7                        21.393035
q8                        21.393035
q9                        21.393035
q10                       21.393035
TimeSpent                  0.829187
TimeSpent_hours            0.829187
TimeSpent_std              0.829187
int                       12.603648
pc                        12.437811
uv                        12.437811
dtype: float64

# Calculate the percentage of missing data in each column
missing_percentage = (time_spent.isnull().sum() / len(time_spent)) * 100
print("Percentage of missing data in each column:")
print(missing_percentage[missing_percentage > 0])  # Only display columns with missing percentages

Percentage of missing data in each column:
Grade_Category     100.000000
FinalGradeCEMS       4.975124
Points_Earned       15.257048
q1                  20.398010
q2                  20.895522
q3                  20.398010
q4                  20.729685
q5                  21.061360
q6                  21.061360
q7                  21.393035
q8                  21.393035
q9                  21.393035
q10                 21.393035
TimeSpent            0.829187
TimeSpent_hours      0.829187
TimeSpent_std        0.829187
int                 12.603648
pc                  12.437811
uv                  12.437811
dtype: float64

Method chaining is a powerful technique that allows us to perform multiple operations in a concise and readable manner by linking method calls together in a single line of code.

First, we use the isnull() method on the time_spent DataFrame to identify missing values. This method creates a DataFrame of the same shape but with Boolean values—True for missing values and False for non-missing values. Next, we call the sum() method on the result of isnull(), which sums up the True values (treated as 1) for each column, giving us the total number of missing values in each column.

We then use the len() function to find the total number of rows in the time_spent DataFrame. This value serves as the denominator for calculating the percentage of missing values. By dividing the total number of missing values by the total number of entries and multiplying by 100, we convert the ratio into a percentage. This operation is done using the expression (.sum() / len(time_spent)) * 100.

Handling Missing Data

Mock generation code
Identify
Drop rows or columns
Filling

import pandas as pd

# Sample DataFrame
data = {
    'A': [1, 2, None, 4, 5],
    'B': [None, 2, 3, 4, None],
    'C': [1, None, None, 4, 5]
}
df = pd.DataFrame(data)

missing_data = df.isnull().sum()
print("Missing values in each column:\n", missing_data)

Missing values in each column:
 A    1
B    2
C    2
dtype: int64

Use dropna() to remove rows or columns with missing values.

# Drop rows with any missing values
df_cleaned = df.dropna()

# Drop columns with any missing values
df_cleaned_cols = df.dropna(axis=1)

# Drop rows with missing values in a specific column
df_cleaned_specific = df.dropna(subset=['A'])

Use fillna() to fill missing values with a specific value or method.

# Fill missing values with a specific value
df_filled = df.fillna(999)

# Fill missing values with the mean of the column
df_filled_mean = df.fillna(df.mean())

# Forward fill (propagate last valid observation forward)
df_filled_ffill = df.fillna(method='ffill')

# Backward fill (propagate next valid observation backward)
df_filled_bfill = df.fillna(method='bfill')

Visualization

Matplot Package
Seaborn Package
Functions 101

MatPlot is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is used for various types of data visualization, allowing users to generate plots, histograms, power spectra, bar charts, error charts, scatterplots, and more. Matplotlib is particularly useful for creating high-quality graphs and figures for data analysis and publication.

Seaborn is a Python visualization library built on top of Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Vizualize

Basic plot
Add labels
Stacked Bar plot
Basic Scatter Plot
👉 Your Turn ⤵
ANSWER

import matplotlib.pyplot as plt
import seaborn as sns

# Create a bar plot
sns.countplot(data=time_spent, x='subject')
plt.show()

sns.countplot(data=time_spent, x='subject')
plt.title("Number of Student Enrollments per Subject")
plt.xlabel("Subject")
plt.ylabel("Count")
plt.show()

# read and load `sci-online-classes.csv` from data folder
data_to_explore = pd.read_csv("data/data_to_explore.csv")

# Ensure gender NaNs are replaced
data_to_explore['gender'] = data_to_explore['gender'].fillna('Not Provided')

# Create a stacked bar plot
sns.histplot(data=data_to_explore, x='subject', hue='gender', multiple='stack', shrink=0.8)
plt.title("Stacked Gender Distribution Across Subjects")
plt.xlabel("Subject")
plt.ylabel("Count")
plt.show()

# Create a scatter plot with color based on enrollment_status
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data_to_explore, x='time_spent_hours', y='proportion_earned', hue='enrollment_status')
plt.show()

Use your Python script to add labels to the visualization.

# (YOUR CODE HERE)
#
#

👉 Your Turn ⤵ -> Answer

# (YOUR CODE HERE)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data_to_explore, x='time_spent_hours', y='proportion_earned', hue='enrollment_status')
plt.title("How Time Spent on Course LMS is Related to Points Earned in the Course by Enrollment Status")
plt.xlabel("Time Spent (Hours)")
plt.ylabel("Proportion of Points Earned")
plt.show()

What’s next?

Complete the Explore parts of the Case Study.
Complete the Badge requirement document Foundations badge - Data Sources
Do required readings for the next Foundations Module 3.