Exploratory Data Analysis

Foundations Module 2: A Code-A-long

Welcome to Foundations code along for Module 2

Exploratory Data Analysis (EDA) for educational researchers involves investigating and summarizing data sets to uncover patterns, spot anomalies, and test hypotheses, using statistical graphics and other data visualization methods.

This process helps researchers understand underlying trends in educational data before applying more complex analytical techniques.

Module Objectives

By the end of this module:

  • Data Visualization with ggplot2:
  • Learners will understand how to use ggplot2 to create various types of plots and graphs, enabling them to visualize data effectively and identify patterns and trends.
  • Data Transformation and Preprocessing:

  • Learners will gain proficiency in transforming and preprocessing raw data using R, ensuring the data is clean, structured correctly, and ready for analysis.

Explore setup

  • Data Visualization

  • Data Transformation

  • Data Preprocessing (DP)

  • Feature Engineering (FE)

SKIMR FuUNCTION

Your turn 👉 Your Turn


#LOAD SKIMR package and use skim() function to skim 'data_to_explore'
#
#

👉 Your Turn -> Answer


#LOAD SKIMR package and use skim() function to skim 'data_to_explore'
#
#
library(skimr)
#load library
library(skimr)

#skim data
skim(data_to_explore)
data_to_explore %>% 
  select(c('subject', 'gender', 'proportion_earned', 'time_spent')) %>% 
  filter(subject == "OcnA" | subject == "PhysA") %>%
  skim() 

GGplot2

Do you need all of these things to create a graph?

ggplot(data_to_explore, aes(x=subject)) + 
  geom_bar()

ggplot(data_to_explore)+
  geom_bar(aes(x=subject))

data_to_explore %>% 
  ggplot(aes(x = subject)) +
  geom_bar()

R Software Handbook

GGplot2 - Histogram

👉 Your Turn

In the corresponding code script add the code for a basic histogram for ‘time_spent_hours’.

#
#
#

👉 Your Turn -> Answer

# Layer 1: add data and aesthetic mapping
data_to_explore %>% #<<
  ggplot(aes(x = time_spent_hours)) +
# layer 2: add histogram geom
  geom_histogram()

# Layer 1: add data and aesthetic mapping
data_to_explore %>% 
  ggplot(aes(x = time_spent_hours)) +
# layer 2: add histogram geom 
# layer 3a: add bin size
  geom_histogram(bins = 10)

# Layer 1: add data and aesthetic mapping
data_to_explore %>% 
  ggplot(aes(x = time_spent_hours)) +
# layer 2: add histogram geom 
# layer 3a: add bin size
#layer 3b: add color
  geom_histogram(bins = 30,
                 fill = "red",
                 colour = "black") 

]

# Layer 1: add data and aesthetic mapping
data_to_explore %>% 
  ggplot(aes(x = time_spent_hours)) +
# layer 2: add histogram geom 
# layer 3a: add bin size
# layer 3b: add color
  geom_histogram(bins = 30, fill = "red", colour = "black")+
#layer 4: add Labels
  labs(title="Time Spent on LMS histogram plot",x="Time Spent(hours)", y = "Count")+
  theme_classic()

How would we interpret this graph?

SCATTERPLOT

#layer 1: add data and aesthetics mapping 
ggplot(data_to_explore, #<<
       aes(x = time_spent_hours, 
           y = proportion_earned)) +
#layer 2: +  geom function type
  geom_point() #<<

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, 
       aes(x = time_spent_hours, 
           y = proportion_earned,
           color = enrollment_status)) + #<<
#layer 2: +  geom function type
  geom_point()

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
ggplot(data_to_explore, 
       aes(x = time_spent_hours, 
           y = proportion_earned,
           color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 4: add labels
  labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", #<<
       x="Time Spent (Hours)", #<<
       y = "Proportion of Points Earned")  #<<

#layer 1: add data and aesthetics mapping 
#layer 3: add color scale by type
viz1 <- ggplot(data_to_explore, aes(x = time_spent_hours, y = proportion_earned, color = enrollment_status)) +
#layer 2: +  geom function type
  geom_point() +
#layer 4: add labels
    labs(title="How Time Spent on Course LMS is Related to Points Earned in the Course", 
       x="Time Spent (Hours)",
       y = "Proportion of Points Earned")
#layer 5: add facet wrap
  facet_wrap(~ subject) #<<

In the corresponding line of your R script, type the name of visualization object we just created and run the code:

# 
#
#

How would you interpret this graph?

What’s next?





  • Complete the Explore parts of the Case Study.
  • Complete the Badge requirement document Foundations badge - Data Sources
  • Do required readings for the next Foundations Module 3.