Prepare and Wrangle

Foundations Python Module 1: Code-A-long

Welcome to Foundations code along for Module 1

Foundations of Learning Analytics are designed for those seeking an introductory understanding of learning analytics and either basic R programming skills or basic Python skills, particularly in the context of STEM education research. The following code along is aimed at preparing you for the first section of the case study.



Module Objectives

By the end of this module:

  • Know how to read in the data:
    • Learners will be able to identify and describe different types of learning environments, explaining their unique features and applications in educational research.
  • Characteristics of Data:
    • Learners will gain proficiency in recognizing and categorizing various data formats commonly used in educational research by the end of this section.

Loading and Installing packages

Install Packages

  • First time using a package
  • Do this ONLY ONCE in the “terminal”

Load Packages

Let’s start by creating a new Python script and loading some essential packages introduced in LA Workflows:

  • First, install package in Terminal

MAC/LINUX: $ python3 -m pip install pandas


Windows: $ py -m pip install pandas


  • Second, load the package in the quarto document
# Load necessary libraries
import pandas as pd

NumPy is a Python package for scientific computing with Python. It is used for various types of data manipulation and mathematical operations, particularly when working with large datasets or complex mathematical computations.

Use your Python script to import the {numpy} package as np.


If this is your first time you will need to install the package in the terminal:


MAC/LINUX: $ python3 -m pip install numpy


Windows: $ py -m pip install numpy


# YOUR CODE HERE
#
#

👉 Your Turn -> Answer


MAC/LINUX: $ python3 -m pip install numpy


Windows: $ py -m pip install numpy


# YOUR CODE HERE
import numpy as np

Reading in Data

Reading and Inspecting data

Copy the code below into your script and try out the code. What do you see?

import pandas as pd

# read and load `sci-online-classes.csv` from data folder
time_spent = pd.read_csv("data/sci-online-classes.csv")

#inspect data
time_spent.head()
student_id course_id total_points_possible total_points_earned percentage_earned subject semester section Gradebook_Item Grade_Category ... q7 q8 q9 q10 TimeSpent TimeSpent_hours TimeSpent_std int pc uv
0 43146 FrScA-S216-02 3280 2220 0.676829 FrScA S216 2 POINTS EARNED & TOTAL COURSE POINTS NaN ... 5.0 5.0 4.0 5.0 1555.1667 25.919445 -0.180515 5.0 4.5 4.333333
1 44638 OcnA-S116-01 3531 2672 0.756726 OcnA S116 1 ATTEMPTED NaN ... 4.0 5.0 4.0 4.0 1382.7001 23.045002 -0.307803 4.2 3.5 4.000000
2 47448 FrScA-S216-01 2870 1897 0.660976 FrScA S216 1 POINTS EARNED & TOTAL COURSE POINTS NaN ... 4.0 5.0 3.0 5.0 860.4335 14.340558 -0.693260 5.0 4.0 3.666667
3 47979 OcnA-S216-01 4562 3090 0.677335 OcnA S216 1 POINTS EARNED & TOTAL COURSE POINTS NaN ... 4.0 5.0 5.0 5.0 1598.6166 26.643610 -0.148447 5.0 3.5 5.000000
4 48797 PhysA-S116-01 2207 1910 0.865428 PhysA S116 1 POINTS EARNED & TOTAL COURSE POINTS NaN ... 4.0 4.0 NaN 3.0 1481.8000 24.696667 -0.234663 3.8 3.5 3.500000

5 rows × 30 columns

Use your Python script to read in log-data.csv from the data folder, using the pd.read_csv function. Save it to a new object called online_classes.

# YOUR CODE HERE
#
#

👉 Your Turn -> Answer

import pandas as pd

# read and load log file from data folder
time_spent = pd.read_csv("data/log-data.csv")

#inspect data
time_spent.head()
student_id course_id gender enrollment_reason enrollment_status time_spent
0 60186 AnPhA-S116-01 M Course Unavailable at Local School Approved/Enrolled 2087.0501
1 66693 AnPhA-S116-01 M Course Unavailable at Local School Approved/Enrolled 2309.0334
2 66811 AnPhA-S116-01 F Course Unavailable at Local School Approved/Enrolled 5298.8507
3 66862 AnPhA-S116-01 F Course Unavailable at Local School Approved/Enrolled 1746.9667
4 67508 AnPhA-S116-01 F Scheduling Conflict Approved/Enrolled 2668.1830

Joining data

Let’s create Mock Data Generation

# Create mock data
import pandas as pd

students = pd.DataFrame({
    'student_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'major': ['Math', 'Physics', 'Biology', 'Computer Science']
})
students
student_id name major
0 1 Alice Math
1 2 Bob Physics
2 3 Charlie Biology
3 4 David Computer Science
# Create mock data
scores = pd.DataFrame({
    'student_id': [1, 2, 3, 5],
    'score': [85, 90, 75, 80]
})
scores
student_id score
0 1 85
1 2 90
2 3 75
3 5 80
# Inner join: Returns rows that have matching values in both tables
inner_join_result = pd.merge(students, scores, on='student_id', how='inner')
inner_join_result
student_id name major score
0 1 Alice Math 85
1 2 Bob Physics 90
2 3 Charlie Biology 75
# Left join: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NA.
left_join_result = pd.merge(students, scores, on='student_id', how='left')
left_join_result
student_id name major score
0 1 Alice Math 85.0
1 2 Bob Physics 90.0
2 3 Charlie Biology 75.0
3 4 David Computer Science NaN
# Right join: Returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NA.
right_join_result = pd.merge(students, scores, on='student_id', how='right')
right_join_result
student_id name major score
0 1 Alice Math 85
1 2 Bob Physics 90
2 3 Charlie Biology 75
3 5 NaN NaN 80
# Full join: Returns all rows from both tables. If there is no match, the result is NA for the missing values.
full_join_result = pd.merge(students, scores, on='student_id', how='outer')
full_join_result
student_id name major score
0 1 Alice Math 85.0
1 2 Bob Physics 90.0
2 3 Charlie Biology 75.0
3 4 David Computer Science NaN
4 5 NaN NaN 80.0

What’s next?





  • Complete the Prepare and Wrangle parts of the Case Study.
  • Complete the Badge requirement document Foundations badge - Data Sources
  • Do required readings for the next Foundations Module 2.