Foundations Python Module 1: Code-A-long
Foundations of Learning Analytics
are designed for those seeking an introductory understanding of learning analytics and either basic R programming skills or basic Python skills, particularly in the context of STEM education research. The following code along is aimed at preparing you for the first section of the case study.
By the end of this module:
Data
:
Let’s start by creating a new Python script and loading some essential packages introduced in LA Workflows:
NumPy is a Python package for scientific computing with Python. It is used for various types of data manipulation and mathematical operations, particularly when working with large datasets or complex mathematical computations.
Use your Python script to import the {numpy} package as np
.
If this is your first time you will need to install the package in the terminal:
MAC/LINUX: $ python3 -m pip install numpy
Windows: $ py -m pip install numpy
Copy the code below into your script and try out the code. What do you see?
import pandas as pd
# read and load `sci-online-classes.csv` from data folder
time_spent = pd.read_csv("data/sci-online-classes.csv")
#inspect data
time_spent.head()
student_id | course_id | total_points_possible | total_points_earned | percentage_earned | subject | semester | section | Gradebook_Item | Grade_Category | ... | q7 | q8 | q9 | q10 | TimeSpent | TimeSpent_hours | TimeSpent_std | int | pc | uv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 43146 | FrScA-S216-02 | 3280 | 2220 | 0.676829 | FrScA | S216 | 2 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 5.0 | 5.0 | 4.0 | 5.0 | 1555.1667 | 25.919445 | -0.180515 | 5.0 | 4.5 | 4.333333 |
1 | 44638 | OcnA-S116-01 | 3531 | 2672 | 0.756726 | OcnA | S116 | 1 | ATTEMPTED | NaN | ... | 4.0 | 5.0 | 4.0 | 4.0 | 1382.7001 | 23.045002 | -0.307803 | 4.2 | 3.5 | 4.000000 |
2 | 47448 | FrScA-S216-01 | 2870 | 1897 | 0.660976 | FrScA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 5.0 | 3.0 | 5.0 | 860.4335 | 14.340558 | -0.693260 | 5.0 | 4.0 | 3.666667 |
3 | 47979 | OcnA-S216-01 | 4562 | 3090 | 0.677335 | OcnA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 5.0 | 5.0 | 5.0 | 1598.6166 | 26.643610 | -0.148447 | 5.0 | 3.5 | 5.000000 |
4 | 48797 | PhysA-S116-01 | 2207 | 1910 | 0.865428 | PhysA | S116 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 4.0 | NaN | 3.0 | 1481.8000 | 24.696667 | -0.234663 | 3.8 | 3.5 | 3.500000 |
5 rows × 30 columns
Use your Python script to read in log-data.csv
from the data
folder, using the pd.read_csv
function. Save it to a new object called online_classes
.
👉 Your Turn ⤵ -> Answer
import pandas as pd
# read and load log file from data folder
time_spent = pd.read_csv("data/log-data.csv")
#inspect data
time_spent.head()
student_id | course_id | gender | enrollment_reason | enrollment_status | time_spent | |
---|---|---|---|---|---|---|
0 | 60186 | AnPhA-S116-01 | M | Course Unavailable at Local School | Approved/Enrolled | 2087.0501 |
1 | 66693 | AnPhA-S116-01 | M | Course Unavailable at Local School | Approved/Enrolled | 2309.0334 |
2 | 66811 | AnPhA-S116-01 | F | Course Unavailable at Local School | Approved/Enrolled | 5298.8507 |
3 | 66862 | AnPhA-S116-01 | F | Course Unavailable at Local School | Approved/Enrolled | 1746.9667 |
4 | 67508 | AnPhA-S116-01 | F | Scheduling Conflict | Approved/Enrolled | 2668.1830 |
Let’s create Mock Data Generation
# Create mock data
import pandas as pd
students = pd.DataFrame({
'student_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'major': ['Math', 'Physics', 'Biology', 'Computer Science']
})
students
student_id | name | major | |
---|---|---|---|
0 | 1 | Alice | Math |
1 | 2 | Bob | Physics |
2 | 3 | Charlie | Biology |
3 | 4 | David | Computer Science |
# Left join: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NA.
left_join_result = pd.merge(students, scores, on='student_id', how='left')
left_join_result
student_id | name | major | score | |
---|---|---|---|---|
0 | 1 | Alice | Math | 85.0 |
1 | 2 | Bob | Physics | 90.0 |
2 | 3 | Charlie | Biology | 75.0 |
3 | 4 | David | Computer Science | NaN |
# Right join: Returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NA.
right_join_result = pd.merge(students, scores, on='student_id', how='right')
right_join_result
student_id | name | major | score | |
---|---|---|---|---|
0 | 1 | Alice | Math | 85 |
1 | 2 | Bob | Physics | 90 |
2 | 3 | Charlie | Biology | 75 |
3 | 5 | NaN | NaN | 80 |
# Full join: Returns all rows from both tables. If there is no match, the result is NA for the missing values.
full_join_result = pd.merge(students, scores, on='student_id', how='outer')
full_join_result
student_id | name | major | score | |
---|---|---|---|---|
0 | 1 | Alice | Math | 85.0 |
1 | 2 | Bob | Physics | 90.0 |
2 | 3 | Charlie | Biology | 75.0 |
3 | 4 | David | Computer Science | NaN |
4 | 5 | NaN | NaN | 80.0 |
What’s next?
Prepare
and Wrangle
parts of the Case Study.