Foundations Python Module 1: Code-A-long
Foundations of Learning Analyticsare designed for those seeking an introductory understanding of learning analytics and either basic R programming skills or basic Python skills, particularly in the context of STEM education research. The following code along is aimed at preparing you for the first section of the case study.
By the end of this module:
Data:
Let’s start by creating a new Python script and loading some essential packages introduced in LA Workflows:
NumPy is a Python package for scientific computing with Python. It is used for various types of data manipulation and mathematical operations, particularly when working with large datasets or complex mathematical computations.
Use your Python script to import the {numpy} package as np.
If this is your first time you will need to install the package in the terminal:
MAC/LINUX: $ python3 -m pip install numpy
Windows: $ py -m pip install numpy
Copy the code below into your script and try out the code. What do you see?
import pandas as pd
# read and load `sci-online-classes.csv` from data folder
time_spent = pd.read_csv("data/sci-online-classes.csv")
#inspect data
time_spent.head()| student_id | course_id | total_points_possible | total_points_earned | percentage_earned | subject | semester | section | Gradebook_Item | Grade_Category | ... | q7 | q8 | q9 | q10 | TimeSpent | TimeSpent_hours | TimeSpent_std | int | pc | uv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 43146 | FrScA-S216-02 | 3280 | 2220 | 0.676829 | FrScA | S216 | 2 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 5.0 | 5.0 | 4.0 | 5.0 | 1555.1667 | 25.919445 | -0.180515 | 5.0 | 4.5 | 4.333333 |
| 1 | 44638 | OcnA-S116-01 | 3531 | 2672 | 0.756726 | OcnA | S116 | 1 | ATTEMPTED | NaN | ... | 4.0 | 5.0 | 4.0 | 4.0 | 1382.7001 | 23.045002 | -0.307803 | 4.2 | 3.5 | 4.000000 |
| 2 | 47448 | FrScA-S216-01 | 2870 | 1897 | 0.660976 | FrScA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 5.0 | 3.0 | 5.0 | 860.4335 | 14.340558 | -0.693260 | 5.0 | 4.0 | 3.666667 |
| 3 | 47979 | OcnA-S216-01 | 4562 | 3090 | 0.677335 | OcnA | S216 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 5.0 | 5.0 | 5.0 | 1598.6166 | 26.643610 | -0.148447 | 5.0 | 3.5 | 5.000000 |
| 4 | 48797 | PhysA-S116-01 | 2207 | 1910 | 0.865428 | PhysA | S116 | 1 | POINTS EARNED & TOTAL COURSE POINTS | NaN | ... | 4.0 | 4.0 | NaN | 3.0 | 1481.8000 | 24.696667 | -0.234663 | 3.8 | 3.5 | 3.500000 |
5 rows × 30 columns
Use your Python script to read in log-data.csv from the data folder, using the pd.read_csv function. Save it to a new object called online_classes.
👉 Your Turn ⤵ -> Answer
import pandas as pd
# read and load log file from data folder
time_spent = pd.read_csv("data/log-data.csv")
#inspect data
time_spent.head()| student_id | course_id | gender | enrollment_reason | enrollment_status | time_spent | |
|---|---|---|---|---|---|---|
| 0 | 60186 | AnPhA-S116-01 | M | Course Unavailable at Local School | Approved/Enrolled | 2087.0501 |
| 1 | 66693 | AnPhA-S116-01 | M | Course Unavailable at Local School | Approved/Enrolled | 2309.0334 |
| 2 | 66811 | AnPhA-S116-01 | F | Course Unavailable at Local School | Approved/Enrolled | 5298.8507 |
| 3 | 66862 | AnPhA-S116-01 | F | Course Unavailable at Local School | Approved/Enrolled | 1746.9667 |
| 4 | 67508 | AnPhA-S116-01 | F | Scheduling Conflict | Approved/Enrolled | 2668.1830 |
Let’s create Mock Data Generation
# Create mock data
import pandas as pd
students = pd.DataFrame({
'student_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'major': ['Math', 'Physics', 'Biology', 'Computer Science']
})
students| student_id | name | major | |
|---|---|---|---|
| 0 | 1 | Alice | Math |
| 1 | 2 | Bob | Physics |
| 2 | 3 | Charlie | Biology |
| 3 | 4 | David | Computer Science |
# Left join: Returns all rows from the left table, and the matched rows from the right table. If there is no match, the result is NA.
left_join_result = pd.merge(students, scores, on='student_id', how='left')
left_join_result| student_id | name | major | score | |
|---|---|---|---|---|
| 0 | 1 | Alice | Math | 85.0 |
| 1 | 2 | Bob | Physics | 90.0 |
| 2 | 3 | Charlie | Biology | 75.0 |
| 3 | 4 | David | Computer Science | NaN |
# Right join: Returns all rows from the right table, and the matched rows from the left table. If there is no match, the result is NA.
right_join_result = pd.merge(students, scores, on='student_id', how='right')
right_join_result| student_id | name | major | score | |
|---|---|---|---|---|
| 0 | 1 | Alice | Math | 85 |
| 1 | 2 | Bob | Physics | 90 |
| 2 | 3 | Charlie | Biology | 75 |
| 3 | 5 | NaN | NaN | 80 |
# Full join: Returns all rows from both tables. If there is no match, the result is NA for the missing values.
full_join_result = pd.merge(students, scores, on='student_id', how='outer')
full_join_result| student_id | name | major | score | |
|---|---|---|---|---|
| 0 | 1 | Alice | Math | 85.0 |
| 1 | 2 | Bob | Physics | 90.0 |
| 2 | 3 | Charlie | Biology | 75.0 |
| 3 | 4 | David | Computer Science | NaN |
| 4 | 5 | NaN | NaN | 80.0 |
What’s next?
Prepare and Wrangle parts of the Case Study.