Lab 3: Assessing High School Students’ Machine Learning Literacy

Author

LASER Institute

Published

July 19, 2024

Background

The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has significantly transformed various sectors, from healthcare and finance to education and entertainment. This transformation underscores the necessity of integrating machine learning literacy into educational curricula, particularly at the high school level, to prepare students for a future where AI plays a central role. Efforts have been developed to create K-12 AI and ML curricula. Meanwhile, assessing ML literacy among high school students is becoming increasingly crucial, serving as a key metric to understand how well students grasp these complex subjects.

Research question

What are high school students’ machine learning literacy before and after participating an AI curriculum?

Install package

The openai package is the official Python client for the OpenAI API, which provides access to OpenAI’s models like GPT (Generative Pre-trained Transformer). Developers use this package to integrate OpenAI’s AI models into their applications for tasks such as text classification.

backoff is a library that helps in adding retrying logic to Python applications. It’s particularly useful when making requests to web services that might be temporarily unavailable or overloaded. By using backoff, we can automatically retry failed requests using a variety of backoff strategies (e.g., fixed delay, exponential backoff) to gracefully handle intermittent failures and improve the robustness of our applications.

!pip install openai
!pip install backoff

Requirement already satisfied: openai in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (1.25.0)
Requirement already satisfied: anyio<5,>=3.5.0 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (4.2.0)
Requirement already satisfied: distro<2,>=1.7.0 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (0.26.0)
Requirement already satisfied: pydantic<3,>=1.9.0 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (2.5.3)
Requirement already satisfied: sniffio in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (1.3.0)
Requirement already satisfied: tqdm>4 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (4.66.4)
Requirement already satisfied: typing-extensions<5,>=4.7 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from openai) (4.11.0)
Requirement already satisfied: idna>=2.8 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from anyio<5,>=3.5.0->openai) (3.7)
Requirement already satisfied: certifi in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (2024.7.4)
Requirement already satisfied: httpcore==1.* in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from httpx<1,>=0.23.0->openai) (1.0.2)
Requirement already satisfied: h11<0.15,>=0.13 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)
Requirement already satisfied: annotated-types>=0.4.0 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (0.6.0)
Requirement already satisfied: pydantic-core==2.14.6 in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (from pydantic<3,>=1.9.0->openai) (2.14.6)
Requirement already satisfied: backoff in /opt/anaconda3/envs/py311/lib/python3.11/site-packages (2.2.1)

Now let’s set up a connection to OpenAI’s GPT service using the openai library.

import openai
from openai import OpenAI

Note that you need to use your own key. You can get it by referring to How to get your ChatGPT API key (4 steps). Once you get it, replace the XXX with your own key and remove the # in front of the code and run.

# Create a client to interact with 
#gpt_key = "xxx" #use your own key
client = OpenAI(api_key=gpt_key)

The following code imports the popular data manipulation library pandas under the alias pd, which is a common convention. pandas is commonly used for data analysis and manipulation tasks.

Also, it imports the tqdm library and specifically imports the tqdm function from it. tqdm is used to create progress bars in Python, which are useful for tracking the progress of iterative tasks such as loops or data processing operations.

import backoff
import time
import pandas as pd
from tqdm import tqdm

Next, let’s define a function named completions_with_backoff that is decorated with the @backoff.on_exception decorator. This decorator is from the backoff library and configures a backoff strategy for retrying a function in case it raises a specified exception.

@backoff.on_exception(backoff.expo, openai.RateLimitError)
def completions_with_backoff(**kwargs):
    '''This function will automatically try the api call again if it fails.'''
    return client.chat.completions.create(**kwargs)

Now, let’s sets up some variables related to the OpenAI GPT model.

gpt_org = "org-pF1Od41p8zEN8oeGTSxATXei"
gpt_host = "https://api.openai.com/v1"
gpt_model = "gpt-3.5-turbo"
model = gpt_model
MAX_TOKENS = 100

Load data

Next, let’s read the dataset and do some initialization.

DATA_FILENAME = 'ml literacy.csv'
df = pd.read_csv(DATA_FILENAME, encoding='utf-8')
token_usage = 0 #This initializes a variable token_usage to keep track of the total number of tokens used during the process.

Zero shot

The following loop iterates over each row in the DataFrame df using df.iterrows(). The tqdm function is used to display a progress bar for the loop.

responses = [] #This initializes an empty list to store responses generated from the OpenAI GPT model.

for i, row in tqdm(df.iterrows(), total=len(df)):
    value = str(row['response']) #Retrieves the value of the 'response' column from the current row and converts it to a string.
    
    #Constructs a prompt by combining a prefix and the 'response' value from the row.
    prefix = "Based on the student's response provided in:"
    postfix = "evaluate and return only the student's machine learning literacy level. The assessment should categorize the student into one of the following three levels: novice, intermediate, or advanced."
    prompt = ' '.join([prefix, value, postfix])

    # Creates a list of messages containing the prompt.
    messages = [{"role": "user", "content": prompt}] 
    
    #Attempts to generate a completion using the completions_with_backoff function, passing the GPT model, messages, and maximum tokens as arguments.
    try:
        completion = completions_with_backoff(
            model=model,
            messages=messages,
            max_tokens=MAX_TOKENS
        )
    except openai.APIError as e:
        print('ERROR: while getting accessing API.')
        print(f'Failed on item {i}.')
        print(e)
        print("Prompt:", prompt)
        raise e
    
    #Retrieves the response from the completion and appends it to the responses list.
    response = completion.choices[0].message.content
    responses.append(response)
    
    #Updates the token_usage counter with the total tokens used in the completion.
    token_usage += completion.usage.total_tokens

    # Need to wait to not exceed rate limit
    time.sleep(5)

  0%|          | 0/9 [00:00<?, ?it/s] 11%|█         | 1/9 [00:06<00:51,  6.45s/it] 22%|██▏       | 2/9 [00:12<00:44,  6.30s/it] 33%|███▎      | 3/9 [00:19<00:38,  6.43s/it] 44%|████▍     | 4/9 [00:25<00:31,  6.28s/it] 56%|█████▌    | 5/9 [00:31<00:25,  6.26s/it] 67%|██████▋   | 6/9 [00:38<00:19,  6.35s/it] 78%|███████▊  | 7/9 [00:44<00:12,  6.42s/it] 89%|████████▉ | 8/9 [00:50<00:06,  6.27s/it]100%|██████████| 9/9 [00:56<00:00,  6.27s/it]100%|██████████| 9/9 [00:56<00:00,  6.31s/it]

Now, we can write text classification results in a new csv file.

df["literacy_llm"] = responses #Adds a new column named 'literacy_llm' to the DataFrame df and populates it with the responses generated by the GPT model.
#df.to_csv("modified.csv") #Writes the modified DataFrame df to a new CSV file named 'modified.csv'.

Look at the outcomes of text classification using large language models. Firstly, rather than producing concise assessments of machine learning literacy (i.e., novice, intermediate, and advanced), these models tend to generate extended narratives. Secondly, the accuracy of the results is often questionable. Lastly, there is a noticeable inconsistency, as the model yields varied outcomes upon each execution.

To address these issues, we can use more advanced models as well as changing prompts. For example, we can edit the prompt as follows.

responses = [] #This initializes an empty list to store responses generated from the OpenAI GPT model.

for i, row in tqdm(df.iterrows(), total=len(df)):
    value = str(row['response']) #Retrieves the value of the 'response' column from the current row and converts it to a string.
    
    #Constructs a prompt by combining a prefix and the 'response' value from the row.
    prefix = "Based on the student's response provided in:"
    
    postfix = "evaluate and return only the student's machine learning literacy level. The assessment should categorize the student into one of the following three levels: novice, intermediate, or advanced. Do not provide any additional commentary or explanation—only specify the level."
    
    prompt = ' '.join([prefix, value, postfix])

    # Creates a list of messages containing the prompt.
    messages = [{"role": "user", "content": prompt}] 
    
    #Attempts to generate a completion using the completions_with_backoff function, passing the GPT model, messages, and maximum tokens as arguments.
    try:
        completion = completions_with_backoff(
            model=model,
            messages=messages,
            max_tokens=MAX_TOKENS
        )
    except openai.APIError as e:
        print('ERROR: while getting accessing API.')
        print(f'Failed on item {i}.')
        print(e)
        print("Prompt:", prompt)
        raise e
    
    #Retrieves the response from the completion and appends it to the responses list.
    response = completion.choices[0].message.content
    responses.append(response)
    
    #Updates the token_usage counter with the total tokens used in the completion.
    token_usage += completion.usage.total_tokens

    # Need to wait to not exceed rate limit
    time.sleep(5)

  0%|          | 0/9 [00:00<?, ?it/s] 11%|█         | 1/9 [00:05<00:46,  5.77s/it] 22%|██▏       | 2/9 [00:11<00:41,  5.87s/it] 33%|███▎      | 3/9 [00:17<00:34,  5.81s/it] 44%|████▍     | 4/9 [00:23<00:28,  5.73s/it] 56%|█████▌    | 5/9 [00:28<00:22,  5.70s/it] 67%|██████▋   | 6/9 [00:34<00:16,  5.64s/it] 78%|███████▊  | 7/9 [00:40<00:11,  5.74s/it] 89%|████████▉ | 8/9 [00:45<00:05,  5.71s/it]100%|██████████| 9/9 [00:51<00:00,  5.67s/it]100%|██████████| 9/9 [00:51<00:00,  5.71s/it]

Now, let’s take a look at the result.

df["literacy_llm"] = responses

The result now only contains the three levels of machine learning literacy. However, the issues of accuracy remain a challenge that needs to be addressed. To improve the accuracy of these classifications, several strategies could be implemented.

First, we can provide a definition of the three levels of machine learning literacy directly within the prompt, clearly outlining the characteristics and knowledge expected at the novice, intermediate, and advanced levels. This contextual guidance can assist the model in making more informed and precise assessments by directly comparing the content of the student’s response against these definitions.

Second, instead of zero-shot, we can employ few-shot learning techniques by providing the model with a few examples of text classified into each of the three levels of machine learning literacy before it makes a prediction. This approach can help the model better understand the context and criteria for each category.

Third, employing a “chain of thought” prompting strategy can be beneficial. This method involves guiding the model to break down its reasoning process into intermediate steps before arriving at a final classification. By explicitly asking the model to first identify key concepts or skills demonstrated in the student’s response and then match these to the criteria defined for novice, intermediate, and advanced levels, we can enhance the model’s accuracy in determining the appropriate literacy level. This approach not only aims to improve the precision of the classification but also adds a layer of transparency to how the model reaches its conclusion.

Fourth, leveraging an “assertion” approach can refine the model’s decision-making process. This involves structuring the prompt to encourage the model to make a direct assertion about the student’s machine learning literacy level based on evidence found in the response. By prompting the model to state its conclusion as an assertion and then justify it with specific examples or reasoning from the student’s text, the process becomes more focused and deliberate, potentially increasing the accuracy and reliability of the classification.

Add definition (coding book)

Let’s experiment with the first strategy and look at the result.

responses = [] #This initializes an empty list to store responses generated from the OpenAI GPT model.

for i, row in tqdm(df.iterrows(), total=len(df)):
    value = str(row['response']) #Retrieves the value of the 'response' column from the current row and converts it to a string.
    
    #Constructs a prompt by combining a prefix and the 'response' value from the row.
    prefix = "Based on the student's response provided in:"
    
    postfix = "evaluate and return only the student's machine learning literacy level. The assessment should categorize the student into one of the following three levels: novice, intermediate, or advanced. Consider the following definitions for each level to guide your assessment: Novice: The student tends to oversimplify or inaccurately describe machine learning concepts, indicating a lack of depth in their understanding. Intermediate: The student demonstrates a foundational understanding of machine learning operations, showing they grasp the basics but may not delve into complexities; Advanced: The student exhibits a thorough and detailed knowledge of machine learning processes, reflecting a deep understanding that spans beyond foundational concepts. Do not provide any additional commentary or explanation—only specify the level."
    
    prompt = ' '.join([prefix, value, postfix])

    # Creates a list of messages containing the prompt.
    messages = [{"role": "user", "content": prompt}] 
    
    #Attempts to generate a completion using the completions_with_backoff function, passing the GPT model, messages, and maximum tokens as arguments.
    try:
        completion = completions_with_backoff(
            model=model,
            messages=messages,
            max_tokens=MAX_TOKENS
        )
    except openai.APIError as e:
        print('ERROR: while getting accessing API.')
        print(f'Failed on item {i}.')
        print(e)
        print("Prompt:", prompt)
        raise e
    
    #Retrieves the response from the completion and appends it to the responses list.
    response = completion.choices[0].message.content
    responses.append(response)
    
    #Updates the token_usage counter with the total tokens used in the completion.
    token_usage += completion.usage.total_tokens

    # Need to wait to not exceed rate limit
    time.sleep(5)
df["literacy_llm_definition"] = responses

  0%|          | 0/9 [00:00<?, ?it/s] 11%|█         | 1/9 [00:05<00:44,  5.56s/it] 22%|██▏       | 2/9 [00:11<00:38,  5.54s/it] 33%|███▎      | 3/9 [00:17<00:34,  5.82s/it] 44%|████▍     | 4/9 [00:22<00:28,  5.70s/it] 56%|█████▌    | 5/9 [00:28<00:23,  5.79s/it] 67%|██████▋   | 6/9 [00:34<00:17,  5.69s/it] 78%|███████▊  | 7/9 [01:07<00:29, 14.62s/it] 89%|████████▉ | 8/9 [01:12<00:11, 11.75s/it]100%|██████████| 9/9 [01:18<00:00,  9.91s/it]100%|██████████| 9/9 [01:18<00:00,  8.74s/it]

One shot and chain of thought

Adding definition does not result in better accuracy. This could be due to several factors, including the model’s inherent limitations in interpreting and applying nuanced definitions to varied responses, or the possibility that the definitions themselves are not distinct enough for the model to differentiate clearly between levels. Let’s experiment with the second and third strategy and look at the result.

responses = [] #This initializes an empty list to store responses generated from the OpenAI GPT model.

for i, row in tqdm(df.iterrows(), total=len(df)):
    value = str(row['response']) #Retrieves the value of the 'response' column from the current row and converts it to a string.
    
    #Constructs a prompt by combining a prefix and the 'response' value from the row.
    prefix = "Based on the student's response provided in:"
    
    # Define the base instructions
    instructions = (
        "Evaluate and return only the student's machine learning literacy level. "
        "The assessment should categorize the student into one of the following three levels: "
        "novice, intermediate, or advanced."
    )
    
    # Define the example and chain of thought for novice level
    novice_example = (
        "Novice: 'Machine learning is kind of intelligence where computers learn on their own.'"
    )
    chain_of_thought = (
        "Chain of Thought: In this example, the student's description of machine learning "
        "focuses on a broad, generalized understanding without delving into specifics about how "
        "machine learning algorithms work or are applied. The emphasis on 'intelligence' and "
        "'learning on their own' suggests a lack of detailed knowledge about the processes and "
        "techniques involved in machine learning, which is characteristic of a novice level of understanding."
    )
    
    # Define the reminder
    reminder = (
        "Remember, your task is to specify the literacy level as either "
        "novice, intermediate, or advanced without adding any additional commentary or explanation."
    )
    
    # Combine all parts into the final postfix message
    postfix = f"{instructions} To guide your evaluation, consider the following example and the associated chain of thought process: {novice_example} {chain_of_thought} {reminder}"
            
    prompt = ' '.join([prefix, value, postfix])

    # Creates a list of messages containing the prompt.
    messages = [{"role": "user", "content": prompt}] 
    
    #Attempts to generate a completion using the completions_with_backoff function, passing the GPT model, messages, and maximum tokens as arguments.
    try:
        completion = completions_with_backoff(
            model=model,
            messages=messages,
            max_tokens=MAX_TOKENS
        )
    except openai.APIError as e:
        print('ERROR: while getting accessing API.')
        print(f'Failed on item {i}.')
        print(e)
        print("Prompt:", prompt)
        raise e
    
    #Retrieves the response from the completion and appends it to the responses list.
    response = completion.choices[0].message.content
    responses.append(response)
    
    #Updates the token_usage counter with the total tokens used in the completion.
    token_usage += completion.usage.total_tokens

    # Need to wait to not exceed rate limit
    time.sleep(5)
df["literacy_llm_oneshotcot"] = responses

  0%|          | 0/9 [00:00<?, ?it/s] 11%|█         | 1/9 [00:05<00:44,  5.51s/it] 22%|██▏       | 2/9 [00:11<00:38,  5.52s/it] 33%|███▎      | 3/9 [00:16<00:33,  5.62s/it] 44%|████▍     | 4/9 [00:27<00:37,  7.52s/it] 56%|█████▌    | 5/9 [00:33<00:27,  6.95s/it] 67%|██████▋   | 6/9 [00:38<00:19,  6.44s/it] 78%|███████▊  | 7/9 [00:44<00:12,  6.17s/it] 89%|████████▉ | 8/9 [00:49<00:05,  5.97s/it]100%|██████████| 9/9 [00:55<00:00,  5.92s/it]100%|██████████| 9/9 [00:55<00:00,  6.18s/it]

It’s working a bit better. You can see that the prompt becomes longer. Long prompts can provide more detailed guidance, enhancing the model’s understanding of the task and potentially improving accuracy. However, they also pose challenges such as increased processing time and the risk of overwhelming the model with too much information, which might detract from its ability to focus on the core aspects of the task. To balance detail with efficiency, it’s crucial to ensure that every part of the prompt is directly relevant and structured to lead the model toward the desired outcome without unnecessary complexity.

Good practices for effective prompt engineering

Effective prompt engineering is a nuanced art that involves crafting inputs to guide machine learning models, particularly large language models (LLMs), to produce desired outputs with greater accuracy. Here are key practices to consider for successful prompt engineering:

Clarity and Precision: Ensure your prompts are clear and unambiguous. Use precise language to reduce the model’s likelihood of misinterpretation. This involves directly stating what you want the model to do, possibly including the format or structure of the desired response.
Contextual Guidance: Incorporate enough context to guide the model’s response without overwhelming it with information. Context helps the model understand the task better but too much can lead to decreased performance or irrelevant details in the output.
Simplicity and Brevity: While detail is important, efficiency in communication should not be overlooked. Aim for the sweet spot where the prompt is brief yet comprehensive enough to guide the model effectively. Avoid unnecessary complexity that could confuse both the model and the reader.
Use of Examples: Including examples within your prompt can significantly improve model performance, especially for tasks that might be open to interpretation. Examples act as direct guidance, showing the model exactly what is expected.
Adaptation and Iteration: Not all prompts work well on the first try. Be prepared to adapt and iterate your prompts based on the outputs you receive. This might mean refining your language, adding more context, or simplifying the request.
Understanding Model Capabilities: Tailor your prompts according to the specific strengths and weaknesses of the model you are using. Different models may perform better with different types of prompts based on their training data and architecture.
Ethical Considerations: Ensure your prompts do not inadvertently guide the model to generate harmful, biased, or sensitive content. Being mindful of the ethical implications of your prompts is crucial for responsible AI use.
Feedback Loops: Incorporate mechanisms for feedback on the model’s outputs. This can involve human review or automated checks to refine prompts further based on performance.
Chain of Thought Prompting: For complex tasks, guiding the model through a chain of thought can help break down the problem into manageable parts, leading to more accurate and logical outputs.
Experimentation: Finally, effective prompt engineering often involves a degree of experimentation. Testing different prompt styles and structures can help identify what works best for your specific task and model.