Same Model, Different Analytic Goals

Code Along

Process

  • Create a .R file in /module-1
  • Then, run copy and paste the code in this presentation as we talk through each step

Quick discussion

  • What concerns do you have about coding in R (or python)?

We’ll take this part easily!

Code-along - Regression

Our aim: What relates to whether a Pokemon is legendary – one that is “incredibly rare and often very powerful Pokémon”?

Data Dictionary

Column Type Description Example Values
name Character The official name of the Pokémon. Pikachu, Bulbasaur
type_1 Categorical The primary elemental type. Determines many battle strengths/weaknesses. Water, Fire, Grass, Electric
type_2 Categorical The secondary elemental type, if applicable (often missing/NA for single-type Pokémon). Flying, Poison, NA
hp Numeric Base “Health Points” indicating how much damage a Pokémon can take before fainting. 35, 60, 100
attack Numeric Base Attack stat. Affects damage dealt by Physical moves. 55, 82, 134
defense Numeric Base Defense stat. Affects damage received from Physical moves. 40, 80, 95
sp_atk Numeric Base Special Attack stat. Affects damage dealt by Special moves (e.g., Flamethrower). 50, 90, 120
sp_def Numeric Base Special Defense stat. Affects damage received from Special moves. 50, 85, 125
speed Numeric Base Speed stat, determining which Pokémon moves first in battle. 35, 100, 130
generation Integer or Factor Numerical indicator of the game generation the Pokémon was introduced (1, 2, 3, etc.). 1, 2, 3
legendary Boolean Indicates if the Pokémon is Legendary (TRUE/FALSE, 1/0). FALSE, TRUE
total Numeric Sum of all base stats (HP + Attack + Defense + Sp. Atk + Sp. Def + Speed). 320, 540, 680
height Numeric Pokémon’s approximate height (units vary by dataset, often meters). 0.4, 1.7, 2.0
weight Numeric Pokémon’s approximate weight (units vary by dataset, often kilograms). 6.0, 90.5, 210.0
early_gen Numeric Whether or not a Pokemon is 1st or 2nd gen 1, 0

R Code

Loading, setting up

library(tidyverse)

pokemon <- read_csv("data/pokemon-data.csv")

pokemon %>% 
    glimpse()

Fit model – we’ll just use three variables to begin with a very simple model - how do several variables relate to a Pokemon being 1st or 2nd gen relative to 3rd-6th

m1 <- glm(is_legendary ~ height_m + weight_kg + hp,
          data = pokemon,
          family = "binomial")

Interpret fit statistics, coefficients and standard errors, and p-values

summary(m1) # what do you notice about the coefficients?

Optionally, convert log-odds coefficients to probabilities

log_odds <- coef(m1)

odds <- exp(log_odds)

probabilities <- odds / (1 + odds)

results <- tibble(
  Term = names(log_odds),
  Log_Odds = log_odds,
  Odds = odds,
  Probability = probabilities
)

results

Python Code

Loading, setting up

import pandas as pd 

pokemon_df = pd.read_csv('data/pokemon-data.csv')

pokemon_df.head()

fit model

import statsmodels.formula.api as smf

model = smf.logit('early_gen ~ height_m + weight_kg + hp', data=pokemon_df).fit()

evaluate model

model.summary()

Code-along - SML

Our aim: How well can we predict whether a Pokemon is legendary – one that is “incredibly rare and often very powerful Pokémon”?

R Code

Loading, setting up

library(tidymodels)

Split data

# skipped for now, to be introduced in the next module!

Engineer features and specify recipe

pokemon <- pokemon %>% 
     mutate(early_gen = as.factor(early_gen))

pokemon_recipe <- recipe(is_legendary ~ height_m + weight_kg + hp, 
                         data = pokemon)

Set model and workflow

my_mod <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

my_wf <- workflow() %>%
    add_recipe(pokemon_recipe) %>%
    add_model(my_mod)

Fit model

log_reg_fit <- fit(my_wf, pokemon_split)

Evaluate accuracy

pokemon_preds <- predict(log_reg_fit, pokemon, type = "class") %>% 
  bind_cols(pokemon %>% select(early_gen))

metrics(pokemon_preds, truth = early_gen, estimate = .pred_class)

Python code

Loading, setting up

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

pokemon_df = pokemon_df.dropna()  # sci-kit learn won't work with rows that have nulls 

#Todo(Josh): Determine if we want to talk about imputation; e.g., median filling

Split data

dependent_col = 'is_legendary'
independent_cols = ['height_m', 'weight_kg', 'hp']

X = pokemon_df[independent_cols]
y = pokemon_df[dependent_col]

Define and train model

model = LogisticRegression()
model.fit(X, y)  # Fit the model on the entire dataset

Predict and Evaluate

y_preds = model.predict(X)  # Then predict on all the data as if you didn't know the dependent variable
accuracy = accuracy_score(y, y_preds)  # Compare your predictions vs the actual y's

round(accuracy, 3)

Discussion

  • What do you notice about the differences in the output between regression and SML?
  • What went well? What was frustrating?