Same Model, Different Analytic Goals

Code Along

Process

Create a .R file in /module-1
Then, run copy and paste the code in this presentation as we talk through each step

Quick discussion

What concerns do you have about coding in R (or python)?

We’ll take this part easily!

Code-along - Regression

Our aim: What relates to whether a Pokemon is legendary – one that is “incredibly rare and often very powerful Pokémon”?

Data Dictionary

Column	Type	Description	Example Values
name	Character	The official name of the Pokémon.	Pikachu, Bulbasaur
type_1	Categorical	The primary elemental type. Determines many battle strengths/weaknesses.	Water, Fire, Grass, Electric
type_2	Categorical	The secondary elemental type, if applicable (often missing/NA for single-type Pokémon).	Flying, Poison, NA
hp	Numeric	Base “Health Points” indicating how much damage a Pokémon can take before fainting.	35, 60, 100
attack	Numeric	Base Attack stat. Affects damage dealt by Physical moves.	55, 82, 134
defense	Numeric	Base Defense stat. Affects damage received from Physical moves.	40, 80, 95
sp_atk	Numeric	Base Special Attack stat. Affects damage dealt by Special moves (e.g., Flamethrower).	50, 90, 120
sp_def	Numeric	Base Special Defense stat. Affects damage received from Special moves.	50, 85, 125
speed	Numeric	Base Speed stat, determining which Pokémon moves first in battle.	35, 100, 130
generation	Integer or Factor	Numerical indicator of the game generation the Pokémon was introduced (1, 2, 3, etc.).	1, 2, 3
legendary	Boolean	Indicates if the Pokémon is Legendary (TRUE/FALSE, 1/0).	FALSE, TRUE
total	Numeric	Sum of all base stats (HP + Attack + Defense + Sp. Atk + Sp. Def + Speed).	320, 540, 680
height	Numeric	Pokémon’s approximate height (units vary by dataset, often meters).	0.4, 1.7, 2.0
weight	Numeric	Pokémon’s approximate weight (units vary by dataset, often kilograms).	6.0, 90.5, 210.0
early_gen	Numeric	Whether or not a Pokemon is 1st or 2nd gen	1, 0

R Code

Loading, setting up

library(tidyverse)

pokemon <- read_csv("data/pokemon-data.csv")

pokemon %>% 
    glimpse()

Fit model – we’ll just use three variables to begin with a very simple model - how do several variables relate to a Pokemon being 1st or 2nd gen relative to 3rd-6th

m1 <- glm(is_legendary ~ height_m + weight_kg + hp,
          data = pokemon,
          family = "binomial")

Interpret fit statistics, coefficients and standard errors, and p-values

summary(m1) # what do you notice about the coefficients?

Optionally, convert log-odds coefficients to probabilities

log_odds <- coef(m1)

odds <- exp(log_odds)

probabilities <- odds / (1 + odds)

results <- tibble(
  Term = names(log_odds),
  Log_Odds = log_odds,
  Odds = odds,
  Probability = probabilities
)

results

Python Code

Loading, setting up

import pandas as pd 

pokemon_df = pd.read_csv('data/pokemon-data.csv')

pokemon_df.head()

fit model

import statsmodels.formula.api as smf

model = smf.logit('early_gen ~ height_m + weight_kg + hp', data=pokemon_df).fit()

evaluate model

model.summary()

Code-along - SML

Our aim: How well can we predict whether a Pokemon is legendary – one that is “incredibly rare and often very powerful Pokémon”?

R Code

Loading, setting up

library(tidymodels)

Split data

# skipped for now, to be introduced in the next module!

Engineer features and specify recipe

pokemon <- pokemon %>% 
     mutate(early_gen = as.factor(early_gen))

pokemon_recipe <- recipe(is_legendary ~ height_m + weight_kg + hp, 
                         data = pokemon)

Set model and workflow

my_mod <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

my_wf <- workflow() %>%
    add_recipe(pokemon_recipe) %>%
    add_model(my_mod)

Fit model

log_reg_fit <- fit(my_wf, pokemon_split)

Evaluate accuracy

pokemon_preds <- predict(log_reg_fit, pokemon, type = "class") %>% 
  bind_cols(pokemon %>% select(early_gen))

metrics(pokemon_preds, truth = early_gen, estimate = .pred_class)

Python code

Loading, setting up

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

pokemon_df = pokemon_df.dropna()  # sci-kit learn won't work with rows that have nulls 

#Todo(Josh): Determine if we want to talk about imputation; e.g., median filling

Split data

dependent_col = 'is_legendary'
independent_cols = ['height_m', 'weight_kg', 'hp']

X = pokemon_df[independent_cols]
y = pokemon_df[dependent_col]

Define and train model

model = LogisticRegression()
model.fit(X, y)  # Fit the model on the entire dataset

Predict and Evaluate

y_preds = model.predict(X)  # Then predict on all the data as if you didn't know the dependent variable
accuracy = accuracy_score(y, y_preds)  # Compare your predictions vs the actual y's

round(accuracy, 3)

Discussion

What do you notice about the differences in the output between regression and SML?
What went well? What was frustrating?