Same Model, Different Analytic Goals

Code Along

Getting started

Process

  • Create a .R file in /module-1
  • Then, run copy and paste the code in this presentation as we talk through each step

Quick discussion

  • How comfortable are you with using RStudio?
  • What concerns do you have about coding?

We’ll take this part easily!

Code-along - Regression

R Code

Our aim: How well can we predict penguins’ species based on their measured characteristics? Let’s simplify our task (for now) to just distinguishing between Adelie and Gentoo.

Loading, setting up

library(tidyverse)
library(tidymodels)

gss_cat # this is built-in data, so it's loaded differently than typical data

gss_cat %>% 
    count(marital)

gss_cat <- gss_cat %>% 
    mutate(is_married = if_else(marital == "Married", 1, 0))

Fit model

m1 <- glm(is_married ~ age + tvhours, # a very simple model!
          data = gss_cat, 
          family = "binomial")

Interpret fit statistics, coefficients and standard errors, and p-values

summary(m1)

python code

just example code

import pandas as pd
import statsmodels.api as sm

# Load and preprocess the data
starwars = pd.read_csv('path_to_starwars.csv')  # Load your data file
starwars['species_human'] = starwars['species'].apply(lambda x: 'Human' if x == 'Human' else 'Not human')
starwars['species_human'] = starwars['species_human'].astype('category').cat.codes

# Regression model
X_reg = starwars[['height', 'mass']]
y_reg = starwars['species_human']
X_reg = sm.add_constant(X_reg)  # Add a constant term for the intercept

reg_model = sm.Logit(y_reg, X_reg).fit()
print(reg_model.summary())

Code-along - SML

R Code

Loading, setting up

gss_cat # this is built-in data, so it's loaded differently than typical data

gss_cat <- gss_cat %>% 
    mutate(is_married = as.factor(if_else(marital == "Married", 1, 0))) # as a factor

Split data

# We will skip this step for now, but know this is common!

Engineer features

# predicting marital status based on the effects of age and tvhours

my_rec <- recipe(is_married ~ age + tvhours, data = gss_cat) 

Specify recipe, model, and workflow

# specify model
my_mod <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

# specify workflow
my_wf <- workflow() %>%
   add_model(my_mod) %>% 
    add_recipe(my_rec)

Fit model

fit_model <- fit(my_wf, data = gss_cat)

Evaluate accuracy

predictions <- predict(fit_model, gss_cat) %>% 
    bind_cols(gss_cat) %>% 
    mutate(is_married = as.factor(is_married))

predictions %>% 
    select(.pred_class, is_married)

predictions %>%
  metrics(is_married, .pred_class) %>%
  filter(.metric == "accuracy")

python code

just example code

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load and preprocess the data
starwars = pd.read_csv('path_to_starwars.csv')  # Load your data file
starwars['species_human'] = starwars['species'].apply(lambda x: 'Human' if x == 'Human' else 'Not human')

# Prepare the features and target
X = starwars[['height', 'mass']]
y = starwars['species_human']

# Specify model and fit
clf = LogisticRegression()
clf.fit(X, y)

# Evaluate accuracy on the training data
y_pred = clf.predict(X)
print(classification_report(y, y_pred))

Discussion

  • What do you notice about the differences in the output between regression and SML?
  • What do you notice is different about the modeling approach?