How to Split Data into Training and Testing Sets

Code Along

Getting started

Process

  • Again, create an .R (or .py) file — this time in /module-2
  • Then, run copy and paste the code in this presentation as we talk through each step

Quick discussion

  • Which parts of the supervised machine learning process is most unclear?

Code-along

R Code

Loading, setting up

library(tidyverse)
library(tidymodels)

pokemon <- read_csv("data/pokemon-data.csv")

pokemon %>% 
    glimpse()

Split data

pokemon_split <- initial_split(pokemon, prop = 0.8)
train <- training(pokemon_split)
test <- testing(pokemon_split)

Engineer features

pokemon_recipe <- recipe(early_gen ~ height_m + weight_kg + hp, 
                         data = train) %>% 
    step_mutate(early_gen = as.factor(early_gen))

Specify recipe, model, and workflow

my_mod <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

my_wf <- workflow() %>%
    add_recipe(pokemon_recipe) %>%
    add_model(my_mod)

Fit model

log_reg_fit <- last_fit(my_wf, pokemon_split)

Evaluate accuracy

collect_metrics(log_reg_fit)

Python Code

Data loading, setting up

import pandas as pd 

pokemon_df = pd.read_csv('data/pokemon-data.csv')

pokemon_df.head()

Split data

from sklearn.model_selection import train_test_split  # import the built in train-test split tool from sci-kit learn.

train_df, test_df = train_test_split(pokemon_df, test_size=0.2, random_state=42)  

Fit model

import statsmodels.formula.api as smf

model = smf.logit('early_gen ~ height_m + weight_kg + hp', data=train_df).fit()

evaluate model

model.summary()

Discussion

  • In the big picture, what is the use of the training data relative to the testing data?