How to Split Data into Training and Testing Sets

Code Along

Getting started

Process

Again, create a .R file — this time in /module-2
Then, run copy and paste the code in this presentation as we talk through each step

Quick discussion

Which parts of the supervised machine learning process is most unclear?

Code-along

R Code

Loading, setting up

library(tidyverse)
library(tidymodels)

gss_cat # this is built-in data, so it's loaded differently than typical data

gss_cat <- gss_cat %>% 
    mutate(is_married = as.factor(if_else(marital == "Married", 1, 0))) # as a factor

gss_cat %>% 
    count(is_married)

Split data

train_test_split <- initial_split(gss_cat, prop = .80, strata = "is_married")
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

Engineer features

# predicting humans based on the independent effects of height and mass
my_rec <- recipe(is_married ~ age + tvhours, data = gss_cat)

Specify recipe, model, and workflow

# specify model
my_mod <- logistic_reg() %>%
    set_engine("glm") %>%
    set_mode("classification")

# specify workflow
my_wf <- workflow() %>%
   add_model(my_mod) %>% 
    add_recipe(my_rec)

Fit model

fit_model <- fit(my_wf, data = gss_cat)

Evaluate accuracy

predictions <- predict(fit_model, gss_cat) %>% 
    bind_cols(gss_cat) %>% 
    mutate(is_married = as.factor(is_married))

predictions %>%
  metrics(is_married, .pred_class) %>%
  filter(.metric == "accuracy")

python code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load and preprocess the data
starwars = pd.read_csv('path_to_starwars.csv')  # Load your data file
starwars['species_human'] = starwars['species'].apply(lambda x: 'Human' if x == 'Human' else 'Not human')

# Split data
train, test = train_test_split(starwars, test_size=0.2, random_state=42)
X_train, y_train = train[['height', 'mass']], train['species_human']
X_test, y_test = test[['height', 'mass']], test['species_human']

# Specify model and fit
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate accuracy
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Discussion

In the big picture, what is the use of the training data relative to the testing data?