# Load packages needed for this analysis
library(tidyverse)
library(tidymodels)
3 Overfitting
Settling In
- Sit in your NEW randomly assigned group
- Introduce yourselves and choose a team name (you’ll need this later)
- Review the finalized Course Syllabus
- Catch up on any Slack messages you might have missed
- Prepare to take notes:
- Locate the QMDs for today’s activity (see the Course Schedule)
- Save these documents in your “STAT 253 > Notes” folder
- You can open the Part 1 QMD…
- .. but DO NOT open Part 2 (yet)
Learning Goals
- Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
- Implement testing and training sets in R using the
tidymodels
package
Small Group Discussion
To start class today, we’re going to do a Model Evaluation Experiment!
Open the Part 1 QMD file.
Directions
Let’s build and evaluate a predicted model of an adult’s height (\(y\)) using some predictors \(x_i\) (e.g., age, weight, etc.).
- Introduce yourself in whatever way you feel appropriate and check in with each other as human beings
- Come up with a team name
- Work through the steps below as a group, after you are told your group number
- Each group will be given a different sample of 40 adults
- Start by predicting
height
(in) usinghip
circumference (cm) - Evaluate the model on your sample.
- Be prepared to share your answers to:
- How good is your simple model?
- What would happen if we added more predictors?
Questions
Goal:
Let’s build and evaluate a predictive model of an adult’s height (\(y\)) using some predictors \(x_i\) (eg: age, height, etc).
Since \(y\) is quantitative this is a regression task.
There are countless possible models of \(y\) vs \(x\). We’ll utilize a linear regression model:
\[y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon\]
- And after building this model, we’ll evaluate it.
Data: Each group will be given a different sample of 40 adults.
# Load your data: fill in the blanks at end of url with your number
# group 1 = 50
# group 2 = 143
# group 3 = 160
# group 4 = 174
# group 5 = 86
<- read.csv("https://kegrinde.github.io/stat253_coursenotes/data/bodyfat___.csv") %>%
humans filter(ankle < 30) %>%
rename(body_fat = fatSiri)
# Check out a density plot of your y outcomes
ggplot(humans, aes(x = height)) +
geom_density()
Model building: Build a linear regression model of height
(in) by hip
circumference (cm).
# STEP 1: model specification
<- ___() %>%
lm_spec set_mode(___) %>%
set_engine(___)
# STEP 2: model estimation
<- ___ %>%
model_1 ___(height ~ hip, data = humans)
# Check out the coefficients
# Do all groups have the same coefficients? Should they?
Solution
Each group will have slightly different coefficients because they have different samples of data.
#This is an example with one of the samples
<- read.csv("https://kegrinde.github.io/stat253_coursenotes/data/bodyfat50.csv") %>%
humans filter(ankle < 30) %>%
rename(body_fat = fatSiri)
# STEP 1: model specification
<- linear_reg() %>%
lm_spec set_mode('regression') %>%
set_engine('lm')
# STEP 2: model estimation
<- lm_spec %>%
model_1 fit(height ~ hip, data = humans)
# Check out the coefficients
%>%
model_1 tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 52.5 7.68 6.84 0.0000000460
2 hip 0.179 0.0778 2.30 0.0272
Model evaluation: How good is our model?
# Calculate the R^2 for model_1
# Use your model to predict height for your subjects
# Just print the first 6 results
%>%
model_1 ___(new_data = ___) %>%
head()
# Calculate the MAE, i.e. typical prediction error, for your model
%>%
model_1 augment(new_data = humans) %>%
___(truth = ___, estimate = ___)
Solution
Again, each group will have slightly different answers here because they have different samples of data.
# Calculate the R^2 for model_1
%>%
model_1 glance()
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.125 0.101 2.26 5.29 0.0272 1 -86.1 178. 183.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Use your model to predict height for your subjects
# Just print the first 6 results
%>%
model_1 augment(new_data = humans) %>%
head()
# A tibble: 6 × 21
.pred .resid fatBrozek body_fat density age weight height adiposity
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 70.3 -1.09 24.7 25.4 1.04 43 177 69.2 26
2 70.8 -1.52 22 22.5 1.05 38 187. 69.2 27.5
3 70.2 -1.19 9.4 8.8 1.08 29 161. 69 23.8
4 68.9 4.58 7.1 6.3 1.08 49 153. 73.5 19.9
5 69.3 2.91 9.9 9.4 1.08 23 160. 72.2 21.6
6 70.2 -2.48 22.7 23.3 1.05 52 167 67.8 25.6
# ℹ 12 more variables: fatFreeWeight <dbl>, neck <dbl>, chest <dbl>,
# abdomen <dbl>, hip <dbl>, thigh <dbl>, knee <dbl>, ankle <dbl>,
# biceps <dbl>, forearm <dbl>, wrist <dbl>, hipin <dbl>
# Calculate the MAE, i.e. typical prediction error, for your model
%>%
model_1 augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.88
Reflection
In addition to hip
circumference, suppose we incorporated more predictors into our model of height
. What would happen to \(R^2\)? To the MAE?
Solution
\(R^2\) would increase and MAE would decrease.Exercises (Part 1)
Directions
- Take 5 minutes to complete exercises 1 and 2 (choosing one of three models).
- We’ll pause for a few minutes to discuss each group’s answers to these exercises.
- Then, and only then, you can finish exercises 3 - 5.
REMINDERS:
- Be kind to yourself/each other. You will make mistakes!
- Collaborate:
- actively contribute to discussion (don’t work on your own)
- actively include all group members in discussion
- create a space where others feel comfortable making mistakes and sharing their ideas
- stay in sync
Questions
- Select a model
Consider 3 different models of height
, estimated below. As a group, use your data to choose which is the best predictive model of height
. Calculate the MAE for this model.
# height vs hip
<- lm_spec %>%
model_1 fit(height ~ hip, data = humans)
%>%
model_1 tidy()
# height vs hip & weight
<- lm_spec %>%
model_2 fit(height ~ hip + weight, data = humans)
%>%
model_2 tidy()
# height vs a lot of predictors (AND some interaction terms)
<- lm_spec %>%
model_3 fit(height ~ chest * age * weight * body_fat * abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, data = humans)
%>%
model_3 tidy()
# Calculate the MAE for your model
%>%
___ augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
Solution
Will vary by group. MAE is calculated here for each model.
# Build the models
<- lm_spec %>%
model_1 fit(height ~ hip, data = humans)
<- lm_spec %>%
model_2 fit(height ~ hip + weight, data = humans)
<- lm_spec %>%
model_3 fit(height ~ chest * age * weight * body_fat * abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, data = humans)
# Evaluate the models
%>%
model_1 augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.88
%>%
model_2 augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.67
%>%
model_3 augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.53e-10
- Share your results
Only when you’re done with exercise 1:- Open this “Top Model Competition” Google Doc.
- Record your team name.
- Record which model you chose (1, 2, or 3).
- Record the MAE for your model.
- WAIT. Don’t keep going.
Don’t peak
What do you know?! 40 new people just walked into the doctor’s office and the doctor wants to predict their height
:
# Import the new data
<- read.csv("https://kegrinde.github.io/stat253_coursenotes/data/bodyfat182.csv") %>%
new_patients filter(ankle < 30) %>%
rename(body_fat = fatSiri)
- Intuition
Consider using your model to predictheight
for these 40 new subjects. On average, do you think these predictions will be better or worse than for your original patients? Why?
- How well does your model do in the real world?
Use your model to predictheight
for the new patients and calculate the typical prediction error (MAE). Record this in the Google sheet.
%>%
___ augment(new_data = new_patients) %>%
mae(truth = height, estimate = .pred)
Solution
# Predict height (assume, for example, I choose model_1)
%>%
model_1 augment(new_data = new_patients) %>%
head()
# A tibble: 6 × 21
.pred .resid fatBrozek body_fat density age weight height adiposity
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 71.5 -1.96 27.1 28 1.04 62 201. 69.5 29.3
2 70.4 -0.141 20.9 21.3 1.05 42 163 70.2 23.3
3 69.7 -0.497 26.1 27 1.04 72 168 69.2 24.7
4 69.1 -1.39 4.1 3 1.09 35 152. 67.8 23.4
5 68.5 -2.99 1.9 0.7 1.1 35 126. 65.5 20.6
6 71.9 -1.91 31 32.3 1.03 57 206. 70 29.5
# ℹ 12 more variables: fatFreeWeight <dbl>, neck <dbl>, chest <dbl>,
# abdomen <dbl>, hip <dbl>, thigh <dbl>, knee <dbl>, ankle <dbl>,
# biceps <dbl>, forearm <dbl>, wrist <dbl>, hipin <dbl>
# Calculate the MAE for model_1
%>%
model_1 augment(new_data = new_patients) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.73
# Calculate the MAE for model_2
%>%
model_2 augment(new_data = new_patients) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.68
# Calculate the MAE for model_3
%>%
model_3 augment(new_data = new_patients) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 105.
- Reflection
In summary, which model seems best? What’s the central theme here?
Notes
Overfitting
When we add more and more predictors into a model, it can become overfit to the noise in our sample data:
- our model loses the broader trend / big picture
- thus does not generalize to new data
- thus results in bad predictions and a bad understanding of the relationship among the new data points
Preventing overfitting: training and testing
- In-sample metrics, i.e. measures of how well the model performs on the same sample data that we used to build it, tend to be overly optimistic and lead to overfitting.
- Instead, we should build and evaluate, or train and test, our model using different data.
R Code
This section is for future reference. It is a summary of code you’ll learn below for creating and applying training and testing data. Throughout, suppose we wish to build and evaluate a linear regression model of y
vs x1
and x2
using our sample_data
.
# Load packages
library(tidyverse)
library(tidymodels)
Split the sample data into training and test sets
# Set the random number seed
set.seed(___)
# Split the sample_data
# "prop" is the proportion of data assigned to the training set
# it must be some number between 0 and 1
<- initial_split(sample_data, strata = y, prop = ___)
data_split
# Get the training data from the split
<- data_split %>%
data_train training()
# Get the testing data from the split
<- data_split %>%
data_test testing()
Build a training model
# STEP 1: model specification
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# STEP 2: model estimation using the training data
<- lm_spec %>%
model_train fit(y ~ x1 + x2, data = data_train)
Use the training model to make predictions for the test data
# Make predictions
%>%
model_train augment(new_data = data_test)
Evaluate the training model using the test data
# Calculate the test MAE
%>%
model_train augment(new_data = data_test) %>%
mae(truth = y, estimate = .pred)
Exercises (Part 2)
Directions
- Open the Part 2 QMD file
- Same directions as before:
- Be kind to yourself/each other
- Collaborate
- We will not discuss these exercises as a class. Be sure to ask questions as I walk around the room.
Questions
The following exercises are inspired by Chapter 5.3.1 of ISLR.
# Load packages & data
# NOTE: You might first need to install the ISLR package
library(tidyverse)
library(tidymodels)
library(ISLR)
data(Auto)
<- Auto %>%
cars ::select(mpg, horsepower, year) dplyr
Let’s use the cars
data to compare three linear regression models of fuel efficiency in miles per gallon (mpg
) by engine power (horsepower
):
# Raw data
<- ggplot(cars, aes(x = horsepower, y = mpg)) +
cars_plot geom_point()
cars_plot
# model 1: 1 predictor (y = b0 + b1 x)
+
cars_plot geom_smooth(method = "lm", se = FALSE)
# model 2: 2 predictors (y = b0 + b1 x + b2 x^2)
+
cars_plot geom_smooth(method = "lm", se = FALSE, formula = y ~ poly(x, 2))
# model 3: 19 predictors (y = b0 + b1 x + b2 x^2 + ... + b19 x^19)
+
cars_plot geom_smooth(method = "lm", se = FALSE, formula = y ~ poly(x, 19))
Goal
Let’s evaluate and compare these models by training and testing them using different data.
- 155 review: set.seed()
Run the two chunks below multiple times each. Afterward, summarize what set.seed()
does and why it’s important to being able to reproduce a random sample.
sample_n(cars, 2)
set.seed(253)
sample_n(cars, 2)
Solution
set.seed()
is used to create the same “random numbers” each time a random function is called.
Note that is if you want to get exactly the same random result, set.seed()
needs to be run right before the call to random function, every time.
It is important so that you can reproduce the same random sample every time you knit your work.
There might be different results across computers/platforms as they might be using different pseudo-random number generators. The most important thing is for your code to be consistent.- Training and test sets
Let’s randomly split our original 392 sample cars into two separate pieces: select 80% of the cars to train (build) the model and the other 20% to test (evaluate) the model.
# Set the random number seed
set.seed(8)
# Split the cars data into 80% / 20%
# Ensure that the sub-samples are similar with respect to mpg
<- initial_split(cars, strata = mpg, prop = 0.8) cars_split
# Check it out
# What do these numbers mean?
cars_split
# Get the training data from the split
<- cars_split %>%
cars_train training()
# Get the testing data from the split
<- cars_split %>%
cars_test testing()
# The original data has 392 cars
nrow(cars)
# How many cars are in cars_train?
# How many cars are in cars_test?
Solution
# Set the random number seed
set.seed(8)
# Split the cars data into 80% / 20%
# Ensure that the sub-samples are similar with respect to mpg
<- initial_split(cars, strata = mpg, prop = 0.8)
cars_split cars_split
<Training/Testing/Total>
<312/80/392>
# Get the training data from the split
<- cars_split %>%
cars_train training()
# Get the testing data from the split
<- cars_split %>%
cars_test testing()
# The original data has 392 cars
nrow(cars)
[1] 392
# How many cars are in cars_train?
nrow(cars_train)
[1] 312
# How many cars are in cars_test?
nrow(cars_test)
[1] 80
- Reflect on the above code
- Why do we want the training and testing data to be similar with respect to
mpg
(strata = mpg
)? What if they weren’t? - Why did we need all this new code instead of just using the first 80% of cars in the sample for training and the last 20% for testing?
Solution
- Suppose, for example, the training cars all had higher
mpg
than the test cars. Then the training model likely would not perform well on the test cars, thus we’d get an overly pessimistic measure of model quality. - If the cars are ordered in some way (eg: from biggest to smallest) then our training and testing samples would have systematically different properties.
- Build the training model
# STEP 1: model specification
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# STEP 2: model estimation using the training data
# Construct the 19th order polynomial model using the TRAINING data
<- ___ %>%
model_19_train ___(mpg ~ poly(horsepower, 19), data = ___)
Solution
# STEP 1: model specification
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# STEP 2: model estimation using the training data
# Construct the 19th order polynomial model using the TRAINING data
<- lm_spec %>%
model_19_train fit(mpg ~ poly(horsepower, 19), data = cars_train)
- Evaluate the training model
# How well does the TRAINING model predict the TRAINING data?
# Calculate the training (in-sample) MAE
%>%
model_19_train augment(new_data = ___) %>%
mae(truth = mpg, estimate = .pred)
# How well does the TRAINING model predict the TEST data?
# Calculate the test MAE
%>%
model_19_train augment(new_data = ___) %>%
mae(truth = mpg, estimate = .pred)
Solution
# How well does the training model predict the training data?
# Calculate the training (in-sample) MAE
%>%
model_19_train augment(new_data = cars_train) %>%
mae(truth = mpg, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 2.99
# How well does the training model predict the test data?
# Calculate the test MAE
%>%
model_19_train augment(new_data = cars_test) %>%
mae(truth = mpg, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 6.59
- Punchline
The table below summarizes your results fortrain_model_19
as well as the other two models of interest. (You should confirm the other two model results outside of class!)
Model | Training MAE | Testing MAE |
---|---|---|
mpg ~ horsepower |
3.78 | 4.00 |
mpg ~ poly(horsepower, 2) |
3.20 | 3.49 |
mpg ~ poly(horsepower, 19) |
2.99 | 6.59 |
Answer the following and reflect on why each answer makes sense:
- Within each model, how do the training errors compare to the testing errors? (This isn’t always the case, but is common.)
- Why about the training and test errors for the third model suggest that it is overfit to our sample data?
- Which model seems the best with respect to the training errors?
- Which model is the best with respect to the testing errors?
- Which model would you choose?
Solution
- the training errors are smaller
- the test MAE is much larger than the training MAE
- the 19th order polynomial
- the quadratic
- the quadratic
Code for the curious
I wrote a function calculate_MAE()
to automate the calculations in the table. If you’re curious, pick through this code!
# Write function to calculate MAEs
<- function(poly_order){
calculate_MAE # Construct a training model
<- lm_spec %>%
model fit(mpg ~ poly(horsepower, poly_order), cars_train)
# Calculate the training MAE
<- model %>%
train_it augment(new_data = cars_train) %>%
mae(truth = mpg, estimate = .pred)
# Calculate the testing MAE
<- model %>%
test_it augment(new_data = cars_test) %>%
mae(truth = mpg, estimate = .pred)
# Return the results
return(data.frame(train_MAE = train_it$.estimate, test_MAE = test_it$.estimate))
}
# Calculate training and testing MSEs
calculate_MAE(poly_order = 1)
train_MAE test_MAE
1 3.779331 4.004333
calculate_MAE(poly_order = 2)
train_MAE test_MAE
1 3.199882 3.487022
calculate_MAE(poly_order = 19)
train_MAE test_MAE
1 2.989305 6.592341
# For those of you interested in trying all orders...
<- purrr::map_df(1:19,calculate_MAE) %>%
results mutate(order = 1:19) %>%
pivot_longer(cols=1:2,names_to='Metric',values_to = 'MAE')
%>%
results ggplot(aes(x = order, y = MAE, color = Metric)) +
geom_line() +
geom_point(data = results %>% filter(Metric == 'test_MAE') %>% slice_min(MAE)) +
geom_point(data = results %>% filter(Metric == 'train_MAE') %>% slice_min(MAE))
- Final reflection
- The training / testing procedure provided a more honest evaluation and comparison of our model predictions. How might we improve upon this procedure? What problems can you anticipate in splitting our data into 80% / 20%?
- Summarize the key themes from today in your own words.
Solution
This will be discussed in the next video!
- STAT 155 REVIEW: data drill
- Construct and interpret a plot of
mpg
vshorsepower
andyear
. - Calculate the average
mpg
. - Calculate the average
mpg
for eachyear
. HINT:group_by()
- Plot the average
mpg
byyear
.
Solution
# a. One of many options
ggplot(cars, aes(x = horsepower, y = mpg, color = year)) +
geom_point()
# b
%>%
cars summarize(mean(mpg))
mean(mpg)
1 23.44592
# c
%>%
cars group_by(year) %>%
summarize(mean_mpg = mean(mpg))
# A tibble: 13 × 2
year mean_mpg
<dbl> <dbl>
1 70 17.7
2 71 21.1
3 72 18.7
4 73 17.1
5 74 22.8
6 75 20.3
7 76 21.6
8 77 23.4
9 78 24.1
10 79 25.1
11 80 33.8
12 81 30.2
13 82 32
# d
%>%
cars group_by(year) %>%
summarize(mean_mpg = mean(mpg)) %>%
ggplot(aes(y = mean_mpg, x = year)) +
geom_point()
- Digging deeper (optional)
Check out the online solutions for exercise 6. Instead of calculating MAE from scratch for 3 different models, I wrote a function calculate_MAE()
to automate the process. After picking through this code, adapt the function so that it also returns the \(R^2\) value of each model.
Done!
- Knit your notes.
- Check the solutions in the course website.
- If you finish all that during class, start your homework!
Wrap-Up
Finishing the Activity
- If you didn’t finish the activity, no problem! Be sure to complete the activity outside of class, review the solutions in the online manual, and ask any questions on Slack or in office hours.
- Re-organize and review your notes to help deepen your understanding, solidify your learning, and make homework go more smoothly!
After Class
- An R code video, posted under the pre-course materials for today’s class (see the “Schedule” page on this website), talks through the new code. This video is OPTIONAL. Decide what’s right for you.
- Continue to check in on Slack. I’ll be posting announcements there from now on.
- Upcoming due dates:
- CP3: due 10 minutes before our next class.
- There are two (short) videos to watch in advance.
- HW1 (Regression Model Evaluation): due next Tuesday at 11:59 pm
- Start today, even if you just review the directions and scan the exercises. Homework is not designed to be completed in one sitting!
- Invite others to work with you!
- Stop by office hours (preceptors or mine) with any questions
- CP3: due 10 minutes before our next class.