# STEP 1: model specification
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# STEP 2: model estimation
<- lm_spec %>%
my_model fit(
~ x1 + x2,
y data = sample_data
)
4 Cross-Validation
Settling In
- Sit with the same group* as last class
- If you missed class Tuesday, or if you’re in the *9:40 section, check with me for your (potentially) new group #!
- Introduce yourselves/check-in with each other, then discuss:
- What has/hasn’t worked well for you in the past when it comes to working on in-class exercises in small groups?
- What would you like to try, or avoid, with this group?
- Prepare to take notes (Schedule > 9/19 > In Class > QMD)
- See the #announcements on Slack about upcoming events
Learning Goals
- Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
- Explain what role CV has in a predictive modeling analysis and its connection to overfitting
- Explain the pros/cons of higher vs. lower k in k-fold CV in terms of sample size and computing time
- Implement cross-validation in R using the
tidymodels
package - Using these tools and concepts to:
- Inform and justify data analysis and modeling process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs
Notes: Cross-Validation
Context: Evaluating Regression Models
A reminder of our current context:
world = supervised learning
We want to build a model some output variable \(y\) by some predictors \(x\).task = regression
\(y\) is quantitativemodel = linear regression model via least squares algorithm
We’ll assume that the relationship between \(y\) and \(x\) can be represented by\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \varepsilon\]
GOAL: model evaluation
We want more honest metrics of prediction quality that
- assess how well our model predicts new outcomes; and
- help prevent overfitting.
Why is overfitting so bad?
Not only can overfitting produce misleading models, it can have serious societal impacts. Examples:
A former Amazon algorithm built to help sift through resumes was overfit to its current employees in leadership positions (who weren’t representative of the general population or candidate pool).
Facial recognition algorithms are often overfit to the people who build them (who are not broadly representative of society). As one example, this has led to disproportionate bias in policing. For more on this topic, you might check out Coded Bias, a documentary by Shalini Kantayya which features MIT Media Lab researcher Joy Buolamwini.
Polygenic risk scores (PRSs), which aim to predict a person’s risk of developing a particular disease/trait based on their genetics, are often overfit to the data on which they are built (which, historically, has exclusively—or at least primarily—included individuals of European ancestry). As a result, PRS predictions tend to be more accurate in European populations and new research suggests that their continued use in clinical settings could exacerbate health disparities.
k-fold Cross Validation
We can use k-fold cross-validation to estimate the typical error in our model predictions for new data:
- Divide the data into \(k\) folds (or groups) of approximately equal size.
- Repeat the following procedures for each fold \(j = 1,2,...,k\):
- Remove fold \(j\) from the data set.
- Fit a model using the data in the other \(k-1\) folds (training).
- Use this model to predict the responses for the \(n_j\) cases in fold \(j\): \(\hat{y}_1, ..., \hat{y}_{n_j}\).
- Calculate the MAE for fold \(j\) (testing): \(\text{MAE}_j = \frac{1}{n_j}\sum_{i=1}^{n_j} |y_i - \hat{y}_i|\).
- Remove fold \(j\) from the data set.
- Combine this information into one measure of model quality: \[\text{CV}_{(k)} = \frac{1}{k} \sum_{j=1}^k \text{MAE}_j\]
Small Group Discussion
Algorithms and Tuning
Definitions
algorithm = a step-by-step procedure for solving a problem (Merriam-Webster)
tuning parameter = a parameter or quantity upon which an algorithm depends, that must be selected or tuned to “optimize” the algorithm
Prompts
- Algorithms
Why is \(k\)-fold cross-validation an algorithm?
What is the tuning parameter of this algorithm and what values can this take?
Solution
- Yes. It follows a list of steps to get to its goal.
- \(k\), the number of folds, is a tuning parameter. \(k\) can be any integer from 1, 2, …, \(n\) where \(n\) is our sample size.
- Tuning the k-fold Cross-Validation algorithm
Let’s explore k-fold cross-validation with some personal experience. Our class has a representative sample of cards from a non-traditional population (no “face cards”, not equal numbers, etc). We want to use these to predict whether a new card will be odd or even (a classification task).
- Based on all of our cards, do we predict the next card will be odd or even?
- You’ve been split into 2 groups. Use 2-fold cross-validation to estimate the possible error of using our sample of cards to predict whether a new card will be odd or even. How’s this different than validation?
- Repeat for 3-fold cross-validation. Why might this be better than 2-fold cross-validation?
- Repeat for LOOCV, i.e. n-fold cross-validation where n is the number of students in this room. Why might this be worse than 3-fold cross-validation?
- What value of k do you think practitioners typically use?
Solution
- Use the percentage of odd and percentage of even among the sample of cards to help you make a prediction.
- We use both groups as training and testing, in turn.
- We have a larger dataset to train our model on. We are less likely to get an unrepresentative set as our training data.
- Prediction error for 1 person is highly variable.
- In practice, \(k = 10\) and \(k=7\) are common choices for cross-validation. This has been shown to hit the ‘sweet spot’ between the extremes of \(k=n\) (LOOCV) and \(k=2\).
- \(k=2\) only utilizes 50% of the data for each training model, thus might result in overestimating the prediction error
- \(k=n\) leave-one-out cross-validation (LOOCV) requires us to build \(n\) training models, thus might be computationally expensive for larger sample sizes \(n\). Further, with only one data point in each test set, the training sets have a lot of overlap. This correlation among the training sets can make the ultimate corresponding estimate of prediction error less reliable.
- R Code Preview
We’ve been doing a 2-step process to build linear regression models using the tidymodels package:
For k-fold cross-validation, we can tweak STEP 2.
- Discuss the code below and why we need to set the seed.
# k-fold cross-validation
set.seed(___)
<- lm_spec %>%
my_model_cv fit_resamples(
~ x1 + x2,
y resamples = vfold_cv(sample_data, v = ___),
metrics = metric_set(mae, rsq)
)
Solution
The process of creating the folds is random, so we should set the seed to have reproducibility within our work.Notes: R code
Suppose we wish to build and evaluate a linear regression model of y
vs x1
and x2
using our sample_data
.
First, load the appropriate packages
# Load packages
library(tidyverse)
library(tidymodels)
Obtain k-fold cross-validated estimates of MAE and \(R^2\)
(Review above for discussion of these steps.)
# model specification
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# k-fold cross-validation
# For "v", put your number of folds k
set.seed(___)
<- lm_spec %>%
model_cv fit_resamples(
~ x1 + x2,
y resamples = vfold_cv(sample_data, v = ___),
metrics = metric_set(mae, rsq)
)
Obtain the cross-validated metrics
%>%
model_cv collect_metrics()
Get the MAE and R-squared for each test fold
# MAE for each test fold: Model 1
%>%
model_cv unnest(.metrics)
Exercises
Instructions
- Go to the Course Schedule and find the QMD template for today
- Save this in your STAT 253 Notes folder, NOT your downloads!
- Work through the exercises implementing CV to compare two possible models predicting
height
- Same directions as before:
- Be kind to yourself/each other
- Collaborate
- DON’T edit starter code (i.e., code with blanks
___
). Instead, copy-paste into a new code chunk below and edit from there.
- Ask me questions as I move around the room
Questions
# Load packages and data
library(tidyverse)
library(tidymodels)
<- read.csv("https://kegrinde.github.io/stat253_coursenotes/data/bodyfat50.csv") %>%
humans filter(ankle < 30) %>%
rename(body_fat = fatSiri)
- Review: In-sample metrics
Use the humans
data to build two separate models of height
:
# STEP 1: model specification
<- ___() %>%
lm_spec set_mode(___) %>%
set_engine(___)
# STEP 2: model estimation
<- ___ %>%
model_1 ___(height ~ hip + weight + thigh + knee + ankle, data = humans)
<- ___ %>%
model_2 ___(height ~ chest * age * weight * body_fat + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, data = humans)
Calculate the in-sample R-squared for both models:
# IN-SAMPLE R^2 for model_1 = ???
%>%
model_1 ___()
# IN-SAMPLE R^2 for model_2 = ???
%>%
model_2 ___()
Calculate the in-sample MAE for both models:
# IN-SAMPLE MAE for model_1 = ???
%>%
model_1 ___(new_data = ___) %>%
mae(truth = ___, estimate = ___)
# IN-SAMPLE MAE for model_2 = ???
%>%
model_2 ___(new_data = ___) %>%
mae(truth = ___, estimate = ___)
Solution
# STEP 1: model specification
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# STEP 2: model estimation
<- lm_spec %>%
model_1 fit(height ~ hip + weight + thigh + knee + ankle, data = humans)
<- lm_spec %>%
model_2 fit(height ~ chest * age * weight * body_fat + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, data = humans)
# IN-SAMPLE R^2 for model_1 = 0.40
%>%
model_1 glance()
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.401 0.310 1.98 4.42 0.00345 5 -78.8 172. 183.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# IN-SAMPLE R^2 for model_2 = 0.87
%>%
model_2 glance()
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.874 0.680 1.35 4.51 0.00205 23 -48.4 147. 188.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# IN-SAMPLE MAE for model_1 = 1.55
%>%
model_1 augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 1.55
# IN-SAMPLE MAE for model_2 = 0.64
%>%
model_2 augment(new_data = humans) %>%
mae(truth = height, estimate = .pred)
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 mae standard 0.646
- In-sample model comparison
Which model seems “better” by the in-sample metrics you calculated above? Any concerns about either of these models?
Solution
The in-sample metrics are better formodel_2
, but from experience in our previous class, we should expect this to be overfit.
10-fold CV
Complete the code to run 10-fold cross-validation for our two models.model_1
:height ~ hip + weight
model_2
:height ~ chest * age * weight * body_fat + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist
# 10-fold cross-validation for model_1
set.seed(253)
<- ___ %>%
model_1_cv ___(
___,___ = vfold_cv(___, v = ___),
___ = metric_set(mae, rsq)
)
# 10-fold cross-validation for model_2
set.seed(253)
<- ___ %>%
model_2_cv ___(
___,___ = vfold_cv(___, v = ___),
___ = metric_set(mae, rsq)
)
Solution
# 10-fold cross-validation for model_1
set.seed(253)
<- lm_spec %>%
model_1_cv fit_resamples(
~ hip + weight + thigh + knee + ankle,
height resamples = vfold_cv(humans, v = 10),
metrics = metric_set(mae, rsq)
)
# STEP 2: 10-fold cross-validation for model_2
set.seed(253)
<- lm_spec %>%
model_2_cv fit_resamples(
~ chest * age * weight * body_fat + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist,
height resamples = vfold_cv(humans, v = 10),
metrics = metric_set(mae, rsq)
)
- Calculating the CV MAE
- Use
collect_metrics()
to obtain the cross-validated MAE and \(R^2\) for both models.
# HINT
%>%
___ collect_metrics()
- Interpret the cross-validated MAE and \(R^2\) for
model_1
.
Solution
# model_1
# CV MAE = 1.87, CV R-squared = 0.41
%>%
model_1_cv collect_metrics()
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 1.87 10 0.159 Preprocessor1_Model1
2 rsq standard 0.409 10 0.124 Preprocessor1_Model1
# model_2
# CV MAE = 2.47, CV R-squared = 0.53
%>%
model_2_cv collect_metrics()
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 2.47 10 0.396 Preprocessor1_Model1
2 rsq standard 0.526 10 0.122 Preprocessor1_Model1
- We expect our first model to explain roughly 40% of variability in height among new adults, and to produce predictions of height that are off by 1.9 inches on average.
- Details: fold-by-fold results
collect_metrics()
gave the final CV MAE, or the average MAE across all 10 test folds.unnest(.metrics)
provides the MAE from each test fold.
- Obtain the fold-by-fold results for the
model_1
cross-validation procedure usingunnest(.metrics)
.
# HINT
%>%
___ unnest(.metrics)
- Which fold had the worst average prediction error and what was it?
- Recall that
collect_metrics()
reported a final CV MAE of 1.87 formodel_1
. Confirm this calculation by wrangling the fold-by-fold results from part a.
Solution
# a. model_1 MAE for each test fold
%>%
model_1_cv unnest(.metrics) %>%
filter(.metric == "mae")
# A tibble: 10 × 7
splits id .metric .estimator .estimate .config .notes
<list> <chr> <chr> <chr> <dbl> <chr> <list>
1 <split [35/4]> Fold01 mae standard 2.22 Preprocessor1_Mo… <tibble>
2 <split [35/4]> Fold02 mae standard 2.34 Preprocessor1_Mo… <tibble>
3 <split [35/4]> Fold03 mae standard 2.56 Preprocessor1_Mo… <tibble>
4 <split [35/4]> Fold04 mae standard 1.51 Preprocessor1_Mo… <tibble>
5 <split [35/4]> Fold05 mae standard 1.81 Preprocessor1_Mo… <tibble>
6 <split [35/4]> Fold06 mae standard 2.43 Preprocessor1_Mo… <tibble>
7 <split [35/4]> Fold07 mae standard 1.61 Preprocessor1_Mo… <tibble>
8 <split [35/4]> Fold08 mae standard 1.84 Preprocessor1_Mo… <tibble>
9 <split [35/4]> Fold09 mae standard 1.28 Preprocessor1_Mo… <tibble>
10 <split [36/3]> Fold10 mae standard 1.10 Preprocessor1_Mo… <tibble>
# b. fold 3 had the worst error (2.55)
# c. use these metrics to confirm the 1.87 CV MAE for model_1
%>%
model_1_cv unnest(.metrics) %>%
filter(.metric == "mae") %>%
summarize(mean(.estimate))
# A tibble: 1 × 1
`mean(.estimate)`
<dbl>
1 1.87
Comparing models
The table below summarizes the in-sample and 10-fold CV MAE for both models.
Model IN-SAMPLE MAE 10-fold CV MAE model_1
1.55 1.87 model_2
0.64 2.47
- Based on the in-sample MAE alone, which model appears better?
- Based on the CV MAE alone, which model appears better?
- Based on all of these results, which model would you pick?
- Do the in-sample and CV MAE suggest that
model_1
is overfit to ourhumans
sample data? What aboutmodel_2
?
Solution
model_2
model_1
model_1
–model_2
produces bad predictions for new adultsmodel_1
is NOT overfit – its predictions of height for new adults seem roughly as accurate as the predictions for the adults in our sample.model_2
IS overfit – its predictions of height for new adults are worse than the predictions for the adults in our sample.
- LOOCV
- Reconsider
model_1
. Instead of estimating its prediction accuracy using the 10-fold CV MAE, use the LOOCV MAE. THINK: How many people are in ourhumans
sample? - How does the LOOCV MAE compare to the 10-fold CV MAE of 1.87? NOTE: These are just two different approaches to estimating the same thing: the typical prediction error when applying our model to new data. Thus we should expect them to be similar.
- Explain why we technically don’t need to
set.seed()
for the LOOCV algorithm.
- Reconsider
Solution
- There are 40 people in our sample, thus LOOCV is equivalent to 40-fold CV:
nrow(humans)
[1] 39
<- lm_spec %>%
model_1_loocv fit_resamples(
~ hip + weight + thigh + knee + ankle,
height resamples = vfold_cv(humans, v = nrow(humans)),
metrics = metric_set(mae)
)
%>%
model_1_loocv collect_metrics()
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 1.82 39 0.179 Preprocessor1_Model1
- The LOOCV MAE (1.82) is very similar to the 10-fold CV MAE (1.87).
- There’s no randomness in the test folds. Each test fold is a single person.
- Data drill
- Calculate the average height of people under 40 years old vs people 40+ years old.
- Plot height vs age among our subjects that are 30+ years old.
- Fix this code:
<-lm_spec%>%fit(height~age,data=humans)
model_3%>%tidy() model_3
Solution
# a (one of many solutions)
%>%
humans mutate(younger_older = age < 40) %>%
group_by(younger_older) %>%
summarize(mean(height))
# A tibble: 2 × 2
younger_older `mean(height)`
<lgl> <dbl>
1 FALSE 70.4
2 TRUE 69.8
# b
%>%
humans filter(age >= 30) %>%
ggplot(aes(x = age, y = height)) +
geom_point()
# c
<- lm_spec %>%
model_3 fit(height ~ age, data = humans)
%>%
model_3 tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 71.1 1.63 43.7 1.96e-33
2 age -0.0210 0.0363 -0.577 5.67e- 1
Reflection: Part 1
The “regular” exercises are over but class is not done! Your group should agree to either work on HW1 or the remaining reflection questions.This is the end of Unit 1 on “Regression: Model Evaluation”! Let’s reflect on the technical content of this unit:
- What was the main motivation / goal behind this unit?
- What are the four main questions that were important to this unit?
- For each of the following tools, describe how they work and what questions they help us address:
- R-squared
- residual plots
- out-of-sample MAE
- in-sample MAE
- validation
- cross-validation
- In your own words, define the following:
- overfitting
- algorithm
- tuning parameter
- Review the new
tidymodels
syntax from this unit. Identify key themes and patterns.
Reflection: Part 2
The reflection above addresses your understanding of/progress toward our course learning goals. Consider the other components that have helped you worked toward this learning throughout Unit 1.With respect to collaboration, reflect upon your strengths and what you might change in the next unit:
- How actively did you contribute to group discussions?
- How actively did you include all other group members in discussion?
- In what ways did you (or did you not) help create a space where others feel comfortable making mistakes & sharing their ideas?
With respect to engagement, reflect upon your strengths and what you might change the next unit:
- Did you regularly attend, be on time for, & stay for the full class sessions?
- Have you not missed more than 3 in-person class sessions?
- Were you actively present during class (eg: not on your phone, not working on other courses, etc)?
- Did you stay updated on Slack?
- When you had questions, did you ask them on Slack or in OH?
With respect to preparation, how many of checkpoints 1–3 did you complete and pass?
With respect to exploration, did you complete and pass HW0? Are you on track to complete and pass HW1?
Done!
- Knit/render your notes.
- Check the solutions in the online manual.
- Check out the wrap-up steps below.
- If you finish all that during class, work on your homework!
Wrap-Up
Today’s Material
- If you didn’t finish the activity, no problem! Be sure to complete the activity outside of class, review the solutions in the course site, and ask any questions on Slack or in office hours.
- This is the end of Unit 1, so there are reflection questions at the bottom to help you organize the concepts in your mind.
- An R Tutorial Video, talking through the new code, is posted under the materials for today’s class on the Course Schedule. This video is OPTIONAL. Decide what’s right for you.
Upcoming Deadlines
- CP4:
- due 10 minutes before our next class
- covers one R code video
- HW1:
- due next Tuesday at 11:59 pm
- start today if you haven’t already!
- review the homework and late work/extension policies on Moodle/Syllabus
- universal flexibility: pass/revise grading (as long as your original submission meets certain criteria including on-time submission), late work grace period
- deadline is so we can get timely feedback to you; if you cannot make a deadline, please send me an email/Slack DM (in advance!) and let me know how much time you need
https://www.wallpaperflare.com/grayscale-photography-of-guitar-headstock-music-low-electric-bass-wallpaper-zzbyn↩︎