5  Model Selection

Settling In

  • Sit with your NEW group
    • Introduce yourselves!
    • Come up with a team name that reflects the majors represented in your group WITHOUT including the names of those majors (e.g., “data rocks!” instead of “statistics geology”)
  • Catch up on any Slack messages you might have missed

Prepare to take notes

  • Locate and download the Part 1 QMD for today (see the Schedule page).
  • Save this QMD in your “STAT 253 > Notes” folder.
  • Render the QMD and check out the general structure.
  • You’ll need the reshape2 package today. Install this if you haven’t already.
  • WAIT to open Part 2 until later in class!
  • I also encourage you to take notes (in a notebook) on any key concepts and new code structures we see today.


Learning Goals

Statistical Machine Learning Concepts

  • Gain intuition about different approaches to variable selection
  • Clearly describe the forward and backward stepwise selection algorithm and why they are examples of greedy algorithms
  • Compare best subset and stepwise algorithms in terms of optimality of output and computational time
  • Describe how selection algorithms can give a measure of variable importance


General Skills

Collaborative Learning:

  • Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
  • Develop a common purpose and agreement on goals.
  • Be able to contribute questions or concerns in a respectful way.
  • Share and contribute to the group’s learning in an equitable manner.

See the Learning Goals section of our couse website for more info!

Reflection

Collaborative Learning

Take 5 minutes to free-write in response to the following prompts, reflecting upon your strengths and areas for growth with respect to collaboration.

In Unit 1:

  • How actively did you contribute to group discussions?
  • How actively did you include ALL other group members in discussion?
  • In what ways did you (or did you not) help create a space where others feel comfortable making mistakes & sharing their ideas?

More generally:

  • What has/hasn’t worked well for you when it comes to working on in-class exercises in small groups (in this class or others)?
  • What would you like to try, or avoid, with this new group?

Unit 1 Reflection (continued)

If you did not finish Exercises #9 and #10 from last class, please take time after class today to do so.


Notes: Model Selection

Context

  • world = supervised learning
    We want to model some output variable \(y\) using a set of potential predictors (\(x_1, x_2, ..., x_p\)).

  • task = regression
    \(y\) is quantitative

  • model = linear regression
    We’ll assume that the relationship between \(y\) and (\(x_1, x_2, ..., x_p\)) can be represented by

    \[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \varepsilon\]

Inferential v. Predictive Models

In model building, the decision of which predictors to use depends upon our goal.

Inferential models

  • Goal: Explore & test hypotheses about a specific relationship.
  • Predictors: Defined by the goal.
  • Example: An economist wants to understand how salaries (\(y\)) vary by age (\(x_1\)) while controlling for education level (\(x_2\)).

Predictive models

  • Goal: Produce the “best” possible predictions of \(y\).
  • Predictors: Any combination of predictors that help us meet this goal.
  • Example: A mapping app wants to provide users with quality estimates of arrival time (\(y\)) utilizing any useful predictors (eg: time of day, distance, route, speed limit, weather, day of week, traffic radar…)

Model Selection Goals

Model selection algorithms can help us build a predictive model of \(y\) using a set of potential predictors (\(x_1, x_2, ..., x_p\)).

There are 3 general approaches to this task:

  1. Variable selection (today)
    Identify a subset of predictors to use in our model of \(y\).

  2. Shrinkage / regularization (next class)
    Shrink / regularize the coefficients of all predictors toward or to 0.

  3. Dimension reduction (later in the semester)
    Combine the predictors into a smaller set of new predictors.


Exercises: Part 1

Instructions

As a group, you’ll design a variable selection algorithm to pick which predictors to use in a predictive model of height. Specifically, you will:

  • 15 mins: come up with one algorithm, document it, and try it
  • 5 mins: try another group’s algorithm

NOTE: This will NOT be perfect! Our goals are to:

  • Have fun and work together!
  • Tap into your intuition for key questions and challenges in variable selection.
  • Deepen your understanding of “algorithms” and “tuning parameters” by designing and communicating your own.

Questions

Let’s build a predictive model of height in inches using one or more of 12 possible predictors. Other than age and weight, these are circumferences measured in cm.

# Load packages
library(tidyverse)
library(tidymodels)

# Load data
humans <- read.csv("https://mac-stat.github.io/data/bodyfat1.csv")
names(humans)
 [1] "age"     "weight"  "neck"    "chest"   "abdomen" "hip"     "thigh"  
 [8] "knee"    "ankle"   "biceps"  "forearm" "wrist"   "height" 


A heat map displays correlations for each pair of variables in our dataset. Not only is height correlated with multiple predictors, the predictors are correlated with one another (mulicollinear)! We don’t need all of them in our model.

# Get the correlation matrix
library(reshape2)
cor_matrix <- cor(humans)
cor_matrix[lower.tri(cor_matrix)] <- NA
cor_matrix <- cor_matrix %>% 
  melt() %>% 
  na.omit() %>% 
  rename(correlation = value)

# Visualize the correlation for each pair of variables
ggplot(cor_matrix, aes(x = Var1, y = Var2, fill = correlation)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(
    low = "blue", high = "red", mid = "white", 
    midpoint = 0, limit = c(-1,1)) +
  labs(x = "", y = "") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
  coord_fixed()

  1. Design your own algorithm (15 minutes)
    • Do not use any materials from outside this class.
    • Document your algorithm in words (not code) in this google doc.
    • Your algorithm must:
      • be clear to other humans
      • be clear to a machine (cannot utilize context)
      • lead to a single model that uses 0-12 of our predictors
      • define and provide directions for selecting any tuning parameters
    • Implement as many steps of your algorithm as possible in the time allotted. You can modify the code below to build and evaluate the models in your algorithm:
# STEP 1: model specification
lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")

# STEP 2: model estimation
height_model_1 <- lm_spec %>% 
  fit(height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, data = humans)

# Check it out
height_model_1 %>% 
  tidy()

# CV MAE
set.seed(253)
lm_spec %>% 
  fit_resamples(
    height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist,
    resamples = vfold_cv(humans, v = 10), 
    metrics = metric_set(mae)
  ) %>% 
  collect_metrics()


  1. Test another group’s algorithm (5 minutes)
    Try to implement the next algorithm below yours (or the first algorithm if your group’s is last). Think: Are the steps clear? What are the drawbacks to the algorithm?


Notes: Variable Selection

Open the Part 2 QMD to take notes.

Goal

Let’s consider three existing variable selection algorithms.

Heads up: these algorithms are important to building intuition for the questions and challenges in model selection, BUT have major drawbacks.

Example 1: Best Subset Selection Algorithm

  • Build all \(2^p\) possible models that use any combination of the available predictors \((x_1, x_2,..., x_p)\).
  • Identify the best model with respect to some chosen metric (eg: CV MAE) and context.

Suppose we used this algorithm for our height model with 12 possible predictors. What’s the main drawback?

Example 2: Backward Stepwise Selection Algorithm

  • Build a model with all \(p\) possible predictors, \((x_1, x_2,..., x_p)\).
  • Repeat the following until only 1 predictor remains in the model:
    • Remove the 1 predictor with the biggest p-value.
    • Build a model with the remaining predictors.
  • You now have \(p\) competing models: one with all \(p\) predictors, one with \(p-1\) predictors, …, and one with 1 predictor. Identify the “best” model with respect to some metric (eg: CV MAE) and context.

. . .

Let’s try out the first few steps!

. . .

# Load packages and data
library(tidyverse)
library(tidymodels)
humans <- read.csv("https://mac-stat.github.io/data/bodyfat1.csv")
# STEP 1: model specifications
lm_spec <- linear_reg() %>% 
  set_mode("regression") %>% 
  set_engine("lm")
# STEP 2: model estimate (using all 12 predictors to start)
# Pick apart this code and make it easier to identify the least "significant" predictor!!!
lm_spec %>% 
  fit(height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist, 
      data = humans) %>% 
  tidy() %>% 
  filter(term != "(Intercept)") %>% 
  mutate(p.value = round(p.value, 4))
# 11 predictors (tweak the code)
lm_spec %>% 
  fit(height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist,
      data = humans) %>% 
  tidy() %>% 
  filter(term != "(Intercept)") %>% 
  mutate(p.value = round(p.value, 4))
# 10 predictors (tweak the code)
lm_spec %>% 
  fit(height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist,
      data = humans) %>% 
  tidy() %>% 
  filter(term != "(Intercept)") %>% 
  mutate(p.value = round(p.value, 4))

Example 3: Backward Stepwise Selection Step-by-Step Results

Below is the complete model sequence along with 10-fold CV MAE for each model (using set.seed(253)).

pred CV MAE predictor list
12 5.728 weight, hip, forearm, thigh, chest, abdomen, age, ankle, wrist, knee, neck, biceps
11 5.523 weight, hip, forearm, thigh, chest, abdomen, age, ankle, wrist, knee, neck
10 5.413 weight, hip, forearm, thigh, chest, abdomen, age, ankle, wrist, knee
9 5.368 weight, hip, forearm, thigh, chest, abdomen, age, ankle, wrist
8 5.047 weight, hip, forearm, thigh, chest, abdomen, age, ankle
7 5.013 weight, hip, forearm, thigh, chest, abdomen, age
6 4.684 weight, hip, forearm, thigh, chest, abdomen
5 4.460 weight, hip, forearm, thigh, chest
4 4.386 weight, hip, forearm, thigh
3 4.091 weight, hip, forearm
2 3.733 weight, hip
1 3.658 weight

DISCUSS:

  1. (Review) Interpret the CV MAE for the model of height by weight alone.

  2. Is this algorithm more or less computationally expensive than the best subset algorithm?

  3. The predictors neck and wrist, in that order, are the most strongly correlated with height. Where do these appear in the backward sequence and what does this mean?

cor(humans)[,'height'] %>% 
  sort()
      thigh         hip         age     abdomen        knee       chest 
-0.11301249 -0.10648937 -0.05853538 -0.02173587  0.02345904  0.05838830 
     biceps       ankle      weight     forearm       wrist        neck 
 0.07441696  0.07920867  0.11228791  0.16968040  0.28967468  0.29147610 
     height 
 1.00000000 
  1. We deleted predictors one at a time. Why is this better than deleting a collection of multiple predictors at the same time (eg: kicking out all predictors with p-value > 0.1)?

Example 4: Backward Stepwise Selection Final Model

We have to pick just 1 of the 12 models as our final model.

That is, we have to pick a value for our tuning parameter, the number of predictors.

It helps to plot the CV MAE for each model in the sequence.

Here’s what we saw above:

Code
data.frame(
    predictors = c(12:1), 
    mae = c(5.728, 5.523, 5.413, 5.368, 5.047, 5.013, 4.684, 4.460, 4.386, 4.091, 3.733, 3.658)) %>% 
  ggplot(aes(x = predictors, y = mae)) + 
    geom_point() + 
    geom_line() + 
    scale_x_continuous(breaks = c(1:12))

  1. Which model do you pick?!?

  2. In the odd “Goldilocks” fairy tale, a kid comes upon a bear den – the first bear’s bed is too hard, the second bear’s is too soft, and the third bear’s is just right. The second plot, below, illustrates a goldilocks problem in tuning the number of predictors in our backward stepwise model. Explain.

  • When the number of predictors is too small, the MAE increases because the model is too….
  • When the number of predictors is too large, the MAE increases because the model is too….

Example 5: Machine Learning vs Human Learning

When tuning or finalizing a model building algorithm, we (humans!) have our own choices to make. For one, we need to decide what we prefer:

  • a model with the lowest prediction errors; or
  • a more parsimonious model: one with slightly higher prediction errors but fewer predictors

In deciding, here are some human considerations:

  • goal: How will the model be used? Should it be easy for humans to interpret and apply?
  • cost: How many resources (time, money, computer memory, etc) do the model and data needed require?
  • impact: What are the consequences of a bad prediction?

For each scenario below, which model would you pick: (1) the model with the lowest prediction errors; or (2) a parsimonious model with slightly worse predictions?

  1. Google asks us to re-build their search algorithm.

  2. A small non-profit hires us to help them build a predictive model of the donation dollars they’ll receive throughout the year.

Example 6: Forward Stepwise Selection Algorithm

  • How do you think this works?
  • Is it more or less computationally expensive than backward stepwise?

WARNING

Variable selection algorithms are a nice, intuitive place to start our discussion of model selection techniques.

BUT we will not use them.

They are frowned upon in the broader ML community, so much so that tidymodels doesn’t even implement them! Why?

  • Best subset selection is computationally expensive.
  • Backward stepwise selection:
    • is greedy – it makes locally optimal decisions, thus often misses the globally optimal model
    • overestimates the significance of remaining predictors, thus shouldn’t be used for inference
  • Forward stepwise selection:
    • is computationally expensive
    • can produce odd combinations of predictors (eg: a new predictor may render previously included predictors non-significant).


Exercises: Part 2

Instructions

  • Goal: become familiar with new code structures (recipes and workflows)
  • Ask me questions as I move around the room.

Questions

The video for today introduced the concepts of recipes and workflows in the tidymodels framework. These concepts will become important to our new modeling algorithms. Though they aren’t necessary to linear regression models, let’s explore them in this familiar setting.

Run through the following discussion and code one step at a time. Take note of the general process, concepts, and questions you have.

STEP 1: model specification

This specifies the structure or general modeling algorithm we plan to use.

It does not specify anything about the variables of interest or our data.

lm_spec <- linear_reg() %>%
  set_mode("regression") %>% 
  set_engine("lm")

# Check it out
lm_spec

STEP 2: recipe specification

Just as a cooking recipe specifies the ingredients and how to prepare them, a tidymodels recipe specifies:

  • the variables in our relationship of interest (the ingredients)
  • how to pre-process or wrangle these variables (how to prepare the ingredients)
  • the data we’ll use to explore these variables (where to find the ingredients)

It does not specify anything about the model structure we’ll use to explore this relationship.

# A simple recipe with NO pre-processing
data_recipe <- recipe(height ~ wrist + ankle, data = humans)

# Check it out
data_recipe

STEP 3: workflow creation (model + recipe)

This specifies the general workflow of our modeling process, including our model structure and our variable recipe.

model_workflow <- workflow() %>%
  add_recipe(data_recipe) %>%
  add_model(lm_spec)

# Check it out
model_workflow

STEP 4: Model estimation

This step estimates or fits our model of interest using our entire sample data.

The model (lm_spec) and variable details (here just height ~ wrist + ankle) are specified in the workflow, so we do not need to give that information again!

my_model <- model_workflow %>% 
  fit(data = humans)

STEPS 5: Model evaluation

To get in-sample metrics, use my_model like normal.

# example: calculate of in-sample metrics
my_model %>% 
  glance()

To get CV metrics, pass the workflow to fit_resamples along with information about how to randomly create folds.

set.seed(253)
my_model_cv <- model_workflow %>% 
  fit_resamples(resamples = vfold_cv(humans, v = 10),
                metrics = metric_set(rsq))

Then, proceed as usual… (my_model_cv %>% collect_metrics(), etc. )


Solutions

Exercises: Part 1

  1. Design your own algorithm.
Solution

There is no “correct” answer to this question — we just want to see what you can come up with! With that said, a “good” algorithm should:

  • be clear to other humans
  • be clear to a machine (eg cannot utilize context)
  • lead to a single model that uses 0–12 of our predictors
  • define and provide directions for selecting any tuning parameters


  1. Test another group’s algorithm.
Solution Again, there is no “correct” answer here. But you should reflect on the following questions: Are the steps clear? What are the drawbacks to the algorithm? Share any feedback you have in the google doc!


Notes: Variable Selection

Example 1: Best Subset Selection Algorithm

Solution

It’s computationally expensive. For our humans example, we’d need to build 4096 models:

2^12
[1] 4096


Example 2: Backward Stepwise Selection Algorithm

Solution
# All 12 predictors
lm_spec %>% 
  fit(height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + biceps + forearm + wrist,
      data = humans) %>% 
  tidy() %>%  # use tidy to get p-values for each coefficient
  filter(term != "(Intercept)") %>% # exclude the intercept
  mutate(p.value = round(p.value, 4)) %>% # round the p-values for easier viewing
  arrange(desc(p.value)) # added this line to arrange from largest to smallest p-value
# A tibble: 12 × 5
   term    estimate std.error statistic p.value
   <chr>      <dbl>     <dbl>     <dbl>   <dbl>
 1 biceps   -0.0808     0.746    -0.108  0.915 
 2 neck      0.139      1.17      0.119  0.906 
 3 knee      0.151      0.941     0.160  0.874 
 4 wrist     0.836      2.32      0.361  0.721 
 5 ankle    -0.888      1.28     -0.693  0.494 
 6 abdomen   0.283      0.354     0.798  0.432 
 7 age      -0.112      0.132    -0.847  0.405 
 8 chest    -0.459      0.473    -0.971  0.340 
 9 forearm   2.25       1.80      1.25   0.223 
10 weight    0.379      0.213     1.78   0.0864
11 hip      -0.921      0.510    -1.81   0.0822
12 thigh    -1.24       0.646    -1.92   0.066 
# 11 predictors (got rid of biceps)
lm_spec %>% 
  fit(height ~ age + weight + neck + chest + abdomen + hip + thigh + knee + ankle + forearm + wrist,
      data = humans) %>% 
  tidy() %>% 
  filter(term != "(Intercept)") %>% 
  mutate(p.value = round(p.value, 4)) %>% 
  arrange(desc(p.value))
# A tibble: 11 × 5
   term    estimate std.error statistic p.value
   <chr>      <dbl>     <dbl>     <dbl>   <dbl>
 1 neck       0.161     1.14      0.142  0.888 
 2 knee       0.180     0.886     0.203  0.841 
 3 wrist      0.907     2.18      0.416  0.681 
 4 ankle     -0.878     1.26     -0.699  0.490 
 5 abdomen    0.281     0.348     0.809  0.425 
 6 age       -0.111     0.130    -0.858  0.398 
 7 chest     -0.453     0.461    -0.982  0.334 
 8 forearm    2.17      1.62      1.34   0.192 
 9 hip       -0.902     0.470    -1.92   0.0652
10 weight     0.369     0.190     1.94   0.0623
11 thigh     -1.26      0.602    -2.09   0.0454
# 10 predictors (got rid of neck)
lm_spec %>% 
  fit(height ~ age + weight  + chest + abdomen + hip + thigh + knee + ankle + forearm + wrist,
      data = humans) %>% 
  tidy() %>% 
  filter(term != "(Intercept)") %>% 
  mutate(p.value = round(p.value, 4)) %>% 
  arrange(desc(p.value))
# A tibble: 10 × 5
   term    estimate std.error statistic p.value
   <chr>      <dbl>     <dbl>     <dbl>   <dbl>
 1 knee       0.166     0.866     0.191  0.850 
 2 wrist      0.985     2.07      0.475  0.639 
 3 ankle     -0.884     1.23     -0.716  0.480 
 4 age       -0.111     0.127    -0.869  0.392 
 5 abdomen    0.298     0.322     0.924  0.363 
 6 chest     -0.460     0.451    -1.02   0.316 
 7 forearm    2.29      1.37      1.66   0.107 
 8 weight     0.377     0.179     2.11   0.0435
 9 thigh     -1.26      0.591    -2.14   0.0409
10 hip       -0.931     0.416    -2.24   0.0331


Example 3: Backward Stepwise Selection Step-by-Step Results

Solution

If you’re curious, here’s some code to implement all steps of the backward stepwise selection algorithm:

Code
# setup
predictors <- c("age", "weight", "neck", "chest", "abdomen", "hip", "thigh", "knee", "ankle", "biceps", "forearm", "wrist")
p <- length(predictors)
kick_out <- rep(0, p)
cvs <- rep(0, p)

# loop through predictors
for(i in 1:12){
  # fit model
  my_model <- lm_spec %>% 
    fit(as.formula(paste("height ~ ", paste(predictors, collapse = "+"))), 
        data = humans) %>% 
    tidy() %>% 
    filter(term != "(Intercept)") %>% 
    arrange(desc(p.value))
  
  # use 10-fold CV to get MAE for model
  set.seed(253)
  cv_process <- lm_spec %>% 
      fit_resamples(
        as.formula(paste("height ~ ", paste(predictors, collapse = "+"))),
        resamples = vfold_cv(humans, v = 10), 
        metrics = metric_set(mae)
      ) %>% 
      collect_metrics()
  
  # get name of worst variable (biggest p-value)
  worst <- as.data.frame(my_model)[1,1]
  kick_out[i] <- worst
  
  # get rid of worst variable from predictor list
  predictors <- predictors[predictors != worst]
  
  # save CV MAE for this model
  cvs[i] <- as.data.frame(cv_process)$mean
}

kick_out
##  [1] "biceps"  "neck"    "knee"    "wrist"   "ankle"   "age"     "abdomen"
##  [8] "chest"   "thigh"   "forearm" "hip"     "weight"
cvs
##  [1] 5.727898 5.522730 5.413292 5.368143 5.047014 5.013322 4.684182 4.460353
##  [9] 4.385650 4.090791 3.733412 3.658251
  1. Using a linear model with only weight to predict height, our prediction error would be on average 3.58 inches off from the truth on new data. In other words, we estimate that this model can predict a new person’s height within, on average, 3.58 inches.
  2. Less. We only have to build 12 models.
  3. Both neck and wrist are kicked out early! The 1-predictor model produced by this algorithm isn’t necessarily the best 1-predictor model (same for any number of predictors).
cor(humans)[,'height'] %>% 
  sort()
      thigh         hip         age     abdomen        knee       chest 
-0.11301249 -0.10648937 -0.05853538 -0.02173587  0.02345904  0.05838830 
     biceps       ankle      weight     forearm       wrist        neck 
 0.07441696  0.07920867  0.11228791  0.16968040  0.28967468  0.29147610 
     height 
 1.00000000 
  1. The value of the coefficient (and thus the p-value) is dependent on the other variables in the model as we are accounting for or conditioning on them.


Example 4: Backward Stepwise Selection Final Model

Solution
  1. Too few predictors: model is too simple. Too many predictors: model is too overfit.
  2. Based on our data, I think the model with 1 predictor seems pretty reasonable! It’s simple and has the smallest CV MAE. If you’re worried that model is too simple, the model with 2 predictors also has a similarly small MAE.
    If I were looking at the other MAE plot, though, I might pick a model with 1 (the simplest), 2 (still simple, but better MAE than 1 predictor), or 5 predictors (the model with the best CV MAE).


Example 5: Machine Learning vs Human Learning

Solution
  1. 1
  2. 2


Example 6: Forward Stepwise Selection Algorithm

Solution
  • Start with 0 predictors. Fit all possible models with 1 predictor. Add the predictor with the smallest p-value. To this model, add a second predictor with the smallest p-value. Continue until all predictors are in the model. (More details in ISLR Section 6.1.2.)

  • more. For 12 predictors, we’d have to build 12 models in step 1, 11 models in step 2, etc. Thus 12 + 11 + … + 1 = 78 models total.


Exercises: Part 2

No solutions for this part.


Wrap-Up

  • Today’s Material
    • If you didn’t finish the exercises, be sure to finish outside of class, check solutions, and bring questions to office hours (or post on Slack)!
  • Upcoming Due Dates
    • TODAY by 11:59 pm: HW1
    • Before our next class: CP5
    • Next week: HW2 (start today!)