library(fivethirtyeight)
data("candy_rankings")
<- candy_rankings %>%
candy_final select(-competitorname) %>%
mutate_if(is.logical, as.factor)
20 Principal Component Regression
Unsupervised & supervised learning are friends!
Settling In
- Sit with the same group as last class
- If you were gone: check in with me to get your new group number
- Re-introduce yourselves!
- Prepare to take notes
- Catch up on any announcements you’ve missed on Slack
Announcements
Quiz 3 is coming up next week!
- Format: same as Quizzes 1 and 2
- Content: cumulative, but focus on unsupervised learning
- Study Tips:
- Use the course Learning Goals as a study guide
- Fill out the STAT 253 Concepts Maps (slides 10–12)
- Work on Group Assignment 3 (instructions on Moodle)
- Review old CPs, HWs, and in-class exercises
Notes: PC Regression
Context
We’ve been distinguishing 2 broad areas in machine learning:
- supervised learning: when we want to predict / classify some outcome y using predictors x
- unsupervised learning: when we don’t have any outcome variable y, only features x
- clustering: examine structure among the rows with respect to x
- dimension reduction: examine & combine structure among the columns x
. . .
BUT sometimes we can combine these ideas.
Combining Forces: Clustering + Regression
Use dimension reduction to visualize / summarize lots of features and notice interesting groups.
Example: many physical characteristics of penguins, many characteristics of songs, etcUse clustering to identify interesting groups.
Example: types (species) of penguins, types (genres) of songs, etcThese groups might then become our \(y\) outcome variable in future analysis.
Example: classify new songs as one of the “genres” we identified
EXAMPLE: K-means clustering + Classification of news articles
Dimension Reduction + Regression: Dealing with lots of predictors
Suppose we have an outcome variable \(y\) (quantitative OR categorical) and lots of potential predictors \(x_1, x_2, ..., x_p\).
Perhaps we even have more predictors than data points (\(p > n\))!
. . .
This idea of measuring lots of things on a sample is common in genetics, image processing, video processing, or really any scenario where we can grab data on a bunch of different features at once.
For simplicity, computational efficiency, avoiding overfitting, etc, it might benefit us to simplify our set of predictors.
. . .
There are a few approaches:
variable selection (eg: using backward stepwise)
Simply kick out some of the predictors. NOTE: This doesn’t work when \(p > n\).regularization (eg: using LASSO)
Shrink the coefficients toward / to 0. NOTE: This doesn’t work when \(p > n\).feature extraction (eg: using PCA)
Identify & utilize only the most salient features of the original predictors. Specifically, combine the original, possibly correlated predictors into a smaller set of uncorrelated predictors which retain most of the original information. NOTE: This does work when \(p > n\).
Principal Component Regression (PCR)
Step 1
Ignore \(y\) for now. Use PCA to combine the \(p\) original, correlated predictors \(x\) into a set of \(p\) uncorrelated PCs.Step 2
Keep only the first \(k\) PCs which retain a “sufficient” amount of information from the original predictors.Step 3
Model \(y\) by these first \(k\) PCs.
PCR vs Partial Least Squares
When combining the original predictors \(x\) into a smaller set of PCs, PCA ignores \(y\). Thus PCA might not produce the strongest possible predictors of \(y\).
Partial least squares provides an alternative.
Like PCA, it combines the original predictors into a smaller set of uncorrelated features, but considers which predictors are most associated with \(y\) in the process.
Chapter 6.3.2 in ISLR provides an optional overview.
Small Group Discussion
EXAMPLE 1
For each scenario below, indicate which would (typically) be preferable in modeling y by a large set of predictors x: (1) PCR; or (2) variable selection or regularization.
- We have more potential predictors than data points (\(p > n\)).
- It’s important to understand the specific relationships between y and x.
- The x are NOT very correlated.
Solution:
- (typically) can’t do variable selection, regularization when \(p > n\).
- The PCs lose the original meaning of the predictors
- PCR wouldn’t simplify things much (need a lot of PCs to retain info).
Exercises
- Make the most of your work time in class!
- These exercises are on HW6.
- IMPORTANT: Remember to
set.seed(253)
on any exercises that involve randomness. - If you finish HW6, start Group Assignment 3.
- Principal components regression (PCR)
Returning to the originalcandy_rankings
data, our final goal will be to build a regression model ofwinpercent
, the popularity of a candy. Let’s start with fresh data, in case anything has been overwritten in the previous exercises :):
- Build a PCR model of
winpercent
. In doing so:- Set the seed to 253.
- Try utilizing 1 PC, 2 PCs, …, up to 11 PCs (the number of original predictors)
- Calculate the 10-fold CV MAE for each model
- Set the seed to 253.
- Plot the CV MAE vs the number of PCs in each model.
- Using the plot, argue for keeping only one PC in our final model.
- Finalize the PCR model of
winpercent
using only one PC AND print itstidy()
table of coefficients. HINT:
finalize_workflow(parameters = list(num_comp = 1))
- Report and interpret the 10-fold CV MAE for this model. HINT:
collect_metrics()
.
- Comparison to LASSO regression
Next, consider 50 possible LASSO regression models ofwinpercent
, starting from the 11 original predictors:
library(tidymodels)
# STEP 1: LASSO model specification
<- linear_reg() %>%
lasso_spec set_mode("regression") %>%
set_engine("glmnet") %>%
set_args(mixture = 1, penalty = tune())
# STEP 2: variable recipe
<- recipe(winpercent ~ ., data = candy_final) %>%
lasso_recipe step_dummy(all_nominal_predictors())
# STEP 3: workflow
<- workflow() %>%
lasso_workflow add_recipe(lasso_recipe) %>%
add_model(lasso_spec)
# STEP 4: Estimate multiple LASSO models using a range of possible lambda values
set.seed(253)
<- lasso_workflow %>%
lasso_models tune_grid(
grid = grid_regular(penalty(range = c(-5, 1)), levels = 50),
resamples = vfold_cv(candy_final, v = 10),
metrics = metric_set(mae)
)
- Finalize the LASSO: build a parsimonious LASSO model of
winpercent
. - Print a
tidy()
table of the LASSO coefficients AND report the number of the 11 original predictors that are utilized in this model. - Calculate and interpret the 10-fold CV MAE for this LASSO model.
- Which model of
winpercent
would you use: the LASSO model or the PCR model? Provide an explanation that would apply to choosing between a LASSO and principal components model in general.
Wrapping Up
Upcoming Due Dates:
- HW6: due TOMORROW (12/4)
- Quiz 2 Revisions: bring to next class (12/5)
- Quiz 3: next Tuesday (12/10)
- Group Assignment 3: next Wednesday (12/11)
- Final Learning Reflection: finals week (12/17)
Notes: R code
Suppose we have a set of sample_data
with multiple predictors x, a quantitative outcome y, and (possibly) a column named data_id
which labels each data point. We could adjust this code if y were categorical.
RUN THE PCR algorithm
library(tidymodels)
library(tidyverse)
# STEP 1: specify a linear regression model
<- linear_reg() %>%
lm_spec set_mode("regression") %>%
set_engine("lm")
# STEP 2: variable recipe
# Add a pre-processing step that does PCA on the predictors
# num_comp is the number of PCs to keep (we need to tune it!)
<- recipe(y ~ ., data = sample_data) %>%
pcr_recipe update_role(data_id, new_role = "id") %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors(), num_comp = tune())
# STEP 3: workflow
<- workflow() %>%
pcr_workflow add_recipe(pcr_recipe) %>%
add_model(lm_spec)
# STEP 4: Estimate multiple PCR models trying out different numbers of PCs to keep
# For the range, the biggest number you can try is the number of predictors you started with
# Put the same number in levels
set.seed(___)
<- pcr_workflow %>%
pcr_models tune_grid(
grid = grid_regular(num_comp(range = c(1, ___)), levels = ___),
resamples = vfold_cv(sample_data, v = 10),
metrics = metric_set(mae)
)
FOLLOW-UP
Processing and applying the results is the same as for our other tidymodels
algorithms!