lm(y ~ x + x2, data) #if you created x2 = x^2
lm(y ~ poly(x, 2), data)
9 LOESS & Splines
Settling In
- Prepare to take notes (find today’s QMD in the usual spot)
- Check Slack for any recent messages you may have missed
Important Announcements
- HW3 due next Tuesday (Feb 25)
- review the Important Notes before you begin!
- designed to help you start studying for Quiz 1 (in class on Mar 6)
- Regression review / group assignment during our next two class periods
- read the Group Assignment 1 instructions before our next class
- designed to help you synthesize and practice the concepts from Units 1–3 and prepare for Quiz 1
- MSCS Capstone Days next week (Feb 27–28)
- no class on Thursday!
- instead, attend at least three capstone talks, and submit a short reflection (link on Moodle) by end-of-day Friday
Learning Goals
- Explain the advantages of splines over global transformations and other types of piecewise polynomials
- Explain how splines are constructed by drawing connections to variable transformations and least squares
- Explain how the number of knots relates to the bias-variance tradeoff
- Clearly describe the local regression algorithm for making a prediction
- Explain how bandwidth (span) relate to the bias-variance tradeoff
Context
world = supervised learning
We want to model some output variable \(y\) using a set of potential predictors (\(x_1, x_2, ..., x_p\)) and the relationship could be non-linear.task = regression
\(y\) is quantitativea flexible algorithm
GOAL
Our standard parametric models (eg: linear regression) may too rigid to represent the relationship between \(y\) and our predictors \(x\). Thus we need more flexible models.
KNN
Discuss the following questions with your group to review key concepts from recent activities.
Small Group Discussion
- KNN Review
In the previous activity, we modeled college Grad.Rate
versus Expend
, Enroll
, and Private
using data on 775 schools.
We chose the KNN with K = 33 because it minimized the CV MAE, i.e. the errors when predicting grad rate for schools outside our sample. We don’t typically worry about a more parsimonious KNN, i.e. a model that has slightly higher prediction errors but is easier to interpret, apply, etc. Why?
What assumptions did the KNN model make about the relationship of
Grad.Rate
withExpend
,Enroll
, andPrivate
? Is this a pro or con?What did the KNN model tell us about the relationship of
Grad.Rate
withExpend
,Enroll
, andPrivate
? For example, did it give you a sense of whether grad rates are higher at private or public institutions? At institutions with higher or lower enrollments? Is this a pro or con?
- Nonparametric vs Parametric
Nonparametric KNN vs parametric least squares and LASSO:
- When should we use a nonparametric algorithm like KNN?
- When shouldn’t we?
Splines
Notes
Goal:
Build a flexible parametric regression model of \(y\) by one quantitative predictor \(x\),
\[y = f(x) + \varepsilon\]
. . .
Idea:
Fit polynomial models in small localized regions and constrain to make it smooth enough.
. . .
Definition:
A spline is a piecewise polynomial of degree \(k\) that is a continuous function and has continuous derivatives (of orders \(1,...,k-1\)) at its knot points (break points).
. . .
A spline function \(f\) with knot points \(t_1<...<t_m\),
- is a polynomial of degree \(k\) on each of the intervals \((-\infty,t_1],[t_1,t_2],....,[t_m,\infty)\) and
- the \(j\)th derivative of the spline function \(f\) is continuous at \(t_1,...,t_m\) for \(j=0,1,...,k-1\).
. . .
More Definitions:
Spline functions can be represented using a basis and can be used within a linear regression framework. We often call this regression splines or B-splines and represent the spline function as
\[f(x) = \beta_0 + \beta_1f_1(x) + \cdots + \beta_{p}f_{p}(x)\]
where the functions \(f_1(x)\),…,\(f_p(x)\) are complex basis functions defined by knots and the degree of the polynomial.
Piecewise Polynomials in Stat 155
If you had a quadratic model (degree = 2) with 0 knots, how could you fit that model using the tools of STAT 155?
. . .
. . .
If you had a piecewise linear model (degree = 1) with 1 knot at \(x=5\), how could you fit that using the tools from Stat 155?
. . .
lm(y ~ x*I(x > 5), data)
. . .
These implementation methods don’t constrain them to be continuous or smooth, so we need additional tools from the splines
package!
Splines in tidymodels
To build models with splines in tidymodels
, we proceed with the same structure as we use for ordinary linear regression models but we’ll add some pre-processing steps to our recipe
.
. . .
To work with splines, we’ll use tools from the splines
package in conjunction with recipes
.
- The
step_ns()
function based onns()
in that package creates the transformations needed to create a natural cubic spline function for a quantitative predictor. - The
step_bs()
function based onbs()
in that package creates the transformations needed to create a basis spline (typically called B-splines) function for a quantitative predictor.
. . .
# Linear Regression Model Spec
<-
lm_spec linear_reg() %>%
set_mode('regression') %>%
set_engine(engine = 'lm')
# Original Recipe
<- recipe(y ~ x, data = sample_data)
lm_recipe
# Polynomial Recipe
<- lm_recipe %>%
poly_recipe step_poly(x, degree = __) # polynomial (higher degree means more flexible)
# Natural Spline Recipe
<- lm_recipe %>%
ns2_recipe step_ns(x, deg_free = __) # natural cubic spline (higher deg_free means more knots)
# Basis Spline Recipe
<- lm_recipe %>%
bs_recipe step_bs(x, options = list(knots = c(__,__))) # b-spline (cubic by default, more knots means higher deg_free)
. . .
- The
deg_free
argument instep_ns()
stands for degrees of freedom:deg_free = # of knots + 1
- The degrees of freedom are the number of coefficients in the transformation functions that are free to vary (essentially the number of underlying parameters behind the transformations).
- The knots are chosen using percentiles of the observed values.
- The
options
argument instep_bs()
allows you to specific specific knots:options = list(knots = c(2, 4))
sets a knot at 2 and a knot at 4 (the choice of knots should depend on the observed range of the quantitative predictor and where you see a change in the relationship)
. . .
What is the difference between natural splines and B-splines?
- B-spline is a tool to incorporate splines into a linear regression setting.
- You have to choose the knots (which can be tricky sometimes).
- These functions can be unstable (high variance) near the boundaries, especially with higher polynomial degrees.
- Natural spline is a variant of the B-spline with additional constraints on the boundaries to reduce variance (the degree of the polynomial near the boundaries is lower).
- Cubic natural splines are the most common
- Typically knots are chosen based on quantiles of the predictor (e.g. 1 knot will be placed at the median, 2 knots will be placed at the 33rd and 66th percentiles, etc.)
Small Group Discussion
- Splines & the Bias-Variance Tradeoff
See the figure below, for either natural splines or cubic splines, what values of deg_free
(small, medium, large) do you get the following:
- high bias but low variance:
- low bias but high variance:
- moderate bias and low variance:
- splines & LASSO
Could we use LASSO with a spline or polynomial recipe? Yes or No? Explain.
LOESS
Notes
Local Regression or Locally Estimated Scatterplot Smoothing (LOESS)
. . .
Goal:
Build a flexible regression model of \(y\) by one quantitative predictor \(x\),
\[y = f(x) + \varepsilon\]
. . .
Idea:
Fit regression models in small localized regions, where nearby data have greater influence than far data.
. . .
Algorithm:
Define the span, aka bandwidth, tuning parameter \(h\) where \(0 \le h \le 1\). Take the following steps to estimate \(f(x)\) at each possible predictor value \(x\):
- Identify a neighborhood consisting of the \(100∗h\)% of cases that are closest to \(x\).
- Putting more weight on the neighbors closest to \(x\) (ie. allowing them to have more influence), fit a linear model in this neighborhood.
- Use the local linear model to estimate f(x).
. . .
In pictures:
Small Group Discussion
Open the QMD and find the LOESS discussion questions. Work on the following questions with your group.
- LOESS in R
We can plot LOESS models using geom_smooth()
. Play around with the span
parameter below.
- What happens as we increase the span from roughly 0 to roughly 1?
- What is one “pro” of this nonparametric algorithm, relative to KNN?
- What questions do you have about this algorithm?
library(tidyverse)
library(ISLR)
data(College)
ggplot(College, aes(x = Expend, y = Grad.Rate)) +
geom_point() +
geom_smooth(method = "loess", span = 0.5)
Note: you’ll find that you can specify span
greater than 1. Use your resources to figure out what that means in terms of the algorithm.
- LOESS & the Bias-Variance Tradeoff
Run the shiny app code, below. Explore the impact of the span tuning parameter h on the LOESS performance across different datasets. Continue to click the Go! button to get different datasets.
For what values of \(h\) (near 0, somewhere in the middle, near 1) do you get the following:
- high bias but low variance
- low bias but high variance
- moderate bias and low variance
# Load packages & data
library(shiny)
library(tidyverse)
# Define a LOESS plotting function
<- function(h, plot_data){
plot_loess ggplot(plot_data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(span = h, se = FALSE) +
labs(title = paste("h = ", h)) +
lims(y = c(-5,12))
}
# BUILD THE SERVER
# These are instructions for building the app - what plot to make, what quantities to calculate, etc
<- function(input, output) {
server_LOESS <- eventReactive(input$do, {
new_data <- c(runif(25, -6, -2.5), runif(50, -3, 3), runif(25, 2.5, 6))
x <- 9 - x^2 + rnorm(100, sd = 0.75)
y c(1:25, 76:100)] <- rnorm(50, sd = 0.75)
y[data.frame(x, y)
})$loesspic <- renderPlot({
outputplot_loess(h = input$hTune, plot_data = new_data())
})
}
# BUILD THE USER INTERFACE (UI)
# The UI controls the layout, appearance, and widgets (eg: slide bars).
<- fluidPage(
ui_LOESS sidebarLayout(
sidebarPanel(
h4("Sample 100 observations:"),
actionButton("do", "Go!"),
h4("Tune the LOESS:"),
sliderInput("hTune", "h", min = 0.05, max = 1, value = 0.05)
),mainPanel(
h4("LOESS plot:"),
plotOutput("loesspic")
)
)
)
# RUN THE SHINY APP!
shinyApp(ui = ui_LOESS, server = server_LOESS)
Exercises
Part 1: Implementing Splines
Goals
- Implement and explore splines in R.
- Explore how this algorithm compares to others, hence how it fits into our broader ML workflow.
Context
Using the College
data in the ISLR
package, we’ll build a predictive model of Grad.Rate
by 6 predictors: Private
, PhD
(percent of faculty with PhDs), Room.Board
, Personal
(estimated personal spending), Outstate
(out-of-state tuition), perc.alumni
(percent of alumni who donate)
Before proceeding, install the splines
package by entering install.packages("splines")
in the Console.
# Load packages
library(tidymodels)
library(tidyverse)
library(ISLR)
# Load data
data(College)
# Wrangle the data
<- College %>%
college_sub filter(Grad.Rate <= 100) %>%
filter((Grad.Rate > 50 | Expend < 40000)) %>%
select(Grad.Rate, Private, PhD, Room.Board, Personal, Outstate, perc.alumni)
- Evaluating a fully linear model
We will model Grad.Rate
as a function of the 6 predictors in college_sub
.
- Make scatterplots of each of the quantitative predictors and the outcome with 2 different smoothing lines to explore potential nonlinearity. Adapt the following code to create a scatterplot with a smooth LOESS blue trend line and a red linear trend line. What do you notice about the form of the relationships?
ggplot(___, aes(___)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_classic()
- Use
tidymodels
to fit a linear regression model (no splines yet) with the following specifications:- Use 8-fold CV.
- Use CV mean absolute error (MAE) to evaluate models.
- Use LASSO engine to do variable selection to select the simplest model for which the metric is within one standard error of the best metric.
- Fit your “best” model and look at coefficients of that final model. Which variables are “least” important to predicting
Grad.Rate
?
set.seed(253)
# Lasso Model Spec with tune
<- linear_reg() %>%
lasso_spec set_mode('regression') %>%
set_engine(engine = 'glmnet') %>%
set_args(mixture = 1, penalty = tune())
# Recipe
<- recipe(__ ~ ___, data = college_sub) %>%
lasso_recipe step_dummy(all_nominal_predictors())
# Workflow (Recipe + Model)
<- workflow() %>%
lasso_wf_tune add_recipe(lasso_recipe) %>%
add_model(lasso_spec)
# Tune Model (trying a variety of values of Lambda penalty)
<- tune_grid(
lasso_models # workflow
lasso_wf_tune, resamples = vfold_cv(__, v = __),
metrics = metric_set(___),
grid = grid_regular(penalty(range = c(-3, 1)), levels = 30)
)
# Select parsimonious model & fit
<- lasso_models %>%
parsimonious_penalty select_by_one_std_err(metric = 'mae', desc(penalty))
# Finalize Model
<- lasso_wf_tune %>%
ls_mod finalize_workflow(parameters = parsimonious_penalty) %>%
fit(data = college_sub)
# Note which variable is the "least" important
%>%
ls_mod tidy()
- Make a plot of the residuals vs. fitted values from
ls_mod
and then plots of the residuals v. each of the quantitative predictors to evaluate whether the model is “right”.
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(__)) +
+
___ geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
- Evaluating a spline model
We’ll compare our best LASSO model with models with spline functions for some of the quantitative predictors.
- Update your recipe from Exercise 1 to fit a linear model (with the
lm
engine rather thanglmnet
) with the all predictors. Allow the quantitative predictors that had some non-linear relationships to be modeled with natural splines that have 4 degrees of freedom. Fit this model with 8 fold CV,fit_resamples
(same folds as before) to compare MAE and then fit the model to the whole training data. Call this fit modelns_mod
.
# Model Spec
<-
lm_spec linear_reg() %>%
set_mode('regression') %>%
set_engine(engine = 'lm')
# New Recipe (remove steps needed for LASSO, add splines)
# Create Workflow (Recipe + Model)
set.seed(253)
# CV to Evaluate
<- fit_resamples(
ns_models # workflow
___, resamples = vfold_cv(__, v = 8), # cv folds
metrics = metric_set(___)
)
%>% collect_metrics()
ns_models
# Fit with all data
<- fit( ___, #workflow from above
ns_mod data = college_sub
)
- Make a plot of the residuals vs. fitted values from
ns_mod
and then plots of the residuals v. each of the quantitative predictors to evaluate whether the model is right.
# Residual plots
- Compare the CV MAE between models with and without the splines.
%>%
lasso_models collect_metrics() %>%
filter(penalty == (parsimonious_penalty %>% pull(penalty)))
%>%
ns_models collect_metrics()
Pick an algorithm
Our overall goal is to build a predictive model ofGrad.Rate
(\(y\)) by 6 predictors:\[y = f(x_1,x_2,...,x_{6}) + \varepsilon\]
The table below summarizes the CV MAE for 3 possible algorithms: least squares, KNN, and splines. After examining the results, explain which model you would choose and why. NOTE: There are no typos here! All models had virtually the same CV MAE. Code is provided in the online manual if you’re curious.
method type assumption about \(f(x_1,x_2,...,x_{6})\) CV MAE least squares parametric \(\beta_0 + \beta_1x_1 + \cdots + \beta_{6} x_{6}\) 10.2 LASSO w/ \(\lambda=1.2\) parametric \(\beta_0 + \beta_1x_1 + \cdots + \beta_{6} x_{6}\) 10.3 KNN w/ \(K = 33\) nonparametric average \(y\) among neighbors 10.2 splines parametric \(\beta_0 + f_1(x_1) + \cdots + f_{6}(x_{6})\) 10.2
Part 2: Reflection
In these final exercises, you’ll reflect upon some algorithms we’ve learned thus far. This might feel bumpy if you haven’t reviewed the material recently, so be kind to yourself!
Some of these exercises are similar to those on Homework 3. Thus solutions are not provided. The following directions are important to deepening your learning and growth.
- Typing the questions into ChatGPT does not require any reflection: You may NOT use GenAI or other online resources for these questions or on Homework 3.
- Locating is not learning: Challenge yourself to write out concepts in your own words. Do not rely on definitions in the activities, videos, or elsewhere.
- Interpretability & flexibility
When picking between different model building algorithms, there are several considerations, including flexibility and interpretability. Consider the following graphic from ISLR:
Convince yourself that the placement of Subset Selection (e.g. backward stepwise), LASSO, Least Squares, and Splines (which is highly related to Generalized Additive Models) make sense. (We’ll address other algorithms later this semester.)
Where would you place KNN on this graphic?
- Differences and similarities
For each pair of algorithms below: try to identify a key similarity, a key difference, and any scenario in which they’re “equivalent”.- KNN vs LOESS
- LOESS vs Splines
- Splines vs least squares
- least squares vs LASSO
- Pros & cons (there’s a similar question on HW3)
Summarize at least 1 pro and 1 con about each model building algorithm. You cannot use the same pro or con more than once!- least squares
- LASSO
- KNN
- Splines
- Lingo (there’s a similar question on HW3)
In your own words, describe what each of these “everyday” words means in the context of ML algorithms:- greedy
- biased
- parsimonious
- (computationally) expensive
- goldilocks problem
Deeper Learning (OPTIONAL)
KNN, regression splines, and LOESS aren’t the only flexible regression algorithms! Chapter 7 of ISLR also discusses:
- piecewise polynomials (parametric)
- smoothing splines (nonparametric)
- generative additive models (nonparametric)
Read Chapter 7 of the book, and watch Prof Johnson’s video on these topics to learn more:
Chapter 7.7 specifically provides more information about generalized additive models (GAM). These models take the ideas of splines and loess and combine them into one framework and allow each predictor to have a non-linear relationship. They are implemented in tidymodels
but can be computationally intensive. They allow the relationships to either be estimated with loess or smoothing splines, both of which are non-parameter approaches.
In general, we can think of GAM as
\[y = \beta_0 + f_1(x_1) + \cdots + f_{p}(x_{p}) + \epsilon\]
where the functions \(f_j(x)\) are estimated nonparametrically.
Chapter 7.5 provides more information about smoothing splines. In general, these techniques estimate the \(f(x)\) by penalizing wiggly behavior where wiggliness is measured by the second derivative of \(f(x)\), i.e. the degree to which the slope changes. Specifically, \(f(x)\) is estimated to minimize:
\[\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int f''(t)^2 dt\]
That is, we want \(f(x)\) to produce small squared residuals (\(\sum_{i=1}^n (y_i - f(x_i))^2\)) and small total change in slope over the range of x (\(\int f''(t)^2 dt\)).
Solutions
Small Group Discussions
- KNN Review
Solution
- The output of the model is: one prediction for one observational unit. There is nothing to interpret (no plots, no coefficients, etc.). Also: there is no way to tune the model to use fewer predictors.
- The only assumption we made was that the outcome values of \(y\) should be similar if the predictor values of \(x\) are similar. No other assumptions are made.
- This is a pro if we want flexibility due to non-linear relationships and that assumption is true.
- This is a con if relationships are actually linear or could be modeled with a parametric model.
- Nothing. I’d say this is a con. There is nothing to interpret, so the model is more of a black box in terms of knowing why it gives you a particular prediction.
Further details are in solutions for previous activity.
- Nonparametric vs Parametric
Solution
Use nonparametric methods when parametric model assumptions are too rigid. Forcing a parametric method in this situation can produce misleading conclusions.
Use parametric methods when the model assumptions hold. In such cases, parametric models provide more contextual insight (eg: meaningful coefficients) and the ability to detect which predictors are beneficial to the model.
- Splines & the Bias-Variance Tradeoff
Solution
- high bias but low variance: small
deg_free
- low bias but high variance: large
deg_free
- moderate bias and low variance: medium
deg_free
- splines & LASSO
Solution
Not necessarily or at least we’d want to proceed carefully…
- LASSO could remove an intermediate term, so the full complex function would not be specified – would we be ok with that?
- Maybe LASSO help us choose our knots & degrees of freedom for regression splines.
- If we specified the spline differently, this paper suggests Knot Selection for Regression Splines using LASSO
- Smoothing Splines takes the idea of penalty from LASSO and uses it to tune the degrees of freedom.
- LOESS in R
Solution
- as we increase span, the model becomes smoother and more simple
- compare to KNN, LOESS is smoother, more like the shape of the relationship we’re trying to estimate
- LOESS & the Bias-Variance Tradeoff
Solution
- h near 1
- h near 0
- h somewhere in the middle
Exercises
- Evaluating a fully linear model
Solution
#a
# There might be a slight non-linear relationship
ggplot(college_sub, aes(y = Grad.Rate, x = PhD)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_classic()
# Only deviation in the extremes of Room and Board
ggplot(college_sub, aes(y = Grad.Rate, x = Room.Board)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_classic()
# linear relationship doesn't fit well but some extreme values in Personal
ggplot(college_sub, aes(y = Grad.Rate, x = Personal)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_classic()
# Close to linear
ggplot(college_sub, aes(y = Grad.Rate, x = Outstate)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_classic()
# Close to linear
ggplot(college_sub, aes(y = Grad.Rate, x = perc.alumni)) +
geom_point() +
geom_smooth(color = "blue", se = FALSE) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_classic()
The relationships between many of these predictors and
Grad.Rate
look reasonably linear. Perhaps two exceptions to this arePersonal
andPhD
.
#b
set.seed(253)
# Lasso Model Spec with tune
<- linear_reg() %>%
lasso_spec set_mode('regression') %>%
set_engine(engine = 'glmnet') %>%
set_args(mixture = 1, penalty = tune())
# Recipe
<- recipe(Grad.Rate ~ ., data = college_sub) %>% # use all predictors ( ~ . )
lasso_recipe step_dummy(all_nominal_predictors())
# Workflow (Recipe + Model)
<- workflow() %>%
lasso_wf_tune add_recipe(lasso_recipe) %>%
add_model(lasso_spec)
# Tune Model (trying a variety of values of Lambda penalty)
<- tune_grid(
lasso_models # workflow
lasso_wf_tune, resamples = vfold_cv(college_sub, v = 8), # 8-fold CV
metrics = metric_set(mae), # calculate MAE
grid = grid_regular(penalty(range = c(-5, 2)), levels = 30)
)
# Select parsimonious model & fit
<- lasso_models %>%
parsimonious_penalty select_by_one_std_err(metric = 'mae', desc(penalty))
<- lasso_wf_tune %>%
ls_mod finalize_workflow(parameters = parsimonious_penalty) %>%
fit(data = college_sub)
# Note which variable is the "least" important
%>%
ls_mod tidy()
# A tibble: 7 × 3
term estimate penalty
<chr> <dbl> <dbl>
1 (Intercept) 37.1 1.17
2 PhD 0.0582 1.17
3 Room.Board 0.00102 1.17
4 Personal -0.000740 1.17
5 Outstate 0.00136 1.17
6 perc.alumni 0.285 1.17
7 Private_Yes 0 1.17
Private_Yes
has a coefficient estimate of zero, so is “least” important”.
# c
# Residuals v. Fitted/Predicted
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = .pred, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. PhD
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = PhD, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. Personal
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = Personal, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. Room.Board
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = Room.Board, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. Outstate
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = Outstate, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. perc.alumni
%>%
ls_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = perc.alumni, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
These residual plots look generally fine to me. Perhaps there is a bit of heteroskedasticity in the residual vs fitted plot. We also look to be over-estimating (residuals < 0) graduation rates for schools with low numbers of PhDs and mid-range levels (2000–5000) of personal spending.
- Evaluating a spline model
Solution
- See below:
# Model Spec
<-
lm_spec linear_reg() %>%
set_mode('regression') %>%
set_engine(engine = 'lm')
# New Recipe (remove steps needed for LASSO, add splines)
<- recipe(Grad.Rate ~ ., data = college_sub) %>%
spline_recipe step_ns(Personal, deg_free = 4) %>% # splines with 4 df for Personal
step_ns(PhD, deg_free = 4) # splines with 4 df for PhD
# Create Workflow (Recipe + Model)
<- workflow() %>%
spline_wf add_recipe(spline_recipe) %>%
add_model(lm_spec)
set.seed(253)
# CV to Evaluate
<- fit_resamples(
ns_models # workflow
spline_wf, resamples = vfold_cv(college_sub, v = 8), # cv folds
metrics = metric_set(mae) # calculate MAE
)
# check out CV metrics
%>%
ns_models collect_metrics()
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 10.2 8 0.229 Preprocessor1_Model1
# Fit with all data
<- fit( spline_wf, #workflow from above
ns_mod data = college_sub
)
#b. Residual plots
# Residuals v. Fitted/Predicted
%>%
ns_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = .pred, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. PhD
%>%
ns_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = PhD, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. Personal
%>%
ns_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = Personal, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. Room.Board
%>%
ns_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = Room.Board, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. Outstate
%>%
ns_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = Outstate, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
# Residuals v. perc.alumni
%>%
ns_mod augment(new_data = college_sub) %>%
mutate(.resid = Grad.Rate - .pred) %>%
ggplot(aes(x = perc.alumni, y = .resid)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0, color = "red") +
theme_classic()
I still notice the same mild heteroskedasticity in the residual vs fitted plot. However, the systematic over-estimation of graduation rates for schools with low numbers of PhDs and mid-range personal spending now looks better (the “typical” residual is closer to zero over those ranges).
#c. compare CV MAE
%>%
lasso_models collect_metrics() %>%
filter(penalty == (parsimonious_penalty %>% pull(penalty)))
%>%
ns_models collect_metrics()
The MAEs are very similar.
- Pick an algorithm
Solution
You can argue a couple of ways.
If you noted the non-linear relationships in the plots above: splines is more informative than KNN and less wrong than least squares.
If you didn’t think above that splines were warranted, you should pick least squares (or LASSO). It’s definitely simpler and easier to interpret.
Code
# least squares
set.seed(253)
<- lm_spec %>% # re-use lm specs from above
ls_model # don't need a recipe since we're not doing any pre-processing
fit_resamples( # get 10-fold CV MAE
~ .,
Grad.Rate resamples = vfold_cv(college_sub, v = 10),
metrics = metric_set(mae)
)
# look at CV MAE
%>%
ls_model collect_metrics()
# A tibble: 1 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 mae standard 10.2 10 0.365 Preprocessor1_Model1
# KNN
# specifications
<- nearest_neighbor() %>%
knn_spec set_mode("regression") %>%
set_engine(engine = "kknn") %>%
set_args(neighbors = tune())
# recipe
<- recipe(Grad.Rate ~ ., data = college_sub) %>%
variable_recipe step_nzv(all_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_numeric_predictors())
# workflow
<- workflow() %>%
knn_workflow add_model(knn_spec) %>%
add_recipe(variable_recipe)
# tune K using 10-fold CV
set.seed(253)
<- knn_workflow %>%
knn_models tune_grid(
grid = grid_regular(neighbors(range = c(1, 200)), levels = 50),
resamples = vfold_cv(college_sub, v = 10),
metrics = metric_set(mae)
)
# select best K
<- select_best(knn_models, metric = "mae")
best_K best_K
# A tibble: 1 × 2
neighbors .config
<int> <chr>
1 33 Preprocessor1_Model09
# fit final model
%>%
knn_models collect_metrics() %>%
filter(neighbors == best_K$neighbors)
# A tibble: 1 × 7
neighbors .metric .estimator mean n std_err .config
<int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 33 mae standard 10.2 10 0.337 Preprocessor1_Model09
Wrap-Up
- Finish the activity, check the solutions, review the optional resources, and reach out with questions
- Before our next class:
- review instructions for Group Assignment 1
- if you will be absent: alert your group members and come up with a plan to contribute to the collaboration outside class
- Upcoming important dates:
- HW3: due Feb 25
- MSCS Capstone Days: Feb 27–28 (no class; reflection due in class Mar 4)
- GA1: due Mar 5
- Quiz 1: in class Mar 6