9  LOESS & Splines

Settling In

  • Prepare to take notes (find today’s QMD in the usual spot)
  • Check Slack for any recent messages you may have missed


Important Announcements

  • HW3 due next Tuesday (Feb 25)
    • review the Important Notes before you begin!
    • designed to help you start studying for Quiz 1 (in class on Mar 6)
  • Regression review / group assignment during our next two class periods
    • read the Group Assignment 1 instructions before our next class
    • designed to help you synthesize and practice the concepts from Units 1–3 and prepare for Quiz 1
  • MSCS Capstone Days next week (Feb 27–28)
    • no class on Thursday!
    • instead, attend at least three capstone talks, and submit a short reflection (link on Moodle) by end-of-day Friday



Learning Goals

  • Explain the advantages of splines over global transformations and other types of piecewise polynomials
  • Explain how splines are constructed by drawing connections to variable transformations and least squares
  • Explain how the number of knots relates to the bias-variance tradeoff
  • Clearly describe the local regression algorithm for making a prediction
  • Explain how bandwidth (span) relate to the bias-variance tradeoff



Context

  • world = supervised learning
    We want to model some output variable \(y\) using a set of potential predictors (\(x_1, x_2, ..., x_p\)) and the relationship could be non-linear.

  • task = regression
    \(y\) is quantitative

  • a flexible algorithm


GOAL

Our standard parametric models (eg: linear regression) may too rigid to represent the relationship between \(y\) and our predictors \(x\). Thus we need more flexible models.



KNN

Discuss the following questions with your group to review key concepts from recent activities.

Small Group Discussion

  1. KNN Review

In the previous activity, we modeled college Grad.Rate versus Expend, Enroll, and Private using data on 775 schools.

  1. We chose the KNN with K = 33 because it minimized the CV MAE, i.e. the errors when predicting grad rate for schools outside our sample. We don’t typically worry about a more parsimonious KNN, i.e. a model that has slightly higher prediction errors but is easier to interpret, apply, etc. Why?

  2. What assumptions did the KNN model make about the relationship of Grad.Rate with Expend, Enroll, and Private? Is this a pro or con?

  3. What did the KNN model tell us about the relationship of Grad.Rate with Expend, Enroll, and Private? For example, did it give you a sense of whether grad rates are higher at private or public institutions? At institutions with higher or lower enrollments? Is this a pro or con?


  1. Nonparametric vs Parametric

Nonparametric KNN vs parametric least squares and LASSO:

  • When should we use a nonparametric algorithm like KNN?
  • When shouldn’t we?



Splines

Notes

Goal:
Build a flexible parametric regression model of \(y\) by one quantitative predictor \(x\),

\[y = f(x) + \varepsilon\]

. . .

Idea:

Fit polynomial models in small localized regions and constrain to make it smooth enough.

. . .


Definition:

A spline is a piecewise polynomial of degree \(k\) that is a continuous function and has continuous derivatives (of orders \(1,...,k-1\)) at its knot points (break points).

Image source

. . .

A spline function \(f\) with knot points \(t_1<...<t_m\),

  • is a polynomial of degree \(k\) on each of the intervals \((-\infty,t_1],[t_1,t_2],....,[t_m,\infty)\) and
  • the \(j\)th derivative of the spline function \(f\) is continuous at \(t_1,...,t_m\) for \(j=0,1,...,k-1\).

. . .

More Definitions:

Spline functions can be represented using a basis and can be used within a linear regression framework. We often call this regression splines or B-splines and represent the spline function as

\[f(x) = \beta_0 + \beta_1f_1(x) + \cdots + \beta_{p}f_{p}(x)\]

where the functions \(f_1(x)\),…,\(f_p(x)\) are complex basis functions defined by knots and the degree of the polynomial.

Image source




Piecewise Polynomials in Stat 155

If you had a quadratic model (degree = 2) with 0 knots, how could you fit that model using the tools of STAT 155?

. . .

lm(y ~ x + x2, data) #if you created x2 = x^2

lm(y ~ poly(x, 2), data)

. . .

If you had a piecewise linear model (degree = 1) with 1 knot at \(x=5\), how could you fit that using the tools from Stat 155?

. . .

lm(y ~ x*I(x > 5), data)

. . .

These implementation methods don’t constrain them to be continuous or smooth, so we need additional tools from the splines package!

Splines in tidymodels

To build models with splines in tidymodels, we proceed with the same structure as we use for ordinary linear regression models but we’ll add some pre-processing steps to our recipe.

. . .

To work with splines, we’ll use tools from the splines package in conjunction with recipes.

  • The step_ns() function based on ns() in that package creates the transformations needed to create a natural cubic spline function for a quantitative predictor.
  • The step_bs() function based on bs() in that package creates the transformations needed to create a basis spline (typically called B-splines) function for a quantitative predictor.

. . .

# Linear Regression Model Spec
lm_spec <- 
  linear_reg() %>%
  set_mode('regression') %>%
  set_engine(engine = 'lm') 
  

# Original Recipe
lm_recipe <- recipe(y ~ x, data = sample_data)

# Polynomial Recipe
poly_recipe <- lm_recipe %>%
  step_poly(x, degree = __) # polynomial (higher degree means more flexible)

# Natural Spline Recipe
ns2_recipe <- lm_recipe %>%
  step_ns(x, deg_free = __) # natural cubic spline (higher deg_free means more knots)

# Basis Spline Recipe
bs_recipe <- lm_recipe %>%
  step_bs(x, options = list(knots = c(__,__))) # b-spline (cubic by default, more knots means higher deg_free)

. . .

  • The deg_free argument in step_ns() stands for degrees of freedom:
    • deg_free = # of knots + 1
    • The degrees of freedom are the number of coefficients in the transformation functions that are free to vary (essentially the number of underlying parameters behind the transformations).
    • The knots are chosen using percentiles of the observed values.
  • The options argument in step_bs() allows you to specific specific knots:
    • options = list(knots = c(2, 4)) sets a knot at 2 and a knot at 4 (the choice of knots should depend on the observed range of the quantitative predictor and where you see a change in the relationship)

. . .


What is the difference between natural splines and B-splines?

  • B-spline is a tool to incorporate splines into a linear regression setting.
    • You have to choose the knots (which can be tricky sometimes).
    • These functions can be unstable (high variance) near the boundaries, especially with higher polynomial degrees.
  • Natural spline is a variant of the B-spline with additional constraints on the boundaries to reduce variance (the degree of the polynomial near the boundaries is lower).
    • Cubic natural splines are the most common
    • Typically knots are chosen based on quantiles of the predictor (e.g. 1 knot will be placed at the median, 2 knots will be placed at the 33rd and 66th percentiles, etc.)



Small Group Discussion

  1. Splines & the Bias-Variance Tradeoff

See the figure below, for either natural splines or cubic splines, what values of deg_free (small, medium, large) do you get the following:

  1. high bias but low variance:
  2. low bias but high variance:
  3. moderate bias and low variance:


  1. splines & LASSO

Could we use LASSO with a spline or polynomial recipe? Yes or No? Explain.



LOESS

Notes

Local Regression or Locally Estimated Scatterplot Smoothing (LOESS)

. . .

Goal:
Build a flexible regression model of \(y\) by one quantitative predictor \(x\),

\[y = f(x) + \varepsilon\]

. . .

Idea:

Fit regression models in small localized regions, where nearby data have greater influence than far data.

. . .

Algorithm:

Define the span, aka bandwidth, tuning parameter \(h\) where \(0 \le h \le 1\). Take the following steps to estimate \(f(x)\) at each possible predictor value \(x\):

  1. Identify a neighborhood consisting of the \(100∗h\)% of cases that are closest to \(x\).
  2. Putting more weight on the neighbors closest to \(x\) (ie. allowing them to have more influence), fit a linear model in this neighborhood.
  3. Use the local linear model to estimate f(x).

. . .

In pictures:

Small Group Discussion

Open the QMD and find the LOESS discussion questions. Work on the following questions with your group.


  1. LOESS in R

We can plot LOESS models using geom_smooth(). Play around with the span parameter below.

  • What happens as we increase the span from roughly 0 to roughly 1?
  • What is one “pro” of this nonparametric algorithm, relative to KNN?
  • What questions do you have about this algorithm?
library(tidyverse)
library(ISLR)
data(College)

ggplot(College, aes(x = Expend, y = Grad.Rate)) + 
  geom_point() + 
  geom_smooth(method = "loess", span = 0.5)

Note: you’ll find that you can specify span greater than 1. Use your resources to figure out what that means in terms of the algorithm.


  1. LOESS & the Bias-Variance Tradeoff

Run the shiny app code, below. Explore the impact of the span tuning parameter h on the LOESS performance across different datasets. Continue to click the Go! button to get different datasets.

For what values of \(h\) (near 0, somewhere in the middle, near 1) do you get the following:

  1. high bias but low variance
  2. low bias but high variance
  3. moderate bias and low variance
# Load packages & data
library(shiny)
library(tidyverse)

# Define a LOESS plotting function
plot_loess <- function(h, plot_data){
  ggplot(plot_data, aes(x = x, y = y)) + 
    geom_point() + 
    geom_smooth(span = h, se = FALSE) + 
    labs(title = paste("h = ", h)) +
    lims(y = c(-5,12))
}

# BUILD THE SERVER
# These are instructions for building the app - what plot to make, what quantities to calculate, etc
server_LOESS <- function(input, output) {
  new_data <- eventReactive(input$do, {
    x <- c(runif(25, -6, -2.5), runif(50, -3, 3), runif(25, 2.5, 6))
    y <- 9 - x^2 + rnorm(100, sd = 0.75) 
    y[c(1:25, 76:100)] <- rnorm(50, sd = 0.75)
    data.frame(x, y)
  })
  output$loesspic <- renderPlot({
    plot_loess(h = input$hTune, plot_data = new_data())
  })
}


# BUILD THE USER INTERFACE (UI)
# The UI controls the layout, appearance, and widgets (eg: slide bars).
ui_LOESS <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      h4("Sample 100 observations:"), 
      actionButton("do", "Go!"),
      h4("Tune the LOESS:"), 
      sliderInput("hTune", "h", min = 0.05, max = 1, value = 0.05)
    ),
    mainPanel(
      h4("LOESS plot:"), 
      plotOutput("loesspic")
    )
  )
)


# RUN THE SHINY APP!
shinyApp(ui = ui_LOESS, server = server_LOESS)



Exercises

Part 1: Implementing Splines

Goals

  • Implement and explore splines in R.
  • Explore how this algorithm compares to others, hence how it fits into our broader ML workflow.

Context

Using the College data in the ISLR package, we’ll build a predictive model of Grad.Rate by 6 predictors: Private, PhD (percent of faculty with PhDs), Room.Board, Personal (estimated personal spending), Outstate (out-of-state tuition), perc.alumni (percent of alumni who donate)

Before proceeding, install the splines package by entering install.packages("splines") in the Console.

# Load packages
library(tidymodels)
library(tidyverse)
library(ISLR)

# Load data
data(College)

# Wrangle the data
college_sub <- College %>% 
  filter(Grad.Rate <= 100) %>% 
  filter((Grad.Rate > 50 | Expend < 40000)) %>% 
  select(Grad.Rate, Private, PhD, Room.Board, Personal, Outstate, perc.alumni)



  1. Evaluating a fully linear model

We will model Grad.Rate as a function of the 6 predictors in college_sub.

  1. Make scatterplots of each of the quantitative predictors and the outcome with 2 different smoothing lines to explore potential nonlinearity. Adapt the following code to create a scatterplot with a smooth LOESS blue trend line and a red linear trend line. What do you notice about the form of the relationships?
ggplot(___, aes(___)) +
    geom_point() +
    geom_smooth(color = "blue", se = FALSE) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    theme_classic()
  1. Use tidymodels to fit a linear regression model (no splines yet) with the following specifications:
    • Use 8-fold CV.
    • Use CV mean absolute error (MAE) to evaluate models.
    • Use LASSO engine to do variable selection to select the simplest model for which the metric is within one standard error of the best metric.
    • Fit your “best” model and look at coefficients of that final model. Which variables are “least” important to predicting Grad.Rate?
set.seed(253)

# Lasso Model Spec with tune
lasso_spec <- linear_reg() %>%
  set_mode('regression')  %>%
  set_engine(engine = 'glmnet') %>% 
  set_args(mixture = 1, penalty = tune()) 
  
# Recipe
lasso_recipe <- recipe(__ ~ ___, data = college_sub) %>%
  step_dummy(all_nominal_predictors())

# Workflow (Recipe + Model)
lasso_wf_tune <- workflow() %>% 
  add_recipe(lasso_recipe) %>%
  add_model(lasso_spec) 

# Tune Model (trying a variety of values of Lambda penalty)
lasso_models <- tune_grid( 
  lasso_wf_tune, # workflow
  resamples = vfold_cv(__, v = __), 
  metrics = metric_set(___),
  grid = grid_regular(penalty(range = c(-3, 1)), levels = 30) 
)

# Select parsimonious model & fit
parsimonious_penalty <- lasso_models %>% 
  select_by_one_std_err(metric = 'mae', desc(penalty))

# Finalize Model
ls_mod <-  lasso_wf_tune  %>% 
  finalize_workflow(parameters = parsimonious_penalty) %>%
  fit(data = college_sub) 
    
# Note which variable is the "least" important    
ls_mod %>% 
  tidy()
  1. Make a plot of the residuals vs. fitted values fromls_mod and then plots of the residuals v. each of the quantitative predictors to evaluate whether the model is “right”.
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(__)) +
    ___ +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()


  1. Evaluating a spline model

We’ll compare our best LASSO model with models with spline functions for some of the quantitative predictors.

  1. Update your recipe from Exercise 1 to fit a linear model (with the lm engine rather than glmnet) with the all predictors. Allow the quantitative predictors that had some non-linear relationships to be modeled with natural splines that have 4 degrees of freedom. Fit this model with 8 fold CV, fit_resamples (same folds as before) to compare MAE and then fit the model to the whole training data. Call this fit model ns_mod.
# Model Spec
lm_spec <-
  linear_reg() %>%
  set_mode('regression') %>%
  set_engine(engine = 'lm')
  

# New Recipe (remove steps needed for LASSO, add splines)



# Create Workflow (Recipe + Model)



set.seed(253)
# CV to Evaluate
ns_models <- fit_resamples(
  ___, # workflow
  resamples = vfold_cv(__, v = 8), # cv folds
  metrics = metric_set(___)
)

ns_models %>% collect_metrics()

# Fit with all data
ns_mod <- fit( ___, #workflow from above
  data = college_sub
)
  1. Make a plot of the residuals vs. fitted values fromns_mod and then plots of the residuals v. each of the quantitative predictors to evaluate whether the model is right.
# Residual plots
  1. Compare the CV MAE between models with and without the splines.
lasso_models %>% 
  collect_metrics() %>% 
  filter(penalty == (parsimonious_penalty %>% pull(penalty)))

ns_models %>% 
  collect_metrics()


  1. Pick an algorithm
    Our overall goal is to build a predictive model of Grad.Rate (\(y\)) by 6 predictors:

    \[y = f(x_1,x_2,...,x_{6}) + \varepsilon\]

    The table below summarizes the CV MAE for 3 possible algorithms: least squares, KNN, and splines. After examining the results, explain which model you would choose and why. NOTE: There are no typos here! All models had virtually the same CV MAE. Code is provided in the online manual if you’re curious.

    method type assumption about \(f(x_1,x_2,...,x_{6})\) CV MAE
    least squares parametric \(\beta_0 + \beta_1x_1 + \cdots + \beta_{6} x_{6}\) 10.2
    LASSO w/ \(\lambda=1.2\) parametric \(\beta_0 + \beta_1x_1 + \cdots + \beta_{6} x_{6}\) 10.3
    KNN w/ \(K = 33\) nonparametric average \(y\) among neighbors 10.2
    splines parametric \(\beta_0 + f_1(x_1) + \cdots + f_{6}(x_{6})\) 10.2



Part 2: Reflection

In these final exercises, you’ll reflect upon some algorithms we’ve learned thus far. This might feel bumpy if you haven’t reviewed the material recently, so be kind to yourself!

Some of these exercises are similar to those on Homework 3. Thus solutions are not provided. The following directions are important to deepening your learning and growth.

  • Typing the questions into ChatGPT does not require any reflection: You may NOT use GenAI or other online resources for these questions or on Homework 3.
  • Locating is not learning: Challenge yourself to write out concepts in your own words. Do not rely on definitions in the activities, videos, or elsewhere.



  1. Interpretability & flexibility
    When picking between different model building algorithms, there are several considerations, including flexibility and interpretability. Consider the following graphic from ISLR:

  1. Convince yourself that the placement of Subset Selection (e.g. backward stepwise), LASSO, Least Squares, and Splines (which is highly related to Generalized Additive Models) make sense. (We’ll address other algorithms later this semester.)

  2. Where would you place KNN on this graphic?


  1. Differences and similarities
    For each pair of algorithms below: try to identify a key similarity, a key difference, and any scenario in which they’re “equivalent”.
    1. KNN vs LOESS
    2. LOESS vs Splines
    3. Splines vs least squares
    4. least squares vs LASSO


  1. Pros & cons (there’s a similar question on HW3)
    Summarize at least 1 pro and 1 con about each model building algorithm. You cannot use the same pro or con more than once!
    1. least squares
    2. LASSO
    3. KNN
    4. Splines


  1. Lingo (there’s a similar question on HW3)
    In your own words, describe what each of these “everyday” words means in the context of ML algorithms:
    • greedy
    • biased
    • parsimonious
    • (computationally) expensive
    • goldilocks problem



Deeper Learning (OPTIONAL)

KNN, regression splines, and LOESS aren’t the only flexible regression algorithms! Chapter 7 of ISLR also discusses:

  • piecewise polynomials (parametric)
  • smoothing splines (nonparametric)
  • generative additive models (nonparametric)

Read Chapter 7 of the book, and watch Prof Johnson’s video on these topics to learn more:

https://youtu.be/sYC3PuONUy8


Chapter 7.7 specifically provides more information about generalized additive models (GAM). These models take the ideas of splines and loess and combine them into one framework and allow each predictor to have a non-linear relationship. They are implemented in tidymodels but can be computationally intensive. They allow the relationships to either be estimated with loess or smoothing splines, both of which are non-parameter approaches.

In general, we can think of GAM as

\[y = \beta_0 + f_1(x_1) + \cdots + f_{p}(x_{p}) + \epsilon\]

where the functions \(f_j(x)\) are estimated nonparametrically.

Chapter 7.5 provides more information about smoothing splines. In general, these techniques estimate the \(f(x)\) by penalizing wiggly behavior where wiggliness is measured by the second derivative of \(f(x)\), i.e. the degree to which the slope changes. Specifically, \(f(x)\) is estimated to minimize:

\[\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int f''(t)^2 dt\]

That is, we want \(f(x)\) to produce small squared residuals (\(\sum_{i=1}^n (y_i - f(x_i))^2\)) and small total change in slope over the range of x (\(\int f''(t)^2 dt\)).



Solutions

Small Group Discussions

  1. KNN Review
Solution
  1. The output of the model is: one prediction for one observational unit. There is nothing to interpret (no plots, no coefficients, etc.). Also: there is no way to tune the model to use fewer predictors.
  2. The only assumption we made was that the outcome values of \(y\) should be similar if the predictor values of \(x\) are similar. No other assumptions are made.
    • This is a pro if we want flexibility due to non-linear relationships and that assumption is true.
    • This is a con if relationships are actually linear or could be modeled with a parametric model.
  3. Nothing. I’d say this is a con. There is nothing to interpret, so the model is more of a black box in terms of knowing why it gives you a particular prediction.

Further details are in solutions for previous activity.


  1. Nonparametric vs Parametric
Solution
  • Use nonparametric methods when parametric model assumptions are too rigid. Forcing a parametric method in this situation can produce misleading conclusions.

  • Use parametric methods when the model assumptions hold. In such cases, parametric models provide more contextual insight (eg: meaningful coefficients) and the ability to detect which predictors are beneficial to the model.


  1. Splines & the Bias-Variance Tradeoff
Solution

  1. high bias but low variance: small deg_free
  2. low bias but high variance: large deg_free
  3. moderate bias and low variance: medium deg_free


  1. splines & LASSO
Solution

Not necessarily or at least we’d want to proceed carefully…

  • LASSO could remove an intermediate term, so the full complex function would not be specified – would we be ok with that?
  • Maybe LASSO help us choose our knots & degrees of freedom for regression splines.


  1. LOESS in R
Solution
  • as we increase span, the model becomes smoother and more simple
  • compare to KNN, LOESS is smoother, more like the shape of the relationship we’re trying to estimate


  1. LOESS & the Bias-Variance Tradeoff
Solution
  1. h near 1
  2. h near 0
  3. h somewhere in the middle


Exercises

  1. Evaluating a fully linear model
Solution
#a
# There might be a slight non-linear relationship
ggplot(college_sub, aes(y = Grad.Rate, x = PhD)) +
    geom_point() +
    geom_smooth(color = "blue", se = FALSE) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    theme_classic()

# Only deviation in the extremes of Room and Board
ggplot(college_sub, aes(y = Grad.Rate, x = Room.Board)) +
    geom_point() +
    geom_smooth(color = "blue", se = FALSE) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    theme_classic()

# linear relationship doesn't fit well but some extreme values in Personal
ggplot(college_sub, aes(y = Grad.Rate, x = Personal)) +
    geom_point() +
    geom_smooth(color = "blue", se = FALSE) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    theme_classic()

# Close to linear
ggplot(college_sub, aes(y = Grad.Rate, x = Outstate)) +
    geom_point() +
    geom_smooth(color = "blue", se = FALSE) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    theme_classic()

# Close to linear
ggplot(college_sub, aes(y = Grad.Rate, x = perc.alumni)) +
    geom_point() +
    geom_smooth(color = "blue", se = FALSE) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    theme_classic()

The relationships between many of these predictors and Grad.Rate look reasonably linear. Perhaps two exceptions to this are Personal and PhD.


#b
set.seed(253)

# Lasso Model Spec with tune
lasso_spec <- linear_reg() %>%
  set_mode('regression')  %>%
  set_engine(engine = 'glmnet') %>% 
  set_args(mixture = 1, penalty = tune()) 
  
# Recipe
lasso_recipe <- recipe(Grad.Rate ~ ., data = college_sub) %>% # use all predictors ( ~ . )
  step_dummy(all_nominal_predictors())

# Workflow (Recipe + Model)
lasso_wf_tune <- workflow() %>% 
  add_recipe(lasso_recipe) %>%
  add_model(lasso_spec) 

# Tune Model (trying a variety of values of Lambda penalty)
lasso_models <- tune_grid( 
  lasso_wf_tune, # workflow
  resamples = vfold_cv(college_sub, v = 8), # 8-fold CV
  metrics = metric_set(mae), # calculate MAE
  grid = grid_regular(penalty(range = c(-5, 2)), levels = 30) 
)

# Select parsimonious model & fit
parsimonious_penalty <- lasso_models %>% 
  select_by_one_std_err(metric = 'mae', desc(penalty))

ls_mod <-  lasso_wf_tune  %>% 
  finalize_workflow(parameters = parsimonious_penalty) %>%
  fit(data = college_sub) 
    
# Note which variable is the "least" important    
ls_mod %>% 
  tidy()
# A tibble: 7 × 3
  term         estimate penalty
  <chr>           <dbl>   <dbl>
1 (Intercept) 37.1         1.17
2 PhD          0.0582      1.17
3 Room.Board   0.00102     1.17
4 Personal    -0.000740    1.17
5 Outstate     0.00136     1.17
6 perc.alumni  0.285       1.17
7 Private_Yes  0           1.17

Private_Yes has a coefficient estimate of zero, so is “least” important”.


# c
# Residuals v. Fitted/Predicted
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = .pred, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()


# Residuals v. PhD
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = PhD, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. Personal
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = Personal, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. Room.Board
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = Room.Board, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. Outstate
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = Outstate, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. perc.alumni
ls_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = perc.alumni, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

These residual plots look generally fine to me. Perhaps there is a bit of heteroskedasticity in the residual vs fitted plot. We also look to be over-estimating (residuals < 0) graduation rates for schools with low numbers of PhDs and mid-range levels (2000–5000) of personal spending.


  1. Evaluating a spline model
Solution
  1. See below:
# Model Spec
lm_spec <-
  linear_reg() %>%
  set_mode('regression') %>%
  set_engine(engine = 'lm')
  

# New Recipe (remove steps needed for LASSO, add splines)
spline_recipe <- recipe(Grad.Rate ~ ., data = college_sub) %>%
  step_ns(Personal, deg_free = 4) %>% # splines with 4 df for Personal
  step_ns(PhD, deg_free = 4) # splines with 4 df for PhD


# Create Workflow (Recipe + Model)
spline_wf <- workflow() %>% 
  add_recipe(spline_recipe) %>%
  add_model(lm_spec) 


set.seed(253)
# CV to Evaluate
ns_models <- fit_resamples(
  spline_wf, # workflow
  resamples = vfold_cv(college_sub, v = 8), # cv folds
  metrics = metric_set(mae) # calculate MAE
)

# check out CV metrics
ns_models %>% 
  collect_metrics()
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard    10.2     8   0.229 Preprocessor1_Model1
# Fit with all data
ns_mod <- fit( spline_wf, #workflow from above
  data = college_sub
)


#b. Residual plots

# Residuals v. Fitted/Predicted
ns_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = .pred, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()


# Residuals v. PhD
ns_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = PhD, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. Personal
ns_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = Personal, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. Room.Board
ns_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = Room.Board, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. Outstate
ns_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = Outstate, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

# Residuals v. perc.alumni
ns_mod %>%
    augment(new_data = college_sub) %>%
    mutate(.resid = Grad.Rate - .pred) %>%
    ggplot(aes(x = perc.alumni, y = .resid)) +
    geom_point() +
    geom_smooth(se = FALSE) +
    geom_hline(yintercept = 0, color = "red") + 
    theme_classic()

I still notice the same mild heteroskedasticity in the residual vs fitted plot. However, the systematic over-estimation of graduation rates for schools with low numbers of PhDs and mid-range personal spending now looks better (the “typical” residual is closer to zero over those ranges).


#c. compare CV MAE
lasso_models %>% 
  collect_metrics() %>% 
  filter(penalty == (parsimonious_penalty %>% pull(penalty)))

ns_models %>% 
  collect_metrics()

The MAEs are very similar.


  1. Pick an algorithm
Solution

You can argue a couple of ways.

  • If you noted the non-linear relationships in the plots above: splines is more informative than KNN and less wrong than least squares.

  • If you didn’t think above that splines were warranted, you should pick least squares (or LASSO). It’s definitely simpler and easier to interpret.

Code

# least squares

set.seed(253)
ls_model <- lm_spec %>% # re-use lm specs from above
  # don't need a recipe since we're not doing any pre-processing
  fit_resamples(   # get 10-fold CV MAE
    Grad.Rate ~ .,
    resamples = vfold_cv(college_sub, v = 10), 
    metrics = metric_set(mae)
  )

# look at CV MAE
ls_model %>% 
  collect_metrics()
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard    10.2    10   0.365 Preprocessor1_Model1
# KNN
# specifications
knn_spec <- nearest_neighbor() %>%
  set_mode("regression") %>% 
  set_engine(engine = "kknn") %>% 
  set_args(neighbors = tune())

# recipe
variable_recipe <- recipe(Grad.Rate ~ ., data = college_sub) %>% 
  step_nzv(all_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

# workflow
knn_workflow <- workflow() %>% 
  add_model(knn_spec) %>% 
  add_recipe(variable_recipe)

# tune K using 10-fold CV
set.seed(253)
knn_models <- knn_workflow %>% 
  tune_grid(
    grid = grid_regular(neighbors(range = c(1, 200)), levels = 50),
    resamples = vfold_cv(college_sub, v = 10),
    metrics = metric_set(mae)
  )

# select best K 
best_K <- select_best(knn_models, metric = "mae")
best_K
# A tibble: 1 × 2
  neighbors .config              
      <int> <chr>                
1        33 Preprocessor1_Model09
# fit final model
knn_models %>% 
  collect_metrics() %>% 
  filter(neighbors == best_K$neighbors)
# A tibble: 1 × 7
  neighbors .metric .estimator  mean     n std_err .config              
      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1        33 mae     standard    10.2    10   0.337 Preprocessor1_Model09


Wrap-Up

  • Finish the activity, check the solutions, review the optional resources, and reach out with questions
  • Before our next class:
    • review instructions for Group Assignment 1
    • if you will be absent: alert your group members and come up with a plan to contribute to the collaboration outside class
  • Upcoming important dates:
    • HW3: due Feb 25
    • MSCS Capstone Days: Feb 27–28 (no class; reflection due in class Mar 4)
    • GA1: due Mar 5
    • Quiz 1: in class Mar 6