21  Unsupervised Learning Review

Settling In

  • Sit with the same group as last class
  • Check Slack for recent announcements and other posts
  • Start the end-of-course survey (link below)


End-of-Course Survey

LINK

Also available on Moodle: Spring 2026 Statistical Machine Learning > Feedback > End of Course Survey (EOCS)


Important

Just as feedback from instructors and preceptors is helpful for guiding your learning, we also rely heavily on your feedback to guide updates / improvements to this class!

I would GREATLY appreciate hearing your thoughts on what worked well for you in this class, and any ideas you have on ways it could be improved. Please take ~15 minutes to fill out the End-of-Course Survey to share your thoughts.


FYI: Your responses to this survey are anonymous, and I will not have access to them until after final grades have been submitted.



Preparing for Quiz 3

Reminders

  • During assigned final exam block
  • Format: same as Quizzes 1 and 2
  • Content: cumulative, with focus on unsupervised learning + big picture concpets
  • Study Tips:
    • Use the course Learning Goals as a study guide
    • Fill out the STAT 253 Concept Maps
    • Work through the Additional Review Questions, below
    • Review old CPs, HWs, and in-class exercises
    • Attend preceptor and instructor office hours
    • Work on Group Assignment 3 (today!)


Additional Review Questions

Here are a few more review questions that I encourage you to come back to while studying for the quiz. Stop by office hours if you want to check your answers!

Part 1: enduring, big picture concepts

Slide 1 of the STAT 253 Concept Maps (link above) presents a set of enduring, big picture questions that are critical to doing, critiquing, and understanding machine learning analyses. I hope these stick with you for years to come. They are also important for Quiz 3.

Respond in your own words!

  • When do we perform a supervised vs unsupervised learning algorithm?
  • Within supervised learning, when do we use a regression vs a classification algorithm?
  • What is the importance of model evaluation and what questions does it address?
  • What is overfitting and why is it bad?
  • What is cross-validation? How does it work and what problem is it trying to address?
  • What is the bias-variance tradeoff? What models tend to have high bias? High variance?

Part 2: supervised learning

Slides 2–8 present a variety of regression & classification algorithms & concepts. On Quiz 3, you won’t be asked about the nitty gritty (eg: how to interpret coefficients, make predictions, do the algorithm by hand). But you should have the following bigger picture understanding of how all of the algorithms fit together.

For each algorithm:

  • In what situations is it useful? Could you use it for regression, classification, or both?
  • Is the algorithm parametric or nonparametric? What’s the difference?
  • In general, how does the algorithm work?

Part 3: unsupervised learning

Slide 9 presents important unsupervised tasks and algorithms. In the table, take notes on the following:

  • What’s the goal of clustering? Dimension reduction?
  • How are these goals similar? How do they differ?
  • Reflect upon the hierarchical clustering algorithm. And think:
    • What are the steps of the algorithm?
    • Can you implement this algorithm by hand for a small sample?
    • Can you interpret and use a dendrogram?
    • What’s the difference between complete, single, centroid, and average linkage? What role do these play in hierarchical clustering?
    • What are some pros and cons of this algorithm?
  • Reflect upon the K-means algorithm. And think:
    • What is K?
    • What values can K take and what impact does this have on our results?
    • What are the steps of the algorithm?
    • What are some pros and cons of this algorithm?
  • Reflect upon the principal component analysis algorithm.
    • What’s the goal?
    • What does PCA produce?
    • Can you interpret PCs and understand how they’re defined (the idea, not the math)?
    • Can you interpret loadings plots, scree plots, and score plots?
    • What are some pros and cons of this algorithm?

Part 4: supervised + unsupervised

Principal Component Regression (PCR) combines supervised and unsupervised ideas. It has been added to Slide 11.

Reflect upon the following:

  • What is the goal of PCR?
  • What are the general steps of the PCR algorithm?
  • How does PCR differ from LASSO? How does it differ from just kicking out some predictors (eg: using backward stepwise regression)?
  • What are some pros and cons of PCR relative to LASSO or other variable selection techniques?



Exercises: (Mini) Group Assignment 3

Today, we’ll be doing an abridged version of Group Assignment 3. See how far you can get during class time today. If you don’t finish, I recommend continuing to work on this as a way of studying for Quiz 3. However, you will not turn anything in. As long as you are in class today and are actively engaged in this activity, you will pass Group Assignment 3.

Note

If you miss class today, please get in touch with your instructor as soon as possible to discuss instructions and expectations for making up this assignment.


Goals

Our goals for this assignment are as follows:

  • Conduct an open-ended unsupervised learning analysis and deepen your understanding of these concepts.
  • Build confidence by working on more open-ended tasks, outside the context of scaffolded in-class exercise or homework prompts
  • Learn to collaborate, and learn from collaborating
  • Prepare for Quiz 3!

With the final goal in mind, please use this activity as an opportunity to review the content (and code) we covered in the last few weeks of class. If you get stuck, talk it through with your group, review the course website and past assignments, and then ask your instructor for help (in that order!), before (and, ideally, instead of!) accessing any outside materials.


Getting Started

Each person should create a new Quarto document where you can document your analysis and take notes.

Each group should also create a Google Doc where you can document your overall process, insights, and questions as you work through the tasks below as a group. Share this Google Doc with the instructor, with editing access.


Data Context and Prep

Our data for this assignment come from a 2021 TidyTuesday challenge focused on data from the Billboard Top 100 list.

Some Mac professors did some initial data cleaning for us. We’ll access a “cleaner” version of the data here:

# load data
library(tidyverse)
music <- read_csv("https://bcheggeseth.github.io/253_spring_2024/data/billboard.csv")

# check it out
head(music)
# A tibble: 6 × 17
  performer      song       duration_ms danceability energy   key loudness  mode
  <chr>          <chr>            <dbl>        <dbl>  <dbl> <dbl>    <dbl> <dbl>
1 Andy Williams  ......And…      166106        0.154  0.185     5   -14.1      1
2 Sandy Nelson   ...And Th…      172066        0.588  0.672    11   -17.3      0
3 Britney Spears ...Baby O…      211066        0.759  0.699     0    -5.74     0
4 Taylor Swift   ...Ready …      208186        0.613  0.764     2    -6.51     1
5 Paul Davis     '65 Love …      219813        0.647  0.686     2    -4.25     0
6 Tammy Wynette  'til I Ca…      182080        0.45   0.294     7   -12.0      1
# ℹ 9 more variables: speechiness <dbl>, acousticness <dbl>,
#   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
#   time_signature <dbl>, spotify_popularity <dbl>, billboard_weeks <dbl>

You can find a codebook and additional details about the data here. Take a few minutes to read this documentation and explore the dataset in R before you move on.






This is a big dataset. Let’s focus on ONE artist who has at least 25 songs in the dataset:

# Check out artists with at least 25 songs
music %>% 
  count(performer) %>% 
  filter(n >= 25) %>% 
  select(performer)
# A tibble: 101 × 1
   performer       
   <chr>           
 1 Aerosmith       
 2 Andy Williams   
 3 Anne Murray     
 4 Aretha Franklin 
 5 Ariana Grande   
 6 B.B. King       
 7 Barbra Streisand
 8 Barry Manilow   
 9 Bee Gees        
10 Beyonce         
# ℹ 91 more rows


Pick one of these artists to focus on for the remainder of your analysis. Use the code below to filter the data to only include songs by this artist and do a little extra data cleaning.

# Pick just one of these artists to study
my_artist <- music %>% 
  filter(performer == "___") %>% 
  select(-performer) %>% 
  group_by(song) %>%       # The last rows deal w songs that appear more than once
  slice_sample(n = 1) %>% 
  ungroup()
“Solution”

I decided to focus on U2 since they were the first band I ever saw in concert!

# Pick just one of these artists to study
my_artist <- music %>% 
  filter(performer == "U2") %>% 
  select(-performer) %>% 
  group_by(song) %>%       # The last rows deal w songs that appear more than once
  slice_sample(n = 1) %>% 
  ungroup()

# check it out
head(my_artist)
# A tibble: 6 × 16
  song          duration_ms danceability energy   key loudness  mode speechiness
  <chr>               <dbl>        <dbl>  <dbl> <dbl>    <dbl> <dbl>       <dbl>
1 (Pride) In T…      228426        0.464  0.839     4    -7.11     1      0.0329
2 All I Want I…      389933        0.239  0.442     1   -11.5      1      0.0308
3 Angel Of Har…      229266        0.519  0.694     0   -10.8      1      0.0264
4 Beautiful Day      246400        0.537  0.921     2    -6.47     1      0.0481
5 Desire             179360        0.49   0.827     8    -9.48     1      0.0482
6 Discotheque        318666        0.581  0.825     4    -9.69     0      0.0373
# ℹ 8 more variables: acousticness <dbl>, instrumentalness <dbl>,
#   liveness <dbl>, valence <dbl>, tempo <dbl>, time_signature <dbl>,
#   spotify_popularity <dbl>, billboard_weeks <dbl>




BEFORE you conduct any analyses, take some time to familiarize yourself with these data.

  • What does each observation represent?
  • What does each variable represent?
  • Which variables are quantitative, and which are categorical? Do you need to do any preprocessing to ensure R knows which is which?
  • What other preprocessing might be needed?
  • Etc.




Clustering Analysis

Your first task is to conduct a clustering analysis and write 1–2 paragraphs (bullet points are fine) with some key takeaways. Be sure to address:

  • which algorithm you used, and why
  • any decisions you had to make about how to implement that algorithm (the type of linkage to use, if applicable; how many clusters to create; etc.), and why
  • the various clusters you identify
  • the features of these clusters

Create at least 3 visualizations that support these takeaways. These visualizations must each present unique information about the data, i.e. not simply be different approaches to displaying similar information.

NOTE: We’ve discussed two clustering techniques in this class. Try both, compare/contrast your results, and think about which you would prefer in this setting if you could only present results from one. Suggestion: you might have two members of your group work on one, and two on the other, and then come back together to discuss what you learn and pros/cons of each.




Dimension Reduction

Next, conduct dimension reduction using Principal Component Analysis. Like above, write 1–2 paragraphs with some key takeaways. For this analysis, be sure to address:

  • the features of the first 2 principal components
  • how many PCs you might retain and why
  • the amount of the original information for which your retained PCs account

Create at least 2 visualizations that support these takeaways. These visualizations must each present unique information about the data, i.e. not simply be different approaches to displaying similar information.




Optional: Principal Component Regression

Use principal component regression to predict billboard_weeks and discuss your takeaways.

Some questions to consider:

  • What additional preprocessing is needed, if any, for this analysis? (Are there any variables you want to exclude?)
  • How many PCs do you want to include in this model? Why?
  • How “good” is your final model?
  • What pros and cons does this model offer compared to other possible models/algorithms we could use for this regression task?




Wrapping Up

Upcoming Due Dates:

  • HW7: Apr 27
  • Quiz 2 Revisions: Apr 29
  • GA3: N/A (if you are actively engaged in class)
  • Reflection 4: May 4
    • Steps 1–3: complete before our last day of class
    • Step 4: due by 11:59 pm on our last day of class
  • Quiz 3: during our assigned final exam block
  • End-of-Course Survey (link above): May 11


Reflection 4

Your fourth (and final!) learning reflection will be due by 11:59 pm on the last day of class.

This reflection will include three components:

  • pre-class preparation (Steps 1–3)
  • in-class discussion (Step 3)
  • post-class reflection (Step 4)

See the Reflection 4 assignment on Moodle for more detailed instructions. Give yourself at least 90 minutes to complete the pre-class preparation steps!