library(palmerpenguins)
data("penguins")
%>%
penguins select(bill_length_mm, bill_depth_mm, body_mass_g) %>%
head()
18 Hierarchical Clustering
Settling In
- Sit wherever you want!
- Prepare to take notes (find the QMD for today in the usual spot)
- Catch up on any announcements you’ve missed on Slack
- Fill out the Group Assignment Feedback Survey (Moodle > Surveys)
- Please be specific when specifying who you want to work with for the next group assignment
Registration Tips
Thinking about what MSCS course to take next semester? Consider the following:
- Interested in working with data (eg wrangling, plotting, data acquisition, advanced data processing)? Try:
- COMP/STAT 112 (Intro to Data Science)
- COMP/STAT 212 (Intermediate Data Science)
- Interested in data and society? Apply to be part of this pilot class:
- STAT 202 (2-credit Data and Society)
- Interested in developing a mathematical foundation for better/deeper understanding of statistical methodology? Try:
- MATH/STAT 354 (Probability)
- STAT 452 (Correlated Data)
- Interested in learning some computing techniques that are fundamental to data science? Try:
- COMP 123 (Core Concepts in Comp Sci)
- COMP 127 (Object-Oriented Programming and Abstraction)
NOTE: all of the highlighted courses above require permission of instructor to register. Fill out the corresponding interest form by April 16th if you’re interested in taking any of these courses: https://www.macalester.edu/mscs/mscs-course-waitlists/
Learning Goals
- Clearly describe / implement by hand the hierarchical clustering algorithm
- Compare and contrast k-means and hierarchical clustering in their outputs and algorithms
- Interpret cuts of the dendrogram for single and complete linkage
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters
Notes: Unsupervised Learning
Context
GOALS
Suppose we have a set of feature variables \((x_1,x_2,...,x_k)\) but NO outcome variable \(y\).
Instead of our goal being to predict/classify/explain \(y\), we might simply want to…
- Examine the structure of our data.
- Utilize this examination as a jumping off point for further analysis.
Unsupervised Methods
Cluster Analysis (Unit 6):
- Focus: Structure among the rows, i.e. individual cases or data points.
- Goal: Identify and examine clusters or distinct groups of cases with respect to their features x.
- Methods: hierarchical clustering & K-means clustering
Dimension Reduction (Unit 7):
- Focus: Structure among the columns, i.e. features x.
- Goal: Combine groups of correlated features x into a smaller set of uncorrelated features which preserve the majority of information in the data. (We’ll discuss the motivation later!)
- Methods: Principal components
Examples
Remember this from the first day of class?
For more examples, check out the Unit 6 Motivating Question page here.
Notes: Hierarchical Cluster Analysis
Let’s recap the main ideas from the videos you watched before class today.
Goal
Create a hierarchy of clusters where clusters consist of similar data points.
Algorithm
Suppose we have a set of p feature variables (\(x_1, x_2,..., x_p\)) on each of n data points. Each data point starts as a leaf.
- Compute the Euclidean distance between all pairs of data points with respect to x.
- Fuse the 2 closest data points into a single cluster or branch.
- Continue to fuse the 2 closest clusters until all cases are in 1 cluster.
NOTE: This is referred to as an “agglomerative” algorithm.
Dendrograms
The hierarchical clustering algorithm produces a dendrogram (of or relating to trees).
To use a dendrogram:
- Start with the leaves at the bottom (unlike classification trees!). Each leaf represents a single case / row in our dataset.
- Moving up the tree, fuse similar leaves into branches, fuse similar branches into bigger branches, fuse all branches into one big trunk (all cases).
- The more similar two cases, the sooner their branches will fuse. The height of the first fusion between two cases’ branches measures the “distance” between them.
- The horizontal distance between 2 leaves does not reflect distance!
Measuring Distance
There are several linkages we can use to measure distance between 2 clusters / branches. Unless specified otherwise, we’ll use the complete linkage method.
Small Group Discussion
EXAMPLE 1: Standardizing the features
Let’s start by using hierarchical clustering to identify similar groups (and possible species!) of penguins with respect to their bill lengths (mm), bill depths (mm), and body masses (g).
This algorithm relies on calculating the distances between each pair of penguins.
Why is it important to first standardize our 3 features to the same scale (centered around a mean of 0 with standard deviation 1)?
EXAMPLE 2: Interpreting a dendrogram
Check out the standardized data, heat map, and dendrogram for a sample of just 50 penguins.
Is penguin 30 more similar to penguin 33 or 12?
Identify and interpret the distance, calculated using the complete linkage, between penguins 30 and 33.
Consider the far left penguin cluster on the dendrogram, starting with penguin 33 and ending with penguin 30. Use the heat map to describe what these penguins have in common.
EXAMPLE 3: By Hand
To really get the details, you’ll perform hierarchical clustering by hand for a small example.
Ask you instructor for a copy of the paper handout.
Code for importing the data is here:
# Record the data
<- data.frame(
penguins_small width = c(-1.3, -0.8, 0.2, 0.8, 1.0),
length = c(0.6, 1.4, -1.0, -0.2, -0.7))
# Plot the data
ggplot(penguins_small, aes(x = width, y = length)) +
geom_point() +
geom_text(aes(label = c(1:5)), vjust = 1.5) +
theme_minimal()
# Calculate the distance between each pair of penguins
round(dist(penguins_small), 2)
1 2 3 4
2 0.94
3 2.19 2.60
4 2.25 2.26 1.00
5 2.64 2.77 0.85 0.54
# Type out the R code from Part b below
EXAMPLE 4: Explore Clusters
Now, let’s go back to a sample of 50 penguins.
Run the 2 chunks below to build a shiny app that we’ll use to build some intuition for hierarchical clustering.
Put the slider at 9 clusters. These 9 clusters are represented in the dendrogram and scatterplot of the data. Do you think that 9 clusters is too many, too few, or just right?
Now set the slider to 5. Simply take note of how the clusters appear in the data.
Now sloooooowly set the slider to 4, then 3, then 2. Each time, notice what happens to the clusters in the data plot. Describe what you observe.
What are your thoughts about the 2-cluster solution? What happened?!
Code
# Load the data
library(tidyverse)
set.seed(253)
<- penguins %>%
more_penguins sample_n(50) %>%
select(bill_length_mm, bill_depth_mm) %>%
na.omit()
# Run hierarchical clustering
<- hclust(dist(scale(more_penguins)), method = "complete")
penguin_cluster
# Record cluster assignments
<- more_penguins %>%
clusters mutate(k = rep(1, nrow(more_penguins)), cluster = rep(1, nrow(more_penguins)))
for(i in 2:12){
<- more_penguins %>%
clusters mutate(k = rep(i, nrow(more_penguins)),
cluster = cutree(penguin_cluster, k = i)) %>%
bind_rows(clusters)
}
Code
library(shiny)
library(factoextra)
library(RColorBrewer)
# Build the shiny server
<- function(input, output) {
server_hc <- reactive({
dend_plot = brewer.pal(n= input$k_pick, "Set1")
cols fviz_dend(penguin_cluster, k = input$k_pick, k_colors = cols)
})
$model_plot <- renderPlot({
output= brewer.pal(n= input$k_pick, "Set1")
cols <- attributes(dend_plot())$dendrogram
dend <- order.dendrogram(dend)
tree_order <- clusters %>%
clusters_k filter(k == input$k_pick)
<- clusters_k %>%
clusters_k mutate(cluster = factor(cluster, levels = unique(clusters_k$cluster[tree_order])))
names(cols) = unique(clusters_k$cluster[tree_order])
%>%
clusters_k ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = factor(cluster))) +
geom_point(size = 3) +
scale_color_manual(values = cols) +
theme_minimal() +
theme(legend.position = "none")
})$dendrogram <- renderPlot({
outputdend_plot()
})
}
# Build the shiny user interface
<- fluidPage(
ui_hc sidebarLayout(
sidebarPanel(
h4("Pick the number of clusters:"),
sliderInput("k_pick", "cluster number", min = 1, max = 9, value = 9, step = 1, round = TRUE)
),mainPanel(
plotOutput("dendrogram"),
plotOutput("model_plot")
)
)
)
# Run the shiny app!
shinyApp(ui = ui_hc, server = server_hc)
EXAMPLE 5: Details
Is hierarchical clustering greedy?
We learned in the video that, though they both produce tree-like output, the hierarchical clustering and classification tree algorithms are not the same thing! Similarly, though they both calculate distances between each pair of data points, the hierarchical clustering and KNN algorithms are not the same thing! Explain.
Note: When all features x are quantitative or logical (TRUE/FALSE), we measure the similarity of 2 data points with respect to the Euclidean distance between their standardized x values. But if at least 1 feature x is a factor / categorical variable, we measure the similarity of 2 data points with respect to their Gower distance. The idea is similar to what we did in KNN (converting categorical x variables to dummies, and then standardizing), but the details are different.
If you’re interested:
library(cluster)
?daisy
Exercises
For the rest of class, work together exercises 1-3 on HW6. We’ll cover material in the next class to complete the rest of the exercises.
Future Reference
Notes: R code
The tidymodels
package is built for models of some outcome variable y. We can’t use it for clustering.
Instead, we’ll use a variety of new packages that use specialized, but short, syntax.
Suppose we have a set of sample_data
with multiple feature columns x, and (possibly) a column named id
which labels each data point.
# Install packages
library(tidyverse)
library(cluster) # to build the hierarchical clustering algorithm
library(factoextra) # to draw the dendrograms
PROCESS THE DATA
If there’s a column that’s an identifying variable or label, not a feature of the data points, convert it to a row name.
<- sample_data %>%
sample_data column_to_rownames("id")
RUN THE CLUSTERING ALGORITHM
# Scenario 1: ALL features x are quantitative OR logical (TRUE/FALSE)
# Use either a "complete", "single", "average", or "centroid" linkage_method
<- hclust(dist(scale(sample_data)), method = ___)
hier_model
# Scenario 2: AT LEAST 1 feature x is a FACTOR (categorical)
# Use either a "complete", "single", "average", or "centroid" linkage
<- hclust(daisy(sample_data, metric = "gower"), method = ___) hier_model
VISUALIZING THE CLUSTERING: HEAT MAPS AND DENDRODGRAMS
NOTE: Heat maps can be goofy if at least 1 x feature is categorical
# Heat maps: ordered by the id variable (not clustering)
heatmap(scale(data.matrix(sample_data)), Colv = NA, Rowv = NA)
# Heat maps: ordered by dendrogram / clustering
heatmap(scale(data.matrix(sample_data)), Colv = NA)
# Dendrogram (change font size w/ cex)
fviz_dend(hier_model, cex = 1)
fviz_dend(hier_model, horiz = TRUE, cex = 1) # Plot the dendrogram horizontally to read longer labels
DEFINING & PLOTTING CLUSTER ASSIGNMENTS
# Assign each sample case to a cluster
# You specify the number of clusters, k
# We typically want to store this in a new dataset so that the cluster assignments aren't
# accidentally used as features in a later analysis!
<- sample_data %>%
cluster_data mutate(hier_cluster_k = as.factor(cutree(hier_model, k = ___)))
# Visualize the clusters on the dendrogram (change font size w/ cex)
fviz_dend(hier_model, k = ___, cex = 1)
fviz_dend(hier_model, k = ___, horiz = TRUE, cex = 1) # Plot the dendrogram horizontally to read longer labels
Wrapping Up
- As usual, take time after class to finish any remaining exercises, check solutions, reflect on key concepts from today, and come to office hours with questions
- Upcoming due dates:
- CP13: before next class (April 17)
- Quiz 2 Revisions: bring to class next Tuesday (April 22)
- HW6: end-of-day next Tuesday (April 22)
Solutions
Small Group Discussion
EXAMPLE 1: Standardizing the features
Solution:
The features are on different scales. It’s important to standardize so that one feature doesn’t have more influence over the distance measure, simply due to its scale.EXAMPLE 2: Interpreting a dendrogram
Solution:
- 33 – 30 and 33 cluster together sooner than 30 and 12
- This clustering was done using the complete linkage method. Thus at most, the standardized difference in the bill length, bill depth, and body mass of penguins 30 and 33 is 2.
- That group generally has large bill length and body mass but small bill depth.
EXAMPLE 3: By Hand
Solution:
COMPLETE LINKAGE
library(tree)
<- hclust(dist(penguins_small), method = "complete")
penguin_cluster plot(penguin_cluster)
Step 1
The closest pair are 4 and 5 with a distance of 0.54
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
1 | 0 | ||||
2 | 0.94 | 0 | |||
3 | 2.19 | 2.60 | 0 | ||
4 | 2.25 | 2.26 | 1.00 | 0 | |
5 | 2.64 | 2.77 | 0.85 | 0.54 | 0 |
Step 2
The distance matrix below gives the current distance between clusters 1, 2, 3, and (4 & 5) using complete linkage (the distance between 2 clusters is the maximum distance btwn any pair of penguins in the 2 clusters). By this, clusters or penguins 1 & 2 are the closest, with a distance of 0.94:
1 | 2 | 3 | 4 & 5 | |
---|---|---|---|---|
1 | 0 | |||
2 | 0.94 | 0 | ||
3 | 2.19 | 2.60 | 0 | |
4 & 5 | 2.64 | 2.77 | 1.00 | 0 |
Step 3
The distance matrix below gives the current distance between clusters (1 & 2), 3, and (4 & 5) using complete linkage. By this, clusters 3 and (4 & 5) are the closest, with a distance of 1:
1 & 2 | 3 | 4 & 5 | |
---|---|---|---|
1 & 2 | 0 | ||
3 | 2.60 | 0 | |
4 & 5 | 2.77 | 1.00 | 0 |
Step 4
The distance matrix below gives the current distance between our 2 clusters, (1 & 2) and (3 & 4 & 5) using complete linkage. The distance between them is 2.77 (hence the height of the final merge in our dendrogram):
1 & 2 | 3,4,5 | |
---|---|---|
1 & 2 | 0 | |
3,4,5 | 2.77 | 0 |
SINGLE LINKAGE
<- hclust(dist(penguins_small), method = "single")
penguin_cluster plot(penguin_cluster)
Step 1
The closest pair are 4 and 5 with a distance of 0.54
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
1 | 0 | ||||
2 | 0.94 | 0 | |||
3 | 2.19 | 2.60 | 0 | ||
4 | 2.25 | 2.26 | 1.00 | 0 | |
5 | 2.64 | 2.77 | 0.85 | 0.54 | 0 |
Step 2
The distance matrix below gives the current distance between clusters 1, 2, 3, and (4 & 5) using single linkage (the distance between 2 clusters is the maximum distance btwn any pair of penguins in the 2 clusters). By this, penguin 3 is closest to the cluster of penguins 4 & 5, with a distance of 0.85:
1 | 2 | 3 | 4 & 5 | |
---|---|---|---|---|
1 | 0 | |||
2 | 0.94 | 0 | ||
3 | 2.19 | 2.60 | 0 | |
4 & 5 | 2.25 | 2.26 | 0.85 | 0 |
Step 3
The distance matrix below gives the current distance between clusters 1, 2, and (3, 4, 5) using single linkage. By this, clusters 1 and 2 are the closest, with a distance of 0.94:
1 | 2 | 3,4,5 | |
---|---|---|---|
1 | 0 | ||
2 | 0.94 | 0 | |
3,4,5 | 2.19 | 2.26 | 0 |
Step 4
The distance matrix below gives the current distance between our 2 clusters, (1 & 2) and (3 & 4 & 5) using single linkage. The distance between them is 2.19 (hence the height of the final merge in our dendrogram):
1 & 2 | 3,4,5 | |
---|---|---|
1 & 2 | 0 | |
3,4,5 | 2.19 | 0 |
EXAMPLE 4: Explore Clusters
Solution:
- too many. they’re very specific
- ….
- the clusters continue to merge as we go up the dendrogram. (eg: the 5-cluster solution is nested in the 4-cluster solution)
- yikes! this solution doesn’t capture the more natural looking clusters in the data. this algorithm is greedy – it makes the best decisions at each step, but the results aren’t necessarily globally optimal.
EXAMPLE 5: Details
Solution:
- Yes. See example 4.
- Clustering is a process of combining similar neighbors based on a full set of variables/features while KNN and Classifications look at points that are similar in terms of predictors with the goal of getting useful predictions of an outcome.
Exercises
Solutions will not be provided. These exercises are part of your homework this week.