1 Introductions

Important

Please check the Schedule for a list of tasks that should be completed before class today.

Welcome!

Sit where you want… but note that we’ll move in a few minutes!
Take out a notebook and writing utensil, but don’t unpack completely (see above).
Meet the people around you, and check in with each other.
⚠️ If you haven’t done so already, please fill out the Background Survey!

Note: everything you need for class today is on the course website: https://kegrinde.github.io/STAT253/

Learning Goals

Meet your classmates and instructor
Understand the basic structure of this course
Identify the appropriate task (regression, classification, unsupervised) for a given research question
Develop foundation to be able to formulate research questions that align with regression, classification, or unsupervised learning tasks

Brief Overview of STAT 253

What’s Machine Learning?

“Machine Learning” was coined back in 1959 by Arthur Samuel, an early contributor to AI.

From Kohavi & Provost (1998): Machine Learning is the exploration & application of algorithms that can learn from existing patterns and make predictions using data.

IMPORTANT: humans are in charge of the “exploration & application”!

From James et al (2021) [link]: Statistical Learning refers to a vast set of tools for understanding data.

In STAT 253 we will…

Pick up where STAT 155 left off, acquiring tools that can be used to learn from data in greater depth and a wider variety of settings. (STAT 155 is a foundational subset of ML!)
Explore universal ML concepts using tools and software common among statisticians (hence “statistical” machine learning).
Survey a breadth of modern ML tools and algorithms that fall into the workflow below. Part of the cognitive load will be:
- keeping all the tools in place (what are they and when to use them)
- understanding the connections between the tools
- adapting (not memorizing) code to implement each tool
- a new topic almost every day
We’ll focus on concepts and applications over mathematical theory. (Come chat with me in office hours if you’re interested in learning more about the theory!)

Meet Your Classmates!

I used a machine learning algorithm, one we’ll learn later this semester, to form groups based on your responses to the pre-course informational survey. BUT it didn’t provide any explanation of why these are the groups it picked. To that end, we need humans.

Get into your assigned group.
Introduce yourselves in whatever way you feel appropriate (ideas: name, pronouns, how you’re feeling at the moment, things you’re looking forward to, best part of break, why you are motivated to take this class)
Try to figure out why the algorithm put you into a group together. (I don’t personally know the answer!)
Prepare to introduce your group to the bigger class:
- Each person will introduce themself
- One person will explain why they think the group was put together

Meet Your Instructor

A few highlights from my answers to the Background Survey…

Preferred name: “Kelsey” or “Professor Grinde”
Pronouns: she/her/hers
Hometown(s) (and walking distance!):
- Plymouth, MN (7 hr walk / 30 min drive from Mac)
- St. Paul, MN (20 min walk from Mac)
Major(s):
- BA in Math @ St. Olaf College (Northfield, MN)
- PhD in Biostatistics @ University of Washington (Seattle, WA)
About You: How have you been spending your time over break? What’s particularly important or meaningful to you? What brings you joy? What do you do when you’re not in class? What is on your mind right now?

Notes: Machine Learning Overview

Types of ML Tasks

Statistical machine learning tools can be classified as follows:

supervised or unsupervised
within supervised learning: regression vs classification
within unsupervised learning: clustering vs dimension reduction

Knowing which of these scenarios your research question falls into is an important first step in identifying which tool to use!

Supervised Learning

We want to model the relationship between some output variable¹ \(y\) and input variables² \(x = (x_1, x_2,..., x_p)\):

\[ \begin{split} y & = f(x) + \varepsilon \\ & = \text{(trend in the relationship) } + \text{ (residual deviation from the trend)} \\ \end{split} \]

Types of supervised learning tasks:

regression: \(y\) is quantitative
example:
\(y\) = number of dental caries (cavities)
\(x\) = (genetic information at millions of markers, sex, age, age\(^2\), etc)
project details
classification: \(y\) is categorical
example:
\(y\) = whether a patient experienced adverse surgery outcomes after undergoing an upper endoscopy (yes, no)
\(x\) = (administration of sedation [anesthesia professional, nurse], age, medical comorbidities [eg sleep apnea], etc.)
project details

Unsupervised Learning

We have some input variables \(x = (x_1, x_2,..., x_p)\) but there’s no output variable \(y\). Thus the goal is to use \(x\) to understand and/or modify the structure of our data.

Types of unsupervised learning tasks:

clustering: Identify and examine groups or clusters of data points that are similar with respect to their \(x_i\) values.
example:
\(x\) = (genetic data)
project details (led by a Mac alum!)
dimension reduction: Turn the original set of \(p\) input variables, which are potentially correlated, into a smaller set of \(k < p\) variables which still preserve the majority of information in the originals.
example:
\(x\) = (genetic data)
project details

Supplemental Figure 2A from Barragan et al (2023) [link] uses *both* clustering and dimension reduction!

Exercises

Instructions

Discuss the following scenarios as a group, talking through your ideas, questions, and reasoning as you go.
Write down your answers, and any insights or questions that come up while working, in your notebook.
- (I’ll give you a Quarto template next time.)
I’ll move around to groups to check in on your progress and see what questions you have.
You can check your answers by clicking the drop-down “Solutions” button.

Questions

Indicate whether each scenario below represents a regression, classification, or clustering task.

How is the number of people that rent bikes on a given day in Washington, D.C. (\(y\)) explained by the temperature (\(x_1\)) and whether or not it’s a weekend (\(x_2\))?

Solution

regression. there’s a quantitative output variable \(y\).

Given the observed bill length (\(x_1\)) and bill depth (\(x_2\)) on a set of penguins, how many different penguin species might there be?

Solution

clustering. there’s no output variable \(y\).

How can we determine whether somebody has a certain infection (\(y\)) based on two different blood sample measurements, Measure A (\(x_1\)) and Measure B (\(x_2\))?

Solution

classification. there’s a categorical output variable \(y\).

Machine learn about each other! Scenario A.
I collected some data on STAT 253 students (you!) and analyzed it using a machine learning algorithm. In your groups: (1) brainstorm what research question is being investigated; (2) determine whether this is a regression, classification, or clustering task; and (3) summarize what the output tells you about your classmates.

Solution

predict someone’s major based on other survey responses
classification (\(y\) = major is categorical)
(will vary by semester – what do you learn about the majors represented in this class and the variables that are useful for predicting it?)

Machine learn about each other! Scenario B.
Same directions as for Scenario A: (1) brainstorm what research question is being investigated; (2) determine whether this is a regression, classification, or clustering task; and (3) summarize what the output tells you about your classmates.

Solution

predict walk time to Mac based on photo rating and class year
regression (\(y\) = time to mac is quantitative)
(answers will vary by semester – what do you learn about the relationships between these variables?)

Machine learn about each other! Scenario C.
Same directions as for Scenario A: (1) brainstorm what research question is being investigated; (2) determine whether this is a regression, classification, or clustering task; and (3) summarize what the output tells you about your classmates.

Solution

which students in the class are most similar to which other students?
clustering (no outcome \(y\), interested in structure/similarities across observations)
(answers will vary by semester – to whom are you most similar? how many “natural” groupings of students do you see? etc.)

Use Spotify users’ previous listening behavior to identify groups of similar users.

Solution

clustering

Predict workers’ wages by their years of experience.

Solution

regression (\(y\) = wages)

Predict workers’ wages by their college major.

Solution

regression (\(y\) = wages)

Use a customer’s age to predict whether they’ve seen the Barbie movie.

Solution

classification (\(y\) = whether or not watched the film)

Look for similarities among genetic samples taken from a group of patients.

Solution

clustering (no outcome \(y\))

If you finish all of the questions above, move on to the Scavenger Hunt in the next section.

Wrapping Up

Scavenger Hunt

Working with your table, take a few minutes to find all of the following. Write down the “answers” in your notebook. Then, bookmark any important sites.

course website
syllabus
textbook
STAT 253 Slack
office hour times and locations
assignment deadlines
information on what you need to complete before class each day (including your homework for next class!)
in-class activities
assignment instructions / submission

What’s next?

What to work on after class today:

complete the pre-class tasks for next class (videos/reading/checkpoint)
- find this info on the schedule page
- review the checkpoint instructions & policies on Moodle before you start!
start HW0
- due Friday (at 11:59 pm)
- review the Stat 155 Review resources as needed

join Slack (invite link)
complete the required R and RStudio Setup steps
carefully review the syllabus
- if time allows, we’ll discuss a few highlights now!
- more to come in the next few class sessions

otherwise known as outcome, response, dependent variable↩︎
otherwise known as predictors, features, independent variables↩︎