16 Classification Review

Settling In

Sit with your assigned group
Catch up on any announcements you’ve missed on Slack
Open Group Assignment 2

Announcements

Upcoming Deadlines:

HW5: Wednesday, Nov 13
- See note on Moodle about LIMITED extension opportunities
- Solutions will be posted by Saturday to help with studying for Quiz 2
CP12: Thursday, Nov 14
- We will be starting our Unsupervised Learning unit on Thursday!
- This will be covered on Quiz 3, but not Quiz 2
HW4 Revisions (if applicable): Friday, Nov 15
Quiz 2: Tuesday, Nov 19
- Focus on classification (Units 4–5), though many concepts carry over
- Structure will be the same as Quiz 2 (Part 1 + Part 2)
Group Assignment 2: Tuesday, Nov 26

Small Group Discussion

Discuss Questions 1–5 with your group. Then, prepare to share answers, questions, etc. with the rest of the class.

Question 1

Trees vs Forests:

What is the difference between a tree and a forest? (in ML, not nature)
How do I predict the outcome of a new observation using a tree? What about a forest?
What is a potential advantage of using a forest compared to a tree? What is a potential drawback?

Question 2

Bagging vs Random Forests:

How are these techniques similar?
How do they differ?

Question 3

CV vs OOB:

What does OOB stand for?
Why do we typically use OOB metrics instead of CV metrics when evaluating forests?
Are these in-sample or out-of-sample metrics? Why does it matter?

Question 4

Review the learning goals for Units 4 and 5 listed on the course webpage here.

Rate your confidence about/current level of understanding for each of these items, on a scale of 1 to 5, where:

1 = I have never heard of this
5 = I could correctly answer a question about this right now, without notes

Across your entire group:

Which topic are you collectively most confident about?
Least confident?
Most split (ie some confident, some not)?

Question 5

Discuss your study strategies:

How did you study for Quiz 1?
Is there anything you found to be particularly helpful?
Is there anything you are planning to try differently for this next quiz?

Exercises

Part 1: Starting Group Assignment 2

Decide on what “adventure” you’d like to take for the Group Assignment 2.
Reply in thread to my recent post on the #help-group-assignments channel so I know what you’ve picked.
Get started on that adventure.
- Get data on your local computers
- Set up communication avenues for asynchronous discussions
- Divide / delegate leadership on tasks

Please spend AT LEAST 20 minutes on these tasks before moving on to Part 2.

Part 2: Review & Reflection

STAT 253 is a survey course of statistical machine learning techniques and concepts. It’s important to continuously reflect on these and how they fit together.

Though you won’t hand anything in, you’re strongly encouraged to complete this activity. This material is designed to help you reflect upon:

ML concepts
- enduring, big picture concepts
- technical concepts
- tidymodels code

Follow the links below and make a copy of the STAT 253 concept maps (or find and modify the copy you made while reviewing the regression unit).

You’ll be given some relevant prompts below, but you should use these materials in whatever way suits you! Take notes, add more content, rearrange, etc.

STAT 253 concept maps

Review slides 6–7 (classification) of the concept map, and mark up slides 1, 6, and 7 with respect to the prompts below.

Enduring, big picture concepts

IMPORTANT to your learning: Respond in your own words.

When do we perform a supervised vs unsupervised learning algorithm?
Within supervised learning, when do we use a regression vs a classification algorithm?
What is the importance of “model evaluation” and what questions does it address?
What is “overfitting” and why is it bad?
What is “cross-validation” and what problem is it trying to address?
What is the “bias-variance tradeoff”?

Technical concepts

On page 6, identify some general themes for each model algorithm listed in the lefthand table:

What’s the goal?
Is the algorithm parametric or nonparametric?
Does the algorithm have any tuning parameters? What are they, how do we tune them, and how is this a goldilocks problem?
What are the key pros & cons of the algorithm?

For each algorithm, you should also reflect upon these important technical concepts:

Can you summarize the steps of this algorithm?
Is the algorithm parametric or nonparametric? (addressed above)
What is the bias-variance tradeoff when working with or tuning this algorithm?
Is it important to scale / pre-process our predictors before feeding them into this algorithm?
Is this algorithm “computationally expensive”?
Can you interpret the technical (RStudio) output for this algorithm? (eg: CV plots, etc)?

And some details:

If this algorithm is parametric, could you:
- interpret its coefficients?
- calculate / predict the probability of different y outcomes from these coefficients?
- come up with a classification rule for a given probability cut-off?
If this algorithm is non-parametric:
- Could you implement the alghorithm “by hand” for a small sample of data points?
If this algorithm is a tree-based method:
- Could you explain the difference between in-sample, OOB, and CV metrics?

And what about narrowing down to important predictors?

What tools do we have to give us a sense of important predictors?
- binary outcome?
- multiclass outcome?

Model evaluation

On page 6, the righthand table lists some model evaluation metrics for binary classification algorithms. Do the following:

Define each metric. THINK: Could you calculate these metrics if given a confusion matrix?
Explain the steps of the CV algorithm.

Algorithm comparisons

Use page 7 to make other observations about the Unit 4-5 modeling algorithms and their connections.

Wrapping Up

See reminders above and on the course schedule about upcoming deadlines