16 Classification Review
Settling In
- Sit with your assigned group
- Catch up on any announcements you’ve missed on Slack
- Open Group Assignment 2
Announcements
Upcoming Deadlines:
- HW5: Wednesday, Nov 13
- See note on Moodle about LIMITED extension opportunities
- Solutions will be posted by Saturday to help with studying for Quiz 2
- CP12: Thursday, Nov 14
- We will be starting our Unsupervised Learning unit on Thursday!
- This will be covered on Quiz 3, but not Quiz 2
- HW4 Revisions (if applicable): Friday, Nov 15
- Quiz 2: Tuesday, Nov 19
- Focus on classification (Units 4–5), though many concepts carry over
- Structure will be the same as Quiz 2 (Part 1 + Part 2)
- Group Assignment 2: Tuesday, Nov 26
Small Group Discussion
Discuss Questions 1–5 with your group. Then, prepare to share answers, questions, etc. with the rest of the class.
Question 1
Trees vs Forests:
- What is the difference between a tree and a forest? (in ML, not nature)
- How do I predict the outcome of a new observation using a tree? What about a forest?
- What is a potential advantage of using a forest compared to a tree? What is a potential drawback?
Question 2
Bagging vs Random Forests:
- How are these techniques similar?
- How do they differ?
Question 3
CV vs OOB:
- What does OOB stand for?
- Why do we typically use OOB metrics instead of CV metrics when evaluating forests?
- Are these in-sample or out-of-sample metrics? Why does it matter?
Question 4
Review the learning goals for Units 4 and 5 listed on the course webpage here.
Rate your confidence about/current level of understanding for each of these items, on a scale of 1 to 5, where:
- 1 = I have never heard of this
- 5 = I could correctly answer a question about this right now, without notes
Across your entire group:
- Which topic are you collectively most confident about?
- Least confident?
- Most split (ie some confident, some not)?
Question 5
Discuss your study strategies:
- How did you study for Quiz 1?
- Is there anything you found to be particularly helpful?
- Is there anything you are planning to try differently for this next quiz?
Exercises
Part 1: Starting Group Assignment 2
- Decide on what “adventure” you’d like to take for the Group Assignment 2.
- Reply in thread to my recent post on the #help-group-assignments channel so I know what you’ve picked.
- Get started on that adventure.
- Get data on your local computers
- Set up communication avenues for asynchronous discussions
- Divide / delegate leadership on tasks
Please spend AT LEAST 20 minutes on these tasks before moving on to Part 2.
Part 2: Review & Reflection
STAT 253 is a survey course of statistical machine learning techniques and concepts. It’s important to continuously reflect on these and how they fit together.
Though you won’t hand anything in, you’re strongly encouraged to complete this activity. This material is designed to help you reflect upon:
- ML concepts
- enduring, big picture concepts
- technical concepts
tidymodels
code
Follow the links below and make a copy of the STAT 253 concept maps (or find and modify the copy you made while reviewing the regression unit).
You’ll be given some relevant prompts below, but you should use these materials in whatever way suits you! Take notes, add more content, rearrange, etc.
STAT 253 concept maps
Review slides 6–7 (classification) of the concept map, and mark up slides 1, 6, and 7 with respect to the prompts below.
Enduring, big picture concepts
IMPORTANT to your learning: Respond in your own words.
- When do we perform a supervised vs unsupervised learning algorithm?
- Within supervised learning, when do we use a regression vs a classification algorithm?
- What is the importance of “model evaluation” and what questions does it address?
- What is “overfitting” and why is it bad?
- What is “cross-validation” and what problem is it trying to address?
- What is the “bias-variance tradeoff”?
Technical concepts
On page 6, identify some general themes for each model algorithm listed in the lefthand table:
- What’s the goal?
- Is the algorithm parametric or nonparametric?
- Does the algorithm have any tuning parameters? What are they, how do we tune them, and how is this a goldilocks problem?
- What are the key pros & cons of the algorithm?
For each algorithm, you should also reflect upon these important technical concepts:
- Can you summarize the steps of this algorithm?
- Is the algorithm parametric or nonparametric? (addressed above)
- What is the bias-variance tradeoff when working with or tuning this algorithm?
- Is it important to scale / pre-process our predictors before feeding them into this algorithm?
- Is this algorithm “computationally expensive”?
- Can you interpret the technical (RStudio) output for this algorithm? (eg: CV plots, etc)?
And some details:
- If this algorithm is parametric, could you:
- interpret its coefficients?
- calculate / predict the probability of different y outcomes from these coefficients?
- come up with a classification rule for a given probability cut-off?
- If this algorithm is non-parametric:
- Could you implement the alghorithm “by hand” for a small sample of data points?
- If this algorithm is a tree-based method:
- Could you explain the difference between in-sample, OOB, and CV metrics?
And what about narrowing down to important predictors?
- What tools do we have to give us a sense of important predictors?
- binary outcome?
- multiclass outcome?
Model evaluation
On page 6, the righthand table lists some model evaluation metrics for binary classification algorithms. Do the following:
- Define each metric. THINK: Could you calculate these metrics if given a confusion matrix?
- Explain the steps of the CV algorithm.
Algorithm comparisons
Use page 7 to make other observations about the Unit 4-5 modeling algorithms and their connections.
Wrapping Up
- See reminders above and on the course schedule about upcoming deadlines