17 Classification Review

Settling In

Sit with your assigned group
Catch up on any recent messages you’ve missed on Slack
Open Group Assignment 2

Announcements

Upcoming Deadlines:

HW5: end-of-day Friday, April 4
Group Assignment 2: end-of-day Wednesday, April 9
Quiz 2: in class Thursday, April 10
Learning Reflection 2: end-of-day Tuesday, April 15

Small Group Discussion

Discuss Questions 1–5 with your group. Then, prepare to share answers, questions, etc. with the rest of the class.

Question 1

Trees vs Forests:

What is the difference between a tree and a forest? (in ML, not nature)
How do I predict the outcome of a new observation using a tree? What about a forest?
What is a potential advantage of using a forest compared to a tree? What is a potential drawback?

Question 2

Bagging vs Random Forests:

How are these techniques similar?
How do they differ?

Question 3

CV vs OOB:

What does OOB stand for?
Why do we typically use OOB metrics instead of CV metrics when evaluating forests?
Are these in-sample or out-of-sample metrics? Why does it matter?

Question 4

Review the learning goals for Units 4 and 5 listed on the course webpage.

Rate your confidence about/current level of understanding for each of these items, on a scale of 1 to 5, where:

1 = I have never heard of this
5 = I could correctly answer a question about this right now, without notes

Across your entire group:

Which topic are you collectively most confident about?
Least confident?
Most split (ie some confident, some not)?

Question 5

Discuss your study strategies:

How did you study for Quiz 1?
Is there anything you found to be particularly helpful?
Is there anything you are planning to try differently for this next quiz?

Notes: Preparing for Quiz 2

Logistics

Content:

Classification (Units 4–5)
Questions will range in style: multiple choice, fill in the blank, short response, matching, etc.

Part 1:

focused on concepts
on paper
closed people, closed notes

Part 2:

focused on code
on paper
closed people, instructor-provided notesheet

Study Tips

Complete the provided review activities:
- Group Assignment 2 (starting in class today!)
- Concept Maps (see below)
- tidymodels Code Comparison (see below)
Create a study guide using the “Learning Goals” page on the course website
Review past checkpoints, in-class exercises, and homework problems (and try quizzing yourself!)
For Part 2:
- Focus on patterns in code, functions we’ve seen many times, etc.
- Resources: tidymodels code comparison, R Notes in course manual, HW3 Exercise 4 (try replicating with the new classification tools), …
Come to office hours with questions!

Study Resources (Review After Class)

STAT 253 is a survey course of statistical machine learning techniques and concepts. It’s important to continuously reflect on these and how they fit together.

Though you won’t hand anything in, you’re strongly encouraged to complete this activity. This material is designed to help you reflect upon:

ML concepts
- enduring, big picture concepts
- technical concepts
- tidymodels code

Follow the links below and make a copy of the STAT 253 concept maps (or find and modify the copy you made while reviewing the regression unit).

You’ll be given some relevant prompts below, but you should use these materials in whatever way suits you! Take notes, add more content, rearrange, etc.

STAT 253 concept maps

Review slides 6–7 (classification) of the concept map, and mark up slides 1, 6, and 7 with respect to the prompts below.

Enduring, big picture concepts

IMPORTANT to your learning: Respond in your own words.

When do we perform a supervised vs unsupervised learning algorithm?
Within supervised learning, when do we use a regression vs a classification algorithm?
What is the importance of “model evaluation” and what questions does it address?
What is “overfitting” and why is it bad?
What is “cross-validation” and what problem is it trying to address?
What is the “bias-variance tradeoff”?

Technical concepts

On page 6, identify some general themes for each model algorithm listed in the lefthand table:

What’s the goal?
Is the algorithm parametric or nonparametric?
Does the algorithm have any tuning parameters? What are they, how do we tune them, and how is this a goldilocks problem?
What are the key pros & cons of the algorithm?

For each algorithm, you should also reflect upon these important technical concepts:

Can you summarize the steps of this algorithm?
Is the algorithm parametric or nonparametric? (addressed above)
What is the bias-variance tradeoff when working with or tuning this algorithm?
Is it important to scale / pre-process our predictors before feeding them into this algorithm?
Is this algorithm “computationally expensive”?
Can you interpret the technical (RStudio) output for this algorithm? (eg: CV plots, etc)?

And some details:

If this algorithm is parametric, could you:
- interpret its coefficients?
- calculate / predict the probability of different y outcomes from these coefficients?
- come up with a classification rule for a given probability cut-off?
If this algorithm is non-parametric:
- Could you implement the alghorithm “by hand” for a small sample of data points?
If this algorithm is a tree-based method:
- Could you explain the difference between in-sample, OOB, and CV metrics?

And what about narrowing down to important predictors?

What tools do we have to give us a sense of important predictors?
- binary outcome?
- multiclass outcome?

Model evaluation

On page 6, the righthand table lists some model evaluation metrics for binary classification algorithms. Do the following:

Define each metric. THINK: Could you calculate these metrics if given a confusion matrix?
Explain the steps of the CV algorithm.

Algorithm comparisons

Use page 7 to make other observations about the Unit 4-5 modeling algorithms and their connections.

Exercises

Use the rest of class time to work on Group Assignment 2!

Suggestions from GA1

Collaboration:
- Make sure each group member is involved in both model building/code and writing
- All group members should be involved in choosing the final model (discussing pros/cons of each option, all together, will be useful review for quiz!)
- If someone takes the lead on a particular model, visualization, or section of the report, have at least one other person review that work. (Review the overall report for flow / cohesion, too!)
- Please be specific when you write your Collaborations summary!
Communication:
- Consider your target audience!
- Data visualizations are a form of communication – use them effectively. Update axis labels. Describe in text. Only include if relevant to narrative. Etc.
- Be concise.
Content:
- Clearly describe and justify all choices! (state / explain / interpret; why did you do what you did and why didn’t you do what you didn’t do)
- Provide evidence to back up your claims.
- Interpret results in context
Code:
- Use Appendix for extra code/viz
- Write comments so that someone with less familiarity with the data/ML can follow (if you open this document again in one year, will you be able to read your own code?)
- Check HTML for formatting issues, unnecessary code/output, etc. before submitting

tldr: review the rubric and instructions (link on Moodle) carefully.

GA2 Work Time

Pick a dataset
Get data on your local computers
Start exploring the data
- Visualizations of outcome vs predictors (take notes as you go!)
- Any data cleaning needed? (remove variables, modify variables, create variables, remove observations, missing observation)
Before you leave class:
- Make a plan: how to decide which predictors to use, how many and which models to try, how to evaluate each model
- Set up communication avenues for asynchronous discussions
- Divide / delegate leadership on tasks

Wrapping Up

See reminders above and on the course schedule about upcoming deadlines