17 Classification Review
Settling In
- Sit with your assigned group
- Catch up on any recent messages you’ve missed on Slack
- Open Group Assignment 2
Announcements
Upcoming Deadlines:
- HW5: end-of-day Friday, April 4
- Group Assignment 2: end-of-day Wednesday, April 9
- Quiz 2: in class Thursday, April 10
- Learning Reflection 2: end-of-day Tuesday, April 15
Small Group Discussion
Discuss Questions 1–5 with your group. Then, prepare to share answers, questions, etc. with the rest of the class.
Question 1
Trees vs Forests:
- What is the difference between a tree and a forest? (in ML, not nature)
- How do I predict the outcome of a new observation using a tree? What about a forest?
- What is a potential advantage of using a forest compared to a tree? What is a potential drawback?
Question 2
Bagging vs Random Forests:
- How are these techniques similar?
- How do they differ?
Question 3
CV vs OOB:
- What does OOB stand for?
- Why do we typically use OOB metrics instead of CV metrics when evaluating forests?
- Are these in-sample or out-of-sample metrics? Why does it matter?
Question 4
Review the learning goals for Units 4 and 5 listed on the course webpage.
Rate your confidence about/current level of understanding for each of these items, on a scale of 1 to 5, where:
- 1 = I have never heard of this
- 5 = I could correctly answer a question about this right now, without notes
Across your entire group:
- Which topic are you collectively most confident about?
- Least confident?
- Most split (ie some confident, some not)?
Question 5
Discuss your study strategies:
- How did you study for Quiz 1?
- Is there anything you found to be particularly helpful?
- Is there anything you are planning to try differently for this next quiz?
Notes: Preparing for Quiz 2
Logistics
Content:
- Classification (Units 4–5)
- Questions will range in style: multiple choice, fill in the blank, short response, matching, etc.
Part 1:
- focused on concepts
- on paper
- closed people, closed notes
Part 2:
- focused on code
- on paper
- closed people, instructor-provided notesheet
Study Tips
- Complete the provided review activities:
- Group Assignment 2 (starting in class today!)
- Concept Maps (see below)
tidymodels
Code Comparison (see below)
- Create a study guide using the “Learning Goals” page on the course website
- Review past checkpoints, in-class exercises, and homework problems (and try quizzing yourself!)
- For Part 2:
- Focus on patterns in code, functions we’ve seen many times, etc.
- Resources:
tidymodels
code comparison, R Notes in course manual, HW3 Exercise 4 (try replicating with the new classification tools), …
- Come to office hours with questions!
Study Resources (Review After Class)
STAT 253 is a survey course of statistical machine learning techniques and concepts. It’s important to continuously reflect on these and how they fit together.
Though you won’t hand anything in, you’re strongly encouraged to complete this activity. This material is designed to help you reflect upon:
- ML concepts
- enduring, big picture concepts
- technical concepts
tidymodels
code
Follow the links below and make a copy of the STAT 253 concept maps (or find and modify the copy you made while reviewing the regression unit).
You’ll be given some relevant prompts below, but you should use these materials in whatever way suits you! Take notes, add more content, rearrange, etc.
STAT 253 concept maps
Review slides 6–7 (classification) of the concept map, and mark up slides 1, 6, and 7 with respect to the prompts below.
Enduring, big picture concepts
IMPORTANT to your learning: Respond in your own words.
- When do we perform a supervised vs unsupervised learning algorithm?
- Within supervised learning, when do we use a regression vs a classification algorithm?
- What is the importance of “model evaluation” and what questions does it address?
- What is “overfitting” and why is it bad?
- What is “cross-validation” and what problem is it trying to address?
- What is the “bias-variance tradeoff”?
Technical concepts
On page 6, identify some general themes for each model algorithm listed in the lefthand table:
- What’s the goal?
- Is the algorithm parametric or nonparametric?
- Does the algorithm have any tuning parameters? What are they, how do we tune them, and how is this a goldilocks problem?
- What are the key pros & cons of the algorithm?
For each algorithm, you should also reflect upon these important technical concepts:
- Can you summarize the steps of this algorithm?
- Is the algorithm parametric or nonparametric? (addressed above)
- What is the bias-variance tradeoff when working with or tuning this algorithm?
- Is it important to scale / pre-process our predictors before feeding them into this algorithm?
- Is this algorithm “computationally expensive”?
- Can you interpret the technical (RStudio) output for this algorithm? (eg: CV plots, etc)?
And some details:
- If this algorithm is parametric, could you:
- interpret its coefficients?
- calculate / predict the probability of different y outcomes from these coefficients?
- come up with a classification rule for a given probability cut-off?
- If this algorithm is non-parametric:
- Could you implement the alghorithm “by hand” for a small sample of data points?
- If this algorithm is a tree-based method:
- Could you explain the difference between in-sample, OOB, and CV metrics?
And what about narrowing down to important predictors?
- What tools do we have to give us a sense of important predictors?
- binary outcome?
- multiclass outcome?
Model evaluation
On page 6, the righthand table lists some model evaluation metrics for binary classification algorithms. Do the following:
- Define each metric. THINK: Could you calculate these metrics if given a confusion matrix?
- Explain the steps of the CV algorithm.
Algorithm comparisons
Use page 7 to make other observations about the Unit 4-5 modeling algorithms and their connections.
Exercises
Use the rest of class time to work on Group Assignment 2!
Suggestions from GA1
- Collaboration:
- Make sure each group member is involved in both model building/code and writing
- All group members should be involved in choosing the final model (discussing pros/cons of each option, all together, will be useful review for quiz!)
- If someone takes the lead on a particular model, visualization, or section of the report, have at least one other person review that work. (Review the overall report for flow / cohesion, too!)
- Please be specific when you write your Collaborations summary!
- Communication:
- Consider your target audience!
- Data visualizations are a form of communication – use them effectively. Update axis labels. Describe in text. Only include if relevant to narrative. Etc.
- Be concise.
- Content:
- Clearly describe and justify all choices! (state / explain / interpret; why did you do what you did and why didn’t you do what you didn’t do)
- Provide evidence to back up your claims.
- Interpret results in context
- Code:
- Use Appendix for extra code/viz
- Write comments so that someone with less familiarity with the data/ML can follow (if you open this document again in one year, will you be able to read your own code?)
- Check HTML for formatting issues, unnecessary code/output, etc. before submitting
tldr: review the rubric and instructions (link on Moodle) carefully.
GA2 Work Time
- Pick a dataset
- Get data on your local computers
- Start exploring the data
- Visualizations of outcome vs predictors (take notes as you go!)
- Any data cleaning needed? (remove variables, modify variables, create variables, remove observations, missing observation)
- Before you leave class:
- Make a plan: how to decide which predictors to use, how many and which models to try, how to evaluate each model
- Set up communication avenues for asynchronous discussions
- Divide / delegate leadership on tasks
Wrapping Up
- See reminders above and on the course schedule about upcoming deadlines