16 Classification Review
Most of today’s class time will be devoted to working on Group Assignment 2.
To make the most of our time together, I strongly recommend that you get started on the assignment before today’s class.
Settling In
- Sit with your Group Assignment 2 group
- Catch up on any recent announcements you’ve missed on Slack
- Check out the HW5 Solutions posted on Moodle (individual feedback coming soon!)
Preparing for Quiz 2
Logistics
When: April 8
- first 60 minutes = Quiz 2
- last 30 minutes = GA2 Work Time
Topics:
- Classification (Units 4–5) + Enduring Concepts (Units 0–5)
- concepts and code
- (use the Course Learning Goals as a study guide!)
Format:
- On paper (no computers!)
- Closed notes, except for an instructor-provided R notesheet
- Questions will range in style: multiple choice, fill in the blank, short response, matching, etc.
Questions?
Study Tips
- Create a study guide using course Learning Goals.
- Complete the provided review activities:
- Group Assignment 2
- STAT 253 concept maps (Slides 1, 6, and 7)
- tidymodels code comparison
- Review past checkpoints, in-class exercises, and homework problems.
- This should include carefully reviewing the feedback you’ve gotten on those assignments. Try implementing any suggestions.
- Come to office hours with questions!
Study Resources (Review After Class)
STAT 253 is a survey course of statistical machine learning techniques and concepts. It’s important to continuously reflect on these and how they fit together.
Though you won’t hand anything in, you’re strongly encouraged to complete this activity. This material is designed to help you reflect upon:
- ML concepts
- enduring, big picture concepts
- technical concepts
tidymodelscode
Follow the links below and make a copy of the STAT 253 concept maps (or find and modify the copy you made while reviewing the regression unit).
You’ll be given some relevant prompts below, but you should use these materials in whatever way suits you! Take notes, add more content, rearrange, etc.
STAT 253 concept maps
Review slides 6–7 (classification) of the concept map, and mark up slides 1, 6, and 7 with respect to the prompts below.
Enduring, big picture concepts
IMPORTANT to your learning: Respond in your own words.
- When do we perform a supervised vs unsupervised learning algorithm?
- Within supervised learning, when do we use a regression vs a classification algorithm?
- What is the importance of “model evaluation” and what questions does it address?
- What is “overfitting” and why is it bad?
- What is “cross-validation” and what problem is it trying to address?
- What is the “bias-variance tradeoff”?
Technical concepts
On page 6, identify some general themes for each model algorithm listed in the lefthand table:
- What’s the goal?
- Is the algorithm parametric or nonparametric?
- Does the algorithm have any tuning parameters? What are they, how do we tune them, and how is this a goldilocks problem?
- What are the key pros & cons of the algorithm?
For each algorithm, you should also reflect upon these important technical concepts:
- Can you summarize the steps of this algorithm?
- Is the algorithm parametric or nonparametric? (addressed above)
- What is the bias-variance tradeoff when working with or tuning this algorithm?
- Is it important to scale / pre-process our predictors before feeding them into this algorithm?
- Is this algorithm “computationally expensive”?
- Can you interpret the technical (RStudio) output for this algorithm? (eg: CV plots, etc)?
And some details:
- If this algorithm is parametric, could you:
- interpret its coefficients?
- calculate / predict the probability of different y outcomes from these coefficients?
- come up with a classification rule for a given probability cut-off?
- If this algorithm is non-parametric:
- Could you implement the alghorithm “by hand” for a small sample of data points?
- If this algorithm is a tree-based method:
- Could you explain the difference between in-sample, OOB, and CV metrics?
And what about narrowing down to important predictors?
- What tools do we have to give us a sense of important predictors?
- binary outcome?
- multiclass outcome?
Model evaluation
On page 6, the righthand table lists some model evaluation metrics for binary classification algorithms. Do the following:
- Define each metric. THINK: Could you calculate these metrics if given a confusion matrix?
- Explain the steps of the CV algorithm.
Algorithm comparisons
Use page 7 to make other observations about the Unit 4-5 modeling algorithms and their connections.
Group Assignment 2
Suggestions from GA1
- Collaboration:
- Each individual needs to submit the Group Feedback Survey!
- As you divide tasks, make sure each individual is involved in all of the following: model building/code, writing, and editing
- All group members should be involved in choosing the final model (discussing pros/cons of each option, all together, will be useful review for quiz!)
- If someone takes the lead on a particular model, visualization, or section of the report, have at least one other person review that work. (Review the overall report for flow / cohesion, too!)
- Be specific when you write your Collaborations summary.
- Communication:
- Remember your target audience!
- Data visualizations are a form of communication – use them effectively. Update axis labels. Describe in text. Only include if relevant to narrative. Etc.
- Be concise.
- Content:
- Clearly describe and justify ALL choices! (state / explain / interpret; why did you do what you did and why didn’t you do what you didn’t do)
- Provide evidence to back up your claims.
- Interpret results in context
- Code:
- Use Appendix for extra code/viz
- Write comments so that someone with less familiarity with the data/ML can follow (if you open this document again in one year, will you be able to read your own code?)
- Check HTML for formatting issues, unnecessary code/output, etc. before submitting
- Use code, syntax we covered in class
In general: review the rubric and instructions (on Moodle) carefully.
Examples
Let’s check out a few example reports…
GA2 Work Time
Check-in with each other as humans.
Go around the table and report (individually):
- What progress have you made since your last check-in?
- What insights have you gained about the data or models thus far?
- What questions do you still have about the data, the model(s) you’ve tried, etc?
As a group, decide:
- which predictors you are going to use (and how to justify those choices!)
- what modifications you need to make, if any, to the variables (eg combining categories, changing units)
- how you are going to handle the missing observations (if any)
- which one predictive model you will present in your report as your final/“best” model (and how to justify that choice!)
- how you will divide up the writing (and reviewing/editing!) of each section of the report
- what questions you want to ask your instructor today
Then, get to work! Check-in with each other and the instructor as you go.
Before you leave, make sure you have a clear plan for any remaining tasks.
Wrapping Up
Upcoming deadlines:
- Quiz 2: April 8, during class
- GA2: April 10, 11:59 pm (report and feedback survey)
- CP12: April 13, before class