Learning Goals
This class is an introduction to the exciting world of statistical machine learning.
The goals of this course are two-fold:
- to gain a working understanding of various machine learning algorithms, and
- to further develop general skills necessary for statistics and data science.
Specific skills and course topics are listed below. Use this list to guide your synthesis of video and reading material for specific topics, and your learning more generally, throughout the semester. It also serves as a study guide for quizzes and other assessments.
General Skills
Computational Thinking
- Decomposition: Break a task into smaller tasks to be able to explain the process to another person or computer
- Pattern Recognition: Recognize patterns in tasks by noticing similarities and common differences
- Abstraction: Represent an idea or process in general terms so that you can use it to solve other projects that are similar in nature
- Algorithmic Thinking: Develop a step-by-step strategy for solving a problem
Ethical Data Thinking
- Identify ethical issues associated with applications of statistical machine learning in a variety of settings
- Assess and critique the actions of individuals and organizations as it relates to ethical use of data
Data Communication
- In written and oral formats: Inform and justify data analysis and modeling process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.
Collaborative Learning
- Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
- Develop a common purpose and agreement on goals.
- Be able to contribute questions or concerns in a respectful way.
- Share and contribute to the group’s learning in an equitable manner.
Reflection
- Regularly reflect on your learning to make note of and celebrate your progress, identify opportunities for continued growth, and set goals
Overarching Themes
For each machine learning method that we learn, the 6 themes below will recur. We recommend setting up a note system that allows you to compare these themes across methods (e.g., a table with 6 rows corresponding to each theme and a column for each method).
Algorithmic understanding: Students can write a pseudocode paragraph that explains how the algorithm works, including how quantitative evaluation metrics are used at different parts of the algorithm and when the algorithm terminates.
Bias-variance tradeoff: Students can explain what happens to the fitted model in terms of low/high bias and low/high variance when key tuning parameter(s) of the method are low vs. high. Their explanation also shows an understanding of what bias and variance mean and how bias and variance related to overall test error.
Interpretation of output: Students can explain how outputs from model building can be used to understand relationships between variables and the usefulness of different variables in the predictive ability of the model. They can also explain any caveats that are important to keep in mind when interpreting output.
Scaling of variables: Students can explain why it is important or not necessary to standardize predictor variables for this method by indicating what part of the algorithm would be affected by variable scaling and how quantitative evaluation metrics are affected by scaling.
Computational time: Students can explain how tuning parameters could affect the computation run time of the method and why this happens based on the algorithm.
Parametric/nonparametric: Students can explain where on the parametric-nonparametric spectrum this method falls by discussing how parameters show up in the model description.
Course Topics
Unit 0
Introduction to Statistical Machine Learning
- Formulate research questions that align with regression, classification, or unsupervised learning tasks.
- Identify the appropriate task (regression, classification, unsupervised) for a given research question.
Unit 1
Evaluating Regression Models
- Create and interpret residuals vs. fitted, residuals vs. predictor plots to identify improvements in modeling and address ethical concerns.
- Calculate and interpret MSE, RMSE, MAE, and R-squared in a contextually meaningful way.
Overfitting and Cross-Validation
- Explain why training/in-sample model evaluation metrics can provide a misleading view of true test/out-of-sample performance
- Accurately describe all steps of cross-validation to estimate the test/out-of-sample version of a model evaluation metric
- Explain what role CV has in a predictive modeling analysis and its connection to overfitting
- Explain the pros/cons of higher vs. lower k in k-fold CV in terms of sample size and computing time
Unit 2
Variable Selection
- Explain the difference between inferential models and predictive models and how the model building processes differ
- Clearly describe the backward stepwise selection algorithm and why they are examples of greedy algorithms
- Compare best subset and stepwise algorithms in terms of optimality of output and computational time
LASSO (shrinkage/regularization)
- Explain how ordinary and penalized least squares are similar and different with regard to (1) the form of the objective function (i.e., the function we are trying to minimize) and (2) the goal of variable selection
- Explain how the lambda tuning parameter affects model performance and how this is related to overfitting
Unit 3
KNN Regression and the Bias-Variance Tradeoff
- Clearly describe / implement by hand the KNN algorithm for making a regression prediction
- Explain how the number of neighbors relates to the bias-variance tradeoff
- Explain the difference between parametric and nonparametric methods
- Explain how the curse of dimensionality relates to the performance of KNN
Local Regression and Splines
- Clearly describe the local regression algorithm for making a prediction
- Explain how bandwidth (span) relate to the bias-variance tradeoff
- Explain the advantages of splines over global transformations (e.g.
y ~ poly(x, 2)) and other types of piecewise polynomials - Explain how splines are constructed by drawing connections to variable transformations and least squares
- Explain how the number of knots relates to the bias-variance tradeoff
Unit 4
Classification via Logistic regression
- Use a logistic regression model to make hard (class) and soft (probability) predictions
- Interpret non-intercept coefficients from logistic regression models in the data context
Evaluating Classification Models
- Calculate (by hand from confusion matrices) and contextually interpret overall accuracy, sensitivity, and specificity
- Construct and interpret plots of predicted probabilities across classes
- Explain how a ROC curve is constructed and the rationale behind AUC as an evaluation metric
- Appropriately use and interpret the no-information rate to evaluate accuracy metrics
Unit 5
KNN Classification
- Clearly describe / implement by hand the KNN algorithm for making a classification prediction
- Interpret a KNN classification region plot
- Discuss the pros and cons of KNN classification relative to other classification tools (eg logistic regression, decision trees)
Decision Trees
- Clearly describe the recursive binary splitting algorithm for tree building for both regression and classification
- Compute the weighted average Gini index to measure the quality of a classification tree split
- Compute the sum of squared residuals to measure the quality of a regression tree split
- Explain how recursive binary splitting is a greedy algorithm
- Explain how different tree parameters relate to the bias-variance tradeoff
Bagging and Random Forests
- Explain the rationale for bagging
- Explain the rationale for selecting a random subset of predictors at each split (random forests)
- Explain how the size of the random subset of predictors at each split relates to the bias-variance tradeoff
- Explain the rationale for and implement out-of-bag error estimation for both regression and classification
- Explain the rationale behind the random forest variable importance measure and why it is biased towards quantitative predictors (in class)
Unit 6
Hierarchical Clustering
- Clearly describe / implement by hand the hierarchical clustering algorithm
- Compare and contrast k-means and hierarchical clustering in their outputs and algorithms
- Interpret cuts of the dendrogram for single and complete linkage
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters
K-Means Clustering
- Clearly describe / implement by hand the k-means algorithm
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters
Unit 7
Principal Component Analysis
- Explain the goal of dimension reduction and how this can be useful in a supervised learning setting
- Interpret and use the information provided by principal component loadings and scores
- Interpret and use a scree plot to guide dimension reduction
Principal Component Regression
- Clearly describe / implement the principal component regression algorithm
- Describe the tradeoff of choice of principal components (\(k\)) in terms of the bias-variance tradeoff
- Implement strategies for choosing \(k\)
- Discuss the pros and cons of principal component regression relative to variable selection and LASSO