Resampling Methods
Statistical Learning - Chapter 5
peter(邱飞)
zhejiang wanli university
Introduction
Welcome to Chapter 5: Resampling Methods! 🎉
This chapter introduces powerful statistical techniques that help us better understand and utilize our models. We will learn two key resampling methods: Cross-Validation and Bootstrap. These methods are essential tools in the data scientist’s toolkit!
Introduction: What Will be Covered?
Introduction: Roadmap
This image provides an overview of what will be covered in this chapter about resampling methods. It shows the conceptual flow and various techniques:
- Cross-Validation: Used for model assessment and selection. Techniques include the validation set approach, leave-one-out cross-validation (LOOCV), and k-fold cross-validation.
- Bootstrap: A method for quantifying uncertainty in our estimates.
What are Resampling Methods?
- Resampling methods are indispensable tools in modern statistics. 💯
- They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample.
- Imagine you are baking a cake; you can modify the amount of each ingredient to get many versions of the cake. Resampling methods is like that, you are making many “cakes” (model) by ingredients(data).
Resampling Methods: Detailed Explanation
- This allows us to obtain additional information about the fitted model.
- Example: Consider linear regression. Resampling lets us examine the variability of the coefficient estimates. How much do the slope and intercept change with different samples?
- This information is usually difficult to get with a single model fit on the original training sample alone. We only have one set of estimates!
- Resampling can be computationally expensive.
- We refit the same statistical method multiple times with different data subsets.
- However, with modern computing power, this is usually not a major obstacle. 💪
Key Concepts
- Model Assessment: Evaluating a model’s performance.
- Crucially, we want to estimate the test error. This tells us how well the model will generalize to unseen data.
- Model Selection: Choosing the best level of flexibility for a model.
- Example: For polynomial regression, what degree of polynomial should we use? For k-NN, how many neighbors?
- Bootstrap: A method to measure the accuracy (or uncertainty) of a parameter estimate or a statistical learning method.
- The Bootstrap helps us quantify how much our estimates might vary if we had different samples.
Two Main Resampling Methods
We’ll cover two commonly used resampling methods:
- Cross-Validation:
- Primarily for estimating test error. Essential for both model assessment (how good is a model?) and model selection (which model is best?).
- It gives us an idea of how well our model will perform on new, unseen data.
- Bootstrap:
- Primarily for quantifying uncertainty. Example: calculating standard errors of coefficient estimates.
- Useful when standard software doesn’t give us the uncertainty, or when assumptions of standard methods aren’t valid.
Cross-Validation: Introduction
Remember the critical distinction between test error and training error (from Chapter 2)?
- Training Error: Calculated on the data used to train the model. It often underestimates the test error. (The model is optimized for this specific data.)
- Test Error: The average error on new, unseen data. This is what we truly care about! It shows how well the model generalizes.
- Our goal: statistical learning methods with low test error.
Ideally, we’d have a large, separate test set. But, we often don’t!
Cross-validation cleverly estimates test error using only the available training data. 😉
The Core Idea of Cross-Validation
Different Cross-Validation Techniques
- Several different cross-validation techniques exist, based on how we hold out the data. We’ll discuss these next.
The Core Idea of Cross-Validation: Visualized
The Core Idea of Cross-Validation: Visualized
- This figure illustrates the validation set approach, a basic form of cross-validation.
- A set of n observations is randomly split into two parts:
- Training set (shown in blue).
- Validation set (shown in beige).
Validation Set Approach: Procedure (Continued)
3. Use the fitted model to *predict* the responses for the observations in the *validation set*.
4. Calculate the validation set error.
- **Example:** For regression, use Mean Squared Error (MSE).
- This error is our estimate of the *test error*.
Validation Set Approach: Example (Auto Data)
Recall the Auto
dataset (Chapter 3). We saw a non-linear relationship between mpg
and horsepower
.
Let’s use the validation set approach to compare models:
- Linear model:
mpg ~ horsepower
- Quadratic model:
mpg ~ horsepower + horsepower²
- Cubic model:
mpg ~ horsepower + horsepower² + horsepower³
Which model predicts mpg
best? Cross-validation helps us decide!
Validation Set Approach: Example (Auto Data) - Left Panel
Validation Set Approach: Example (Auto Data) - Left Panel - Explanation
- Left Panel: Shows the validation set MSE for a single random split of the data.
- Blue: Linear model.
- Orange: Quadratic model.
- Green: Cubic model.
- The quadratic model (orange) has lower MSE than the linear model (blue), suggesting a better fit.
- The cubic model (green) has slightly higher MSE than the quadratic, suggesting overfitting.
Validation Set Approach: Example (Auto Data) - Right Panel
Validation Set Approach: Example (Auto Data) - Right Panel - Explanation
- Right Panel: Shows validation set MSE for ten different random splits.
- Notice the variability! The MSE changes depending on which observations are in the training and validation sets.
- This illustrates a key drawback: The validation set approach can be highly variable.
LOOCV: Procedure (Continued)
- Train the model on the remaining *n-1* observations.
- *Predict* the response for the held-out observation *i*.
- Calculate the error for observation *i*.
- **Example (Regression):** `(yi - ŷi)²` (squared error).
- $y_i$: Actual value for observation *i*.
- $\hat{y}_i$: Predicted value for observation *i*.
LOOCV: Procedure (Continued)
2. Calculate the LOOCV estimate of the test MSE by averaging the *n* individual errors:
$$
CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_i
$$
Where $MSE_i = (y_i - \hat{y}_i)^2$.
LOOCV: Visualized
LOOCV: Visualized
- Schematic of LOOCV.
- n data points are repeatedly split:
- Training set (blue): Contains all but one observation (n-1 observations).
- Validation set (beige): Contains only that one observation.
- The test error is estimated by averaging the n resulting MSEs.
LOOCV: Auto Data Example - Left Panel
LOOCV: Auto Data Example - Left Panel - Explanation
- Left Panel: The LOOCV error curve for different polynomial models predicting
mpg
from horsepower
.
- The curve shows how the LOOCV error changes as the model complexity (degree of the polynomial) increases.
LOOCV: Auto Data Example - Right Panel
LOOCV: Auto Data Example - Right Panel - Explanation
- Right Panel: Shows multiple 10-fold CV curves (we’ll discuss k-fold CV shortly).
- This demonstrates that 10-fold CV can produce different results depending on the random splits, while LOOCV is deterministic (always gives the same result).
LOOCV: A Computational Shortcut
\[
CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}\left( \frac{y_i - \hat{y}_i}{1 - h_i} \right)^2
\]
LOOCV Shortcut: Explanation
- \(\hat{y}_i\) is the ith fitted value from the original least squares fit (using the entire dataset).
- \(h_i\) is the leverage statistic.
- Leverage measures how much an observation influences its own fitted value. Observations with high leverage have a larger influence on the fitted model.
- Key Point: This formula allows us to calculate LOOCV with the cost of just one model fit! This is a huge computational saving.
LOOCV Shortcut: Important Note
This shortcut does not generally apply to other models (like logistic regression or more complex models).
It’s specific to least squares linear and polynomial regression.
For other models, you typically need to perform the full LOOCV procedure (fitting the model n times).
k-Fold CV: Procedure (Continued)
2. For each fold *j* (from 1 to *k*):
- Treat fold *j* as the validation set.
- Train the model on the remaining *k-1* folds.
- Compute the error (e.g., MSE) on the held-out fold *j*.
k-Fold CV: Procedure (Continued)
3. Calculate the k-fold CV estimate by averaging the *k* errors:
$$
CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k}MSE_i
$$
Where $MSE_i$ is the mean squared error calculated on fold *i*.
k-Fold CV: Visualized
k-Fold CV: Visualized - Explanation
- Schematic of 5-fold CV.
- n observations are randomly split into five non-overlapping groups.
- Each “fifth” acts as a validation set (beige), and the remainder as the training set (blue).
- The test error is estimated by averaging the five resulting MSE estimates.
k-Fold CV: Choosing k
Common choices for k are 5 or 10. These values have been empirically shown to provide good estimates of the test error.
Computational Advantage: k-fold CV (with k < n) is less computationally expensive than LOOCV. It requires only k model fits, not n. This is significant for large datasets.
5.1.4 Bias-Variance Trade-Off for k-Fold CV
- Bias:
- LOOCV: Nearly unbiased (uses almost all data for training).
- k-fold CV: Slightly more bias (uses (k-1)n/k observations for training).
- Validation set approach: Most bias (uses roughly half the data, leading to overestimation of test error).
Bias-Variance Trade-Off for k-Fold CV (Continued)
- Variance:
- LOOCV: Higher variance than k-fold CV. The n fitted models in LOOCV are highly correlated (they share almost all their training data). Averaging highly correlated quantities has higher variance.
- k-fold CV: Averages k models with less overlap in training data, leading to lower variance.
- Conclusion: 5-fold or 10-fold CV often achieve a good balance between bias and variance. They offer a reasonable compromise between accuracy and computational cost.
Cross-Validation on Simulated Data
Cross-Validation on Simulated Data
- Blue: True test MSE (known because it’s simulated data).
- Black Dashed: LOOCV estimate.
- Orange Solid: 10-fold CV estimate.
- Crosses: Minimum points of each curve.
- The plots show that CV curves can sometimes underestimate the true test MSE.
- However, they generally identify the correct level of flexibility (the model with the lowest test error).
5.1.5 Cross-Validation for Classification
So far, we’ve focused on regression (quantitative response).
Cross-validation works similarly for classification (qualitative response).
Instead of MSE, we use the number of misclassified observations to quantify error.
Cross-Validation for Classification (Continued)
- Example: LOOCV error rate:
\[
CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}Err_i
\]
where \(Err_i = I(y_i \neq \hat{y}_i)\). - \(I(y_i \neq \hat{y}_i)\) is an indicator function: - 1 if misclassified (predicted class \(\hat{y}_i\) is different from the true class \(y_i\)). - 0 otherwise.
- The k-fold CV error rate is calculated similarly, averaging the misclassification rates across the k folds.
CV for Classification: Example - Decision Boundaries
CV for Classification: Example - Decision Boundaries
- Purple Dashed Line: Bayes decision boundary (the optimal decision boundary, usually unknown in practice).
- Black Lines: Decision boundaries from logistic regression with different polynomial degrees. Observe how the decision boundary changes with model complexity.
CV for Classification: Example - Error Curves
CV for Classification: Example - Error Curves
- Left: Logistic regression with polynomial terms.
- Right: KNN classifier with different values of K (number of neighbors).
- Brown: True test error.
- Blue: Training error (typically overly optimistic).
- Black: 10-fold CV error.
- CV curves often underestimate the true test error.
- Crucially, they tend to identify the minimum, corresponding to the best model complexity.
5.2 The Bootstrap
A powerful and widely applicable tool for quantifying the uncertainty associated with an estimator or statistical learning method.
It helps us understand how much our estimates might vary if we had different samples from the population.
Example: Estimating standard errors of regression coefficients. (Standard software does this for linear regression, but the bootstrap is useful in more general cases.)
Bootstrap: The Core Idea
Problem: We usually cannot generate new samples from the population. We only have our one observed dataset.
Bootstrap Solution: Repeatedly sample observations with replacement from the original dataset to create bootstrap datasets.
- With Replacement: The same observation can appear multiple times in a bootstrap dataset. This is the key to the bootstrap!
Bootstrap: The Core Idea (Continued)
- We treat the original dataset as if it were the population. This allows us to simulate the process of drawing multiple samples.
Bootstrap: Visualized
Bootstrap: Visualized
- Graphical illustration of the bootstrap with a small sample (n = 3).
- Each bootstrap dataset contains n observations, sampled with replacement from the original data.
- Note that the same observation can appear multiple times in a single bootstrap sample.
Bootstrap: Investment Example
- We want to invest in two assets, X and Y.
- Our goal: Minimize the risk (variance) of our investment.
- The optimal fraction (α) to invest in X is:
\[
\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2 - 2\sigma_{XY}}
\]
Investment Example: Explanation
- \(\sigma_X^2\): Variance of returns for asset X.
- \(\sigma_Y^2\): Variance of returns for asset Y.
- \(\sigma_{XY}\): Covariance between returns of assets X and Y.
- In practice, these population variances and covariance are unknown. We only have estimates from our observed data.
Bootstrap Example - Simulated Data
Bootstrap Example - Simulated Data
- Each panel shows 100 simulated returns for investments X and Y.
- This simulates different possible scenarios for asset returns.
Bootstrap: Example - Histograms
Bootstrap: Example - Histograms - Explanation
- Left: Histogram of α estimates from 1,000 simulated datasets (if we could repeatedly sample from the true population).
- Center: Histogram of α estimates from 1,000 bootstrap samples from a single dataset (what we can do in practice).
- Right: Boxplots comparing the two distributions.
Bootstrap Example: Conclusion
- The bootstrap accurately estimates the variability of α!
- The distribution of bootstrap estimates (center) closely resembles the distribution of estimates from repeated sampling (left).
- This demonstrates how the bootstrap can provide a good estimate of uncertainty even with only one dataset.
Summary
- Resampling methods are essential for:
- Assessing model performance (cross-validation).
- Quantifying uncertainty (bootstrap).
Summary: Cross-Validation
- Cross-validation:
- Estimates test error by holding out data (creating a “fake” test set).
- Common techniques: Validation set, LOOCV, and k-fold CV.
- k-fold CV (k=5 or k=10) often provides a good bias-variance trade-off, balancing accuracy and computation.
Summary: Bootstrap
- Bootstrap:
- Estimates uncertainty by resampling with replacement from the original data.
- Widely applicable, especially when standard error formulas are unavailable or assumptions are violated.
Thoughts and Discussion 🤔
- When might you prefer LOOCV over k-fold CV, despite the higher computational cost?
- Answer: With very small datasets, the bias reduction of LOOCV might be crucial. Also, when computational cost isn’t a major concern (simple models).
- Can you think of situations where the bootstrap would be particularly useful?
- Answer: When working with a statistic without a simple standard error formula (e.g., median, custom metric), or when assumptions of standard methods (like normality) are violated.
Thoughts and Discussion 🤔 (Continued)
- How do these resampling methods relate to the concepts of bias and variance we discussed in earlier chapters?
- Answer: Cross-validation helps us choose models with a good balance of bias and variance (by estimating test error). The bootstrap helps us estimate the variance of parameter estimates.
- What are the limitations of these methods? When might they not be appropriate?
- Answer: Resampling can be computationally intensive. Cross-validation assumes independent and identically distributed (i.i.d.) data. The bootstrap relies on the original sample being representative of the population. Severe violations of these assumptions can lead to misleading results.