graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
zhejiang wanli university
Last time, we explored the fundamentals of linear regression, a powerful tool for modeling relationships between variables. It allows us to understand how a dependent variable changes with one or more independent variables.
However, the simple linear model, while interpretable and often effective, has limitations. In this chapter, we extend the linear model, enhancing its capabilities. We will discuss:
The primary goals of these extensions are twofold:
Data mining and machine learning, in essence, are to find the most suitable model from a collection of potential models to best fit the data at hand (which includes training data and test data).
graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
Let’s recall our standard linear model:
\[ Y = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p + \epsilon \]
Where:
\(Y\) is the response variable.
\(X_1, \dots, X_p\) are the predictor variables.
\(\beta_0, \dots, \beta_p\) are the coefficients (parameters) to be estimated.
\(\epsilon\) is the error term.
We usually use Least Squares to fit this model, finding the coefficients that minimize the sum of squared differences between the observed and predicted values.
Linear model has distinct advantages in terms of inference (understanding the relationship between variables) and competitive in relation to non-linear methods.
But, plain Least Squares has some limitations.
When do we need to use another fitting procedure instead of least squares?
We will discuss the limitations from two aspects, prediction accuracy and model interpretability.
Overfitting: A model that fits the training data too well, capturing noise and random fluctuations rather than the true underlying relationship. It won’t generalize well to new data.
A good model should not only fit the training data well but also have good predictive performance on new data (test data). Therefore, when we say a model is good, we generally consider two aspects: its fit to the training data and its fit to the test data.
A model with fewer, carefully selected variables is often easier to understand and explain. It highlights the key drivers of the response.
To address these limitations, we explore three main classes of methods that offer alternatives to least squares:
Subset Selection: Identify a subset of the p predictors that are most related to the response. Fit a model using least squares on this reduced set of variables. This simplifies the model and improves interpretability.
Shrinkage (Regularization): Fit a model with all p predictors, but shrink the estimated coefficients towards zero. This reduces variance and can improve prediction accuracy. Some methods (like the lasso) can even set coefficients exactly to zero, performing variable selection.
Dimension Reduction: Project the p predictors into an M-dimensional subspace (M < p). This means creating M linear combinations (projections) of the original variables. Use these projections as predictors in a least squares model. This reduces the complexity of the problem.
We will introduce several methods to select subsets of predictors. Here we consider best subset and stepwise model selection procedures. The goal is to find a smaller group of predictors that still explain the response well.
The Idea: Fit a separate least squares regression for every possible combination of the p predictors. This is an exhaustive search through all possible models. Then, choose the “best” model from this set based on some criterion.
Exhaustive Search: If you have p predictors, you have 2p possible models!
Algorithm 6.1 Best subset selection
Null Model (M0): A model with no predictors. It simply predicts the sample mean (\(\bar{y}\)) of the response for all observations. This serves as a baseline.
For k = 1, 2, …, p: (where k is the number of predictors)
Select the ultimate best model: From the models M0, M1, …, Mp (one best model for each size), choose the single best model using a method that estimates the test error, such as:
Credit data: RSS and \(R^2\) for all possible models. The red frontier tracks the best model for each number of predictors.
Figure 6.1: Shows RSS and R2 for all possible models on the Credit dataset. This dataset contains information about credit card holders, and the goal is to predict their credit card balance.
The data contains ten predictors, but the x-axis ranges to 11. The reason is that one of the predictors is categorical, taking three values. It is split up into two dummy variables.
A categorical variable is also called a qualitative variable. Examples of categorical variables include gender (male, female), region (North, South, East, West), education level (high school, bachelor’s, master’s, doctorate), etc.
Credit data: RSS and \(R^2\) for all possible models. The red frontier tracks the best model for each number of predictors.
The red line connects the best models for each size (lowest RSS or highest R2). For each number of predictors, the red line indicates the model that performs best on the training data.
As expected, RSS decreases and R2 increases as more variables are added. However, the improvements become very small after just a few variables. This suggests that adding more variables beyond a certain point doesn’t significantly improve the model’s fit to the training data.
The RSS of these p + 1 models decreases monotonically, and the R² increases monotonically, as the number of features included in the models increases. So we can’t use them to select the best model!
Training Error vs. Test Error: Low RSS and high R2 indicate a good fit to the training data. But we want a model that performs well on new, unseen data (low test error). Training error is often much smaller than test error! This is because the model is specifically optimized to fit the training data.
Need a Different Criterion: We can’t use RSS or R2 directly to select the best model (from among M0, M1, …, Mp) because they only reflect the fit on the training data. We need to estimate the test error.
Best subset selection is often computationally infeasible for large p. Thus, stepwise methods are attractive alternatives. They offer a more efficient way to search for a good model.
Algorithm 6.2 Forward stepwise selection
Null Model (M0): Start with the model containing no predictors.
For k = 0, 1, …, p-1:
Select the ultimate best model: Choose the single best model from M0, M1, …, Mp using validation set error, cross-validation, Cp, BIC, or adjusted R2. This step uses a method that estimates the test error.
Not Guaranteed Optimal: Forward stepwise selection is not guaranteed to find the best possible model out of all 2p possibilities. It’s a greedy algorithm – it makes the locally optimal choice at each step, which may not lead to the globally optimal solution. It might miss the true best model.
Example:
# Variables | Best Subset | Forward Stepwise |
---|---|---|
One | rating | rating |
Two | rating, income | rating, income |
Three | rating, income, student | rating, income, student |
Four | cards, income, student, limit | rating, income, student, limit |
n < p Case: Forward stepwise selection can be used even when n < p (more predictors than observations). This is a major advantage.
Limitation: In this case, you can only build models up to size Mn-1, because least squares can’t fit a unique solution when p ≥ n. You can’t add more variables than you have observations.
Algorithm 6.3 Backward stepwise selection
Full Model (Mp): Begin with the model containing all p predictors.
For k = p, p-1, …, 1:
Select the ultimate best model: Select the single best model from M0, …, Mp using validation set error, cross-validation, Cp, BIC, or adjusted R2. This uses a method that estimates test error.
Computational Advantage: Like forward stepwise, backward stepwise considers only 1 + p(p+1)/2 models, making it computationally efficient.
Not Guaranteed Optimal: Like forward stepwise, it’s not guaranteed to find the best possible model.
Requirement: n > p: Backward stepwise selection requires that n > p (more observations than predictors) so that the full model (with all p predictors) can be fit. This is a significant limitation compared to forward stepwise.
Forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.
Combine Forward and Backward: Hybrid methods combine aspects of forward and backward stepwise selection. They try to get the benefits of both.
Add and Remove: Variables are added sequentially (like forward). But, after adding each new variable, the method may also remove any variables that no longer contribute significantly to the model fit (based on some criterion). This allows the algorithm to “correct” earlier decisions.
Goal: Try to mimic best subset selection while retaining the computational advantages of stepwise methods. They aim for a better solution than pure forward or backward stepwise, while still being computationally efficient.
Best subset selection, forward selection, and backward selection result in the creation of a set of models, each of which contains a subset of the p predictors.
The Challenge: How do we choose the best model from among the set of models (M0, …, Mp) generated by subset selection or stepwise selection? We cannot simply use the model that has the smallest RSS and the largest R2! Those metrics are based on the training data and are likely to be overly optimistic.
Need to Estimate Test Error: We need to estimate the test error of each model – how well it will perform on new data.
Two Main Approaches:
Training Error is Deceptive: The training set MSE (RSS/n) generally underestimates the test MSE. This is because least squares specifically minimizes the training RSS. The model is optimized for the training data, so it will naturally perform better on that data than on unseen data.
Adjusting for Model Size: We need to adjust the training error to account for the fact that it tends to be too optimistic. Several techniques do this:
For a least squares model with d predictors, these statistics are computed as:
Low Values are Good (Cp, AIC, BIC): For Cp, AIC, and BIC, we choose the model with the lowest value. Lower values indicate a better trade-off between model fit and complexity.
High Values are Good (Adjusted R2): For adjusted R2, we choose the model with the highest value. Higher values indicate a better fit, adjusted for the number of predictors.
Theoretical Justification: Cp, AIC, and BIC have theoretical justifications (though they rely on assumptions that may not always hold). They are derived from principles of statistical information theory. Adjusted R2 is more intuitive but less theoretically grounded.
Cp, BIC, and adjusted R2 for the best models of each size on the Credit data.
Figure 6.2: Shows these statistics for the Credit dataset.
Cp and BIC are estimates of test MSE. They aim to approximate how well the model would perform on new data.
BIC selects a model with 4 variables (income, limit, cards, student).
Cp selects a 6-variable model.
Adjusted R2 selects a 7-variable model.
BIC, validation set error, and cross-validation error for the best models of each size on the Credit data.
Alternative to Subset Selection: Instead of selecting a subset of variables, shrinkage methods (also called regularization methods) fit a model with all p predictors, but constrain or regularize the coefficient estimates. They “shrink” the coefficients towards zero.
How it Works: Shrinkage methods shrink the coefficient estimates towards zero.
Why Shrink?: Shrinking the coefficients can significantly reduce their variance. This can improve prediction accuracy, especially when the least squares estimates have high variance.
Two Main Techniques:
\[ RSS = \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 \]
\[ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \lambda\sum_{j=1}^{p}\beta_j^2 \]
\[ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \lambda\sum_{j=1}^{p}\beta_j^2 \]
No Shrinkage on Intercept: Notice that the shrinkage penalty is not applied to the intercept (β0).
Why?: We want to shrink the coefficients of the predictors, but not the intercept, which represents the average value of the response when all predictors are zero (or at their mean values, if centered). Shrinking the intercept would bias the predictions.
Centering Predictors: If the predictors are centered (mean of zero) before performing ridge regression, then the estimated intercept will be the sample mean of the response: \(\hat{\beta}_0 = \bar{y}\). Centering simplifies the calculations and ensures the intercept has a meaningful interpretation.
Standardized ridge regression coefficients for the Credit data, as a function of λ and ||βλR||2 / ||β||2.
Figure 6.4: Shows ridge regression coefficient estimates for the Credit data.
Left Panel: Coefficients plotted against λ.
Right Panel: Coefficients plotted against ||βλR||2 / ||β||2.
||βλR||2 represents the l2 norm of ridge regression coefficients.
||β||2 represents the l2 norm of least squares coefficients.
The x-axis can be seen as how much the ridge regression coefficient estimates have been shrunken towards zero.
Scale Equivariance (Least Squares): Least squares coefficient estimates are scale equivariant. Multiplying a predictor by a constant c simply scales the corresponding coefficient by 1/c. The overall prediction remains unchanged.
Scale Dependence (Ridge Regression): Ridge regression coefficients can change substantially when multiplying a predictor by a constant. This is because of the Σβj2 term in the penalty. If a coefficient is large, squaring it makes it even larger, increasing the penalty. Scaling a predictor changes the scale of its coefficient, thus changing the penalty.
\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)^2}} \]
This formula standardizes each predictor by subtracting its mean and dividing by its standard deviation. This ensures that all predictors are on the same scale (mean of 0, standard deviation of 1).
This ensures that all predictors are on the same scale. Standardization makes the penalty apply equally to all predictors, regardless of their original units.
Bias-Variance Trade-Off: Ridge regression’s advantage comes from the bias-variance trade-off. This is a fundamental concept in statistical learning.
Finding the Sweet Spot: The goal is to find a value of λ that reduces variance more than it increases bias, leading to a lower test MSE (Mean Squared Error).
Squared bias, variance, and test MSE for ridge regression on a simulated dataset.
Figure 6.5: Shows bias, variance, and test MSE for ridge regression on a simulated dataset.
As λ increases, variance decreases rapidly at first, with only a small increase in bias. This leads to a decrease in MSE.
Eventually, the decrease in variance slows, and the increase in bias accelerates, causing the MSE to increase. The penalty becomes too strong, and the model underfits.
The minimum MSE is achieved at a moderate value of λ. This is the optimal level of shrinkage.
Disadvantage of Ridge Regression: Ridge regression includes all p predictors in the final model. The penalty shrinks coefficients towards zero, but it doesn’t set any of them exactly to zero (unless λ = ∞). This can make interpretation difficult when p is large. It doesn’t perform variable selection.
The Lasso: An Alternative: The lasso is a more recent alternative to ridge regression that overcomes this disadvantage. It can perform variable selection.
\[ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}|\beta_j| = RSS + \lambda\sum_{j=1}^{p}|\beta_j| \]
Shrinkage and Selection: Like ridge regression, the lasso shrinks coefficients towards zero.
Key Difference: The l1 penalty has the effect of forcing some coefficients to be exactly zero when λ is sufficiently large. This is the crucial difference from ridge regression.
Variable Selection: This means the lasso performs variable selection! It automatically excludes some variables from the model.
Sparse Models: The lasso yields sparse models – models that involve only a subset of the variables. “Sparse” means that many of the coefficients are zero.
Standardized lasso coefficients for the Credit data, as a function of λ and ||βλL||1 / ||β||1.
Contours of the error and constraint functions for the lasso (left) and ridge regression (right).
Contours of the error and constraint functions for the lasso (left) and ridge regression (right).
Why does the lasso set coefficients to zero, while ridge regression doesn’t?
The solution is the first point where the “ellipse” (contour of constant RSS) touches the constraint region. The ellipses represent combinations of coefficients that have the same RSS.
Because the lasso constraint has corners, the ellipse often intersects at an axis, setting one coefficient to zero. When the ellipse hits a corner of the diamond, one of the coefficients is zero.
Ridge regression’s circular constraint doesn’t have corners, so this rarely happens. The ellipse is unlikely to intersect the circle exactly on an axis.
Interpretability: The lasso has a major advantage in terms of interpretability, producing simpler models with fewer variables. This makes it easier to understand the key factors influencing the response.
Prediction Accuracy: Which method is better for prediction depends on the true underlying relationship between the predictors and the response. It’s data-dependent.
Unknown Truth: In practice, we don’t know which scenario is true. Cross-validation can help us choose the best approach for a particular dataset.
Lasso and ridge regression on a simulated dataset where all predictors are related to the response.
Lasso and ridge regression on a simulated dataset where only two predictors are related to the response.
We consider a simple situation:
Then, it can be shown:
Ridge and lasso coefficient estimates in the simple case.
Figure 6.10: It shows that:
Ridge regression shrinks each coefficient by same proportion.
Lasso shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.
Cross-validation error and coefficient estimates for ridge regression on the Credit data.
Cross-validation error and coefficient estimates for the lasso on the simulated data from Figure 6.9.
Different Approach: Instead of working directly with the original predictors (X1, …, Xp), dimension reduction methods transform the predictors and then fit a least squares model using the transformed variables. They create new, fewer variables that are combinations of the original ones.
Linear Combinations: Create M linear combinations (Z1, …, ZM) of the original p predictors, where M < p.
\[ Z_m = \sum_{j=1}^{p}\phi_{jm}X_j \]
- φ<sub>jm</sub> are constants (weights) that define the linear combinations. Each $Z_m$ is a weighted sum of all the original predictors.
\[ y_i = \theta_0 + \sum_{m=1}^{M}\theta_mz_{im} + \epsilon_i \]
- This reduces the problem from estimating *p*+1 coefficients (in the original model) to estimating *M*+1 coefficients. This can significantly simplify the model.
\[ \beta_j = \sum_{m=1}^{M}\theta_m\phi_{jm} \] - The original coefficients (\(\beta_j\)) are now expressed in terms of the new coefficients (\(\theta_m\)) and the weights (\(\phi_{jm}\)).
Principal Components Analysis (PCA): PCA is a technique for deriving a low-dimensional set of features from a larger set of variables. (More detail in Chapter 12.) It finds new variables (principal components) that capture the most variation in the original data.
Unsupervised: PCA is an unsupervised method – it identifies linear combinations that best represent the predictors (X), without considering the response (Y). It only looks at the relationships among the predictors.
PCA seeks to find the directions in the data along with which the observations vary the most. These directions are the principal components.
First Principal Component: The first principal component is the direction in the data with the greatest variance. It’s the line that best captures the spread of the data.
Population size and ad spending for 100 cities. The first principal component is shown in green, and the second in blue.
Population size and ad spending for 100 cities. The first principal component is shown in green, and the second in blue.
\[ Z_1 = 0.839 \times (pop - \overline{pop}) + 0.544 \times (ad - \overline{ad}) \]
- 0.839 and 0.544 are the *principal component loadings* (the φ<sub>jm</sub> values). They define the direction of the first principal component.
- $\overline{pop}$ and $\overline{ad}$ are the means of pop and ad, respectively. The variables are centered (mean-subtracted).
\[ Z_{i1} = 0.839 \times (pop_i - \overline{pop}) + 0.544 \times (ad_i - \overline{ad}) \]
The first principal component direction, with distances to the observations shown as dashed lines.
The first principal component direction, with distances to the observations shown as dashed lines.
Plots of the first principal component scores versus pop and ad.
Plots of the first principal component scores versus pop and ad.
More than One: You can construct up to p distinct principal components.
Second Principal Component: The second principal component (Z2) is:
Successive Components: Each subsequent principal component captures the maximum remaining variance, subject to being uncorrelated with the previous components. Each component captures a different “direction” of variation in the data.
Plots of the second principal component scores versus pop and ad.
Plots of the second principal component scores versus pop and ad.
The Idea: Use the first M principal components (Z1, …, ZM) as predictors in a linear regression model. This is PCR.
Assumption: We assume that the directions in which X1, …, Xp show the most variation are the directions that are associated with Y. This is the key assumption of PCR. If it holds, PCR can be effective.
Potential for Improvement: If this assumption holds, PCR can outperform least squares, especially when M << p, by reducing variance.
PCR applied to two simulated datasets.
Figure 6.18: Shows PCR applied to the simulated datasets from Figures 6.8 and 6.9.
PCR with an appropriate choice of M can improve substantially over least squares.
However, in this example, PCR does not perform as well as ridge regression or the lasso. This is because the data were generated in a way that required many principal components to model the response well.
Linear Combinations: PCR is not a feature selection method. Each principal component is a linear combination of all p original features. It doesn’t exclude any of the original variables.
Example: In the advertising data, Z1 was a combination of both pop and ad.
Relationship to Ridge Regression: PCR is more closely related to ridge regression than to the lasso. Both involve using all the original predictors, albeit in a transformed way.
PCR standardized coefficient estimates and cross-validation MSE on the Credit data.
Figure 6.20: Shows cross-validation for PCR on the Credit data.
The lowest cross-validation error occurs with M = 10, which is almost no dimension reduction.
Standardization: It’s generally recommended to standardize each predictor before performing PCA (and thus PCR). This ensures that all variables are on the same scale. Otherwise, variables with larger variances will dominate the principal components.
Supervised Dimension Reduction: PLS is a supervised dimension reduction technique. Unlike PCR (which is unsupervised), PLS uses the response (Y) to help identify the new features (Z1, …, ZM). It takes the response into account when creating the linear combinations.
Goal: Find directions that explain both the response and the predictors. It tries to find components that are relevant to predicting the response.
Standardize Predictors: Standardize the p predictors (subtract the mean and divide by the standard deviation).
Simple Linear Regressions: Compute the coefficient from the simple linear regression of Y onto each Xj (separately for each predictor). This measures the individual relationship between each predictor and the response.
First PLS Direction: Set each φj1 in the equation for Z1 (Equation 6.16) equal to this coefficient. This means PLS places the highest weight on variables that are most strongly related to the response (based on the simple linear regressions).
First PLS direction (solid line) and first PCR direction (dotted line) for the advertising data.
First PLS direction (solid line) and first PCR direction (dotted line) for the advertising data.
Choosing M: The number of PLS directions (M) is a tuning parameter, typically chosen by cross-validation.
Standardization: It’s generally recommended to standardize both the predictors and the response before performing PLS.
Performance: In practice, PLS often performs no better than ridge regression or PCR. The supervised dimension reduction of PLS can reduce bias, but it also has the potential to increase variance.
Low-Dimensional Setting: Most traditional statistical techniques are designed for the low-dimensional setting, where n (number of observations) is much greater than p (number of features).
High-Dimensional Setting: In recent years, new technologies have led to a dramatic increase in the number of features that can be measured. We often encounter datasets where p is large, possibly even larger than n. This is the “high-dimensional” setting.
The lasso performed with varying numbers of features (p) and a fixed sample size (n).
Figure 6.24: Lasso with n=100, p can be 20, 50 and 2000.
As the number of features increases, the test set error increases, highlights the curse of dimensionality.
Three Key Points:
Multicollinearity is Extreme: In the high-dimensional setting, multicollinearity is extreme. Any variable can be written as a linear combination of all the other variables. This makes it impossible to isolate the effect of individual predictors.
Cannot Identify True Predictors: This means we can never know exactly which variables (if any) are truly predictive of the outcome. We can only identify variables that are correlated with the true predictors.
Caution in Reporting: Be very cautious when reporting results. Don’t overstate conclusions. Avoid claiming to have found the “true” predictors.
Never Use Training Data for Evaluation: Never use training data measures (sum of squared errors, p-values, R2) as evidence of a good model fit in the high-dimensional setting. These measures will be misleadingly optimistic. They will always look good, even if the model is terrible.
Use Test Data or Cross-Validation: Always report results on an independent test set or using cross-validation. These are the only reliable ways to assess model performance in high dimensions.
Goals: These methods aim to improve:
High-Dimensional Data: These methods are particularly important in the high-dimensional setting (p ≥ n), where least squares fails. They provide ways to fit models even when the number of predictors is large.
Cross-Validation: Cross-validation is a powerful tool for selecting tuning parameters and estimating the test error of different models. It’s essential for reliable model selection.
The choice of modeling method, the choice of tuning parameter, and the choice of assessment metrics all become especially important in high dimensions.
邱飞(peter) 💌 [email protected]