Linear Model Selection and Regularization

peter(邱飞)

zhejiang wanli university

Introduction: Beyond Simple Linear Regression

Last time, we explored the fundamentals of linear regression, a powerful tool for modeling relationships between variables. It allows us to understand how a dependent variable changes with one or more independent variables.

However, the simple linear model, while interpretable and often effective, has limitations. In this chapter, we extend the linear model, enhancing its capabilities. We will discuss:

Ways in which the simple linear model can be improved.
Alternative fitting procedures instead of least squares.

Introduction: Goals

The primary goals of these extensions are twofold:

Better Prediction Accuracy 💪: We aim to create models that make more accurate predictions on new, unseen data.
Improved Model Interpretability 🧐: We want to develop models that are easier to understand, highlighting the most important factors influencing the outcome.

Data mining and machine learning, in essence, are to find the most suitable model from a collection of potential models to best fit the data at hand (which includes training data and test data).

Data Mining, Machine Learning and Statistical Learning

graph LR
    A[Data Mining] --> C(Common Ground)
    B[Machine Learning] --> C
    D[Statistical Learning] --> C
    C --> E[Insights & Predictions]

Data Mining: Discovering patterns and insights from large datasets. It often employs machine learning techniques.
Machine Learning: Algorithms that allow computers to learn from data without explicit programming.
Statistical Learning: A set of tools for modeling and understanding complex datasets. It’s a blend of statistics and machine learning, focusing on both inference and prediction.
All three aim to extract meaningful information and make predictions from data.

Why Go Beyond Least Squares?

Let’s recall our standard linear model:

\[ Y = \beta_0 + \beta_1X_1 + \dots + \beta_pX_p + \epsilon \]

Where:

\(Y\) is the response variable.
\(X_1, \dots, X_p\) are the predictor variables.
\(\beta_0, \dots, \beta_p\) are the coefficients (parameters) to be estimated.
\(\epsilon\) is the error term.
We usually use Least Squares to fit this model, finding the coefficients that minimize the sum of squared differences between the observed and predicted values.
Linear model has distinct advantages in terms of inference (understanding the relationship between variables) and competitive in relation to non-linear methods.
But, plain Least Squares has some limitations.

Why Go Beyond Least Squares? (Cont.)

When do we need to use another fitting procedure instead of least squares?

We will discuss the limitations from two aspects, prediction accuracy and model interpretability.

Limitations of Least Squares: Prediction Accuracy

Low Bias, Low Variance (Ideal): When the true relationship between the predictors and the response is approximately linear and you have many more observations (n) than predictors (p) (\(n \gg p\)), least squares works great! The estimates have low bias and low variance.
- Low Bias: The model captures the true underlying relationship well.
- Low Variance: The model’s predictions are stable and don’t fluctuate much if you were to train it on different datasets.

Limitations of Least Squares: Prediction Accuracy (Cont.)

High Variance (Problem): If n is not much larger than p, the least squares fit can have high variability. The model becomes too sensitive to the specific training data. This leads to overfitting 🤯 - the model fits the training data too closely and performs poorly on new data.

Limitations of Least Squares: Prediction Accuracy (Cont.)

No Unique Solution (Big Problem): If p > n (more predictors than observations), there’s no longer a unique least squares solution! This is often called the “high-dimensional” case. Many possible coefficient values will fit the training data perfectly, leading to huge variance and terrible predictions on new data.

Limitations of Least Squares: Prediction Accuracy (Cont.)

Overfitting: A model that fits the training data too well, capturing noise and random fluctuations rather than the true underlying relationship. It won’t generalize well to new data.

A good model should not only fit the training data well but also have good predictive performance on new data (test data). Therefore, when we say a model is good, we generally consider two aspects: its fit to the training data and its fit to the test data.

Limitations of Least Squares: Model Interpretability

Irrelevant Variables: Often, some predictors in your model aren’t actually related to the response. Including these irrelevant variables adds unnecessary complexity to the model, making it harder to understand the key drivers. We’d like to remove these irrelevant variables.

Limitations of Least Squares: Model Interpretability (Cont.)

Least Squares Doesn’t Zero Out: Least squares rarely sets coefficients exactly to zero. Even if a variable is irrelevant, its coefficient will usually be a small, non-zero value. This makes it hard to identify the truly important variables.

Limitations of Least Squares: Model Interpretability (Cont.)

Feature/Variable Selection: We want methods that automatically perform feature selection (or variable selection) – excluding irrelevant variables to create a simpler, more interpretable model. This helps us focus on the most important factors.

Limitations of Least Squares: Model Interpretability (Cont.)

A model with fewer, carefully selected variables is often easier to understand and explain. It highlights the key drivers of the response.

Three Classes of Methods

To address these limitations, we explore three main classes of methods that offer alternatives to least squares:

Subset Selection: Identify a subset of the p predictors that are most related to the response. Fit a model using least squares on this reduced set of variables. This simplifies the model and improves interpretability.
Shrinkage (Regularization): Fit a model with all p predictors, but shrink the estimated coefficients towards zero. This reduces variance and can improve prediction accuracy. Some methods (like the lasso) can even set coefficients exactly to zero, performing variable selection.
Dimension Reduction: Project the p predictors into an M-dimensional subspace (M < p). This means creating M linear combinations (projections) of the original variables. Use these projections as predictors in a least squares model. This reduces the complexity of the problem.

1. Subset Selection

We will introduce several methods to select subsets of predictors. Here we consider best subset and stepwise model selection procedures. The goal is to find a smaller group of predictors that still explain the response well.

1.1 Best Subset Selection

The Idea: Fit a separate least squares regression for every possible combination of the p predictors. This is an exhaustive search through all possible models. Then, choose the “best” model from this set based on some criterion.
Exhaustive Search: If you have p predictors, you have 2^p possible models!
- (e.g., 10 predictors = 1,024 models; 20 predictors = over 1 million models!)
- For each predictor, it can either be in the model or not, leading to 2 choices for each.

Best Subset Selection Algorithm

Algorithm 6.1 Best subset selection

Null Model (M₀): A model with no predictors. It simply predicts the sample mean (\(\bar{y}\)) of the response for all observations. This serves as a baseline.
For k = 1, 2, …, p: (where k is the number of predictors)
- Fit all \(\binom{p}{k}\) models that contain exactly k predictors. This is the number of ways to choose k predictors out of p.
- Pick the “best” model among these \(\binom{p}{k}\) models, and call it M_k. “Best” is defined as having the smallest Residual Sum of Squares (RSS) or, equivalently, the largest R². RSS measures the error between the model’s predictions and the actual values.
Select the ultimate best model: From the models M₀, M₁, …, M_p (one best model for each size), choose the single best model using a method that estimates the test error, such as:
- Validation set error
- Cross-validation error
- C_p (AIC)
- BIC
- Adjusted R²

Best Subset Selection: Illustration

Credit data: RSS and \(R^2\) for all possible models. The red frontier tracks the best model for each number of predictors.

Best Subset Selection: Illustration (Cont.)

Figure 6.1: Shows RSS and R² for all possible models on the Credit dataset. This dataset contains information about credit card holders, and the goal is to predict their credit card balance.
The data contains ten predictors, but the x-axis ranges to 11. The reason is that one of the predictors is categorical, taking three values. It is split up into two dummy variables.
A categorical variable is also called a qualitative variable. Examples of categorical variables include gender (male, female), region (North, South, East, West), education level (high school, bachelor’s, master’s, doctorate), etc.

Best Subset Selection: Illustration (Cont.)

Credit data: RSS and \(R^2\) for all possible models. The red frontier tracks the best model for each number of predictors.

The red line connects the best models for each size (lowest RSS or highest R²). For each number of predictors, the red line indicates the model that performs best on the training data.
As expected, RSS decreases and R² increases as more variables are added. However, the improvements become very small after just a few variables. This suggests that adding more variables beyond a certain point doesn’t significantly improve the model’s fit to the training data.

Best Subset Selection: Choosing the Best Model

The RSS of these p + 1 models decreases monotonically, and the R² increases monotonically, as the number of features included in the models increases. So we can’t use them to select the best model!

Training Error vs. Test Error: Low RSS and high R² indicate a good fit to the training data. But we want a model that performs well on new, unseen data (low test error). Training error is often much smaller than test error! This is because the model is specifically optimized to fit the training data.
Need a Different Criterion: We can’t use RSS or R² directly to select the best model (from among M₀, M₁, …, M_p) because they only reflect the fit on the training data. We need to estimate the test error.

Best Subset Selection: Computational Limitations

Exponential Growth: The number of possible models (2^p) grows very quickly as p increases. This makes the search computationally expensive.
Infeasible for Large p: Best subset selection becomes computationally infeasible for even moderately large values of p (e.g., p > 40). It simply takes too long to fit all possible models.
Statistical Problems (Large p): With a huge search space, there’s a higher chance of finding models that fit the training data well by chance, even if they have no real predictive power. This leads to overfitting and high variance in the coefficient estimates. The model becomes too “tuned” to the training data and doesn’t generalize well.

1.2 Stepwise Selection

Best subset selection is often computationally infeasible for large p. Thus, stepwise methods are attractive alternatives. They offer a more efficient way to search for a good model.

Stepwise methods explore a far more restricted set of models. Instead of considering all possible models, they make sequential decisions to add or remove variables.

1.2.1 Forward Stepwise Selection

The Idea: Start with the null model (no predictors). Add predictors one-at-a-time, always choosing the variable that gives the greatest additional improvement to the fit. It’s a “greedy” approach, making the best local decision at each step.

Algorithm 6.2 Forward stepwise selection

Null Model (M₀): Start with the model containing no predictors.
For k = 0, 1, …, p-1:
- Consider all p - k models that add one additional predictor to the current model (M_k). This means adding each of the remaining variables, one at a time.
- Choose the “best” of these p - k models (smallest RSS or highest R²), and call it M_k+1. “Best” is again based on the training data fit.
Select the ultimate best model: Choose the single best model from M₀, M₁, …, M_p using validation set error, cross-validation, C_p, BIC, or adjusted R². This step uses a method that estimates the test error.

Forward Stepwise Selection: Computational Advantage

Much Fewer Models: Forward stepwise selection considers many fewer models than best subset selection.
- Best subset: 2^p models.
- Forward stepwise: 1 + p(p+1)/2 models. (This is the sum of the integers from 1 to p, plus 1 for the null model.)
- Example: If p = 20, best subset considers over 1 million models, while forward stepwise considers only 211.
Computational Efficiency: This makes forward stepwise selection computationally feasible for much larger values of p.

Forward Stepwise Selection: Limitations

Not Guaranteed Optimal: Forward stepwise selection is not guaranteed to find the best possible model out of all 2^p possibilities. It’s a greedy algorithm – it makes the locally optimal choice at each step, which may not lead to the globally optimal solution. It might miss the true best model.
Example:
- Suppose the best 1-variable model contains X₁.
- The best 2-variable model might contain X₂ and X₃.
- Forward stepwise won’t find this, because it must keep X₁ in the 2-variable model, having chosen it in the first step.

Forward Stepwise Selection vs. Best Subset Selection: An Example

Comparison on the *Credit* dataset.
# Variables	Best Subset	Forward Stepwise
One	rating	rating
Two	rating, income	rating, income
Three	rating, income, student	rating, income, student
Four	cards, income, student, limit	rating, income, student, limit

The table compares the models selected by best subset selection and forward stepwise selection on the Credit dataset.

Forward Stepwise Selection vs. Best Subset Selection: An Example (Cont.)

The first three models selected are identical. Both methods choose the same variables in the same order for the first three steps.
The fourth models differ. Best subset selection chooses “cards,” while forward stepwise keeps “rating.”
But in this example, the four-variable models perform very similarly (see Figure 6.1), so the difference isn’t crucial. The performance difference between these two 4-variable models is likely small.

Forward Stepwise in High Dimensions

n < p Case: Forward stepwise selection can be used even when n < p (more predictors than observations). This is a major advantage.
Limitation: In this case, you can only build models up to size M_n-1, because least squares can’t fit a unique solution when p ≥ n. You can’t add more variables than you have observations.

1.2.2 Backward Stepwise Selection

The Idea: Start with the full model (all p predictors). Remove the least useful predictor one-at-a-time. It’s the opposite of forward stepwise.

Algorithm 6.3 Backward stepwise selection

Full Model (M_p): Begin with the model containing all p predictors.
For k = p, p-1, …, 1:
- Consider all k models that remove one predictor from the current model (M_k).
- Choose the “best” of these k models (smallest RSS or highest R²), and call it M_k-1. Again, “best” is based on training data fit.
Select the ultimate best model: Select the single best model from M₀, …, M_p using validation set error, cross-validation, C_p, BIC, or adjusted R². This uses a method that estimates test error.

Backward Stepwise Selection: Properties

Computational Advantage: Like forward stepwise, backward stepwise considers only 1 + p(p+1)/2 models, making it computationally efficient.
Not Guaranteed Optimal: Like forward stepwise, it’s not guaranteed to find the best possible model.
Requirement: n > p: Backward stepwise selection requires that n > p (more observations than predictors) so that the full model (with all p predictors) can be fit. This is a significant limitation compared to forward stepwise.

Backward Stepwise Selection: When to use

Forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large.

Backward stepwise is not useful when p > n.

1.2.3 Hybrid Approaches

Combine Forward and Backward: Hybrid methods combine aspects of forward and backward stepwise selection. They try to get the benefits of both.
Add and Remove: Variables are added sequentially (like forward). But, after adding each new variable, the method may also remove any variables that no longer contribute significantly to the model fit (based on some criterion). This allows the algorithm to “correct” earlier decisions.
Goal: Try to mimic best subset selection while retaining the computational advantages of stepwise methods. They aim for a better solution than pure forward or backward stepwise, while still being computationally efficient.

1.3 Choosing the Optimal Model

Best subset selection, forward selection, and backward selection result in the creation of a set of models, each of which contains a subset of the p predictors.

The Challenge: How do we choose the best model from among the set of models (M₀, …, M_p) generated by subset selection or stepwise selection? We cannot simply use the model that has the smallest RSS and the largest R²! Those metrics are based on the training data and are likely to be overly optimistic.
Need to Estimate Test Error: We need to estimate the test error of each model – how well it will perform on new data.
Two Main Approaches:
1. Indirectly Estimate Test Error: Adjust the training error (e.g., RSS) to account for the bias due to overfitting. These adjustments penalize model complexity.
2. Directly Estimate Test Error: Use a validation set or cross-validation. These methods directly assess performance on data not used for training.

Indirectly Estimating Test Error: C_p, AIC, BIC, Adjusted R²

Training Error is Deceptive: The training set MSE (RSS/n) generally underestimates the test MSE. This is because least squares specifically minimizes the training RSS. The model is optimized for the training data, so it will naturally perform better on that data than on unseen data.
Adjusting for Model Size: We need to adjust the training error to account for the fact that it tends to be too optimistic. Several techniques do this:
- C_p
- Akaike Information Criterion (AIC)
- Bayesian Information Criterion (BIC)
- Adjusted R²
- All of these add a penalty to the training error that increases with the number of predictors.

C_p, AIC, BIC, Adjusted R²: Formulas

For a least squares model with d predictors, these statistics are computed as:

C_p: \[ C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2) \]
- \(\hat{\sigma}^2\) is an estimate of the error variance (the variance of the noise term, \(\epsilon\)). This is usually estimated from the full model (using all p predictors).
- Adds a penalty of \(2d\hat{\sigma}^2\) to the RSS. This penalty increases linearly with the number of predictors (d).

C_p, AIC, BIC, Adjusted R²: Formulas (Cont.)

AIC: \[ AIC = \frac{1}{n}(RSS + 2d\hat{\sigma}^2) \]
- For linear model with Gaussian (normally distributed) errors, AIC is proportional to C_p. They are essentially equivalent in this context. AIC is more broadly applicable to other types of models.

C_p, AIC, BIC, Adjusted R²: Formulas (Cont.)

BIC: \[ BIC = \frac{1}{n}(RSS + log(n)d\hat{\sigma}^2) \]
- Similar to C_p, but the penalty for the number of predictors is multiplied by log(n).
- Since log(n) > 2 for n > 7, BIC generally penalizes models with more variables more heavily than C_p, leading to the selection of smaller models. The penalty for adding a predictor is larger with BIC than with C_p (assuming n > 7).

C_p, AIC, BIC, Adjusted R²: Formulas (Cont.)

Adjusted R²: \[ Adjusted \ R^2 = 1 - \frac{RSS/(n - d - 1)}{TSS/(n-1)} \]
Where TSS (Total Sum of Squares) = \(\sum(y_i - \bar{y})^2\), which represents total variance in the response.
Maximizing the adjusted R² is equivalent to minimizing \(\frac{RSS}{n-d-1}\). Unlike RSS, adjusted R² accounts for the number of predictors.

C_p, AIC, BIC, Adjusted R²: Interpretation

Low Values are Good (C_p, AIC, BIC): For C_p, AIC, and BIC, we choose the model with the lowest value. Lower values indicate a better trade-off between model fit and complexity.
High Values are Good (Adjusted R²): For adjusted R², we choose the model with the highest value. Higher values indicate a better fit, adjusted for the number of predictors.
Theoretical Justification: C_p, AIC, and BIC have theoretical justifications (though they rely on assumptions that may not always hold). They are derived from principles of statistical information theory. Adjusted R² is more intuitive but less theoretically grounded.

C_p, AIC, BIC, Adjusted R²: Example

C_p, BIC, and adjusted R² for the best models of each size on the Credit data.

C_p, AIC, BIC, Adjusted R²: Example (Cont.)

Figure 6.2: Shows these statistics for the Credit dataset.
C_p and BIC are estimates of test MSE. They aim to approximate how well the model would perform on new data.
BIC selects a model with 4 variables (income, limit, cards, student).
C_p selects a 6-variable model.
Adjusted R² selects a 7-variable model.

Directly Estimating Test Error: Validation and Cross-Validation

Direct Estimation: Instead of adjusting the training error, we can directly estimate the test error using:
- Validation set approach
- Cross-validation approach
Advantages:
- Provide a direct estimate of the test error. This is more reliable than adjusting the training error.
- Make fewer assumptions about the true underlying model. They are more generally applicable.
- Can be used in a wider range of model selection tasks (not just linear models).
Computational Cost: Historically, cross-validation was computationally expensive. Now, with fast computers, this is less of a concern.

Validation and Cross-Validation: Example

BIC, validation set error, and cross-validation error for the best models of each size on the Credit data.

Validation and Cross-Validation: Example (Cont.)

Figure 6.3: Shows BIC, validation set error, and cross-validation error for the Credit data.
Validation and cross-validation both select a 6-variable model.
All three approaches suggest that models with 4, 5, or 6 variables are quite similar. The estimated test error is relatively flat for these model sizes.
One-Standard-Error Rule: A practical rule for choosing among models with similar estimated test error.
- Calculate the standard error of the estimated test MSE for each model size. This represents the uncertainty in the estimate.
- Select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve.
- Rationale: Choose the simplest model among those that perform comparably. We prefer simpler models if they perform almost as well as more complex models.
Apply to this example, we may choose the three-variable model.

2. Shrinkage Methods

Alternative to Subset Selection: Instead of selecting a subset of variables, shrinkage methods (also called regularization methods) fit a model with all p predictors, but constrain or regularize the coefficient estimates. They “shrink” the coefficients towards zero.
How it Works: Shrinkage methods shrink the coefficient estimates towards zero.
Why Shrink?: Shrinking the coefficients can significantly reduce their variance. This can improve prediction accuracy, especially when the least squares estimates have high variance.
Two Main Techniques:
- Ridge regression
- Lasso

2.1 Ridge Regression

Recall Least Squares: Least squares minimizes the Residual Sum of Squares (RSS):

\[ RSS = \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 \]

This finds the coefficients that minimize the sum of squared differences between the observed responses (\(y_i\)) and the predicted responses (\(\hat{y}_i = \beta_0 + \sum_{j=1}^{p}\beta_jx_{ij}\)).

Ridge Regression (Cont.)

Ridge Regression: Ridge regression minimizes a slightly different quantity:

\[ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \lambda\sum_{j=1}^{p}\beta_j^2 \]

λ (Tuning Parameter): λ ≥ 0 is a tuning parameter that controls the amount of shrinkage. It determines the strength of the penalty.

Ridge Regression: The Shrinkage Penalty

\[ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}\beta_j^2 = RSS + \lambda\sum_{j=1}^{p}\beta_j^2 \]

Two Parts:
- RSS: Measures how well the model fits the data. Smaller RSS means a better fit.
- Shrinkage Penalty (λΣβ_j²): Penalizes large coefficients. This term is small when β₁, …, β_p are close to zero. The penalty is the sum of the squared coefficients (excluding the intercept).

Ridge Regression: The Shrinkage Penalty (Cont.)

Tuning Parameter (λ):
- λ = 0: No penalty. Ridge regression is the same as least squares.
- λ → ∞: Coefficients are shrunk all the way to zero (this would result in the null model, predicting only the mean of the response).
- 0 < λ < ∞: Controls the trade-off between fitting the data well (low RSS) and shrinking the coefficients (small penalty).

Ridge Regression: The Intercept

No Shrinkage on Intercept: Notice that the shrinkage penalty is not applied to the intercept (β₀).
Why?: We want to shrink the coefficients of the predictors, but not the intercept, which represents the average value of the response when all predictors are zero (or at their mean values, if centered). Shrinking the intercept would bias the predictions.
Centering Predictors: If the predictors are centered (mean of zero) before performing ridge regression, then the estimated intercept will be the sample mean of the response: \(\hat{\beta}_0 = \bar{y}\). Centering simplifies the calculations and ensures the intercept has a meaningful interpretation.

Ridge Regression: Example on Credit Data

Standardized ridge regression coefficients for the Credit data, as a function of λ and ||β^λ_R||₂ / ||β||₂.

Ridge Regression: Example on Credit Data (Cont.)

Figure 6.4: Shows ridge regression coefficient estimates for the Credit data.
Left Panel: Coefficients plotted against λ.
- λ = 0: Coefficients are the same as least squares.
- As λ increases, coefficients shrink towards zero.
Right Panel: Coefficients plotted against ||β^λ_R||₂ / ||β||₂.
||β^λ_R||₂ represents the l2 norm of ridge regression coefficients.
||β||₂ represents the l2 norm of least squares coefficients.
The x-axis can be seen as how much the ridge regression coefficient estimates have been shrunken towards zero.

Ridge Regression: Standardization

Scale Equivariance (Least Squares): Least squares coefficient estimates are scale equivariant. Multiplying a predictor by a constant c simply scales the corresponding coefficient by 1/c. The overall prediction remains unchanged.
Scale Dependence (Ridge Regression): Ridge regression coefficients can change substantially when multiplying a predictor by a constant. This is because of the Σβ_j² term in the penalty. If a coefficient is large, squaring it makes it even larger, increasing the penalty. Scaling a predictor changes the scale of its coefficient, thus changing the penalty.

Ridge Regression: Standardization (Cont.)

Standardization: It’s best to apply ridge regression after standardizing the predictors:

\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)^2}} \]

This formula standardizes each predictor by subtracting its mean and dividing by its standard deviation. This ensures that all predictors are on the same scale (mean of 0, standard deviation of 1).
This ensures that all predictors are on the same scale. Standardization makes the penalty apply equally to all predictors, regardless of their original units.

Why Does Ridge Regression Improve Over Least Squares?

Bias-Variance Trade-Off: Ridge regression’s advantage comes from the bias-variance trade-off. This is a fundamental concept in statistical learning.
- As λ increases:
  - Flexibility of the model decreases. The model becomes less able to fit the training data perfectly.
  - Variance decreases. The model’s predictions become more stable.
  - Bias increases. The model’s predictions, on average, deviate more from the true values.
Finding the Sweet Spot: The goal is to find a value of λ that reduces variance more than it increases bias, leading to a lower test MSE (Mean Squared Error).

Ridge Regression: Bias-Variance Trade-Off Illustrated

Squared bias, variance, and test MSE for ridge regression on a simulated dataset.

Ridge Regression: Bias-Variance Trade-Off Illustrated (Cont.)

Figure 6.5: Shows bias, variance, and test MSE for ridge regression on a simulated dataset.
As λ increases, variance decreases rapidly at first, with only a small increase in bias. This leads to a decrease in MSE.
Eventually, the decrease in variance slows, and the increase in bias accelerates, causing the MSE to increase. The penalty becomes too strong, and the model underfits.
The minimum MSE is achieved at a moderate value of λ. This is the optimal level of shrinkage.

When Does Ridge Regression Work Well?

High Variance in Least Squares: Ridge regression works best in situations where the least squares estimates have high variance. This often happens when:
- n is not much larger than p. The number of observations is not significantly greater than the number of predictors.
- p is close to n.
- p > n (though least squares doesn’t have a unique solution in this case). Ridge regression can still provide a solution, even when p > n.
Computational Advantage: Ridge regression is also computationally efficient, even for large p. There’s a closed-form solution, making it relatively fast to compute.

2.2 The Lasso

Disadvantage of Ridge Regression: Ridge regression includes all p predictors in the final model. The penalty shrinks coefficients towards zero, but it doesn’t set any of them exactly to zero (unless λ = ∞). This can make interpretation difficult when p is large. It doesn’t perform variable selection.
The Lasso: An Alternative: The lasso is a more recent alternative to ridge regression that overcomes this disadvantage. It can perform variable selection.

The Lasso: Penalty

Lasso Penalty: The lasso uses a different penalty term:

\[ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 + \lambda\sum_{j=1}^{p}|\beta_j| = RSS + \lambda\sum_{j=1}^{p}|\beta_j| \]

Absolute Value Penalty: The lasso uses an l₁ penalty (absolute value of coefficients) instead of an l₂ penalty (squared coefficients, used in ridge regression).

The Lasso: Variable Selection

Shrinkage and Selection: Like ridge regression, the lasso shrinks coefficients towards zero.
Key Difference: The l₁ penalty has the effect of forcing some coefficients to be exactly zero when λ is sufficiently large. This is the crucial difference from ridge regression.
Variable Selection: This means the lasso performs variable selection! It automatically excludes some variables from the model.
Sparse Models: The lasso yields sparse models – models that involve only a subset of the variables. “Sparse” means that many of the coefficients are zero.

The Lasso: Example on Credit Data

Standardized lasso coefficients for the Credit data, as a function of λ and ||β^λ_L||₁ / ||β||₁.

The Lasso: Example on Credit Data (Cont.)

Figure 6.6: Shows lasso coefficient estimates for the Credit data.
As λ increases, coefficients shrink towards zero. But unlike ridge regression, some coefficients are set exactly to zero.
This leads to a simpler, more interpretable model. The lasso identifies the most important predictors and excludes the rest.

Another Formulation for Ridge Regression and the Lasso

Both ridge regression and the lasso can be formulated as constrained optimization problems. This provides an alternative way to understand them.
- Ridge Regression: \[ \underset{\beta}{minimize} \left\{ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 \right\} \quad subject \ to \ \sum_{j=1}^{p}\beta_j^2 \le s \]
- Minimize the RSS, subject to a constraint on the sum of the squared coefficients. s is a tuning parameter that controls the amount of shrinkage.
- Lasso: \[ \underset{\beta}{minimize} \left\{ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_jx_{ij})^2 \right\} \quad subject \ to \ \sum_{j=1}^{p}|\beta_j| \le s \]
- Minimize the RSS, subject to a constraint on the sum of the absolute values of the coefficients.
The above formulations reveal a close connection between the lasso, ridge regression, and best subset selection.

The Variable Selection Property of the Lasso

Contours of the error and constraint functions for the lasso (left) and ridge regression (right).

Figure 6.7: Contours of the error and constraint functions for the lasso (left) and ridge regression (right).
The solid blue areas are the constraint regions.
The red ellipses are the contours of the RSS.

The Variable Selection Property of the Lasso (Cont.)

Contours of the error and constraint functions for the lasso (left) and ridge regression (right).

Why does the lasso set coefficients to zero, while ridge regression doesn’t?
- Consider the constraint regions (where the solution must lie):
  - Ridge regression: A circle (l₂ constraint: \(\sum_{j=1}^{p}\beta_j^2 \le s\)).
  - Lasso: A diamond (l₁ constraint: \(\sum_{j=1}^{p}|\beta_j| \le s\)).
The solution is the first point where the “ellipse” (contour of constant RSS) touches the constraint region. The ellipses represent combinations of coefficients that have the same RSS.
Because the lasso constraint has corners, the ellipse often intersects at an axis, setting one coefficient to zero. When the ellipse hits a corner of the diamond, one of the coefficients is zero.
Ridge regression’s circular constraint doesn’t have corners, so this rarely happens. The ellipse is unlikely to intersect the circle exactly on an axis.

Comparing the Lasso and Ridge Regression

Interpretability: The lasso has a major advantage in terms of interpretability, producing simpler models with fewer variables. This makes it easier to understand the key factors influencing the response.
Prediction Accuracy: Which method is better for prediction depends on the true underlying relationship between the predictors and the response. It’s data-dependent.
- Few Important Predictors: If only a few predictors are truly important (with large coefficients), the lasso tends to perform better. It will identify these key predictors and exclude the irrelevant ones.
- Many Important Predictors: If many predictors have small or moderate-sized coefficients, ridge regression tends to perform better. Shrinking all coefficients towards zero, without setting any to zero, is more appropriate in this case.
Unknown Truth: In practice, we don’t know which scenario is true. Cross-validation can help us choose the best approach for a particular dataset.

Comparing the Lasso and Ridge Regression: Simulated Examples

Lasso and ridge regression on a simulated dataset where all predictors are related to the response.

Comparing the Lasso and Ridge Regression: Simulated Examples (Cont.)

Figure 6.8: All 45 predictors are related to the response.
Ridge regression slightly outperforms the lasso. Since all predictors contribute to the response, shrinking all coefficients (ridge regression) is better than setting some to zero (lasso).
The minimum MSE of ridge regression is slightly smaller than that of the lasso.

Comparing the Lasso and Ridge Regression: Simulated Examples

Lasso and ridge regression on a simulated dataset where only two predictors are related to the response.

Comparing the Lasso and Ridge Regression: Simulated Examples (Cont.)

Figure 6.9: Only 2 of 45 predictors are related to the response. This is a sparse setting.
The lasso tends to outperform ridge regression. The lasso correctly identifies the two important predictors and sets the coefficients of the irrelevant predictors to zero.

A Simple Special Case for Ridge Regression and the Lasso

We consider a simple situation:

\(n=p\). The number of observations equals the number of predictors.
\(\mathbf{X}\) is a diagonal matrix with 1’s on the diagonal. This means the predictors are uncorrelated and have unit variance.
No intercept.

Then, it can be shown:

Ridge regression shrinks each least squares coefficient estimate by the same proportion. \[ \hat{\beta}_j^R = \frac{y_j}{1 + \lambda} \]

A Simple Special Case for Ridge Regression and the Lasso (Cont.)

Lasso soft-threshold the least squares coefficient estimates. \[ \hat{\beta}_j^L = \begin{cases} y_j - \lambda/2 & \text{if } y_j > \lambda/2 \\ y_j + \lambda/2, & \text{if } y_j < -\lambda/2 \\ 0 & \text{if } |y_j| \le \lambda/2 \end{cases} \]
If the least squares estimate (\(y_j\)) is larger than \(\lambda/2\) or smaller than \(-\lambda/2\), the lasso shrinks it towards zero.
If it is between \(-\lambda/2\) and \(\lambda/2\), the lasso sets it to zero. This is called “soft-thresholding.”

A Simple Special Case for Ridge Regression and the Lasso

Ridge and lasso coefficient estimates in the simple case.

A Simple Special Case for Ridge Regression and the Lasso (Cont.)

Figure 6.10: It shows that:
Ridge regression shrinks each coefficient by same proportion.
Lasso shrinks all coefficients toward zero by a similar amount, and sufficiently small coefficients are shrunken all the way to zero.

2.3 Selecting the Tuning Parameter

Crucial Choice: Just like with subset selection, we need to choose the tuning parameter (λ) for ridge regression and the lasso. The value of λ determines the amount of shrinkage.
Cross-Validation: Cross-validation is a powerful method for selecting λ.
1. Choose a grid of λ values. Try a range of values to see which works best.
2. Compute the cross-validation error for each value of λ. This estimates the test error for each λ.
3. Select the λ that gives the smallest cross-validation error. This is the value that is expected to perform best on new data.
4. Re-fit the model using all of the data with the chosen λ.

Selecting λ: Example for Ridge Regression

Cross-validation error and coefficient estimates for ridge regression on the Credit data.

Figure 6.12: Shows cross-validation for ridge regression on the Credit data.
The optimal λ is relatively small, indicating a small amount of shrinkage.
The cross-validation error curve is quite flat, suggesting that a range of λ values would work similarly well.

Selecting λ: Example for Lasso

Cross-validation error and coefficient estimates for the lasso on the simulated data from Figure 6.9.

Selecting λ: Example for Lasso (Cont.)

Figure 6.13: Shows cross-validation for the lasso on the simulated data from Figure 6.9 (where only two predictors are truly related to the response).
The lasso correctly identifies the two signal variables (colored lines) and sets the coefficients of the noise variables (gray lines) to near zero.
The minimum cross-validation error occurs when only the signal variables have non-zero coefficients.

3. Dimension Reduction Methods

Different Approach: Instead of working directly with the original predictors (X₁, …, X_p), dimension reduction methods transform the predictors and then fit a least squares model using the transformed variables. They create new, fewer variables that are combinations of the original ones.
Linear Combinations: Create M linear combinations (Z₁, …, Z_M) of the original p predictors, where M < p.

\[ Z_m = \sum_{j=1}^{p}\phi_{jm}X_j \]

-   φ<sub>jm</sub> are constants (weights) that define the linear combinations.  Each $Z_m$ is a weighted sum of all the original predictors.

Dimension Reduction Methods (Cont.)

Reduced Dimension: Fit a linear regression model using Z₁, …, Z_M as predictors:

\[ y_i = \theta_0 + \sum_{m=1}^{M}\theta_mz_{im} + \epsilon_i \]

-   This reduces the problem from estimating *p*+1 coefficients (in the original model) to estimating *M*+1 coefficients.  This can significantly simplify the model.

Dimension Reduction: Why it Works

Constraint: The coefficients in the dimension-reduced model are constrained by the linear combinations:

\[ \beta_j = \sum_{m=1}^{M}\theta_m\phi_{jm} \] - The original coefficients (\(\beta_j\)) are now expressed in terms of the new coefficients (\(\theta_m\)) and the weights (\(\phi_{jm}\)).

Bias-Variance Trade-Off: This constraint can introduce bias, but if p is large relative to n, choosing M << p can significantly reduce the variance of the fitted coefficients. By reducing the number of parameters to estimate, we reduce the model’s flexibility and its tendency to overfit.

Dimension Reduction: Two Steps

Two Steps:
1. Obtain the transformed predictors (Z₁, …, Z_M). This is where the different dimension reduction methods differ.
2. Fit a least squares model using these M predictors. This is a standard linear regression.
Different Methods: Different dimension reduction methods differ in how they choose the Z_m (or, equivalently, the φ_jm). They have different ways of creating the linear combinations.
- Principal components regression (PCR)
- Partial least squares (PLS)

3.1 Principal Components Regression (PCR)

Principal Components Analysis (PCA): PCA is a technique for deriving a low-dimensional set of features from a larger set of variables. (More detail in Chapter 12.) It finds new variables (principal components) that capture the most variation in the original data.
Unsupervised: PCA is an unsupervised method – it identifies linear combinations that best represent the predictors (X), without considering the response (Y). It only looks at the relationships among the predictors.

An Overview of Principal Components Analysis

PCA seeks to find the directions in the data along with which the observations vary the most. These directions are the principal components.
First Principal Component: The first principal component is the direction in the data with the greatest variance. It’s the line that best captures the spread of the data.

PCA: Example on Advertising Data

Population size and ad spending for 100 cities. The first principal component is shown in green, and the second in blue.

Figure 6.14: Shows population size (pop) and advertising spending (ad) for 100 cities.
The green line is the first principal component direction.

PCA: Example on Advertising Data (Cont.)

Population size and ad spending for 100 cities. The first principal component is shown in green, and the second in blue.

Projecting the observations (data points) onto this line would maximize the variance of the projected points. The first principal component captures the direction of greatest variability in the data.

PCA: Finding the First Principal Component

Mathematical Representation: The first principal component can be written as:

\[ Z_1 = 0.839 \times (pop - \overline{pop}) + 0.544 \times (ad - \overline{ad}) \]

-   0.839 and 0.544 are the *principal component loadings* (the φ<sub>jm</sub> values).  They define the direction of the first principal component.
-   $\overline{pop}$ and $\overline{ad}$ are the means of pop and ad, respectively.  The variables are centered (mean-subtracted).

Interpretation: Z₁ is almost an average of the two variables (since the loadings are positive and similar in size). It represents a direction that captures a combination of population size and ad spending.

PCA: Principal Component Scores

\[ Z_{i1} = 0.839 \times (pop_i - \overline{pop}) + 0.544 \times (ad_i - \overline{ad}) \]

Scores: The values z_i1, …, z_n1 are called the principal component scores. They represent the “coordinates” of the data points along the first principal component direction. They are the values of the new variable, Z₁, for each observation.

PCA: Another Interpretation

Closest Line: The first principal component vector defines the line that is as close as possible to the data (minimizing the sum of squared perpendicular distances). It’s the line that best fits the data in a least-squares sense.

The first principal component direction, with distances to the observations shown as dashed lines.

Figure 6.15: The first principal component direction, with distances to the observations shown as dashed lines.
Left: Shows the perpendicular distances from each point to the first principal component line.
Right: Rotates the plot so that the first principal component is horizontal.

PCA: Another Interpretation (Cont.)

The first principal component direction, with distances to the observations shown as dashed lines.

The x-coordinate of each point in this rotated plot is its principal component score.

PCA: Capturing Information

Plots of the first principal component scores versus pop and ad.

Figure 6.16: Shows the first principal component scores (z_i1) plotted against pop and ad.

PCA: Capturing Information (Cont.)

Plots of the first principal component scores versus pop and ad.

Strong Relationship: There’s a strong relationship, indicating that the first principal component captures much of the information in the original two variables. The first principal component score is highly correlated with both population size and ad spending.

PCA: Multiple Principal Components

More than One: You can construct up to p distinct principal components.
Second Principal Component: The second principal component (Z₂) is:
- A linear combination of the variables.
- Uncorrelated with Z₁. It captures variation in a direction independent of Z₁.
- Has the largest variance among all linear combinations uncorrelated with Z₁.
- Orthogonal (perpendicular) to the first principal component.
Successive Components: Each subsequent principal component captures the maximum remaining variance, subject to being uncorrelated with the previous components. Each component captures a different “direction” of variation in the data.

PCA: Second Principal Component

Plots of the second principal component scores versus pop and ad.

Figure 6.17: Shows the second principal component scores (z_i2) plotted against pop and ad.

PCA: Second Principal Component (Cont.)

Plots of the second principal component scores versus pop and ad.

Weak Relationship: There’s very little relationship, indicating that the second principal component captures much less information than the first.

The Principal Components Regression Approach

The Idea: Use the first M principal components (Z₁, …, Z_M) as predictors in a linear regression model. This is PCR.
Assumption: We assume that the directions in which X₁, …, X_p show the most variation are the directions that are associated with Y. This is the key assumption of PCR. If it holds, PCR can be effective.
Potential for Improvement: If this assumption holds, PCR can outperform least squares, especially when M << p, by reducing variance.

PCR: Example

PCR applied to two simulated datasets.

PCR: Example (Cont.)

Figure 6.18: Shows PCR applied to the simulated datasets from Figures 6.8 and 6.9.
PCR with an appropriate choice of M can improve substantially over least squares.
However, in this example, PCR does not perform as well as ridge regression or the lasso. This is because the data were generated in a way that required many principal components to model the response well.

PCR: When it Works Well

First Few Components are Key: PCR tends to work well when the first few principal components capture most of the variation in the predictors and that variation is related to the response.

PCR: When it Works Well (Cont.)

Figure 6.19: Shows an example where the response depends only on the first five principal components.
PCR performs very well, achieving a low MSE with M = 5.
PCR and ridge regression slightly outperform the lasso in this case.

PCR: Not Feature Selection

Linear Combinations: PCR is not a feature selection method. Each principal component is a linear combination of all p original features. It doesn’t exclude any of the original variables.
Example: In the advertising data, Z₁ was a combination of both pop and ad.
Relationship to Ridge Regression: PCR is more closely related to ridge regression than to the lasso. Both involve using all the original predictors, albeit in a transformed way.

PCR: Choosing M and Standardization

Choosing M: The number of principal components (M) is typically chosen by cross-validation.

PCR standardized coefficient estimates and cross-validation MSE on the Credit data.

PCR: Choosing M and Standardization (Cont.)

Figure 6.20: Shows cross-validation for PCR on the Credit data.
The lowest cross-validation error occurs with M = 10, which is almost no dimension reduction.
Standardization: It’s generally recommended to standardize each predictor before performing PCA (and thus PCR). This ensures that all variables are on the same scale. Otherwise, variables with larger variances will dominate the principal components.

3.2 Partial Least Squares (PLS)

Supervised Dimension Reduction: PLS is a supervised dimension reduction technique. Unlike PCR (which is unsupervised), PLS uses the response (Y) to help identify the new features (Z₁, …, Z_M). It takes the response into account when creating the linear combinations.
Goal: Find directions that explain both the response and the predictors. It tries to find components that are relevant to predicting the response.

PLS: Computing the First Direction

Standardize Predictors: Standardize the p predictors (subtract the mean and divide by the standard deviation).
Simple Linear Regressions: Compute the coefficient from the simple linear regression of Y onto each X_j (separately for each predictor). This measures the individual relationship between each predictor and the response.
First PLS Direction: Set each φ_j1 in the equation for Z₁ (Equation 6.16) equal to this coefficient. This means PLS places the highest weight on variables that are most strongly related to the response (based on the simple linear regressions).

PLS: Example

First PLS direction (solid line) and first PCR direction (dotted line) for the advertising data.

Figure 6.21: Shows the first PLS direction (solid line) and first PCR direction (dotted line) for the advertising data, with Sales as the response and Population Size and Advertising Spending as predictors.

PLS: Example (Cont.)

First PLS direction (solid line) and first PCR direction (dotted line) for the advertising data.

PLS chooses a direction that emphasizes Population Size more than Advertising Spending, because Population Size is more correlated with Sales (based on the simple linear regressions).

PLS: Subsequent Directions

Iterative Process:
1. Adjust for Z₁: Regress each variable (both the predictors and the response) on Z₁ and take the residuals. This removes the information already explained by Z₁. We “orthogonalize” the data with respect to Z₁.
2. Compute Z₂: Compute Z₂ using these orthogonalized data, in the same way as Z₁ was computed (using simple linear regressions of the residual response on the residual predictors).
3. Repeat: Repeat this process M times to identify multiple PLS components (Z₁, …, Z_M).
Final Model: Fit a linear model using Z₁, …, Z_M as predictors, just like in PCR.

PLS: Tuning Parameter and Standardization

Choosing M: The number of PLS directions (M) is a tuning parameter, typically chosen by cross-validation.
Standardization: It’s generally recommended to standardize both the predictors and the response before performing PLS.
Performance: In practice, PLS often performs no better than ridge regression or PCR. The supervised dimension reduction of PLS can reduce bias, but it also has the potential to increase variance.

4. Considerations in High Dimensions

4.1 High-Dimensional Data

Low-Dimensional Setting: Most traditional statistical techniques are designed for the low-dimensional setting, where n (number of observations) is much greater than p (number of features).
High-Dimensional Setting: In recent years, new technologies have led to a dramatic increase in the number of features that can be measured. We often encounter datasets where p is large, possibly even larger than n. This is the “high-dimensional” setting.

High-Dimensional Data: Examples

Examples:
- Genomics: Measuring hundreds of thousands of single nucleotide polymorphisms (SNPs) to predict a trait (e.g., disease risk).
- Marketing: Using all search terms entered by users of a search engine to understand online shopping patterns.
- Image Analysis: Analyzing thousands of pixels in an image to classify objects.

4.2 What Goes Wrong in High Dimensions?

Least Squares Fails: When p is as large as or larger than n, least squares cannot be used (or should not be used). It breaks down.
Perfect Fit, But Useless: Least squares will always find a set of coefficients that perfectly fit the training data (zero residuals), regardless of whether there’s a true relationship between the features and the response. This is because there are more parameters than observations, allowing the model to perfectly interpolate the data.
Overfitting: This perfect fit is a result of overfitting. The model is too flexible and captures noise in the data, leading to terrible performance on new data. It doesn’t generalize.
Curse of Dimensionality: Adding more features, even if unrelated to response, can easily lead to the model overfitting the training data.

4.3 Regression in High Dimensions

Less Flexible Methods: Many of the methods we’ve discussed in this chapter – forward stepwise selection, ridge regression, the lasso, and PCR – are particularly useful in the high-dimensional setting.
Avoiding Overfitting: These methods avoid overfitting by being less flexible than least squares. They constrain the model in some way, preventing it from capturing noise.

Regression in High Dimensions: Example with the Lasso

The lasso performed with varying numbers of features (p) and a fixed sample size (n).

Regression in High Dimensions: Example with the Lasso (Cont.)

Figure 6.24: Lasso with n=100, p can be 20, 50 and 2000.
As the number of features increases, the test set error increases, highlights the curse of dimensionality.
Three Key Points:
1. Regularization is Crucial: Regularization (or shrinkage) is essential in high-dimensional problems. It’s necessary to constrain the model.
2. Tuning Parameter Selection: Choosing the right tuning parameter (e.g., λ for the lasso) is critical for good performance. Cross-validation is essential for this.
3. Curse of Dimensionality: The test error tends to increase as the dimensionality of the problem increases, unless the additional features are truly associated with the response. Adding irrelevant variables makes the problem harder.

4.4 Interpreting Results in High Dimensions

Multicollinearity is Extreme: In the high-dimensional setting, multicollinearity is extreme. Any variable can be written as a linear combination of all the other variables. This makes it impossible to isolate the effect of individual predictors.
Cannot Identify True Predictors: This means we can never know exactly which variables (if any) are truly predictive of the outcome. We can only identify variables that are correlated with the true predictors.

Interpreting Results in High Dimensions (Cont.)

Caution in Reporting: Be very cautious when reporting results. Don’t overstate conclusions. Avoid claiming to have found the “true” predictors.
Never Use Training Data for Evaluation: Never use training data measures (sum of squared errors, p-values, R²) as evidence of a good model fit in the high-dimensional setting. These measures will be misleadingly optimistic. They will always look good, even if the model is terrible.
Use Test Data or Cross-Validation: Always report results on an independent test set or using cross-validation. These are the only reliable ways to assess model performance in high dimensions.

Summary

Beyond Least Squares: We’ve explored several alternatives to least squares for linear regression:
- Subset selection (best subset, forward stepwise, backward stepwise) - Choose a subset of the predictors.
- Shrinkage methods (ridge regression, the lasso) - Shrink the coefficients towards zero.
- Dimension reduction methods (PCR, PLS) - Transform the predictors into a smaller set of linear combinations.

Summary (Cont.)

Goals: These methods aim to improve:
- Prediction accuracy (by reducing variance, often at the cost of a small increase in bias).
- Model interpretability (by selecting a subset of variables or shrinking coefficients).
High-Dimensional Data: These methods are particularly important in the high-dimensional setting (p ≥ n), where least squares fails. They provide ways to fit models even when the number of predictors is large.
Cross-Validation: Cross-validation is a powerful tool for selecting tuning parameters and estimating the test error of different models. It’s essential for reliable model selection.
The choice of modeling method, the choice of tuning parameter, and the choice of assessment metrics all become especially important in high dimensions.

Thoughts and Discussion

Think about situations where you might encounter high-dimensional data. What are the challenges and opportunities?
- Challenges: Overfitting, difficulty in interpretation, computational cost.
- Opportunities: Ability to model complex relationships, potential for improved prediction accuracy if the true signal is strong.
- Examples: Genomics, text analysis, image processing, financial modeling.
How would you choose between the different methods we’ve discussed (subset selection, ridge regression, the lasso, PCR, PLS)? What factors would you consider?
- Interpretability: If interpretability is crucial, the lasso (or subset selection) is preferred.
- Prediction Accuracy: If prediction accuracy is the primary goal, try all methods and use cross-validation to compare.
- Computational Cost: For very large p, stepwise selection, ridge regression, or the lasso may be more computationally feasible than best subset selection.
- Underlying Relationship: Consider whether you expect only a few predictors to be important (lasso) or many (ridge regression).
What are the ethical implications of using high-dimensional data for prediction, especially in sensitive areas like healthcare or finance? How can we mitigate potential biases and ensure fairness?
- High dimensional data may contain sensitive attributes or proxies for them.
- Models trained on biased data can perpetuate or amplify existing biases.
- Mitigation Strategies: Careful data collection, bias detection and mitigation techniques, transparency in modeling, and ongoing monitoring of model performance.

Linear Model Selection and Regularization

Introduction: Beyond Simple Linear Regression

Introduction: Goals

Data Mining, Machine Learning and Statistical Learning

Why Go Beyond Least Squares?

Why Go Beyond Least Squares? (Cont.)

Limitations of Least Squares: Prediction Accuracy

Limitations of Least Squares: Prediction Accuracy (Cont.)

Limitations of Least Squares: Prediction Accuracy (Cont.)

Limitations of Least Squares: Prediction Accuracy (Cont.)

Limitations of Least Squares: Model Interpretability

Limitations of Least Squares: Model Interpretability (Cont.)

Limitations of Least Squares: Model Interpretability (Cont.)

Limitations of Least Squares: Model Interpretability (Cont.)

Three Classes of Methods

1. Subset Selection

1.1 Best Subset Selection

Best Subset Selection Algorithm

Best Subset Selection: Illustration

Best Subset Selection: Illustration (Cont.)

Best Subset Selection: Illustration (Cont.)

Best Subset Selection: Choosing the Best Model

Best Subset Selection: Computational Limitations

1.2 Stepwise Selection

1.2.1 Forward Stepwise Selection

Forward Stepwise Selection: Computational Advantage

Forward Stepwise Selection: Limitations

Forward Stepwise Selection vs. Best Subset Selection: An Example

Forward Stepwise Selection vs. Best Subset Selection: An Example (Cont.)

Forward Stepwise in High Dimensions

1.2.2 Backward Stepwise Selection

Backward Stepwise Selection: Properties

Backward Stepwise Selection: When to use

1.2.3 Hybrid Approaches

1.3 Choosing the Optimal Model

Indirectly Estimating Test Error: Cp, AIC, BIC, Adjusted R2

Cp, AIC, BIC, Adjusted R2: Formulas

Cp, AIC, BIC, Adjusted R2: Formulas (Cont.)

Cp, AIC, BIC, Adjusted R2: Formulas (Cont.)

Cp, AIC, BIC, Adjusted R2: Formulas (Cont.)

Cp, AIC, BIC, Adjusted R2: Interpretation

Cp, AIC, BIC, Adjusted R2: Example

Cp, AIC, BIC, Adjusted R2: Example (Cont.)

Directly Estimating Test Error: Validation and Cross-Validation

Validation and Cross-Validation: Example

Validation and Cross-Validation: Example (Cont.)

2. Shrinkage Methods

2.1 Ridge Regression

Ridge Regression (Cont.)

Ridge Regression: The Shrinkage Penalty

Ridge Regression: The Shrinkage Penalty (Cont.)

Ridge Regression: The Intercept

Ridge Regression: Example on Credit Data

Ridge Regression: Example on Credit Data (Cont.)

Ridge Regression: Standardization

Ridge Regression: Standardization (Cont.)

Why Does Ridge Regression Improve Over Least Squares?

Ridge Regression: Bias-Variance Trade-Off Illustrated

Ridge Regression: Bias-Variance Trade-Off Illustrated (Cont.)

When Does Ridge Regression Work Well?

2.2 The Lasso

The Lasso: Penalty

The Lasso: Variable Selection

The Lasso: Example on Credit Data

The Lasso: Example on Credit Data (Cont.)

Another Formulation for Ridge Regression and the Lasso

The Variable Selection Property of the Lasso

The Variable Selection Property of the Lasso (Cont.)

Comparing the Lasso and Ridge Regression

Comparing the Lasso and Ridge Regression: Simulated Examples

Comparing the Lasso and Ridge Regression: Simulated Examples (Cont.)

Comparing the Lasso and Ridge Regression: Simulated Examples

Comparing the Lasso and Ridge Regression: Simulated Examples (Cont.)

A Simple Special Case for Ridge Regression and the Lasso

A Simple Special Case for Ridge Regression and the Lasso (Cont.)

A Simple Special Case for Ridge Regression and the Lasso

A Simple Special Case for Ridge Regression and the Lasso (Cont.)

2.3 Selecting the Tuning Parameter

Selecting λ: Example for Ridge Regression

Selecting λ: Example for Lasso

Indirectly Estimating Test Error: C_p, AIC, BIC, Adjusted R²

C_p, AIC, BIC, Adjusted R²: Formulas

C_p, AIC, BIC, Adjusted R²: Formulas (Cont.)

C_p, AIC, BIC, Adjusted R²: Formulas (Cont.)

C_p, AIC, BIC, Adjusted R²: Formulas (Cont.)

C_p, AIC, BIC, Adjusted R²: Interpretation

C_p, AIC, BIC, Adjusted R²: Example

C_p, AIC, BIC, Adjusted R²: Example (Cont.)