Statistical Learning - Beyond Linearity

peter(邱飞)

zhejiang wanli university

Introduction to Data Mining, Machine Learning, and Statistical Learning

Let’s start with the basics! What are data mining, machine learning, and statistical learning?

Data Mining: The process of discovering patterns, anomalies, and knowledge from large datasets. Think of it as “mining” for valuable insights in a mountain of data. ⛏️ We’re looking for hidden treasures!

Machine Learning: A subset of artificial intelligence (AI). It’s about enabling systems to learn from data without being explicitly programmed. Algorithms learn patterns and make predictions, like teaching a computer to learn by example. 🤖

Statistical Learning: A framework of tools for understanding data. It’s closely related to both data mining and machine learning, but with a stronger emphasis on statistical models and inference. 🤔

Note

These fields are highly interdisciplinary!

Relationship: Data Mining, Machine Learning, and Statistical Learning

graph LR
    A[Data Mining] --> C(Common Ground)
    B[Machine Learning] --> C
    D[Statistical Learning] --> C
    C --> E[Insights & Predictions]

Common Goal: All three aim to extract insights and make predictions from data. They’re different paths to the same destination! 🗺️
Data Mining: Often emphasizes discovering previously unknown patterns. It’s like exploratory detective work. 🕵️‍♀️
Machine Learning: Focuses on prediction accuracy. It’s like building a super-powered prediction machine. ⚙️
Statistical Learning: Emphasizes model interpretability and quantifying uncertainty. It’s like building a model and understanding how confident we are in its predictions. 📊

Why Go Beyond Linearity?

Linear models (like linear regression) are great! They’re simple, interpretable, and a good starting point. But… they have limitations.

Limitations of Linear Models:

Linearity Assumption: They assume a straight-line relationship between predictors and the response. This is often too simplistic for the real world.
Limited Predictive Power: If the true relationship is not linear, linear models will give poor predictions.
Independence Assumption: The effect of the change on one predictor is irrelavant to other predictors.

Analogy

Imagine trying to fit a straight line through a curved set of points. You’ll miss the real pattern!

Linear Models vs. Reality (Example)

The figure shows simulated data. - The red line represents the true (but unknown) relationship between the predictor and the response. - The blue line is the linear model we fitted to the data.

Key Takeaway: Linear models might not capture the full complexity of the relationship.

Examples of Linear Models

Here are some familiar examples of linear models:

Linear Regression: The foundation! Predicts a continuous response.
Ridge Regression: Adds a penalty to reduce model complexity and prevent overfitting. It shrinks the coefficients towards zero.
Lasso: Similar to ridge, but uses a different penalty that can perform feature selection (setting some coefficients to zero).
PCR (Principal Components Regression): Reduces dimensionality using PCA before applying linear regression.

Introduction to Non-Linear Approaches

This chapter is all about relaxing the linearity assumption! We’ll explore techniques that can capture curved relationships.

Goal: Find models that are both flexible (can fit complex patterns) and interpretable (we can understand how they work).

Non-Linear Approaches: Overview

Here’s a roadmap of what we’ll cover:

Polynomial Regression: Add polynomial terms (e.g., \(x^2\), \(x^3\)) to capture curves.
Step Functions: Divide the predictor’s range into bins and fit a constant in each. Like creating categories.
Regression Splines: More flexible! Piecewise polynomials with smoothness constraints.
Smoothing Splines: Minimize RSS plus a penalty for roughness. Balance fit and smoothness.
Local Regression: Fit a model locally using only nearby data.
Generalized Additive Models (GAMs): Extend these ideas to multiple predictors.

Polynomial Regression

Instead of a straight line:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

We use a polynomial:

\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + ... + \beta_d x_i^d + \epsilon_i \]

This allows us to model curves! 📈

Polynomial Regression: Important Points

Linear in Coefficients: The equation is linear in the coefficients (\(\beta_0, \beta_1, ...\)). This means we can still use least squares! 👍
Focus on the Fitted Function: We usually don’t care about the individual coefficients. We look at the overall shape of the fitted function.
Degree (d): The highest power (\(d\)) is the degree. We rarely use \(d > 3\) or \(4\) because high degrees can lead to overly flexible and strange curves.

Polynomial Regression: Example (Wage Data)

We’ll use the Wage dataset (from the ISLR book) to predict wage based on age. Let’s see how a degree-4 polynomial fits the data.

Visualizing the Polynomial Fit (Wage Data)

Blue Curve: The degree-4 polynomial fit. It captures the non-linear trend!
Dashed Curves: 95% confidence interval. This shows the uncertainty in our fit.

Understanding the Confidence Interval

The confidence interval (dashed lines) tells us how much our fitted curve might vary. It’s calculated like this:

Fitted Value: For a specific age (\(x_0\)), we get the fitted value: \(\hat{f}(x_0)\).
Variance: We estimate the variance of the fit at that point: \(\text{Var}[\hat{f}(x_0)]\).
Standard Error: The pointwise standard error is: \(\sqrt{\text{Var}[\hat{f}(x_0)]}\).
Confidence Interval: The 95% confidence interval is: \(\hat{f}(x_0) \pm 2 \cdot \text{SE}[\hat{f}(x_0)]\).

Interpretation

We’re 95% confident that the true relationship lies within the dashed curves.

Polynomial Logistic Regression

We can also use polynomial terms in logistic regression to model a binary outcome (e.g., yes/no, 0/1).

Example: Model the probability that wage > 250, given age.

Polynomial Logistic Regression: Equation

The equation looks like this:

\[ \text{Pr}(y_i > 250 | x_i) = \frac{\exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_d x_i^d)}{1 + \exp(\beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_d x_i^d)} \]

This models the probability as a non-linear function of the predictor.

Visualizing Polynomial Logistic Regression (Wage Data)

Key Point: The confidence intervals are wider for older ages. This means we’re less certain about our predictions in that range, because we have less data.

Step Functions

Global vs. Local: Polynomial regression imposes a global structure (the same polynomial applies everywhere). Step functions are local.
How They Work:
1. Bins: Divide the range of the predictor (\(X\)) into bins using cutpoints (\(c_1, c_2, ..., c_K\)).
2. Constant Fit: Fit a constant within each bin. The predicted value is the same for all values of \(X\) within a bin.

Step Functions: Creating Indicator Variables

To do this, we create indicator variables:

\[\begin{aligned} C_0(X) &= I(X < c_1), \\ C_1(X) &= I(c_1 \le X < c_2), \\ C_2(X) &= I(c_2 \le X < c_3), \\ &\vdots \\ C_{K-1}(X) &= I(c_{K-1} \le X < c_K), \\ C_K(X) &= I(c_K \le X), \end{aligned}\]

where \(I(\cdot)\) is an indicator function. It’s 1 if the condition is true, and 0 otherwise.

Step Functions: Regression Equation

We then use least squares with \(C_1(X), C_2(X), ..., C_K(X)\) as predictors:

\[y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + \dots + \beta_K C_K(x_i) + \epsilon_i\]

Multicollinearity: We exclude \(C_0(X)\) to avoid perfect multicollinearity (the indicator functions sum to 1).
Interpretation:
- \(\beta_0\): Mean value of \(Y\) for \(X < c_1\) (the baseline group).
- \(\beta_j\): Average increase in the response for \(X\) in bin \(j\), relative to the baseline.

Step Functions: Converting Continuous to Categorical

Ordered Categorical Variable: We’ve turned a continuous variable into an ordered categorical variable. The order of the categories matters!

Step Functions: Example (Wage Data)

Let’s fit step functions to the Wage data.

Left Panel: Piecewise constant fit to wage.
Right Panel: Fitted probabilities from logistic regression for wage > 250 (also using step functions).

Visualizing Step Functions (Wage Data)

Limitation: Step functions can miss overall trends (like the initial increase of wage with age).
Use Cases: Common in biostatistics and epidemiology (e.g., grouping ages into 5-year intervals).

Basis Functions: A General Framework

Generalization: Polynomial and piecewise-constant regression are special cases of the basis function approach.
How it Works:
1. Choose Basis Functions: Select transformations to apply to \(X\): \(b_1(X), b_2(X), ..., b_K(X)\).
2. Linear Model: Fit a linear model using these basis functions:
  
  \[y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + ... + \beta_K b_K(x_i) + \epsilon_i\]

Basis Functions: Important Points

Fixed and Known: The basis functions are fixed (we choose them).
Still a Linear Model!: This is still a linear model in the coefficients! We can use least squares.
Inference Tools: All the usual inference tools (hypothesis tests, confidence intervals) still apply! This is a huge advantage.

Regression Splines: Combining Polynomials and Piecewise-Constants

Motivation: Regression splines combine the best of both worlds! They’re more flexible than either approach alone.
Piecewise Polynomials: Fit separate low-degree polynomials in different regions of \(X\), instead of one high-degree polynomial everywhere.

Regression Splines: Piecewise Polynomials (Details)

Example: Piecewise Cubic:

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i\]

but the coefficients (\(\beta_0, \beta_1, \beta_2, \beta_3\)) change in different regions.
Knots: The points where the coefficients change are called knots.

Piecewise Polynomials: Example (One Knot)

Consider a piecewise cubic with one knot at \(c\):

\[y_i = \begin{cases} \beta_{01} + \beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i & \text{if } x_i < c \\ \beta_{02} + \beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i & \text{if } x_i \ge c \end{cases}\]

Two different cubic polynomials!

Unconstrained Piecewise Cubic Polynomial

Problem: This fit is discontinuous at the knot. This is usually undesirable.
Degrees of Freedom: 8 (2 sets of 4 parameters).

Constraints and Splines: Making it Smooth

To make it better, we add constraints:

Continuity: The function must be continuous at the knot (the pieces must meet).
Continuous 1st Derivative: The first derivative must be continuous (no sharp turns).
Continuous 2nd Derivative: The second derivative can also be continuous (even smoother).

Visualizing Constraints: Unconstrained

Let’s see the effect of adding constraints. First, the unconstrained version:

Visualizing Constraints: Continuous

Now, with the continuity constraint:

Visualizing Constraints: Cubic Spline

Finally, the cubic spline (continuity, continuous 1st and 2nd derivatives):

Constraints and Degrees of Freedom

Reducing Degrees of Freedom: Each constraint reduces the degrees of freedom. We have fewer parameters to estimate.
Cubic Spline: A cubic spline with K knots has 4 + K degrees of freedom.
- 4: From the base cubic polynomial.
- K: One for each knot.

Linear Spline

A linear spline is continuous, fitting straight lines that meet at the knots:

The Spline Basis Representation

Basis Function Model: A cubic spline with K knots can be written as:

\[y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + ... + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i\]
Simple Basis: A common basis is: \(x, x^2, x^3\), and truncated power basis functions:

\[h(x, \xi) = (x - \xi)^3_+ = \begin{cases} (x-\xi)^3 & \text{if } x > \xi \\ 0 & \text{otherwise} \end{cases}\]

where \(\xi\) is a knot.

The Spline Basis Representation: Truncated Power Basis

Truncated Power Basis Function: \(h(x, \xi)\) is zero until it reaches the knot (\(\xi\)), then it becomes a cubic. This adds flexibility locally.
Fitting: We fit this model using least squares.

The Spline Basis Representation: Summary

Cubic Spline with K Knots: Use least squares with an intercept and 3 + K predictors: \(X, X^2, X^3, h(X, \xi_1), ..., h(X, \xi_K)\).
Boundary Variance: Splines can have high variance at the boundaries (extreme values of X).

Natural Splines

Boundary Constraints: A natural spline adds boundary constraints: the function is linear outside the boundary knots. This stabilizes the fit.

The figure shows a cubic spline and a natural cubic spline. The natural spline is linear beyond the boundary knots.

Choosing the Number and Locations of the Knots

Knot Placement: Where do we put the knots?
- More Knots Where Function Varies: More knots where the function changes rapidly, fewer where it’s stable.
- Uniform Placement: Often, place knots uniformly (e.g., at quantiles of the data).
- Software Automation: Specify the degrees of freedom, and the software places the knots.

Choosing the Number and Locations of the Knots: Example

This natural cubic spline (4 degrees of freedom) likely has one knot near the middle, where the curve changes most.

Choosing the Number of Knots: Cross-Validation

How Many Knots? Use cross-validation!
1. Fit Splines: Fit splines with different numbers of knots.
2. Cross-Validated RSS: Compute the cross-validated RSS for each.
3. Minimize: Choose the number of knots that minimizes the cross-validated RSS. This gives the best generalization.

Cross-Validation Example

The plot shows ten-fold cross-validated MSE. Lower is better! The optimal degrees of freedom are where the curve is lowest.

Comparison to Polynomial Regression

Splines vs. Polynomials: Splines often give better results (lower test error).
Flexibility:
- Splines: Increase flexibility by adding knots, keeping the degree fixed (usually cubic).
- Polynomials: Need high degrees for flexibility, which can lead to problems.

Splines vs. Polynomials: Visual Comparison

Key Point: High-degree polynomials can oscillate wildly. Splines are more stable.

Smoothing Splines: A Different Approach

Alternative: Smoothing splines are another way to get a spline, but they take a different path.
Goal: Find a function \(g(x)\) that fits the data well (small RSS) and is smooth.

Smoothing Splines: The Optimization Problem

Minimization: Find \(g(x)\) that minimizes:

\[\sum_{i=1}^{n}(y_i - g(x_i))^2 + \lambda \int g''(t)^2 dt\]
Loss + Penalty: This is a loss + penalty formulation.
- Loss: \(\sum_{i=1}^{n}(y_i - g(x_i))^2\) (RSS). Measures fit.
- Penalty: \(\lambda \int g''(t)^2 dt\). Penalizes roughness.

Smoothing Splines: Understanding the Penalty

Roughness Penalty: \(\lambda \int g''(t)^2 dt\) measures the total change in the function’s slope.
- \(g''(t)\): Second derivative. Measures the rate of change of the slope.
- \(\int g''(t)^2 dt\): Integrates the squared second derivative. Large value = wiggly. Small value = smooth.
- \(\lambda\): Tuning parameter. Controls the trade-off between fit and smoothness.

Smoothing Splines: The Role of λ

Effect of λ:
- Larger \(\lambda\) \(\Rightarrow\) smoother \(g\).
- \(\lambda = 0\): \(g\) interpolates the data (perfect fit, but probably overfit).
- \(\lambda \to \infty\): \(g\) becomes a straight line (least squares line).
Bias-Variance Trade-Off: \(\lambda\) controls the bias-variance trade-off.

Smoothing Splines: The Solution

Natural Cubic Spline: The solution is a natural cubic spline!
Knots at Data Points: This spline has knots at every unique \(x_i\) (every data point)! The penalty (\(\lambda\)) controls the complexity.

Choosing the Smoothing Parameter (λ)

Effective Degrees of Freedom: \(\lambda\) controls the effective degrees of freedom (\(df_{\lambda}\)).
- As \(\lambda\) increases, \(df_{\lambda}\) decreases from \(n\) to 2.
LOOCV: Use leave-one-out cross-validation (LOOCV) to choose \(\lambda\).

LOOCV for Smoothing Splines

The LOOCV formula is:

\[\text{RSS}_{cv}(\lambda) = \sum_{i=1}^{n} \left( \frac{y_i - \hat{g}_{\lambda}^{(-i)}(x_i)}{1 - \{\mathbf{S}_{\lambda}\}_{ii}} \right)^2\]

\(\hat{g}_{\lambda}^{(-i)}(x_i)\): Fitted value leaving out observation \(i\).
\(\mathbf{S}_{\lambda}\): A matrix depending on \(\lambda\).
Efficiency: We don’t need to refit \(n\) times! LOOCV can be computed efficiently.

Smoothing Splines: Example (Wage Data)

Let’s see smoothing splines on the Wage data.

Red Curve: 16 effective degrees of freedom.
Blue Curve: \(\lambda\) chosen by LOOCV (6.8 effective degrees of freedom).

Visualizing Smoothing Splines (Wage Data)

Key Point: We often prefer the simpler model (blue curve) unless there’s strong evidence for the more complex one.

Local Regression: Fitting Locally

Local Fits: Fit a model locally, using only nearby training points.
Algorithm:
1. Gather Neighbors: For a target point \(x_0\), gather the fraction \(s = k/n\) of closest points (\(k\) neighbors, \(n\) total points).
2. Assign Weights: Weight each neighbor: \(K_{i0} = K(x_i, x_0)\).
  - Farthest points: Weight = 0.
  - Closest points: Highest weight.
3. Weighted Least Squares: Fit weighted least squares:
  
  \[\text{minimize } \sum_{i=1}^{n} K_{i0} (y_i - \beta_0 - \beta_1 x_i)^2\]
4. Fitted Value: \(\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0\).

Local Regression: Illustration

The shaded region shows the s = 1/3 closest neighbors to x0.

Local Regression: The Span (s)

Key Choice: The Span (s): Determines the neighborhood size.
- Smaller \(s\): More local, wigglier fit.
- Larger \(s\): Smoother, more global fit.
Cross-Validation: Use cross-validation to choose \(s\).

Local Regression: Example (Wage Data)

The figure shows two local linear fits. A smaller span (s) gives a wigglier curve, and a larger span gives a smoother curve.

Local Regression: Generalizations

Varying Coefficient Model: Local regression is a type of varying coefficient model (coefficients vary locally).
Multi-dimensional: Can be generalized to multiple predictors.

Generalized Additive Models (GAMs)

Extending Linear Regression: GAMs allow non-linear functions of predictors, while maintaining additivity.
The GAM Equation:

\[y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + ... + f_p(x_{ip}) + \epsilon_i\]
- \(f_j\): A (potentially) non-linear function for predictor \(j\).

GAMs: Key Features

Non-Linear Functions: Allow non-linear \(f_j\) for each predictor.
Additivity: The model is additive. Effects of predictors are added together. This helps with interpretability.
Building Blocks: Use various functions for \(f_j\):
- Splines (natural, smoothing)
- Local regression
- Polynomials
- Step functions
- Other basis functions
Fitting: Use backfitting.

GAMs: Example (Wage Data)

wage = β0 + f1(year) + f2(age) + f3(education) + ε

year, age: Quantitative.
education: Qualitative (5 levels).

GAMs: Example (Wage Data) - Implementation

Natural Splines: Use natural splines for year and age.
Dummy Variables: Use dummy variables for education.
Large Regression: The GAM is a large regression on spline basis variables and dummy variables.

GAMs: Example (Wage Data) - Natural Splines

Each panel shows the effect of a predictor, holding others constant. Year and age are non-linear. Education’s effect is shown by level shifts.

GAMs: Example (Wage Data) - Smoothing Splines

Smoothing Splines: Can also use smoothing splines.
Similar Results: Natural and smoothing splines usually give similar results.

GAMs for Classification Problems

Qualitative Response: GAMs can be used when \(Y\) is qualitative.
Logistic Regression GAM: For a binary response:

\[\log\left( \frac{p(X)}{1 - p(X)} \right) = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p)\]

where \(p(X) = \text{Pr}(Y = 1 | X)\).

GAMs for Classification: Example

Example: Predict wage > 250 using a GAM.

This shows a logistic regression GAM. Year and age have non-linear effects. The plot for education shows the effect of each level on the log-odds of high earnings.

Pros and Cons of GAMs

Pros:
- Automatic Non-Linearity: Model non-linearity automatically.
- Potentially More Accurate: Better predictions when relationships are non-linear.
- Interpretability: Additivity helps with interpretation.
- Smoothness Summarization: Smoothness is summarized by effective degrees of freedom.
Cons:
- Additivity Restriction: GAMs are additive. They can miss interactions (unless explicitly added).
- Interaction Handling
  - The effect of the change of one predictor on the response variable may also depend on other predictors. This is so-called interaction.
  - By default, GAM assumes no interaction between predictors.
  - We can manually add interaction terms: y ~ x1 + x2 + f(x3, x4).

Summary

Beyond Linearity: We explored techniques for going beyond linear models.
Non-Linear Techniques: Learned about:
- Polynomial regression
- Step functions
- Regression splines
- Smoothing splines
- Local regression
- GAMs
Flexibility and Interpretability: These methods offer more flexibility while maintaining interpretability.
Cross-Validation: Crucial for choosing tuning parameters to avoid overfitting.
GAMs: Extend non-linear ideas to multiple predictors.

Thoughts and Discussion

Linear vs. Non-Linear: When choose linear vs. non-linear? Consider simplicity vs. accuracy.
Comparing Techniques: How do the techniques compare in flexibility and interpretability?
Beyond GAMs: When might a GAM not be enough? (Complex interactions, highly non-linear relationships).
Interactions in GAMs: How can you add interactions? Trade-offs?
Smoothing Splines vs. Regression Splines: Discuss the trade-offs between smoothing and regression splines. Consider computation, implementation, and knot placement.