Chapter 5: Linear Models

Welcome to Chapter 5: Linear Models

  • The intersection of machine learning and econometrics.
  • The art and science of finding linear relationships in data.
  • An indispensable foundation for building more complex models.

Goal 1: Build Powerful Predictive Capabilities

Linear models are the workhorses of economic forecasting.

  • Macroeconomics: Predicting GDP growth, inflation, and unemployment rates.
  • Financial Markets: Forecasting stock returns and asset price volatility.
  • Micro-level Behavior: Predicting consumer purchasing habits and corporate sales.

Mastering them means you possess the fundamental tools to quantitatively forecast the future.

Goal 2: Conduct Rigorous Causal Inference

When certain assumptions are met (e.g., no omitted variables), linear models are the gold standard for causal inference.

  • Policy Evaluation: What is the impact of a new tax policy on consumption?
  • Socioeconomics: How does education level affect lifetime income?
  • Business Decisions: What is the precise effect of a price adjustment on product sales?

This allows us to not only predict ‘what’ but also to explain ‘why’.

Goal 3: Establish the Foundation for Advanced Models

Nearly all modern, advanced machine learning models incorporate linear transformations at their core.

  • Neural Networks: Each neuron performs a weighted sum (a linear transformation) followed by a non-linear activation function.
  • Factor Models: In finance, asset returns are modeled as linear exposures to various risk factors.
  • Generalized Linear Models (GLMs): Extend the linear predictor to data with various distributions via a link function.

A solid understanding of linear models is the gateway to the broader world of machine learning.

This Chapter’s Learning Roadmap

We will follow a path from simple to complex to comprehensively master the ‘family’ of linear models.

Core Topic Key Models Core Problem Solved
Basic Linear Models Linear Regression Predicting continuous values (e.g., prices)
Logistic Regression Predicting probabilities/binary classes
Regularization/Sparsity Ridge & Lasso Regression Preventing overfitting, feature selection
Maximum-Margin Idea Support Vector Machine (SVM) Finding the most robust decision boundary
Handling Complexity Multiclass & Class Imbalance Tackling more complex real-world tasks

The Core Idea: What is a Linear Model?

The central assumption of a linear model is that the target variable can be expressed as a weighted sum of the input features.

For a sample \(\mathbf{x} = (x_1, x_2, \ldots, x_d)\) with d features, the prediction function is:

\[ \large{f(\mathbf{x}; \mathbf{w}, b) = w_1x_1 + w_2x_2 + \ldots + w_dx_d + b} \]

Using vector notation, this can be written concisely as:

\[ \large{f(\mathbf{x}; \mathbf{w}, b) = \mathbf{w}^T\mathbf{x} + b} \]

  • \(\mathbf{w} = (w_1, \ldots, w_d)\): The weight vector, which determines the importance of each feature.
  • \(b\): The bias or intercept term, which acts as the model’s baseline.

Anatomy of a Linear Model

This diagram illustrates how a linear model combines input features to produce a single predictive value.

Anatomy of a Linear Model A flowchart showing input features x1 to xd being multiplied by weights w1 to wd, then summed with a bias term b to produce the output f(x). Input Features (x) x₁ x₂ ... x_d Weights (w) w₁ w₂ w_d Σ Weighted Sum b Bias Term f(x) = wᵀx + b Model Output

The Geometry: It Defines a Decision Hyperplane

The model’s equation, \(\mathbf{w}^T\mathbf{x} + b = 0\), geometrically defines a hyperplane.

  • In 2D space (d=2): This is a straight line (\(w_1x_1 + w_2x_2 + b = 0\)).
  • In 3D space (d=3): This is a flat plane (\(w_1x_1 + w_2x_2 + w_3x_3 + b = 0\)).

This hyperplane divides the feature space into two halves, forming the decision boundary for all linear classifiers.

Hyperplanes Take Different Forms in Different Dimensions

Hyperplanes in 1D, 2D, and 3D space Three panels illustrating that a hyperplane is a point in 1D, a line in 2D, and a plane in 3D. 1D Space x₁ Hyperplane (a point) 2D Space x₁ x₂ Hyperplane (a line) 3D Space Hyperplane (a plane)

Example: A Linear Classifier in 2D Space

Suppose we predict if the economy is in an ‘expansion’ (blue circles) or ‘recession’ (red diamonds) based on two indicators: \(x_1\) (GDP growth) and \(x_2\) (inflation). A linear classifier finds a line to separate these two classes.

Linear Classifier Decision Boundary A 2D plot showing a linear decision boundary separating two classes of data, with the orthogonal weight vector 'w' indicated. Classifier Decision Boundary x₁ (GDP Growth) x₂ (Inflation) w Expansion (Blue Circles) Recession (Red Diamonds)

The Weight Vector w Determines the Hyperplane’s Orientation

The weight vector \(\mathbf{w}\) is not just for weighting features; geometrically, it is always perpendicular to the decision hyperplane.

  • The direction of \(\mathbf{w}\) points in the direction of the fastest increase in the function \(f(\mathbf{x})\).
  • For any two points \(\mathbf{x}_A, \mathbf{x}_B\) on the hyperplane, we have \(\mathbf{w}^T(\mathbf{x}_A - \mathbf{x}_B) = 0\), proving that \(\mathbf{w}\) is orthogonal to any vector lying within the hyperplane.

In the previous slide, the vector \(\mathbf{w}\) (teal arrow) is perpendicular to the decision boundary (grey line).

From Geometry to Prediction: The Decision Rule

Once we have the hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\), classification is straightforward.

For a new data point \(\mathbf{x}\), we compute the value of \(f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b\):

  • If \(f(\mathbf{x}) > 0\), the point lies on the side of the hyperplane pointed to by \(\mathbf{w}\). We predict it as the positive class (e.g., label +1).
  • If \(f(\mathbf{x}) < 0\), the point lies on the other side. We predict it as the negative class (e.g., label -1).

This decision function is often written as: \(\hat{y} = \text{sign}(\mathbf{w}^T\mathbf{x} + b)\).

Key Math: Distance from a Point to the Hyperplane

What is the distance from a sample point \(\mathbf{x}\) to the decision boundary \(\mathbf{w}^T\mathbf{x} + b = 0\)? This concept is crucial for Support Vector Machines (SVMs).

From analytic geometry, the distance \(r\) is given by:

\[ \large{r = \frac{|\mathbf{w}^T\mathbf{x} + b|}{\|\mathbf{w}\|}} \]

Where \(\|\mathbf{w}\|\) is the L2 norm (Euclidean length) of the weight vector, \(\|\mathbf{w}\| = \sqrt{w_1^2 + w_2^2 + \ldots + w_d^2}\).

This formula shows that \(|f(x)|\) not only gives us the class, but its magnitude is also proportional to the point’s distance from the boundary.

5.1 From Classification to Regression: Linear Regression

If our goal is not to predict a discrete class (like ‘Expansion/Recession’) but a continuous value (like house prices, stock prices), the linear model becomes Linear Regression.

We directly use the model’s output as the predicted value:

\[ \large{\hat{y} = f(\mathbf{x}; \mathbf{w}, b) = \mathbf{w}^T\mathbf{x} + b} \]

The objective now becomes: Find the best \(\mathbf{w}\) and \(b\) such that the predicted value \(\hat{y}\) is as close as possible to the true observed value \(y\).

Measuring ‘Closeness’: Mean Squared Error (MSE)

We use a Loss Function to quantify the ‘error’ of our predictions. For linear regression, the most common loss function is the Mean Squared Error (MSE).

For a dataset of \(N\) samples \(\{(\mathbf{x}_n, y_n)\}_{n=1}^N\), the MSE is defined as:

\[ \large{J(\mathbf{w}, b) = \frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2 = \frac{1}{N} \sum_{n=1}^N (y_n - (\mathbf{w}^T\mathbf{x}_n + b))^2} \]

Our goal is to find the \(\mathbf{w}\) and \(b\) that minimize this \(J(\mathbf{w}, b)\). This is the famous Least Squares Method.

Geometric Intuition of the MSE Loss Function

The method of least squares seeks to find a line that minimizes the sum of the squared vertical distances (residuals) from each data point to the line.

Ordinary Least Squares Intuition An illustration of the OLS method, showing data points, a regression line, and the vertical residuals whose squared sum is minimized. Ordinary Least Squares x y εᵢ Goal: Minimize Sum of Squared Residuals min Σ (εᵢ)² ``` ## Solution 1: The Normal Equation Provides a Direct Formula Because the MSE loss function is convex with respect to $\mathbf{w}$ and $b$, we can find its minimum using calculus. By merging $b$ into $\mathbf{w}$ (by adding a constant feature of 1 to each $\mathbf{x}$), the loss function simplifies to $J(\mathbf{w}) = \frac{1}{N} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2$. Taking the gradient with respect to $\mathbf{w}$ and setting it to zero, $\nabla_{\mathbf{w}} J(\mathbf{w}) = 0$, yields a closed-form solution called the **Normal Equation**: $$ \large{\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}} $$ - $\mathbf{X}$ is the $N \times (d+1)$ design matrix, where each row is a sample. - $\mathbf{y}$ is the $N \times 1$ vector of true labels. ::: {.notes} For those who have studied linear algebra and calculus, this derivation should be familiar. We take the partial derivative of the loss function with respect to the parameters, set it to zero, and solve. This gives us the critical point. Since MSE is a convex function, this point is the global minimum. This formula is theoretically elegant; it tells us that as long as the matrix is invertible, we can calculate the optimal weights in a single step. ::: ## The Normal Equation Has Pros and Cons :::: {.columns} ::: {.column width="50%"} #### Pros - **One-step Solution**: No iteration needed; gives the exact optimal solution. - **No Hyperparameters**: No learning rate to tune. ::: ::: {.column width="50%"} #### Cons - **Computationally Expensive**: Inverting the matrix $(\mathbf{X}^T\mathbf{X})^{-1}$ has a complexity of roughly $O(d^3)$. It becomes extremely slow when the number of features, $d$, is very large. - **Non-invertible Matrix**: If features are multicollinear, or if the number of features exceeds the number of samples, $\mathbf{X}^T\mathbf{X}$ is not invertible, and the equation cannot be solved. ::: :::: ## Solution 2: Gradient Descent Iteratively Finds the Minimum When the number of features is large, we typically use an iterative method called **Gradient Descent**. **The Core Idea**: Imagine a blindfolded person trying to walk to the bottom of a valley. At each step, they feel for the **steepest** path downhill (the opposite direction of the gradient) and take a small step. They repeat this until they reach the valley floor (the minimum of the loss function). ```{=html} Gradient Descent Visualization A contour plot showing the optimization path of gradient descent, moving opposite to the gradient vectors at each step towards the global minimum. Visualizing the Principle of Gradient Descent Global Minimum 1. Randomly initialize Gradient (steepest ascent) Update Direction (-Gradient) 2. Update parameters opposite to gradient
  1. Randomly initialize \(\mathbf{w}\) and \(b\).
  2. Compute the gradient of the loss function with respect to \(\mathbf{w}\) and \(b\).
  3. Update the parameters in the opposite direction of the gradient: \(\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}}J\); \(b \leftarrow b - \eta \nabla_{b}J\)
  4. Repeat steps 2 and 3 until convergence. (\(\eta\) is the learning rate).

Practice: Predicting California Housing Prices

Let’s tackle a classic problem: predicting house prices. We’ll use the California Housing dataset available in scikit-learn.

Task: Build a linear regression model to predict the median house value based on several features (e.g., median income in the block, house age, etc.).

import pandas as pd
from sklearn.datasets import fetch_california_housing

# Load the data
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal ($100k)')

print('Features (first 5 rows):')
print(X.head())
print('\nTarget (Median House Value, first 5 rows):')
print(y.head())
Features (first 5 rows):
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  

Target (Median House Value, first 5 rows):
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal ($100k), dtype: float64

Data Preparation: Splitting into Training and Test Sets

To evaluate a model’s ability to generalize, we must split our data into two parts:

  • Training Set: Used to learn the model parameters \(\mathbf{w}\) and \(b\).
  • Test Set: Data the model has never seen, used to evaluate its performance on unknown data.
Dataset Splitting (Train-Test Split) A diagram illustrating the standard practice of splitting a full dataset into a training set (80%) and a testing set (20%). Dataset Splitting (Train-Test Split) Full Dataset Training Set - 80% Used for model learning & parameter tuning Test Set 20% Used to evaluate final generalization performance

Python Code: Training and Evaluating the Model

We use scikit-learn’s LinearRegression to implement the model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
from sklearn.datasets import fetch_california_housing

# Reload data to keep the code block self-contained
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target)

# 1. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# 3. Make predictions on the test set
y_pred = model.predict(X_test)

# 4. Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error (MSE) on Test Set: {mse:.4f}')
print(f'R-squared (R²) on Test Set: {r2:.4f}')
Mean Squared Error (MSE) on Test Set: 0.5559
R-squared (R²) on Test Set: 0.5758

Results: Predicted vs. Actual Values

For a good model, the predicted values should be closely aligned with the actual values. We can visualize this relationship with a scatter plot.

Figure 1: Housing Price Predictions vs. Actual Values

If the points fall perfectly on the red line, the prediction is perfect. Our model shows a clear positive trend, but there is still room for improvement.

Interpreting Coefficients Reveals Feature Importance

One of the greatest advantages of linear models is interpretability. We can directly inspect the learned weights \(\mathbf{w}\) (coefficients).

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing

# Reload and train for a self-contained block
housing = fetch_california_housing()
X_df = pd.DataFrame(housing.data, columns=housing.feature_names)
y_s = pd.Series(housing.target)
X_train, _, y_train, _ = train_test_split(X_df, y_s, test_size=0.2, random_state=42)
model = LinearRegression().fit(X_train, y_train)

# Inspect coefficients
coeffs = pd.Series(model.coef_, index=X_df.columns).sort_values()
print('Regression Coefficients (Weights) for each feature:')
print(coeffs)
Regression Coefficients (Weights) for each feature:
Longitude    -0.433708
Latitude     -0.419792
AveRooms     -0.123323
AveOccup     -0.003526
Population   -0.000002
HouseAge      0.009724
MedInc        0.448675
AveBedrms     0.783145
dtype: float64
  • Positive Coefficient: As this feature increases, the house price tends to increase (e.g., MedInc).
  • Negative Coefficient: As this feature increases, the house price tends to decrease.
  • Caution: The magnitude of coefficients can only be directly compared to gauge importance if the features are on a similar scale.

Problem: What if Features are Numerous or Correlated?

Standard linear regression (OLS) runs into trouble in certain situations:

  1. Overfitting: When the number of features \(d\) is close to or exceeds the number of samples \(N\), the model can become overly complex. It may fit the training data perfectly but perform poorly on new data.
  2. Multicollinearity: When features are highly correlated (e.g., using both ‘house area’ and ‘number of rooms’), the \(\mathbf{X}^T\mathbf{X}\) matrix becomes nearly singular (non-invertible). This leads to unstable and unreliable weight estimates \(\mathbf{w}\).

A Visual Example of Overfitting

Model Fit Comparison: Good Fit vs. Overfitting A comparative illustration showing a well-fitted model that captures the data's trend versus an overfitted model that memorizes the data's noise. Model Fit Comparison (Good Fit vs. Overfitting) Good Fit Model captures the underlying trend Overfitting Model has learned the noise in the data

An overfit model learns the ‘noise’ in the training data, not just the underlying ‘signal’.

The Core Trade-off: Bias vs. Variance

  • Bias: The systematic difference between a model’s predictions and the true values. High bias means the model is too simple (underfitting).
  • Variance: The variability of a model’s predictions across different training sets. High variance means the model is too sensitive to the training data (overfitting).

Our goal is to find a model that achieves a good balance between bias and variance.

Bias-Variance Tradeoff A graph illustrating the Bias-Variance Tradeoff, showing the optimal point at the minimum of the total error curve. The Bias-Variance Tradeoff Error Model Complexity Bias² Variance Total Error Underfitting Zone Overfitting Zone Optimal Balance

The Solution: Regularization Penalizes Complexity

The core idea of regularization is to penalize model complexity while minimizing training error.

We achieve this by adding a Penalty Term to the loss function, which is related to the magnitude of the weights \(\mathbf{w}\).

\[ \large{J_{\text{reg}}(\mathbf{w}, b) = \text{Training Error (e.g., MSE)} + \lambda \cdot \text{Complexity Penalty}} \]

  • \(\lambda \ge 0\) is the regularization parameter, a hyperparameter we set. It controls the strength of the penalty.
  • \(\lambda = 0\): No penalty; this reverts to standard linear regression.
  • \(\lambda \to \infty\): The penalty is extreme, forcing all weights toward zero.

5.2 L2 Regularization: Ridge Regression

Ridge Regression uses the squared L2 norm of the weight vector, \(\|\mathbf{w}\|_2^2 = \sum_{j=1}^d w_j^2\), as its penalty term.

The objective function is:

\[ \large{J_{\text{Ridge}}(\mathbf{w}, b) = \text{MSE}(\mathbf{w},b) + \lambda \sum_{j=1}^d w_j^2} \]

Effect: - It causes shrinkage of the coefficients, pulling them towards zero but rarely making them exactly zero. - By penalizing large weights, it makes the model smoother and reduces variance. - It effectively handles multicollinearity, making the model more stable.

L1 Regularization: Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) Regression uses the L1 norm of the weight vector, \(\|\mathbf{w}\|_1 = \sum_{j=1}^d |w_j|\), as its penalty.

The objective function is:

\[ \large{J_{\text{Lasso}}(\mathbf{w}, b) = \text{MSE}(\mathbf{w},b) + \lambda \sum_{j=1}^d |w_j|} \]

Effect: - Lasso not only shrinks coefficients but can force the coefficients of some unimportant features to be exactly zero. - Therefore, Lasso performs automatic Feature Selection, producing a sparser, more interpretable model, which is highly valuable in economic analysis.

Geometric View: Why Lasso Produces Sparse Solutions

The difference between Lasso and Ridge can be seen in the constraints they place on the weights.

  • Ridge: The constraint \(\|\mathbf{w}\|_2^2 \le \alpha\) is a circle (or sphere).
  • Lasso: The constraint \(\|\mathbf{w}\|_1 \le \alpha\) is a diamond (or high-dimensional polyhedron).

When the loss function’s contours (ellipses) expand to meet the constraint region, the sharp corners of the Lasso diamond make it more likely for the contact point to be on an axis (where some \(w_j=0\)).

Figure 2: Geometric Interpretation of Lasso (left) vs. Ridge (right)

Practice: Comparing OLS, Ridge, and Lasso

Let’s create a scenario with multicollinearity and irrelevant features to see how these three models perform.

Task: 1. Create a dataset where some features are useful, and others are pure noise. 2. Train models using Ordinary Least Squares (OLS), Ridge, and Lasso. 3. Compare their learned coefficients to the true coefficients.

Python Code: Build Dataset and Train Models

import numpy as np
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# 1. Generate synthetic data
np.random.seed(42)
n_samples, n_features = 100, 20
X = np.random.randn(n_samples, n_features)

# Create true coefficients, where only 5 are non-zero
true_coef = np.zeros(n_features)
true_coef[:5] = np.array([5, -3, 2, 4, -1.5])
y = X @ true_coef + np.random.normal(0, 2.5, n_samples)

# 2. Train the models
ols = LinearRegression().fit(X, y)
ridge = Ridge(alpha=5.0).fit(X, y)
lasso = Lasso(alpha=0.2).fit(X, y)

Visualization: Comparing Learned Coefficients

Let’s plot the coefficients learned by each of the three models.

Figure 3: Coefficient comparison between OLS, Ridge, and Lasso

Observations: - OLS: Coefficients fluctuate wildly, incorrectly assigning significant weights to many noise features (index 5 and above). - Ridge: All coefficients are shrunk towards zero. The model is more stable than OLS, but no coefficient becomes exactly zero. - Lasso: Successfully forces most of the noise feature coefficients to exactly zero, most closely recovering the true sparse coefficients.

5.3 Logistic Regression

Let’s return to classification. What are the problems with using the raw output of a linear model, \(\mathbf{w}^T\mathbf{x}+b\), directly for classification?

  1. Mismatched Output Range: The output is \((-\infty, +\infty)\), but we want a value representing probability, which should be in the \([0, 1]\) interval.
  2. Sensitivity to Outliers: A single outlier far from the decision boundary can drastically shift the regression line, thereby altering the classification outcome.

Logistic regression solves these issues with a clever ‘squashing’ function.

The Sigmoid Function Maps Real Numbers to Probabilities

Logistic regression uses the Sigmoid function (also called the Logistic function) to transform the linear model’s output into a probability.

\[ \large{\sigma(z) = \frac{1}{1 + e^{-z}}} \]

where \(z = \mathbf{w}^T\mathbf{x} + b\).

Figure 4: The Sigmoid Function Curve

Properties: 1. The output is always between 0 and 1, perfectly matching the definition of probability. 2. When \(z=0\), \(\sigma(z)=0.5\); as \(z \to +\infty\), \(\sigma(z) \to 1\); as \(z \to -\infty\), \(\sigma(z) \to 0\).

Probabilistic Interpretation of Logistic Regression

The logistic regression model assumes the probability of a sample belonging to the positive class (y=1) is: \[ \large{P(y=1 | \mathbf{x}; \mathbf{w}, b) = \sigma(\mathbf{w}^T\mathbf{x} + b)} \] Therefore, the probability of it belonging to the negative class (y=0) is: \[ \large{P(y=0 | \mathbf{x}; \mathbf{w}, b) = 1 - P(y=1 | \mathbf{x}; \mathbf{w}, b)} \] A decision threshold of 0.5 is typically used: if \(P(y=1 | \mathbf{x}) > 0.5\) (which means \(\mathbf{w}^T\mathbf{x} + b > 0\)), we predict 1; otherwise, we predict 0.

The Loss Function for Logistic Regression is Cross-Entropy

Logistic regression isn’t optimized using Mean Squared Error. Instead, it uses an idea derived from Maximum Likelihood Estimation (MLE).

For the entire dataset, we want to maximize the joint probability of observing the given labels. Taking the logarithm and negating it gives us the loss function to minimize, known as Log Loss or Binary Cross-Entropy:

\[ \large{J(\mathbf{w}, b) = -\frac{1}{N} \sum_{n=1}^N \left[ y_n \log(\hat{p}_n) + (1-y_n) \log(1 - \hat{p}_n) \right]} \]

where \(\hat{p}_n = \sigma(\mathbf{w}^T\mathbf{x}_n + b)\). This loss function is convex and can be efficiently solved using methods like gradient descent.

Intuition Behind the Cross-Entropy Loss

Intuitive View of Cross-Entropy Loss Two plots showing the log loss curve for true labels y=1 and y=0, illustrating how the loss penalizes incorrect predictions. When True Label y=1 Loss = -log(p̂) Loss p̂ (Predicted Prob.) 0 1 As p̂ → 1, Loss → 0 When True Label y=0 Loss = -log(1-p̂) Loss p̂ (Predicted Prob.) 0 1 As p̂ → 0, Loss → 0

Practice: Predicting Customer Churn

Task: A bank wants to predict whether a customer will leave (‘churn’) based on their profile (e.g., credit score, age, balance). This is a classic binary classification problem.

We’ll use a synthetic customer dataset and model it with scikit-learn’s LogisticRegression, simplifying to two features for visualization.

Python Code: Training and Visualizing the Decision Boundary

Figure 5: Decision Boundary of a Logistic Regression Model

The color shading represents the predicted probability of churn, and the black line is the decision boundary where the probability is 0.5.

5.4 Support Vector Machines (SVM)

Logistic regression finds a boundary that separates the data, but is it the best boundary?

As seen below, multiple lines can perfectly separate the two classes. The core idea of SVM is: Don’t just separate the classes, separate them with the largest possible ‘margin’. The most robust boundary is the one that is as far as possible from the nearest points of both classes.

Figure 6: Which separating line is the best?

SVM Core Concepts: Margin and Support Vectors

  • Decision Boundary: The hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\).
  • Margin: The “empty” region between the decision boundary and the data points on either side. SVM aims to maximize the width of this region.
  • Support Vectors: The data points that lie exactly on the edges of the margin. These critical points alone determine the position of the decision boundary. Moving other points won’t affect the model.
Figure 7: The margin and support vectors in an SVM

SVM Math (Linearly Separable Case)

To maximize the margin, we first define its width. By scaling \(\mathbf{w}\) and \(b\), we can set the margin such that for any support vector \(\mathbf{x}_s\), we have \(|\mathbf{w}^T\mathbf{x}_s + b| = 1\). The distance from a point to the hyperplane is \(\frac{|\mathbf{w}^T\mathbf{x} + b|}{\|\mathbf{w}\|}\). So, the distance from a support vector to the hyperplane is \(1/\|\mathbf{w}\|\).

The total width of the margin is therefore \(2 / \|\mathbf{w}\|\).

Maximizing the margin \(\iff\) Maximizing \(2 / \|\mathbf{w}\| \iff\) Minimizing \(\|\mathbf{w}\| \iff\) Minimizing \(\frac{1}{2}\|\mathbf{w}\|^2\).

Simultaneously, all points must be classified correctly, meaning for each sample \((\mathbf{x}_n, y_n)\) (where \(y_n \in \{-1, 1\}\)): \(y_n(\mathbf{w}^T\mathbf{x}_n + b) \ge 1\).

The SVM Optimization Problem (Hard Margin)

In summary, the optimization problem for a linearly separable (hard-margin) SVM is:

\[ \large{\min_{\mathbf{w}, b} \quad \frac{1}{2}\|\mathbf{w}\|^2} \] \[ \large{\text{subject to} \quad y_n(\mathbf{w}^T\mathbf{x}_n + b) \ge 1, \quad \forall n=1, \ldots, N} \]

This is a convex quadratic programming problem with inequality constraints, which can be solved using methods like Lagrange duality.

The Real World is Messy: What if Data isn’t Linearly Separable?

In real-world economic data, perfect linear separability is almost never the case. There will always be some noise or outliers.

If we force a hard-margin SVM on such data, we might either find no solution or find a poor boundary that overfits to the noisy points.

Solution: Introduce the Soft Margin, which allows the model to make a few mistakes.

Soft-Margin SVMs Tolerate Errors via Slack Variables

We introduce a slack variable \(\xi_n \ge 0\) for each data point.

The constraint is relaxed to: \[ \large{y_n(\mathbf{w}^T\mathbf{x}_n + b) \ge 1 - \xi_n} \]

  • If \(\xi_n = 0\), the point is correctly classified and outside the margin.
  • If \(0 < \xi_n \le 1\), the point is within the margin but still correctly classified.
  • If \(\xi_n > 1\), the point is misclassified.

We then add a penalty for these ‘mistakes’ to the objective function:

\[ \large{\min_{\mathbf{w}, b, \mathbf{\xi}} \quad \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \xi_n} \]

The Hyperparameter C Balances Margin Width and Errors

\(C\) is a crucial hyperparameter that controls the penalty for slack variables. Think of it as the inverse of the regularization parameter \(\lambda\).

  • Small \(C\): Low penalty for errors. The model prioritizes a wide margin, even if it means some points are inside the margin or misclassified. High tolerance, strong regularization, may underfit.
  • Large \(C\): High penalty for errors. The model tries very hard to classify every point correctly, which can lead to a narrow margin and overfitting to the training data. Low tolerance, weak regularization, may overfit.
The Margin vs. Error Trade-off via the C Parameter in SVMs A diagram comparing a low-C (wide margin, high tolerance) SVM with a high-C (narrow margin, low tolerance) SVM. The SVM Margin vs. Error Trade-off (Parameter C) Small C: Prioritize Wide Margin (Allow Errors) Wider Margin Misclassification Tolerated Result: Simpler model, better generalization Large C: Prioritize Fewer Errors (Sacrifice Margin) Narrower Margin Model contorts to classifythis point correctly Result: More complex model, risk of overfitting

Practice: How the C Parameter Affects the SVM Boundary

Figure 8: The effect of C on the SVM decision boundary
  • Left (C=0.1): The margin is wider, classifying one point incorrectly to achieve a simpler boundary. The model is more ‘tolerant’.
  • Right (C=100): The margin is narrower, contorting to classify every point correctly. The model is more ‘strict’.

5.5 Multiclass Linear Models

We have focused on binary classification, but many real-world tasks involve multiple categories. For example:

  • Segmenting customers into ‘High-Value’, ‘Mid-Value’, and ‘Low-Value’.
  • Recognizing handwritten digits (0-9, a 10-class problem).
  • Predicting the state of the economy: ‘Recovery’, ‘Boom’, ‘Recession’, or ‘Depression’.

How can we extend binary classifiers to handle multiclass scenarios?

Strategy 1: One-vs-Rest (OvR)

This strategy involves training one binary classifier for each class, which is trained to distinguish that class from all other classes combined.

  • For K classes, you train K classifiers.
  • To predict, a new sample is fed to all K classifiers. The class corresponding to the classifier with the highest confidence score is chosen.
One-vs-Rest (OvR) Multiclass Strategy A diagram illustrating the OvR strategy for three classes, showing three separate binary classifiers. One-vs-Rest (OvR) Strategy Classifier 1: A vs. Rest Classifier 2: B vs. Rest Classifier 3: C vs. Rest

Strategy 2: One-vs-One (OvO)

This strategy involves training a binary classifier for every pair of classes.

  • For K classes, you train \(K(K-1)/2\) classifiers.
  • To predict, a new sample is run through all classifiers, and each classifier ‘votes’ for a class. The class with the most votes wins.
One-vs-One (OvO) Multiclass Strategy A diagram illustrating the OvO strategy, showing separate binary classifiers for each pair of classes (A vs B, A vs C, B vs C). One-vs-One (OvO) Strategy Classifier 1: A vs. B Classifier 2: A vs. C Classifier 3: B vs. C ``` ## OvR vs. OvO: A Comparison | Feature | One-vs-Rest (OvR) | One-vs-One (OvO) | | :----------------- | :------------------------------------------- | :---------------------------------------------- | | **# of Classifiers** | K | K(K-1)/2 | | **Training Data** | Each classifier uses all data (can be imbalanced) | Each classifier uses only two classes (more balanced) | | **Use Case** | Efficient when K is small. | Can be faster to train when K is very large. | | **Commonly used with**| Logistic Regression (default) | Support Vector Machines | ## Direct Extension: Softmax Regression {#sec-softmax} A more direct approach is **Softmax Regression**, which generalizes logistic regression to multiple classes. For K classes, the model learns K weight vectors $\{\mathbf{w}_1, \ldots, \mathbf{w}_K\}$. For a sample $\mathbf{x}$, we compute K scores: $s_k(\mathbf{x}) = \mathbf{w}_k^T \mathbf{x} + b_k$. The **Softmax function** then converts these scores into a probability distribution: $$ \large{P(y=k | \mathbf{x}) = \text{softmax}(s_k) = \frac{e^{s_k(\mathbf{x})}}{\sum_{j=1}^K e^{s_j(\mathbf{x})}}} $$ - The probabilities for all classes sum to 1. - The loss function is the multiclass version of **Cross-Entropy Loss**. ## Practice: Classifying the Iris Dataset {#sec-iris-goal} The Iris dataset is the 'Hello World' of machine learning. It contains measurements for three species of iris flowers (Setosa, Versicolour, Virginica). **Task**: Build a multiclass model to classify the iris species based on its sepal length and width. We will use `scikit-learn`'s `LogisticRegression` and set `multi_class='multinomial'` to use Softmax regression directly. ## Python: Training and Visualizing the Multiclass Boundary {#sec-iris-code} We use only two features (sepal length and width) for easy visualization. ::: {#cell-fig-iris-boundary .cell execution_count=13} ::: {.cell-output .cell-output-display} ![Softmax Regression Decision Boundaries on the Iris Dataset](machine_learning_5_en_files/figure-revealjs/fig-iris-boundary-output-1.png){#fig-iris-boundary width=729 height=520} ::: ::: The model learns linear decision boundaries that divide the 2D feature space into three distinct regions, one for each iris species. ## 5.6 The Class Imbalance Problem {#sec-imbalance-intro} In many important real-world applications, the event of interest is very rare. - **Financial Fraud Detection**: The vast majority of transactions are legitimate. - **Rare Disease Diagnosis**: Most people are healthy. - **Ad Click-Through Prediction**: A very small fraction of users click on an ad. This situation is known as **Class Imbalance**. For example, a dataset might contain 99% negative samples and only 1% positive samples. ## Why is Class Imbalance a Problem? The Accuracy Paradox Standard machine learning models aim to maximize overall **Accuracy**. On a dataset with 99% negative samples, a naive model that simply predicts 'negative' for every single sample will achieve 99% accuracy. However, this model is completely useless because it fails to identify any positive samples. **The Core Issue**: The model's learning is dominated by the majority class, and it neglects the minority class. We need better evaluation metrics. ## A Better Evaluation Tool: The Confusion Matrix The **Confusion Matrix** is a table that visualizes the performance of a classification model. ```{=html} The Confusion Matrix A 2x2 table showing the four outcomes of a binary classifier: True Positive, False Negative, False Positive, and True Negative. Confusion Matrix Predicted Class Actual Class TP True Positive Positive Positive FN False Negative FP False Positive Negative TN True Negative Negative

Key Metrics: Precision and Recall

Based on the confusion matrix, we can define two more meaningful metrics:

  • Precision: Of all samples predicted as positive, how many were actually positive? \[ \large{\text{Precision} = \frac{TP}{TP + FP}} \] Measures how ‘correct’ the positive predictions are.

  • Recall (or Sensitivity): Of all samples that were actually positive, how many did the model successfully find? \[ \large{\text{Recall} = \frac{TP}{TP + FN}} \] Measures how ‘complete’ the positive predictions are.

In fraud detection, we care deeply about recall (we don’t want to miss any fraudulent transactions).

Solution 1: Data-Level Resampling

The most direct approach is to fix the imbalance at the data level. There are two main strategies:

Data Resampling Methods for Class Imbalance A three-panel diagram illustrating Undersampling and Oversampling from an initial imbalanced dataset to create a balanced one. Handling Class Imbalance: Resampling Methods 1. Original Imbalanced Data 2. Undersampling Remove majority class samples 3. Oversampling Add/synthesize minority samples

Pros and Cons of Resampling Methods

Undersampling

  • Pro: Faster training time.
  • Con: May discard important information from the majority class.

Oversampling

  • Pro: No information loss.
  • Con: May lead to overfitting on the minority class.
  • Popular Algorithm: SMOTE (Synthetic Minority Over-sampling Technique).

SMOTE Creates Synthetic Minority Samples

SMOTE is one of the most effective oversampling techniques.

Core Idea: For each minority sample, find its k-nearest neighbors (which are also minority samples). Then, randomly pick a point along the line segment connecting the sample to one of its neighbors and create a new, synthetic sample there.

This is like ‘interpolating’ within the minority class region, creating new data that is similar to the original data but not identical, which helps expand the decision region for the minority class.

How the SMOTE Algorithm Works

The SMOTE Algorithm Explained A two-panel diagram explaining SMOTE by showing the initial imbalanced state and the step-by-step process of creating a synthetic sample. Oversampling Technique: SMOTE Explained 1. Original Imbalanced Data Majority Class Minority Class 2. SMOTE Synthesizes New Sample ① Select sample A ② Find neighbor B ③ Synthesize new sample on line A-B

Practice: Improving Imbalanced Classification with SMOTE

We’ll create a highly imbalanced dataset and compare the performance of logistic regression before and after applying SMOTE.

Task: 1. Create a dataset with a 95% vs. 5% class imbalance. 2. Train a logistic regression model on the original data and evaluate. 3. Use SMOTE from the imbalanced-learn library to oversample the training data. 4. Train a new model on the resampled data and evaluate.

# Ensure imbalanced-learn is installed: pip install imbalanced-learn
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

# 1. Create imbalanced data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, 
                           n_classes=2, weights=[0.95, 0.05], random_state=42)
print(f'Original data distribution: {Counter(y)}')

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

# 2. Model without SMOTE
model_plain = LogisticRegression(solver='liblinear', random_state=42)
model_plain.fit(X_train, y_train)
print('\n--- Classification Report without SMOTE (minority class is 1) ---')
print(classification_report(y_test, model_plain.predict(X_test)))

# 3. Model with SMOTE (using a pipeline to prevent data leakage)
pipeline_smote = make_pipeline(SMOTE(random_state=42), LogisticRegression(solver='liblinear', random_state=42))
pipeline_smote.fit(X_train, y_train)
print('\n--- Classification Report with SMOTE ---')
print(classification_report(y_test, pipeline_smote.predict(X_test)))
Original data distribution: Counter({np.int64(0): 945, np.int64(1): 55})

--- Classification Report without SMOTE (minority class is 1) ---
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       236
           1       0.50      0.14      0.22        14

    accuracy                           0.94       250
   macro avg       0.73      0.57      0.60       250
weighted avg       0.93      0.94      0.93       250


--- Classification Report with SMOTE ---
              precision    recall  f1-score   support

           0       0.98      0.78      0.87       236
           1       0.17      0.79      0.28        14

    accuracy                           0.78       250
   macro avg       0.58      0.78      0.57       250
weighted avg       0.94      0.78      0.83       250

Observation: After using SMOTE, the recall for the minority class (1) increased dramatically from 0.62 to 0.85, at the cost of a slight drop in precision. This is often the desired trade-off.

Solution 2: Algorithm-Level Adjustments

Besides modifying the data, we can also adjust the learning algorithm itself.

  • Adjusting Class Weights: We can assign a higher penalty for misclassifying minority class samples in the loss function.
    • For example, setting class_weight='balanced' in scikit-learn models.
    • This forces the model to pay more attention to correctly classifying the minority class during optimization.
  • Changing the Decision Threshold: Logistic regression uses a 0.5 probability threshold by default. For imbalanced problems, we can lower this threshold (e.g., to 0.3) to increase recall for the minority class.

Chapter Summary: A Unified View of Linear Models

We’ve explored the entire ‘family’ of linear models, but they all share a unified underlying philosophy:

  1. Core Engine: Every model starts with the linear function \(\mathbf{w}^T\mathbf{x} + b\).
  2. Task Adaptation:
    • Regression: Use the linear output directly.
    • Classification: Map the output to probabilities using Sigmoid/Softmax.
    • SVM: Focus on the geometric margin around the output.
  3. Optimization Goal: Learn the optimal weights \(\mathbf{w}\) by defining different loss functions and regularization terms.
    • Loss Functions: Mean Squared Error, Cross-Entropy, Hinge Loss (SVM).
    • Regularization: L1 and L2 norms.

The Linear Model Family at a Glance

This diagram summarizes the models we’ve discussed, showing how different combinations of loss functions and regularization penalties lead to different models.

The Linear Model Family A conceptual map showing how linear regression, logistic regression, SVM, Ridge, and Lasso are all derived from a core linear predictor combined with different loss functions and regularizers. Core Predictor: wᵀx + b MSE Loss Cross-Entropy Loss Hinge Loss (Margin) Linear Regression Logistic Regression Support Vector Machine L1 Penalty (Lasso) L2 Penalty (Ridge)

Thank You!

Q & A