Using vector notation, this can be written concisely as:
\[ \large{f(\mathbf{x}; \mathbf{w}, b) = \mathbf{w}^T\mathbf{x} + b} \]
\(\mathbf{w} = (w_1, \ldots, w_d)\): The weight vector, which determines the importance of each feature.
\(b\): The bias or intercept term, which acts as the model’s baseline.
Anatomy of a Linear Model
This diagram illustrates how a linear model combines input features to produce a single predictive value.
The Geometry: It Defines a Decision Hyperplane
The model’s equation, \(\mathbf{w}^T\mathbf{x} + b = 0\), geometrically defines a hyperplane.
In 2D space (d=2): This is a straight line (\(w_1x_1 + w_2x_2 + b = 0\)).
In 3D space (d=3): This is a flat plane (\(w_1x_1 + w_2x_2 + w_3x_3 + b = 0\)).
This hyperplane divides the feature space into two halves, forming the decision boundary for all linear classifiers.
Hyperplanes Take Different Forms in Different Dimensions
Example: A Linear Classifier in 2D Space
Suppose we predict if the economy is in an ‘expansion’ (blue circles) or ‘recession’ (red diamonds) based on two indicators: \(x_1\) (GDP growth) and \(x_2\) (inflation). A linear classifier finds a line to separate these two classes.
The Weight Vector w Determines the Hyperplane’s Orientation
The weight vector \(\mathbf{w}\) is not just for weighting features; geometrically, it is always perpendicular to the decision hyperplane.
The direction of \(\mathbf{w}\) points in the direction of the fastest increase in the function \(f(\mathbf{x})\).
For any two points \(\mathbf{x}_A, \mathbf{x}_B\) on the hyperplane, we have \(\mathbf{w}^T(\mathbf{x}_A - \mathbf{x}_B) = 0\), proving that \(\mathbf{w}\) is orthogonal to any vector lying within the hyperplane.
In the previous slide, the vector \(\mathbf{w}\) (teal arrow) is perpendicular to the decision boundary (grey line).
From Geometry to Prediction: The Decision Rule
Once we have the hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\), classification is straightforward.
For a new data point \(\mathbf{x}\), we compute the value of \(f(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b\):
If \(f(\mathbf{x}) > 0\), the point lies on the side of the hyperplane pointed to by \(\mathbf{w}\). We predict it as the positive class (e.g., label +1).
If \(f(\mathbf{x}) < 0\), the point lies on the other side. We predict it as the negative class (e.g., label -1).
This decision function is often written as: \(\hat{y} = \text{sign}(\mathbf{w}^T\mathbf{x} + b)\).
Key Math: Distance from a Point to the Hyperplane
What is the distance from a sample point \(\mathbf{x}\) to the decision boundary \(\mathbf{w}^T\mathbf{x} + b = 0\)? This concept is crucial for Support Vector Machines (SVMs).
From analytic geometry, the distance \(r\) is given by:
Where \(\|\mathbf{w}\|\) is the L2 norm (Euclidean length) of the weight vector, \(\|\mathbf{w}\| = \sqrt{w_1^2 + w_2^2 + \ldots + w_d^2}\).
This formula shows that \(|f(x)|\) not only gives us the class, but its magnitude is also proportional to the point’s distance from the boundary.
5.1 From Classification to Regression: Linear Regression
If our goal is not to predict a discrete class (like ‘Expansion/Recession’) but a continuous value (like house prices, stock prices), the linear model becomes Linear Regression.
We directly use the model’s output as the predicted value:
The objective now becomes: Find the best \(\mathbf{w}\) and \(b\) such that the predicted value \(\hat{y}\) is as close as possible to the true observed value \(y\).
Measuring ‘Closeness’: Mean Squared Error (MSE)
We use a Loss Function to quantify the ‘error’ of our predictions. For linear regression, the most common loss function is the Mean Squared Error (MSE).
For a dataset of \(N\) samples \(\{(\mathbf{x}_n, y_n)\}_{n=1}^N\), the MSE is defined as:
Our goal is to find the \(\mathbf{w}\) and \(b\) that minimize this \(J(\mathbf{w}, b)\). This is the famous Least Squares Method.
Geometric Intuition of the MSE Loss Function
The method of least squares seeks to find a line that minimizes the sum of the squared vertical distances (residuals) from each data point to the line.
```
## Solution 1: The Normal Equation Provides a Direct Formula
Because the MSE loss function is convex with respect to $\mathbf{w}$ and $b$, we can find its minimum using calculus.
By merging $b$ into $\mathbf{w}$ (by adding a constant feature of 1 to each $\mathbf{x}$), the loss function simplifies to $J(\mathbf{w}) = \frac{1}{N} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|_2^2$.
Taking the gradient with respect to $\mathbf{w}$ and setting it to zero, $\nabla_{\mathbf{w}} J(\mathbf{w}) = 0$, yields a closed-form solution called the **Normal Equation**:
$$ \large{\mathbf{w}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}} $$
- $\mathbf{X}$ is the $N \times (d+1)$ design matrix, where each row is a sample.
- $\mathbf{y}$ is the $N \times 1$ vector of true labels.
::: {.notes}
For those who have studied linear algebra and calculus, this derivation should be familiar. We take the partial derivative of the loss function with respect to the parameters, set it to zero, and solve. This gives us the critical point. Since MSE is a convex function, this point is the global minimum. This formula is theoretically elegant; it tells us that as long as the matrix is invertible, we can calculate the optimal weights in a single step.
:::
## The Normal Equation Has Pros and Cons
:::: {.columns}
::: {.column width="50%"}
#### Pros
- **One-step Solution**: No iteration needed; gives the exact optimal solution.
- **No Hyperparameters**: No learning rate to tune.
:::
::: {.column width="50%"}
#### Cons
- **Computationally Expensive**: Inverting the matrix $(\mathbf{X}^T\mathbf{X})^{-1}$ has a complexity of roughly $O(d^3)$. It becomes extremely slow when the number of features, $d$, is very large.
- **Non-invertible Matrix**: If features are multicollinear, or if the number of features exceeds the number of samples, $\mathbf{X}^T\mathbf{X}$ is not invertible, and the equation cannot be solved.
:::
::::
## Solution 2: Gradient Descent Iteratively Finds the Minimum
When the number of features is large, we typically use an iterative method called **Gradient Descent**.
**The Core Idea**: Imagine a blindfolded person trying to walk to the bottom of a valley. At each step, they feel for the **steepest** path downhill (the opposite direction of the gradient) and take a small step. They repeat this until they reach the valley floor (the minimum of the loss function).
```{=html}
Randomly initialize \(\mathbf{w}\) and \(b\).
Compute the gradient of the loss function with respect to \(\mathbf{w}\) and \(b\).
Update the parameters in the opposite direction of the gradient: \(\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_{\mathbf{w}}J\); \(b \leftarrow b - \eta \nabla_{b}J\)
Repeat steps 2 and 3 until convergence. (\(\eta\) is the learning rate).
Practice: Predicting California Housing Prices
Let’s tackle a classic problem: predicting house prices. We’ll use the California Housing dataset available in scikit-learn.
Task: Build a linear regression model to predict the median house value based on several features (e.g., median income in the block, house age, etc.).
import pandas as pdfrom sklearn.datasets import fetch_california_housing# Load the datahousing = fetch_california_housing()X = pd.DataFrame(housing.data, columns=housing.feature_names)y = pd.Series(housing.target, name='MedHouseVal ($100k)')print('Features (first 5 rows):')print(X.head())print('\nTarget (Median House Value, first 5 rows):')print(y.head())
Data Preparation: Splitting into Training and Test Sets
To evaluate a model’s ability to generalize, we must split our data into two parts:
Training Set: Used to learn the model parameters \(\mathbf{w}\) and \(b\).
Test Set: Data the model has never seen, used to evaluate its performance on unknown data.
Python Code: Training and Evaluating the Model
We use scikit-learn’s LinearRegression to implement the model.
from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_scoreimport pandas as pdfrom sklearn.datasets import fetch_california_housing# Reload data to keep the code block self-containedhousing = fetch_california_housing()X = pd.DataFrame(housing.data, columns=housing.feature_names)y = pd.Series(housing.target)# 1. Split data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 2. Create and train the linear regression modelmodel = LinearRegression()model.fit(X_train, y_train)# 3. Make predictions on the test sety_pred = model.predict(X_test)# 4. Evaluate model performancemse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f'Mean Squared Error (MSE) on Test Set: {mse:.4f}')print(f'R-squared (R²) on Test Set: {r2:.4f}')
Mean Squared Error (MSE) on Test Set: 0.5559
R-squared (R²) on Test Set: 0.5758
Results: Predicted vs. Actual Values
For a good model, the predicted values should be closely aligned with the actual values. We can visualize this relationship with a scatter plot.
Figure 1: Housing Price Predictions vs. Actual Values
If the points fall perfectly on the red line, the prediction is perfect. Our model shows a clear positive trend, but there is still room for improvement.
One of the greatest advantages of linear models is interpretability. We can directly inspect the learned weights \(\mathbf{w}\) (coefficients).
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.datasets import fetch_california_housing# Reload and train for a self-contained blockhousing = fetch_california_housing()X_df = pd.DataFrame(housing.data, columns=housing.feature_names)y_s = pd.Series(housing.target)X_train, _, y_train, _ = train_test_split(X_df, y_s, test_size=0.2, random_state=42)model = LinearRegression().fit(X_train, y_train)# Inspect coefficientscoeffs = pd.Series(model.coef_, index=X_df.columns).sort_values()print('Regression Coefficients (Weights) for each feature:')print(coeffs)
Regression Coefficients (Weights) for each feature:
Longitude -0.433708
Latitude -0.419792
AveRooms -0.123323
AveOccup -0.003526
Population -0.000002
HouseAge 0.009724
MedInc 0.448675
AveBedrms 0.783145
dtype: float64
Positive Coefficient: As this feature increases, the house price tends to increase (e.g., MedInc).
Negative Coefficient: As this feature increases, the house price tends to decrease.
Caution: The magnitude of coefficients can only be directly compared to gauge importance if the features are on a similar scale.
Problem: What if Features are Numerous or Correlated?
Standard linear regression (OLS) runs into trouble in certain situations:
Overfitting: When the number of features \(d\) is close to or exceeds the number of samples \(N\), the model can become overly complex. It may fit the training data perfectly but perform poorly on new data.
Multicollinearity: When features are highly correlated (e.g., using both ‘house area’ and ‘number of rooms’), the \(\mathbf{X}^T\mathbf{X}\) matrix becomes nearly singular (non-invertible). This leads to unstable and unreliable weight estimates \(\mathbf{w}\).
A Visual Example of Overfitting
An overfit model learns the ‘noise’ in the training data, not just the underlying ‘signal’.
The Core Trade-off: Bias vs. Variance
Bias: The systematic difference between a model’s predictions and the true values. High bias means the model is too simple (underfitting).
Variance: The variability of a model’s predictions across different training sets. High variance means the model is too sensitive to the training data (overfitting).
Our goal is to find a model that achieves a good balance between bias and variance.
The Solution: Regularization Penalizes Complexity
The core idea of regularization is to penalize model complexity while minimizing training error.
We achieve this by adding a Penalty Term to the loss function, which is related to the magnitude of the weights \(\mathbf{w}\).
\(\lambda \ge 0\) is the regularization parameter, a hyperparameter we set. It controls the strength of the penalty.
\(\lambda = 0\): No penalty; this reverts to standard linear regression.
\(\lambda \to \infty\): The penalty is extreme, forcing all weights toward zero.
5.2 L2 Regularization: Ridge Regression
Ridge Regression uses the squared L2 norm of the weight vector, \(\|\mathbf{w}\|_2^2 = \sum_{j=1}^d w_j^2\), as its penalty term.
The objective function is:
\[ \large{J_{\text{Ridge}}(\mathbf{w}, b) = \text{MSE}(\mathbf{w},b) + \lambda \sum_{j=1}^d w_j^2} \]
Effect: - It causes shrinkage of the coefficients, pulling them towards zero but rarely making them exactly zero. - By penalizing large weights, it makes the model smoother and reduces variance. - It effectively handles multicollinearity, making the model more stable.
L1 Regularization: Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator) Regression uses the L1 norm of the weight vector, \(\|\mathbf{w}\|_1 = \sum_{j=1}^d |w_j|\), as its penalty.
The objective function is:
\[ \large{J_{\text{Lasso}}(\mathbf{w}, b) = \text{MSE}(\mathbf{w},b) + \lambda \sum_{j=1}^d |w_j|} \]
Effect: - Lasso not only shrinks coefficients but can force the coefficients of some unimportant features to be exactly zero. - Therefore, Lasso performs automatic Feature Selection, producing a sparser, more interpretable model, which is highly valuable in economic analysis.
The difference between Lasso and Ridge can be seen in the constraints they place on the weights.
Ridge: The constraint \(\|\mathbf{w}\|_2^2 \le \alpha\) is a circle (or sphere).
Lasso: The constraint \(\|\mathbf{w}\|_1 \le \alpha\) is a diamond (or high-dimensional polyhedron).
When the loss function’s contours (ellipses) expand to meet the constraint region, the sharp corners of the Lasso diamond make it more likely for the contact point to be on an axis (where some \(w_j=0\)).
Figure 2: Geometric Interpretation of Lasso (left) vs. Ridge (right)
Practice: Comparing OLS, Ridge, and Lasso
Let’s create a scenario with multicollinearity and irrelevant features to see how these three models perform.
Task: 1. Create a dataset where some features are useful, and others are pure noise. 2. Train models using Ordinary Least Squares (OLS), Ridge, and Lasso. 3. Compare their learned coefficients to the true coefficients.
Python Code: Build Dataset and Train Models
import numpy as npfrom sklearn.linear_model import LinearRegression, Ridge, Lasso# 1. Generate synthetic datanp.random.seed(42)n_samples, n_features =100, 20X = np.random.randn(n_samples, n_features)# Create true coefficients, where only 5 are non-zerotrue_coef = np.zeros(n_features)true_coef[:5] = np.array([5, -3, 2, 4, -1.5])y = X @ true_coef + np.random.normal(0, 2.5, n_samples)# 2. Train the modelsols = LinearRegression().fit(X, y)ridge = Ridge(alpha=5.0).fit(X, y)lasso = Lasso(alpha=0.2).fit(X, y)
Visualization: Comparing Learned Coefficients
Let’s plot the coefficients learned by each of the three models.
Figure 3: Coefficient comparison between OLS, Ridge, and Lasso
Observations: - OLS: Coefficients fluctuate wildly, incorrectly assigning significant weights to many noise features (index 5 and above). - Ridge: All coefficients are shrunk towards zero. The model is more stable than OLS, but no coefficient becomes exactly zero. - Lasso: Successfully forces most of the noise feature coefficients to exactly zero, most closely recovering the true sparse coefficients.
5.3 Logistic Regression
Let’s return to classification. What are the problems with using the raw output of a linear model, \(\mathbf{w}^T\mathbf{x}+b\), directly for classification?
Mismatched Output Range: The output is \((-\infty, +\infty)\), but we want a value representing probability, which should be in the \([0, 1]\) interval.
Sensitivity to Outliers: A single outlier far from the decision boundary can drastically shift the regression line, thereby altering the classification outcome.
Logistic regression solves these issues with a clever ‘squashing’ function.
The Sigmoid Function Maps Real Numbers to Probabilities
Logistic regression uses the Sigmoid function (also called the Logistic function) to transform the linear model’s output into a probability.
\[ \large{\sigma(z) = \frac{1}{1 + e^{-z}}} \]
where \(z = \mathbf{w}^T\mathbf{x} + b\).
Figure 4: The Sigmoid Function Curve
Properties: 1. The output is always between 0 and 1, perfectly matching the definition of probability. 2. When \(z=0\), \(\sigma(z)=0.5\); as \(z \to +\infty\), \(\sigma(z) \to 1\); as \(z \to -\infty\), \(\sigma(z) \to 0\).
Probabilistic Interpretation of Logistic Regression
The logistic regression model assumes the probability of a sample belonging to the positive class (y=1) is: \[ \large{P(y=1 | \mathbf{x}; \mathbf{w}, b) = \sigma(\mathbf{w}^T\mathbf{x} + b)} \] Therefore, the probability of it belonging to the negative class (y=0) is: \[ \large{P(y=0 | \mathbf{x}; \mathbf{w}, b) = 1 - P(y=1 | \mathbf{x}; \mathbf{w}, b)} \] A decision threshold of 0.5 is typically used: if \(P(y=1 | \mathbf{x}) > 0.5\) (which means \(\mathbf{w}^T\mathbf{x} + b > 0\)), we predict 1; otherwise, we predict 0.
The Loss Function for Logistic Regression is Cross-Entropy
Logistic regression isn’t optimized using Mean Squared Error. Instead, it uses an idea derived from Maximum Likelihood Estimation (MLE).
For the entire dataset, we want to maximize the joint probability of observing the given labels. Taking the logarithm and negating it gives us the loss function to minimize, known as Log Loss or Binary Cross-Entropy:
where \(\hat{p}_n = \sigma(\mathbf{w}^T\mathbf{x}_n + b)\). This loss function is convex and can be efficiently solved using methods like gradient descent.
Intuition Behind the Cross-Entropy Loss
Practice: Predicting Customer Churn
Task: A bank wants to predict whether a customer will leave (‘churn’) based on their profile (e.g., credit score, age, balance). This is a classic binary classification problem.
We’ll use a synthetic customer dataset and model it with scikit-learn’s LogisticRegression, simplifying to two features for visualization.
Python Code: Training and Visualizing the Decision Boundary
Figure 5: Decision Boundary of a Logistic Regression Model
The color shading represents the predicted probability of churn, and the black line is the decision boundary where the probability is 0.5.
5.4 Support Vector Machines (SVM)
Logistic regression finds a boundary that separates the data, but is it the best boundary?
As seen below, multiple lines can perfectly separate the two classes. The core idea of SVM is: Don’t just separate the classes, separate them with the largest possible ‘margin’. The most robust boundary is the one that is as far as possible from the nearest points of both classes.
Figure 6: Which separating line is the best?
SVM Core Concepts: Margin and Support Vectors
Decision Boundary: The hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\).
Margin: The “empty” region between the decision boundary and the data points on either side. SVM aims to maximize the width of this region.
Support Vectors: The data points that lie exactly on the edges of the margin. These critical points alone determine the position of the decision boundary. Moving other points won’t affect the model.
Figure 7: The margin and support vectors in an SVM
SVM Math (Linearly Separable Case)
To maximize the margin, we first define its width. By scaling \(\mathbf{w}\) and \(b\), we can set the margin such that for any support vector \(\mathbf{x}_s\), we have \(|\mathbf{w}^T\mathbf{x}_s + b| = 1\). The distance from a point to the hyperplane is \(\frac{|\mathbf{w}^T\mathbf{x} + b|}{\|\mathbf{w}\|}\). So, the distance from a support vector to the hyperplane is \(1/\|\mathbf{w}\|\).
The total width of the margin is therefore \(2 / \|\mathbf{w}\|\).
Simultaneously, all points must be classified correctly, meaning for each sample \((\mathbf{x}_n, y_n)\) (where \(y_n \in \{-1, 1\}\)): \(y_n(\mathbf{w}^T\mathbf{x}_n + b) \ge 1\).
The SVM Optimization Problem (Hard Margin)
In summary, the optimization problem for a linearly separable (hard-margin) SVM is:
The Hyperparameter C Balances Margin Width and Errors
\(C\) is a crucial hyperparameter that controls the penalty for slack variables. Think of it as the inverse of the regularization parameter \(\lambda\).
Small \(C\): Low penalty for errors. The model prioritizes a wide margin, even if it means some points are inside the margin or misclassified. High tolerance, strong regularization, may underfit.
Large \(C\): High penalty for errors. The model tries very hard to classify every point correctly, which can lead to a narrow margin and overfitting to the training data. Low tolerance, weak regularization, may overfit.
Practice: How the C Parameter Affects the SVM Boundary
Figure 8: The effect of C on the SVM decision boundary
Left (C=0.1): The margin is wider, classifying one point incorrectly to achieve a simpler boundary. The model is more ‘tolerant’.
Right (C=100): The margin is narrower, contorting to classify every point correctly. The model is more ‘strict’.
5.5 Multiclass Linear Models
We have focused on binary classification, but many real-world tasks involve multiple categories. For example:
Segmenting customers into ‘High-Value’, ‘Mid-Value’, and ‘Low-Value’.
Recognizing handwritten digits (0-9, a 10-class problem).
Predicting the state of the economy: ‘Recovery’, ‘Boom’, ‘Recession’, or ‘Depression’.
How can we extend binary classifiers to handle multiclass scenarios?
Strategy 1: One-vs-Rest (OvR)
This strategy involves training one binary classifier for each class, which is trained to distinguish that class from all other classes combined.
For K classes, you train K classifiers.
To predict, a new sample is fed to all K classifiers. The class corresponding to the classifier with the highest confidence score is chosen.
Strategy 2: One-vs-One (OvO)
This strategy involves training a binary classifier for every pair of classes.
For K classes, you train \(K(K-1)/2\) classifiers.
To predict, a new sample is run through all classifiers, and each classifier ‘votes’ for a class. The class with the most votes wins.
```
## OvR vs. OvO: A Comparison
| Feature | One-vs-Rest (OvR) | One-vs-One (OvO) |
| :----------------- | :------------------------------------------- | :---------------------------------------------- |
| **# of Classifiers** | K | K(K-1)/2 |
| **Training Data** | Each classifier uses all data (can be imbalanced) | Each classifier uses only two classes (more balanced) |
| **Use Case** | Efficient when K is small. | Can be faster to train when K is very large. |
| **Commonly used with**| Logistic Regression (default) | Support Vector Machines |
## Direct Extension: Softmax Regression {#sec-softmax}
A more direct approach is **Softmax Regression**, which generalizes logistic regression to multiple classes.
For K classes, the model learns K weight vectors $\{\mathbf{w}_1, \ldots, \mathbf{w}_K\}$. For a sample $\mathbf{x}$, we compute K scores: $s_k(\mathbf{x}) = \mathbf{w}_k^T \mathbf{x} + b_k$.
The **Softmax function** then converts these scores into a probability distribution:
$$ \large{P(y=k | \mathbf{x}) = \text{softmax}(s_k) = \frac{e^{s_k(\mathbf{x})}}{\sum_{j=1}^K e^{s_j(\mathbf{x})}}}
$$
- The probabilities for all classes sum to 1.
- The loss function is the multiclass version of **Cross-Entropy Loss**.
## Practice: Classifying the Iris Dataset {#sec-iris-goal}
The Iris dataset is the 'Hello World' of machine learning. It contains measurements for three species of iris flowers (Setosa, Versicolour, Virginica).
**Task**: Build a multiclass model to classify the iris species based on its sepal length and width.
We will use `scikit-learn`'s `LogisticRegression` and set `multi_class='multinomial'` to use Softmax regression directly.
## Python: Training and Visualizing the Multiclass Boundary {#sec-iris-code}
We use only two features (sepal length and width) for easy visualization.
::: {#cell-fig-iris-boundary .cell execution_count=13}
::: {.cell-output .cell-output-display}
{#fig-iris-boundary width=729 height=520}
:::
:::
The model learns linear decision boundaries that divide the 2D feature space into three distinct regions, one for each iris species.
## 5.6 The Class Imbalance Problem {#sec-imbalance-intro}
In many important real-world applications, the event of interest is very rare.
- **Financial Fraud Detection**: The vast majority of transactions are legitimate.
- **Rare Disease Diagnosis**: Most people are healthy.
- **Ad Click-Through Prediction**: A very small fraction of users click on an ad.
This situation is known as **Class Imbalance**. For example, a dataset might contain 99% negative samples and only 1% positive samples.
## Why is Class Imbalance a Problem? The Accuracy Paradox
Standard machine learning models aim to maximize overall **Accuracy**.
On a dataset with 99% negative samples, a naive model that simply predicts 'negative' for every single sample will achieve 99% accuracy. However, this model is completely useless because it fails to identify any positive samples.
**The Core Issue**: The model's learning is dominated by the majority class, and it neglects the minority class. We need better evaluation metrics.
## A Better Evaluation Tool: The Confusion Matrix
The **Confusion Matrix** is a table that visualizes the performance of a classification model.
```{=html}
Key Metrics: Precision and Recall
Based on the confusion matrix, we can define two more meaningful metrics:
Precision: Of all samples predicted as positive, how many were actually positive? \[ \large{\text{Precision} = \frac{TP}{TP + FP}} \]Measures how ‘correct’ the positive predictions are.
Recall (or Sensitivity): Of all samples that were actually positive, how many did the model successfully find? \[ \large{\text{Recall} = \frac{TP}{TP + FN}} \]Measures how ‘complete’ the positive predictions are.
In fraud detection, we care deeply about recall (we don’t want to miss any fraudulent transactions).
Solution 1: Data-Level Resampling
The most direct approach is to fix the imbalance at the data level. There are two main strategies:
Pros and Cons of Resampling Methods
Undersampling
Pro: Faster training time.
Con: May discard important information from the majority class.
Oversampling
Pro: No information loss.
Con: May lead to overfitting on the minority class.
Popular Algorithm: SMOTE (Synthetic Minority Over-sampling Technique).
SMOTE Creates Synthetic Minority Samples
SMOTE is one of the most effective oversampling techniques.
Core Idea: For each minority sample, find its k-nearest neighbors (which are also minority samples). Then, randomly pick a point along the line segment connecting the sample to one of its neighbors and create a new, synthetic sample there.
This is like ‘interpolating’ within the minority class region, creating new data that is similar to the original data but not identical, which helps expand the decision region for the minority class.
How the SMOTE Algorithm Works
Practice: Improving Imbalanced Classification with SMOTE
We’ll create a highly imbalanced dataset and compare the performance of logistic regression before and after applying SMOTE.
Task: 1. Create a dataset with a 95% vs. 5% class imbalance. 2. Train a logistic regression model on the original data and evaluate. 3. Use SMOTE from the imbalanced-learn library to oversample the training data. 4. Train a new model on the resampled data and evaluate.
# Ensure imbalanced-learn is installed: pip install imbalanced-learnfrom collections import Counterfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_reportfrom imblearn.over_sampling import SMOTEfrom imblearn.pipeline import make_pipeline# 1. Create imbalanced dataX, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, n_classes=2, weights=[0.95, 0.05], random_state=42)print(f'Original data distribution: {Counter(y)}')X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)# 2. Model without SMOTEmodel_plain = LogisticRegression(solver='liblinear', random_state=42)model_plain.fit(X_train, y_train)print('\n--- Classification Report without SMOTE (minority class is 1) ---')print(classification_report(y_test, model_plain.predict(X_test)))# 3. Model with SMOTE (using a pipeline to prevent data leakage)pipeline_smote = make_pipeline(SMOTE(random_state=42), LogisticRegression(solver='liblinear', random_state=42))pipeline_smote.fit(X_train, y_train)print('\n--- Classification Report with SMOTE ---')print(classification_report(y_test, pipeline_smote.predict(X_test)))
Observation: After using SMOTE, the recall for the minority class (1) increased dramatically from 0.62 to 0.85, at the cost of a slight drop in precision. This is often the desired trade-off.
Solution 2: Algorithm-Level Adjustments
Besides modifying the data, we can also adjust the learning algorithm itself.
Adjusting Class Weights: We can assign a higher penalty for misclassifying minority class samples in the loss function.
For example, setting class_weight='balanced' in scikit-learn models.
This forces the model to pay more attention to correctly classifying the minority class during optimization.
Changing the Decision Threshold: Logistic regression uses a 0.5 probability threshold by default. For imbalanced problems, we can lower this threshold (e.g., to 0.3) to increase recall for the minority class.
Chapter Summary: A Unified View of Linear Models
We’ve explored the entire ‘family’ of linear models, but they all share a unified underlying philosophy:
Core Engine: Every model starts with the linear function \(\mathbf{w}^T\mathbf{x} + b\).
Task Adaptation:
Regression: Use the linear output directly.
Classification: Map the output to probabilities using Sigmoid/Softmax.
SVM: Focus on the geometric margin around the output.
Optimization Goal: Learn the optimal weights \(\mathbf{w}\) by defining different loss functions and regularization terms.
Loss Functions: Mean Squared Error, Cross-Entropy, Hinge Loss (SVM).
Regularization: L1 and L2 norms.
The Linear Model Family at a Glance
This diagram summarizes the models we’ve discussed, showing how different combinations of loss functions and regularization penalties lead to different models.