06: Non-linear Models

The Core Question of This Chapter

What should we do when the real world isn’t linear?

Most of our studies so far have focused on linear models, which are the cornerstones of econometrics and machine learning.

However, economic and financial relationships in the real world are often complex and non-linear.

The Limitations of Linear Models: An Intuitive Example

Suppose we use linear regression to fit a dataset that clearly has a curved relationship.

Limitation of Linear Models A scatter plot of points following a parabolic curve, with a poorly fitting straight line from a linear regression model drawn through them. The errors are highlighted. Linear Model vs. Non-linear Data X Y Systematically Underestimated Systematically Overestimated

Observation: The model systematically underestimates at both ends and overestimates in the middle. The Mean Squared Error (MSE) is high.

Non-linear Relationships in the Real World

  • Credit Default Risk: The relationship between income and default rate may not be linear. Both very low and very high-income individuals might have low risk, while middle-income groups could have higher risk due to over-leveraging.
  • Asset Returns: An asset’s relationship with market factors (its Beta) might change with market volatility, exhibiting non-linearity.
  • Consumer Behavior: The effect of price discounts on sales volume diminishes over time, following a saturation curve rather than a straight line.

Our Journey of Discovery: Three Non-linear Paths

In this chapter, we will systematically learn three powerful classes of non-linear models to overcome the limitations of linear ones.

Roadmap for Exploring Non-linear Models A roadmap showing three learning paths for non-linear models: Piecewise Linear, Quadratic Discriminant Analysis (QDA), and Kernel Methods. Start Linear Models 1. Piecewise Linear Approximate with lines 2. QDA Directly fit a curve 3. Kernel Trick Lift dimension, turn curve to line

Part 1: Piecewise Linear Discriminant

Piecewise Linear Discriminant: The ‘Brute Force’ Approach

Core Idea: If one straight line doesn’t work, let’s use multiple straight lines to piece together an approximation.

This strategy typically involves two steps:

  1. Subclass Partitioning: Further divide one of the original classes (e.g., the blue dots) into multiple subclasses, such that each subclass is linearly separable from the other classes.
  2. Class Merging: When making the final prediction, merge the results from all subclasses that belong to the same original parent class.

Geometric Intuition (1/3): The Initial Problem

First, we have a dataset that cannot be separated by a single straight line.

Initial Non-linear Classification Problem A typical dataset that cannot be perfectly separated by a linear model, where purple and blue dots are arranged in a moon shape. Customer Metric 1 Customer Metric 2 At Risk (Class 0) Good Credit (Class 1)

Geometric Intuition (2/3): Partition and Conquer

We imagine the ‘Good Credit’ (blue) customer group as two potential subgroups. Then we use two different lines to separate them.

  • Line 1: Separates the purple group and the blue ‘bottom-left’ subgroup.
  • Line 2: Separates the purple group and the blue ‘top-right’ subgroup.
Piecewise Linear Discriminant Idea The purple and blue point sets are separated by two independent red and orange dashed lines. Line 1 Line 2

Geometric Intuition (3/3): The Final Boundary

The final decision boundary is a ‘V’ shape, formed by combining parts of these two lines.

The Final Piecewise Linear Boundary Parts of two straight lines form a V-shape that separates the data points. Final Decision Boundary

Pros and Cons of Piecewise Linear Discriminant

Pros Cons
Simple idea, easy to understand How to partition subclasses is a key challenge
Can leverage existing linear model algorithms Number of subclasses is a hyperparameter
Can fit any complex shape with enough segments Prone to overfitting
Conceptually similar to models like decision trees Lacks ‘holistic’ model elegance

Conclusion: Piecewise linear discriminant is an effective ‘heuristic’ method, but in practice, we often seek more systematic and elegant solutions.

Part 2: Quadratic Discriminant Analysis (QDA)

QDA: Directly Fitting the Curve

Quadratic Discriminant Analysis (QDA) doesn’t piece together lines; it directly constructs a quadratic decision boundary.

For our moon-shaped data, an elegant parabola can perform the classification task very well.

QDA's Quadratic Boundary A red parabola elegantly separates the purple and blue data points. Quadratic Decision Boundary

The Form of the QDA Discriminant Function

The QDA discriminant function \(f(\mathbf{x}_n)\) is a quadratic function of the input vector \(\mathbf{x}_n\):

\[ \large{ f(\mathbf{x}_n, \mathbf{W}, \mathbf{w}, b) = \mathbf{x}_n^T \mathbf{W} \mathbf{x}_n + \mathbf{w}^T \mathbf{x}_n + b } \]

  • \(\mathbf{x}_n \in \mathbb{R}^d\): a d-dimensional feature vector.
  • \(\mathbf{W} \in \mathbb{R}^{d \times d}\): The quadratic term coefficient matrix, which ‘bends’ the decision boundary.
  • \(\mathbf{w} \in \mathbb{R}^d\): The linear term coefficient vector, which ‘shifts’ the decision boundary.
  • \(b \in \mathbb{R}\): The bias term.

The Core of QDA: The Quadratic Term Matrix \(\mathbf{W}\)

In the discriminant function, the term that gives QDA its non-linear power is the quadratic term \(\mathbf{x}_n^T \mathbf{W} \mathbf{x}_n\).

  • If \(\mathbf{W} = \mathbf{0}\), QDA degenerates into the familiar Linear Discriminant Analysis (LDA). The decision boundary is linear.
  • If \(\mathbf{W} \neq \mathbf{0}\), the decision boundary \(f(\mathbf{x}_n) = 0\) is a quadratic surface (in 2D space, this is a conic section like an ellipse, parabola, or hyperbola).

The Geometric Meaning of \(\mathbf{W}\): It Shapes the Boundary

Different \(\mathbf{W}\) matrices can generate various shapes of quadratic surface boundaries.

The W Matrix Determines the Boundary Shape Three subplots showing decision boundaries in the shapes of an ellipse, a parabola, and a hyperbola. Possible Boundary Shapes from QDA Ellipse Parabola Hyperbola

The ‘Cost’ of QDA: An Explosion in Parameters

This flexibility comes at a cost. Let’s count the number of parameters in the model:

  • Matrix \(\mathbf{W}\): Since \(\mathbf{W}\) can be assumed to be symmetric, it has \(d(d+1)/2\) independent parameters.
  • Vector \(\mathbf{w}\): Has \(d\) parameters.
  • Scalar \(b\): Has \(1\) parameter.

The total number of parameters is \(\frac{d(d+1)}{2} + d + 1\), which is on the order of \(O(d^2)\).

Consequences of Parameter Explosion

Feature Dimension (d) QDA Parameter Count (approx. d²/2)
2 6
10 66
50 1,326
100 5,151
1000 ~500,000

This leads to two serious practical problems:

  1. Huge Computational Load: Solving a model with \(d^2\)-level parameters is very time-consuming.
  2. Large Sample Size Requirement: To robustly estimate so many parameters, a very large sample size is needed, otherwise, the model is highly prone to overfitting.

Understanding QDA from a Probabilistic View: A More Practical Path

In practice, we don’t directly fit that massive \(\mathbf{W}\) matrix. Instead, we derive QDA through the lens of a probabilistic generative model.

Core Assumption: We assume that the data for each class \(k\) follows a Multivariate Gaussian Distribution.

\[ \large{ p(\mathbf{x} | y=k) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) } \]

  • \(\boldsymbol{\mu}_k\): The mean vector for class \(k\) (the center of the distribution).
  • \(\boldsymbol{\Sigma}_k\): The covariance matrix for class \(k\) (the shape and orientation of the distribution).

The Key Difference Between QDA and LDA: The Covariance Matrix

Both QDA and LDA assume the data follows a Gaussian distribution, but they differ in their assumptions about the covariance matrix.

  • LDA (Linear Discriminant Analysis): Assumes that the covariance matrices for all classes are the same. \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \dots = \boldsymbol{\Sigma}_K = \boldsymbol{\Sigma}\) This strong assumption leads to a linear decision boundary.

  • QDA (Quadratic Discriminant Analysis): Allows each class to have its own distinct covariance matrix. \(\boldsymbol{\Sigma}_i \neq \boldsymbol{\Sigma}_j\) for \(i \neq j\) This more flexible assumption is precisely the source of the quadratic decision boundary.

Visual Comparison: Assumptions of LDA vs. QDA

LDA vs. QDA Assumptions A side-by-side comparison. LDA shows two parallel ellipses with a linear boundary. QDA shows two differently shaped ellipses with a curved boundary. Linear Discriminant Analysis (LDA) Assumption: Shared Covariance (Σ₁ = Σ₂) Same-shaped contours yield a linear boundary. Quadratic Discriminant Analysis (QDA) Assumption: Separate Covariances (Σ₁ ≠ Σ₂) Different-shaped contours yield a quadratic boundary.

Derivation of the QDA Discriminant Function

Using Bayes’ theorem, the posterior probability is \(P(y=k|\mathbf{x}) \propto p(\mathbf{x}|y=k) P(y=k)\). By taking the logarithm and expanding the PDF of the Gaussian distribution, we get the discriminant function (ignoring constant terms):

\[ \large{ \delta_k(\mathbf{x}) = -\frac{1}{2} \log |\boldsymbol{\Sigma}_k| - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k } \]

This is a quadratic function of \(\mathbf{x}\)! The quadratic term comes from \(\mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x}\).

What is Mahalanobis Distance?

The core of the QDA discriminant function is the squared Mahalanobis distance:

\[ \large{ D_M^2(\mathbf{x}, \boldsymbol{\mu}_k) = (\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) } \]

  • Intuitive Understanding: It measures the ‘statistical distance’ from a point \(\mathbf{x}\) to the center of a distribution \(\boldsymbol{\mu}_k\).
  • It takes into account the covariance structure \(\boldsymbol{\Sigma}_k\) of the data itself. If the data varies greatly in a certain direction (high variance), distances in that direction are ‘shrunk’.
  • QDA essentially classifies a sample point to the class with the smallest Mahalanobis distance.

Visualizing: Euclidean vs. Mahalanobis Distance

Intuition vs. Form: Understanding Mahalanobis Distance A diagram comparing Euclidean and Mahalanobis distance. It shows two points A and B that are equidistant from the mean in Euclidean terms, but B is a statistical outlier according to Mahalanobis distance which considers the data's covariance. Intuition vs. Form: Understanding Mahalanobis Distance Euclidean: d(A,μ) ≈ d(B,μ) μ A B ● Formal Definition: Euclidean: d(x,μ) = √[(x-μ)T(x-μ)] Mahalanobis: DM(x,μ) = √[(x-μ)TΣ-1(x-μ)] Key Difference: Uses inverse covariance -1) to 'rescale' space and remove correlations. ● Intuitive Explanation: Point A is in the mainstream region of the data; a perturbation Δd is normal fluctuation. Point B, though equidistant in Euclidean terms, is on the edge of the distribution; the same perturbation Δd makes it an extreme outlier.

Conclusion: Point A lies along the ‘major axis’ of the data distribution. Although its Euclidean distance is large, it is statistically ‘closer’.

QDA Decision Rule

For a two-class problem (classes \(\omega_1, \omega_2\)), we can define a discriminant function \(f_i(\mathbf{x})\):

\[ \large{ f_i(\mathbf{x}) = \log\pi_i - \frac{1}{2}\log|\boldsymbol{\Sigma}_i| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_i)^T \boldsymbol{\Sigma}_i^{-1} (\mathbf{x} - \boldsymbol{\mu}_i) } \]

Decision Rule: - If \(f_1(\mathbf{x}) \geq f_2(\mathbf{x})\), predict class \(\omega_1\). - Otherwise, predict class \(\omega_2\).

Since each class’s \(f_i(\mathbf{x})\) has its own covariance matrix \(\boldsymbol{\Sigma}_i\), the decision boundary \(f_1(\mathbf{x}) = f_2(\mathbf{x})\) is a quadratic function.

Case Study: QDA for Customer Classification in Python

Let’s apply LDA and QDA to the moons dataset using scikit-learn to visually compare their differences.

Objective:

  1. Train an LDA model.
  2. Train a QDA model.
  3. Visualize and compare their decision boundaries.

Preparing Data and Environment

We will use the LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis classes from scikit-learn.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from matplotlib.colors import ListedColormap

# Use previously generated data
X, y = make_moons(n_samples=200, noise=0.1, random_state=42)

# Define a function to plot the decision boundary
def plot_decision_boundary(clf, X, y, ax, title):
    """A helper function to plot classifier decision boundaries"""
    h = .02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    # create a meshgrid of features
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # predict the class for each point in the mesh
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # define color map
    cm_bright = ListedColormap(['#FF9999', '#9999FF'])
    
    # plot the decision regions and data points
    ax.contourf(xx, yy, Z, cmap=cm_bright, alpha=.5)
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['mediumpurple', 'dodgerblue']), edgecolors='k')
    ax.set_title(title, fontsize=16)
    ax.set_xlabel('Customer Metric 1')
    ax.set_ylabel('Customer Metric 2')
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

LDA’s Failure: A Single Line Isn’t Enough

We first apply LDA. As expected, LDA tries to find the best straight line to separate the data, but it performs poorly on this ‘moon-shaped’ dataset.

lda = LinearDiscriminantAnalysis()
lda.fit(X, y)

fig, ax = plt.subplots(1, 1, figsize=(9, 5.5))
plot_decision_boundary(lda, X, y, ax, 'Decision Boundary of Linear Discriminant Analysis (LDA)')
plt.show()
Figure 1: LDA attempts to separate non-linear data with a straight line, with poor results.

QDA’s Success: The Curve Fits Perfectly

Now, we apply QDA. Since QDA can learn a different covariance structure for each class, it is able to generate a flexible quadratic curve boundary that perfectly separates the two ‘moons’.

qda = QuadraticDiscriminantAnalysis()
qda.fit(X, y)

fig, ax = plt.subplots(1, 1, figsize=(9, 5.5))
plot_decision_boundary(qda, X, y, ax, 'Decision Boundary of Quadratic Discriminant Analysis (QDA)')
plt.show()
Figure 2: QDA generates a quadratic curve boundary, successfully solving the non-linear classification problem.

QDA Summary: When Should You Use It?

  • When to Use: QDA is a great choice when you believe that data from different classes have different covariance structures. It offers a good balance between LDA and more complex non-parametric methods (like K-Nearest Neighbors).
  • Advantages: The model form is explicit, and it’s less computationally expensive than many non-linear methods (like kernel SVM).
  • Disadvantages: It relies on a strong assumption of Gaussian distribution. When the feature dimension \(d\) is high, the number of parameters becomes very large, requiring substantial data to avoid overfitting and increasing computational cost.

Part 3: The Kernel Trick

The Kernel Method: A ‘Game-Changing’ Idea

While QDA is effective, it is limited to quadratic functions. What if we want to fit more complex boundaries?

The Kernel Method provides an extremely elegant and powerful idea:

Instead of learning a complex non-linear model in the original low-dimensional space, we map the data via a non-linear function \(\phi(\mathbf{x})\) to a higher-dimensional ‘feature space’ and then learn a simple linear model in that high-dimensional space.

This idea, often called the ‘Kernel Trick’, is a cornerstone of modern machine learning.

Core Intuition: Lifting the Dimension (1/3)

Imagine some data points in a one-dimensional space that cannot be separated by a single ‘point’ (a 0-dimensional hyperplane).

One-dimensional Linearly Inseparable Data A line with two colors of dots distributed on it, which cannot be separated by a single point. x Problem: Cannot separate blue and purple with a single point.

Core Intuition: Lifting the Dimension (2/3)

We define a simple non-linear mapping \(\phi(x) = (x, x^2)\), which maps the one-dimensional data point \(x\) into a two-dimensional space.

Mapping from 1D to 2D Shows the process of mapping x to (x, x^2), lifting the points onto a parabola. x Map: φ(x) = (x, x²)

Core Intuition: Lifting the Dimension (3/3)

A miracle happens! In this new 2D space, the data points become separable by a straight line.

Linear Separability in 2D Space The previously inseparable points become separable by a red line after being mapped to 2D. x Now, the data is linearly separable!

Complete Flowchart of the Kernel Method

The Kernel Trick Flowchart A diagram illustrating how non-linearly separable data in an input space becomes linearly separable in a higher-dimensional feature space via a mapping function phi(x). The Kernel Method (Kernel Trick) 1. Input Space Non-linearly Separable φ(x) 2. Feature Space Linearly Separable

The New Home After Mapping: Hilbert Space

The high-dimensional space where the mapping \(\phi(\mathbf{x})\) leads is mathematically known as a feature space. For our mathematical tools (like distance, angle) to work properly, we require this space to be a Hilbert Space.

For economics students, you don’t need to delve into its rigorous mathematical definition. You can understand it intuitively as:

A Hilbert space is a well-behaved, possibly infinite-dimensional generalization of Euclidean space.

Here, we can confidently compute lengths, distances, and angles, just as in ordinary space.

Key Property of Hilbert Space: The Inner Product

The most important property of a Hilbert space is that it defines an Inner Product, denoted as \(\langle \cdot, \cdot \rangle_{\mathcal{H}}\).

The inner product is a generalization of the familiar dot product: \[ \large{ \langle \mathbf{x}_1, \mathbf{x}_2 \rangle = \mathbf{x}_1^T \mathbf{x}_2 = \sum_{i=1}^d x_{1,i} x_{2,i} } \]

With an inner product, we can define Norm (length) and Distance, just as in Euclidean space. The inner product also contains information about the angle between vectors, allowing us to measure similarity.

\[ \large{ \langle \mathbf{x}_1, \mathbf{x}_2 \rangle = ||\mathbf{x}_1|| \cdot ||\mathbf{x}_2|| \cos\theta } \]

The ‘Magic’ of the Kernel Trick

Recall that the final computations of many linear algorithms (like SVM, PCA, linear regression) can be expressed solely in terms of the inner products of sample points, \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\).

So, if we map the data to a high-dimensional space \(\phi(\mathbf{x})\), the algorithm would need to compute the inner product after the mapping: \(\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}\).

The Problem:

  1. The mapping function \(\phi(\mathbf{x})\) can be very complex, and we might not even know its explicit form.
  2. The high-dimensional space could have a very high, or even infinite, dimension, making a direct computation of the inner product impossible.

The Kernel Function: A Shortcut to High Dimensions

The Kernel Function \(K(\mathbf{x}_i, \mathbf{x}_j)\) is defined to solve this very problem:

A function \(K(\mathbf{x}_i, \mathbf{x}_j)\) is a kernel function if it can be written as the inner product of some mapping \(\phi\) in a high-dimensional space, i.e., \(K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}\).

The essence of the kernel trick: We can directly compute the value of \(K(\mathbf{x}_i, \mathbf{x}_j)\) in the original low-dimensional space, without ever needing to know what \(\phi\) is or going to that high-dimensional space. This is equivalent to computing the inner product in the high-dimensional space, thus avoiding the ‘curse of dimensionality’.

Example: A Simple Quadratic Kernel

Suppose our original data is two-dimensional: \(\mathbf{x} = (x_1, x_2)\). Consider a simple kernel function: \(K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T \mathbf{x}_j)^2\).

Let’s expand it: \[ \begin{aligned} \large{ K(\mathbf{x}_i, \mathbf{x}_j) } &= \large{ (x_{i1}x_{j1} + x_{i2}x_{j2})^2 } \\ &= \large{ x_{i1}^2x_{j1}^2 + x_{i2}^2x_{j2}^2 + 2x_{i1}x_{j1}x_{i2}x_{j2} } \\ &= \large{ \langle (x_{i1}^2, x_{i2}^2, \sqrt{2}x_{i1}x_{i2}), (x_{j1}^2, x_{j2}^2, \sqrt{2}x_{j1}x_{j2}) \rangle } \end{aligned} \] This shows that this simple kernel implicitly corresponds to a mapping from 2D to 3D: \(\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)\).

We performed the inner product computation in the high-dimensional space with just one multiplication and one squaring!

How to Identify a Valid Kernel Function?

What conditions must a function \(K(\mathbf{x}_i, \mathbf{x}_j)\) satisfy to guarantee that there is always an underlying Hilbert space and a mapping \(\phi\)?

Mercer’s Theorem provides the answer:

For any finite set of data points \(\{\mathbf{x}_1, \dots, \mathbf{x}_N\}\), if the Gram matrix \(\mathbf{K}\) computed from the kernel function is always positive semi-definite, then \(K\) is a valid kernel function.

What is a Gram Matrix?

The Gram matrix \(\mathbf{K}\) is an \(N \times N\) symmetric matrix where each element \(K_{ij}\) is the value computed by the kernel function on the \(i\)-th and \(j\)-th data points:

\[ \large{ \mathbf{K} = \begin{pmatrix} K(\mathbf{x}_1, \mathbf{x}_1) & K(\mathbf{x}_1, \mathbf{x}_2) & \dots & K(\mathbf{x}_1, \mathbf{x}_N) \\ K(\mathbf{x}_2, \mathbf{x}_1) & K(\mathbf{x}_2, \mathbf{x}_2) & \dots & K(\mathbf{x}_2, \mathbf{x}_N) \\ \vdots & \vdots & \ddots & \vdots \\ K(\mathbf{x}_N, \mathbf{x}_1) & K(\mathbf{x}_N, \mathbf{x}_2) & \dots & K(\mathbf{x}_N, \mathbf{x}_N) \end{pmatrix} } \]

Intuitive Understanding: The Gram matrix is a pairwise similarity matrix that describes the relationships between all sample points in the dataset. The positive semi-definite property ensures that this ‘similarity’ is geometrically consistent (not self-contradictory).

Common Kernel Functions: Your ‘Arsenal’

Fortunately, we don’t need to verify Mercer’s theorem ourselves every time. There are many ready-to-use, proven kernel functions available.

Basic Principle for Choosing a Kernel: The more similar the sample attribute vectors \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are, the larger the value of the kernel function \(K(\mathbf{x}_i, \mathbf{x}_j)\) should be.

Next, we will introduce some of the most commonly used kernel functions.

Common Kernel 1: Linear Kernel

\[ \large{ K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j } \]

  • Essence: It’s simply the dot product in the original space.
  • Corresponding Map: \(\phi(\mathbf{x}) = \mathbf{x}\), meaning no mapping is performed.
  • Application: When the data is already linearly separable, an SVM with a linear kernel is the standard linear SVM. This is the simplest and fastest baseline.

Common Kernel 2: Polynomial Kernel

\[ \large{ K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i^T \mathbf{x}_j + r)^d } \]

  • Parameters:
    • \(d\): The degree of the polynomial.
    • \(\gamma\): A scaling coefficient.
    • \(r\): A constant term coefficient.
  • Essence: Implicitly maps data to a feature space that includes all interaction terms up to degree \(d\).
  • Application: Very effective for problems where interactions between features are important. For instance, in financial risk control, \(x_1 \times x_2\) (product of income and debt ratio) might be a stronger predictor than \(x_1\) or \(x_2\) alone.

Common Kernel 3: Gaussian (RBF) Kernel

The Radial Basis Function (RBF) Kernel, also commonly known as the Gaussian kernel, is the most frequently used and most powerful kernel function.

\[ \large{ K(\mathbf{x}_i, \mathbf{x}_j) = \exp \left( -\frac{||\mathbf{x}_i - \mathbf{x}_j||^2}{2\sigma^2} \right) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2) } \]

  • Parameter:
    • \(\sigma\) (or \(\gamma = 1/(2\sigma^2)\)): Controls the ‘width’ of the kernel.
  • Essence: Similarity is determined entirely by the Euclidean distance between two points. The closer two points are, the closer the kernel value is to 1; the farther apart, the closer to 0.
  • Corresponding Map: This is a very powerful kernel because it corresponds to an infinite-dimensional feature space.

Intuitive Understanding of the RBF Kernel

You can think of each data point as placing a Gaussian ‘hill’ in space. The decision boundary of an RBF kernel is formed by combining the contour lines of these ‘hills’.

RBF Kernel: How Gamma Determines 'Influence Radius' A diagram showing that a large gamma leads to a narrow influence and a complex boundary, while a small gamma leads to a wide influence and a smooth boundary. RBF Kernel: How γ Determines 'Influence Radius' Kernel K(x, x') = exp( · d²), larger γ means influence decays faster with distance(d) Large γ → Narrow Influence 'Sharp' curve: influence decays fast Small γ → Wide Influence 'Broad' curve: influence decays slowly
  • Small \(\gamma\) (Large \(\sigma\)): The kernel has a large radius, giving it a wide range of influence. The decision boundary is very smooth.
  • Large \(\gamma\) (Small \(\sigma\)): The kernel has a small radius, limiting the influence of each data point. The decision boundary becomes very complex and wiggly, and is prone to overfitting.

Conclusion: If you have no prior knowledge about the data’s structure, the RBF kernel is usually the default first choice.

Common Kernel 4: Sigmoid Kernel

\[ \large{ K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\alpha \mathbf{x}_i^T \mathbf{x}_j + \beta) } \]

  • Parameters: \(\alpha\) and \(\beta\).
  • Origin: Its form is inspired by the activation functions used in neural networks.
  • Caution: The Sigmoid kernel only satisfies Mercer’s condition for certain parameter values, making it less commonly used in practice than RBF and polynomial kernels.

Application: Kernelized Learning Algorithms

Applying the Kernel Method: ‘Kernelization’ of Algorithms

The beauty of the kernel trick lies in its universality. Any algorithm whose computations can be expressed entirely in terms of inner products between data points can be ‘kernelized’ by replacing the inner product with a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j)\).

\[ \large{ \langle \mathbf{x}_i, \mathbf{x}_j \rangle \quad \xrightarrow{\text{Kernelize}} \quad K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}} } \]

This allows us to seamlessly upgrade linear algorithms, endowing them with powerful non-linear processing capabilities.

Classic Applications

We will introduce two classic applications:

  1. Kernel Principal Component Analysis (Kernel PCA)
  2. Kernel Support Vector Machine (Kernel SVM)

Application 1: Kernel Principal Component Analysis (KPCA)

Limitations of PCA

Review of PCA:

  1. Compute the covariance matrix of the data: \(\mathbf{\Sigma} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^T\) (assuming data is centered).
  2. Perform eigenvalue decomposition on \(\mathbf{\Sigma}\) to find the principal eigenvectors (principal components).

Problem: PCA can only discover linear structures in the data. For non-linear structures, PCA fails.

The Idea Behind KPCA

The idea of KPCA:

  1. Map the data to a high-dimensional feature space: \(\mathbf{x}_n \to \phi(\mathbf{x}_n)\).
  2. Compute the covariance matrix in the high-dimensional space: \(\mathbf{\Sigma}_\phi = \frac{1}{N} \sum_{n=1}^N \phi(\mathbf{x}_n) \phi(\mathbf{x}_n)^T\).
  3. Perform eigenvalue decomposition on \(\mathbf{\Sigma}_\phi\).

The Computational Trick of KPCA

Directly computing \(\mathbf{\Sigma}_\phi\) is infeasible, as it could be an infinite-dimensional matrix.

Fortunately, through a series of mathematical derivations, it can be shown that an eigenvector \(\mathbf{v}\) in the high-dimensional space can be expressed as a linear combination of the mapped data points: \(\mathbf{v} = \sum_{n=1}^N \alpha_n \phi(\mathbf{x}_n)\).

Ultimately, the problem of finding the eigenvectors is transformed into an eigenvalue problem for the \(N \times N\) Gram matrix \(\mathbf{K}\):

\[ \large{ N\lambda \boldsymbol{\alpha} = \mathbf{K} \boldsymbol{\alpha} } \]

Here, \(\mathbf{K}\) is the kernel matrix we defined earlier, with elements \(K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)\). The entire computation completely avoids the explicit use of \(\phi\).

Case Study: Non-linear Dimensionality Reduction with KPCA

Let’s look at an example where standard PCA fails, but KPCA succeeds. We’ll use the make_circles dataset from scikit-learn.

Objective:

  1. Generate a ‘concentric circles’ dataset, which clearly has a non-linear structure.
  2. Reduce its dimensionality to one dimension using both standard PCA and KPCA with an RBF kernel.
  3. Compare the results after dimensionality reduction.

Preparing Data and Environment (KPCA)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.decomposition import PCA, KernelPCA

# Generate concentric circles data
X, y = make_circles(n_samples=400, factor=.3, noise=.05, random_state=0)

Visualizing the Original Data: Concentric Circles

This is our original data: a large circle enclosing a smaller one. We want to find a single dimension that can separate these two circles.

Figure 3: The concentric circles dataset cannot be separated by a linear projection.

The Failure of Standard PCA

Standard PCA tries to find the direction of maximum variance for projection. For the concentric circles data, any linear projection will mix the points from both circles together.

Figure 4: After standard PCA reduction, the two classes completely overlap.

The Success of KPCA

KPCA, using an RBF kernel, maps the data into an infinite-dimensional space. In this space, the inner and outer circles become separable. When we project this back to one dimension, we can see the two classes are clearly separated.

Figure 5: After KPCA with an RBF kernel, the two classes are perfectly separated.

Application 2: Kernel Support Vector Machine (KSVM)

Review of Linear SVM

  • Objective: Find a hyperplane that separates two classes of data with the maximum margin.
  • Mathematically, this is equivalent to solving a constrained quadratic optimization problem.

The Dual Problem

  • Through Lagrangian duality, the SVM optimization problem can be converted into its ‘dual form’.
  • Magically, in the dual form, both the optimization objective and the decision function depend only on the inner products of the data points, \(\mathbf{x}_i^T \mathbf{x}_j\).

From SVM to KSVM: A Single Substitution

The dual optimization objective for a linear SVM is: \[ \large{ \max_{\boldsymbol{\lambda}} \sum_{n=1}^N \lambda_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \lambda_n \lambda_m y_n y_m (\mathbf{x}_m^T \mathbf{x}_n) } \]

To get a Kernel SVM (KSVM), we just need to make one simple substitution: \[ \large{ \mathbf{x}_m^T \mathbf{x}_n \quad \longrightarrow \quad K(\mathbf{x}_m, \mathbf{x}_n) } \]

The final optimization objective becomes: \[ \large{ \max_{\boldsymbol{\lambda}} \sum_{n=1}^N \lambda_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \lambda_n \lambda_m y_n y_m K(\mathbf{x}_m, \mathbf{x}_n) } \]

This allows the SVM to find the maximum margin hyperplane in a high-dimensional feature space, resulting in a non-linear decision boundary in the original space.

Case Study: Solving the Moons Problem with KSVM

Let’s return to our original ‘moons’ classification problem. We previously solved it using QDA. Now let’s see how SVMs with different kernels perform.

Objective:

  1. Train an SVM with a linear kernel.
  2. Train an SVM with a polynomial kernel.
  3. Train an SVM with an RBF kernel.
  4. Visualize and compare their decision boundaries.

Preparing Data and Environment (KSVM)

from sklearn.svm import SVC

# Use the previous moons dataset
X, y = make_moons(n_samples=200, noise=0.1, random_state=42)

# Create classifier instances
svc_linear = SVC(kernel='linear', C=1.0)
svc_poly = SVC(kernel='poly', degree=3, C=1.0)
svc_rbf = SVC(kernel='rbf', gamma=2, C=1.0) # gamma=2 works well for this problem

# Train the models
svc_linear.fit(X, y);
svc_poly.fit(X, y);
svc_rbf.fit(X, y);

print("All SVM models have been trained.")
All SVM models have been trained.

KSVM Decision Boundary Visualization

Figure 6: Comparison of different SVM kernels on a non-linear problem.

Observations:

  1. Linear Kernel: Fails to separate the data, just like LDA.
  2. Polynomial Kernel: Successfully fits a curved boundary.
  3. RBF Kernel: Also succeeds, and the boundary appears very smooth and natural.

Chapter Summary

Comparison of Three Non-linear Modeling Philosophies

Feature / Method Piecewise Linear Discriminant Quadratic Discriminant Analysis (QDA) Kernel Method (Kernel Trick)
Core Idea Approximate with line segments Directly fit a quadratic curve Lift dimension, turn curve into line
Flexibility High (can fit any shape in theory) Medium (only quadratic forms) Very High (RBF can fit any shape)
Computational Cost Depends on number of segments Medium (\(O(d^2)\)) High (\(O(N^2)\) to \(O(N^3)\))
Interpretability Medium (like a decision tree) High (quadratic function) Low (high-dim space is not intuitive)
Main Assumption No specific distribution assumption Data follows a Gaussian distribution No specific distribution assumption
Key Parameters Number of subclasses/segments None (but relies on Gaussian assumption) Kernel type and its parameters (gamma, d)
Use Case Intuitive modeling, tree-like scenarios Different class covariances, low feature dimension Complex non-linear problems, the powerful default tool

Key Takeaways

  1. Embrace Non-linearity: The real world is complex. Linear models have their limits, and learning non-linear models is essential.
  2. QDA: Generates quadratic boundaries by assuming different covariance matrices for each class. It’s a natural extension of LDA and strikes a good balance between LDA and more complex models.
  3. Kernel Trick:
    • Core Idea: Lift dimension + Apply a linear model.
    • Uses a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle\) as a computational shortcut, avoiding the curse of dimensionality.
    • Can ‘kernelize’ any algorithm that relies on inner products (like PCA, SVM), giving it powerful non-linear capabilities.
    • The RBF kernel is the most powerful and commonly used default option.

Thank You!

Q & A