What should we do when the real world isn’t linear?
Most of our studies so far have focused on linear models, which are the cornerstones of econometrics and machine learning.
However, economic and financial relationships in the real world are often complex and non-linear.
The Limitations of Linear Models: An Intuitive Example
Suppose we use linear regression to fit a dataset that clearly has a curved relationship.
Observation: The model systematically underestimates at both ends and overestimates in the middle. The Mean Squared Error (MSE) is high.
Non-linear Relationships in the Real World
Credit Default Risk: The relationship between income and default rate may not be linear. Both very low and very high-income individuals might have low risk, while middle-income groups could have higher risk due to over-leveraging.
Asset Returns: An asset’s relationship with market factors (its Beta) might change with market volatility, exhibiting non-linearity.
Consumer Behavior: The effect of price discounts on sales volume diminishes over time, following a saturation curve rather than a straight line.
Our Journey of Discovery: Three Non-linear Paths
In this chapter, we will systematically learn three powerful classes of non-linear models to overcome the limitations of linear ones.
Part 1: Piecewise Linear Discriminant
Piecewise Linear Discriminant: The ‘Brute Force’ Approach
Core Idea: If one straight line doesn’t work, let’s use multiple straight lines to piece together an approximation.
This strategy typically involves two steps:
Subclass Partitioning: Further divide one of the original classes (e.g., the blue dots) into multiple subclasses, such that each subclass is linearly separable from the other classes.
Class Merging: When making the final prediction, merge the results from all subclasses that belong to the same original parent class.
Geometric Intuition (1/3): The Initial Problem
First, we have a dataset that cannot be separated by a single straight line.
Geometric Intuition (2/3): Partition and Conquer
We imagine the ‘Good Credit’ (blue) customer group as two potential subgroups. Then we use two different lines to separate them.
Line 1: Separates the purple group and the blue ‘bottom-left’ subgroup.
Line 2: Separates the purple group and the blue ‘top-right’ subgroup.
Geometric Intuition (3/3): The Final Boundary
The final decision boundary is a ‘V’ shape, formed by combining parts of these two lines.
Related Idea: Decision Trees
A decision tree model essentially performs a type of piecewise partitioning. It slices the feature space into multiple rectangular regions through a series of ‘splits’ parallel to the coordinate axes.
Pros and Cons of Piecewise Linear Discriminant
Pros
Cons
Simple idea, easy to understand
How to partition subclasses is a key challenge
Can leverage existing linear model algorithms
Number of subclasses is a hyperparameter
Can fit any complex shape with enough segments
Prone to overfitting
Conceptually similar to models like decision trees
Lacks ‘holistic’ model elegance
Conclusion: Piecewise linear discriminant is an effective ‘heuristic’ method, but in practice, we often seek more systematic and elegant solutions.
Part 2: Quadratic Discriminant Analysis (QDA)
QDA: Directly Fitting the Curve
Quadratic Discriminant Analysis (QDA) doesn’t piece together lines; it directly constructs a quadratic decision boundary.
For our moon-shaped data, an elegant parabola can perform the classification task very well.
The Form of the QDA Discriminant Function
The QDA discriminant function \(f(\mathbf{x}_n)\) is a quadratic function of the input vector \(\mathbf{x}_n\):
\[
\large{ f(\mathbf{x}_n, \mathbf{W}, \mathbf{w}, b) = \mathbf{x}_n^T \mathbf{W} \mathbf{x}_n + \mathbf{w}^T \mathbf{x}_n + b }
\]
\(\mathbf{x}_n \in \mathbb{R}^d\): a d-dimensional feature vector.
\(\mathbf{W} \in \mathbb{R}^{d \times d}\): The quadratic term coefficient matrix, which ‘bends’ the decision boundary.
\(\mathbf{w} \in \mathbb{R}^d\): The linear term coefficient vector, which ‘shifts’ the decision boundary.
\(b \in \mathbb{R}\): The bias term.
The Core of QDA: The Quadratic Term Matrix \(\mathbf{W}\)
In the discriminant function, the term that gives QDA its non-linear power is the quadratic term \(\mathbf{x}_n^T \mathbf{W} \mathbf{x}_n\).
If \(\mathbf{W} = \mathbf{0}\), QDA degenerates into the familiar Linear Discriminant Analysis (LDA). The decision boundary is linear.
If \(\mathbf{W} \neq \mathbf{0}\), the decision boundary \(f(\mathbf{x}_n) = 0\) is a quadratic surface (in 2D space, this is a conic section like an ellipse, parabola, or hyperbola).
The Geometric Meaning of \(\mathbf{W}\): It Shapes the Boundary
Different \(\mathbf{W}\) matrices can generate various shapes of quadratic surface boundaries.
The ‘Cost’ of QDA: An Explosion in Parameters
This flexibility comes at a cost. Let’s count the number of parameters in the model:
Matrix \(\mathbf{W}\): Since \(\mathbf{W}\) can be assumed to be symmetric, it has \(d(d+1)/2\) independent parameters.
Vector \(\mathbf{w}\): Has \(d\) parameters.
Scalar \(b\): Has \(1\) parameter.
The total number of parameters is \(\frac{d(d+1)}{2} + d + 1\), which is on the order of \(O(d^2)\).
Consequences of Parameter Explosion
Feature Dimension (d)
QDA Parameter Count (approx. d²/2)
2
6
10
66
50
1,326
100
5,151
1000
~500,000
This leads to two serious practical problems:
Huge Computational Load: Solving a model with \(d^2\)-level parameters is very time-consuming.
Large Sample Size Requirement: To robustly estimate so many parameters, a very large sample size is needed, otherwise, the model is highly prone to overfitting.
Understanding QDA from a Probabilistic View: A More Practical Path
In practice, we don’t directly fit that massive \(\mathbf{W}\) matrix. Instead, we derive QDA through the lens of a probabilistic generative model.
Core Assumption: We assume that the data for each class \(k\) follows a Multivariate Gaussian Distribution.
\(\boldsymbol{\mu}_k\): The mean vector for class \(k\) (the center of the distribution).
\(\boldsymbol{\Sigma}_k\): The covariance matrix for class \(k\) (the shape and orientation of the distribution).
The Key Difference Between QDA and LDA: The Covariance Matrix
Both QDA and LDA assume the data follows a Gaussian distribution, but they differ in their assumptions about the covariance matrix.
LDA (Linear Discriminant Analysis): Assumes that the covariance matrices for all classes are the same. \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \dots = \boldsymbol{\Sigma}_K = \boldsymbol{\Sigma}\) This strong assumption leads to a linear decision boundary.
QDA (Quadratic Discriminant Analysis): Allows each class to have its own distinct covariance matrix. \(\boldsymbol{\Sigma}_i \neq \boldsymbol{\Sigma}_j\) for \(i \neq j\) This more flexible assumption is precisely the source of the quadratic decision boundary.
Visual Comparison: Assumptions of LDA vs. QDA
Derivation of the QDA Discriminant Function
Using Bayes’ theorem, the posterior probability is \(P(y=k|\mathbf{x}) \propto p(\mathbf{x}|y=k) P(y=k)\). By taking the logarithm and expanding the PDF of the Gaussian distribution, we get the discriminant function (ignoring constant terms):
Intuitive Understanding: It measures the ‘statistical distance’ from a point \(\mathbf{x}\) to the center of a distribution \(\boldsymbol{\mu}_k\).
It takes into account the covariance structure\(\boldsymbol{\Sigma}_k\) of the data itself. If the data varies greatly in a certain direction (high variance), distances in that direction are ‘shrunk’.
QDA essentially classifies a sample point to the class with the smallest Mahalanobis distance.
Visualizing: Euclidean vs. Mahalanobis Distance
Conclusion: Point A lies along the ‘major axis’ of the data distribution. Although its Euclidean distance is large, it is statistically ‘closer’.
QDA Decision Rule
For a two-class problem (classes \(\omega_1, \omega_2\)), we can define a discriminant function \(f_i(\mathbf{x})\):
Decision Rule: - If \(f_1(\mathbf{x}) \geq f_2(\mathbf{x})\), predict class \(\omega_1\). - Otherwise, predict class \(\omega_2\).
Since each class’s \(f_i(\mathbf{x})\) has its own covariance matrix \(\boldsymbol{\Sigma}_i\), the decision boundary \(f_1(\mathbf{x}) = f_2(\mathbf{x})\) is a quadratic function.
Case Study: QDA for Customer Classification in Python
Let’s apply LDA and QDA to the moons dataset using scikit-learn to visually compare their differences.
Objective:
Train an LDA model.
Train a QDA model.
Visualize and compare their decision boundaries.
Preparing Data and Environment
We will use the LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis classes from scikit-learn.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_moonsfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysisfrom matplotlib.colors import ListedColormap# Use previously generated dataX, y = make_moons(n_samples=200, noise=0.1, random_state=42)# Define a function to plot the decision boundarydef plot_decision_boundary(clf, X, y, ax, title):"""A helper function to plot classifier decision boundaries""" h =.02# step size in the mesh x_min, x_max = X[:, 0].min() -.5, X[:, 0].max() +.5 y_min, y_max = X[:, 1].min() -.5, X[:, 1].max() +.5# create a meshgrid of features xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# predict the class for each point in the mesh Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape)# define color map cm_bright = ListedColormap(['#FF9999', '#9999FF'])# plot the decision regions and data points ax.contourf(xx, yy, Z, cmap=cm_bright, alpha=.5) ax.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['mediumpurple', 'dodgerblue']), edgecolors='k') ax.set_title(title, fontsize=16) ax.set_xlabel('Customer Metric 1') ax.set_ylabel('Customer Metric 2') ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False)
LDA’s Failure: A Single Line Isn’t Enough
We first apply LDA. As expected, LDA tries to find the best straight line to separate the data, but it performs poorly on this ‘moon-shaped’ dataset.
lda = LinearDiscriminantAnalysis()lda.fit(X, y)fig, ax = plt.subplots(1, 1, figsize=(9, 5.5))plot_decision_boundary(lda, X, y, ax, 'Decision Boundary of Linear Discriminant Analysis (LDA)')plt.show()
Figure 1: LDA attempts to separate non-linear data with a straight line, with poor results.
QDA’s Success: The Curve Fits Perfectly
Now, we apply QDA. Since QDA can learn a different covariance structure for each class, it is able to generate a flexible quadratic curve boundary that perfectly separates the two ‘moons’.
qda = QuadraticDiscriminantAnalysis()qda.fit(X, y)fig, ax = plt.subplots(1, 1, figsize=(9, 5.5))plot_decision_boundary(qda, X, y, ax, 'Decision Boundary of Quadratic Discriminant Analysis (QDA)')plt.show()
Figure 2: QDA generates a quadratic curve boundary, successfully solving the non-linear classification problem.
QDA Summary: When Should You Use It?
When to Use: QDA is a great choice when you believe that data from different classes have different covariance structures. It offers a good balance between LDA and more complex non-parametric methods (like K-Nearest Neighbors).
Advantages: The model form is explicit, and it’s less computationally expensive than many non-linear methods (like kernel SVM).
Disadvantages: It relies on a strong assumption of Gaussian distribution. When the feature dimension \(d\) is high, the number of parameters becomes very large, requiring substantial data to avoid overfitting and increasing computational cost.
Part 3: The Kernel Trick
The Kernel Method: A ‘Game-Changing’ Idea
While QDA is effective, it is limited to quadratic functions. What if we want to fit more complex boundaries?
The Kernel Method provides an extremely elegant and powerful idea:
Instead of learning a complex non-linear model in the original low-dimensional space, we map the data via a non-linear function \(\phi(\mathbf{x})\) to a higher-dimensional ‘feature space’ and then learn a simple linear model in that high-dimensional space.
This idea, often called the ‘Kernel Trick’, is a cornerstone of modern machine learning.
Core Intuition: Lifting the Dimension (1/3)
Imagine some data points in a one-dimensional space that cannot be separated by a single ‘point’ (a 0-dimensional hyperplane).
Core Intuition: Lifting the Dimension (2/3)
We define a simple non-linear mapping \(\phi(x) = (x, x^2)\), which maps the one-dimensional data point \(x\) into a two-dimensional space.
Core Intuition: Lifting the Dimension (3/3)
A miracle happens! In this new 2D space, the data points become separable by a straight line.
Complete Flowchart of the Kernel Method
The New Home After Mapping: Hilbert Space
The high-dimensional space where the mapping \(\phi(\mathbf{x})\) leads is mathematically known as a feature space. For our mathematical tools (like distance, angle) to work properly, we require this space to be a Hilbert Space.
For economics students, you don’t need to delve into its rigorous mathematical definition. You can understand it intuitively as:
A Hilbert space is a well-behaved, possibly infinite-dimensional generalization of Euclidean space.
Here, we can confidently compute lengths, distances, and angles, just as in ordinary space.
Key Property of Hilbert Space: The Inner Product
The most important property of a Hilbert space is that it defines an Inner Product, denoted as \(\langle \cdot, \cdot \rangle_{\mathcal{H}}\).
The inner product is a generalization of the familiar dot product: \[
\large{ \langle \mathbf{x}_1, \mathbf{x}_2 \rangle = \mathbf{x}_1^T \mathbf{x}_2 = \sum_{i=1}^d x_{1,i} x_{2,i} }
\]
With an inner product, we can define Norm (length) and Distance, just as in Euclidean space. The inner product also contains information about the angle between vectors, allowing us to measure similarity.
Recall that the final computations of many linear algorithms (like SVM, PCA, linear regression) can be expressed solely in terms of the inner products of sample points, \(\langle \mathbf{x}_i, \mathbf{x}_j \rangle\).
So, if we map the data to a high-dimensional space \(\phi(\mathbf{x})\), the algorithm would need to compute the inner product after the mapping: \(\langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}\).
The Problem:
The mapping function \(\phi(\mathbf{x})\) can be very complex, and we might not even know its explicit form.
The high-dimensional space could have a very high, or even infinite, dimension, making a direct computation of the inner product impossible.
The Kernel Function: A Shortcut to High Dimensions
The Kernel Function\(K(\mathbf{x}_i, \mathbf{x}_j)\) is defined to solve this very problem:
A function \(K(\mathbf{x}_i, \mathbf{x}_j)\) is a kernel function if it can be written as the inner product of some mapping \(\phi\) in a high-dimensional space, i.e., \(K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}\).
The essence of the kernel trick: We can directly compute the value of \(K(\mathbf{x}_i, \mathbf{x}_j)\) in the original low-dimensional space, without ever needing to know what \(\phi\) is or going to that high-dimensional space. This is equivalent to computing the inner product in the high-dimensional space, thus avoiding the ‘curse of dimensionality’.
Example: A Simple Quadratic Kernel
Suppose our original data is two-dimensional: \(\mathbf{x} = (x_1, x_2)\). Consider a simple kernel function: \(K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^T \mathbf{x}_j)^2\).
Let’s expand it: \[
\begin{aligned}
\large{ K(\mathbf{x}_i, \mathbf{x}_j) } &= \large{ (x_{i1}x_{j1} + x_{i2}x_{j2})^2 } \\
&= \large{ x_{i1}^2x_{j1}^2 + x_{i2}^2x_{j2}^2 + 2x_{i1}x_{j1}x_{i2}x_{j2} } \\
&= \large{ \langle (x_{i1}^2, x_{i2}^2, \sqrt{2}x_{i1}x_{i2}), (x_{j1}^2, x_{j2}^2, \sqrt{2}x_{j1}x_{j2}) \rangle }
\end{aligned}
\] This shows that this simple kernel implicitly corresponds to a mapping from 2D to 3D: \(\phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}x_1x_2)\).
We performed the inner product computation in the high-dimensional space with just one multiplication and one squaring!
How to Identify a Valid Kernel Function?
What conditions must a function \(K(\mathbf{x}_i, \mathbf{x}_j)\) satisfy to guarantee that there is always an underlying Hilbert space and a mapping \(\phi\)?
Mercer’s Theorem provides the answer:
For any finite set of data points \(\{\mathbf{x}_1, \dots, \mathbf{x}_N\}\), if the Gram matrix\(\mathbf{K}\) computed from the kernel function is always positive semi-definite, then \(K\) is a valid kernel function.
What is a Gram Matrix?
The Gram matrix \(\mathbf{K}\) is an \(N \times N\) symmetric matrix where each element \(K_{ij}\) is the value computed by the kernel function on the \(i\)-th and \(j\)-th data points:
Intuitive Understanding: The Gram matrix is a pairwise similarity matrix that describes the relationships between all sample points in the dataset. The positive semi-definite property ensures that this ‘similarity’ is geometrically consistent (not self-contradictory).
Common Kernel Functions: Your ‘Arsenal’
Fortunately, we don’t need to verify Mercer’s theorem ourselves every time. There are many ready-to-use, proven kernel functions available.
Basic Principle for Choosing a Kernel: The more similar the sample attribute vectors \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are, the larger the value of the kernel function \(K(\mathbf{x}_i, \mathbf{x}_j)\) should be.
Next, we will introduce some of the most commonly used kernel functions.
Essence: It’s simply the dot product in the original space.
Corresponding Map:\(\phi(\mathbf{x}) = \mathbf{x}\), meaning no mapping is performed.
Application: When the data is already linearly separable, an SVM with a linear kernel is the standard linear SVM. This is the simplest and fastest baseline.
Essence: Implicitly maps data to a feature space that includes all interaction terms up to degree \(d\).
Application: Very effective for problems where interactions between features are important. For instance, in financial risk control, \(x_1 \times x_2\) (product of income and debt ratio) might be a stronger predictor than \(x_1\) or \(x_2\) alone.
Common Kernel 3: Gaussian (RBF) Kernel
The Radial Basis Function (RBF) Kernel, also commonly known as the Gaussian kernel, is the most frequently used and most powerful kernel function.
\(\sigma\) (or \(\gamma = 1/(2\sigma^2)\)): Controls the ‘width’ of the kernel.
Essence: Similarity is determined entirely by the Euclidean distance between two points. The closer two points are, the closer the kernel value is to 1; the farther apart, the closer to 0.
Corresponding Map: This is a very powerful kernel because it corresponds to an infinite-dimensional feature space.
Intuitive Understanding of the RBF Kernel
You can think of each data point as placing a Gaussian ‘hill’ in space. The decision boundary of an RBF kernel is formed by combining the contour lines of these ‘hills’.
Small \(\gamma\) (Large \(\sigma\)): The kernel has a large radius, giving it a wide range of influence. The decision boundary is very smooth.
Large \(\gamma\) (Small \(\sigma\)): The kernel has a small radius, limiting the influence of each data point. The decision boundary becomes very complex and wiggly, and is prone to overfitting.
Conclusion: If you have no prior knowledge about the data’s structure, the RBF kernel is usually the default first choice.
Origin: Its form is inspired by the activation functions used in neural networks.
Caution: The Sigmoid kernel only satisfies Mercer’s condition for certain parameter values, making it less commonly used in practice than RBF and polynomial kernels.
Application: Kernelized Learning Algorithms
Applying the Kernel Method: ‘Kernelization’ of Algorithms
The beauty of the kernel trick lies in its universality. Any algorithm whose computations can be expressed entirely in terms of inner products between data points can be ‘kernelized’ by replacing the inner product with a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j)\).
This allows us to seamlessly upgrade linear algorithms, endowing them with powerful non-linear processing capabilities.
Classic Applications
We will introduce two classic applications:
Kernel Principal Component Analysis (Kernel PCA)
Kernel Support Vector Machine (Kernel SVM)
Application 1: Kernel Principal Component Analysis (KPCA)
Limitations of PCA
Review of PCA:
Compute the covariance matrix of the data: \(\mathbf{\Sigma} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^T\) (assuming data is centered).
Perform eigenvalue decomposition on \(\mathbf{\Sigma}\) to find the principal eigenvectors (principal components).
Problem: PCA can only discover linear structures in the data. For non-linear structures, PCA fails.
The Idea Behind KPCA
The idea of KPCA:
Map the data to a high-dimensional feature space: \(\mathbf{x}_n \to \phi(\mathbf{x}_n)\).
Compute the covariance matrix in the high-dimensional space: \(\mathbf{\Sigma}_\phi = \frac{1}{N} \sum_{n=1}^N \phi(\mathbf{x}_n) \phi(\mathbf{x}_n)^T\).
Perform eigenvalue decomposition on \(\mathbf{\Sigma}_\phi\).
The Computational Trick of KPCA
Directly computing \(\mathbf{\Sigma}_\phi\) is infeasible, as it could be an infinite-dimensional matrix.
Fortunately, through a series of mathematical derivations, it can be shown that an eigenvector \(\mathbf{v}\) in the high-dimensional space can be expressed as a linear combination of the mapped data points: \(\mathbf{v} = \sum_{n=1}^N \alpha_n \phi(\mathbf{x}_n)\).
Ultimately, the problem of finding the eigenvectors is transformed into an eigenvalue problem for the \(N \times N\) Gram matrix \(\mathbf{K}\):
Here, \(\mathbf{K}\) is the kernel matrix we defined earlier, with elements \(K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)\). The entire computation completely avoids the explicit use of \(\phi\).
Case Study: Non-linear Dimensionality Reduction with KPCA
Let’s look at an example where standard PCA fails, but KPCA succeeds. We’ll use the make_circles dataset from scikit-learn.
Objective:
Generate a ‘concentric circles’ dataset, which clearly has a non-linear structure.
Reduce its dimensionality to one dimension using both standard PCA and KPCA with an RBF kernel.
Compare the results after dimensionality reduction.
Preparing Data and Environment (KPCA)
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_circlesfrom sklearn.decomposition import PCA, KernelPCA# Generate concentric circles dataX, y = make_circles(n_samples=400, factor=.3, noise=.05, random_state=0)
Visualizing the Original Data: Concentric Circles
This is our original data: a large circle enclosing a smaller one. We want to find a single dimension that can separate these two circles.
Figure 3: The concentric circles dataset cannot be separated by a linear projection.
The Failure of Standard PCA
Standard PCA tries to find the direction of maximum variance for projection. For the concentric circles data, any linear projection will mix the points from both circles together.
Figure 4: After standard PCA reduction, the two classes completely overlap.
The Success of KPCA
KPCA, using an RBF kernel, maps the data into an infinite-dimensional space. In this space, the inner and outer circles become separable. When we project this back to one dimension, we can see the two classes are clearly separated.
Figure 5: After KPCA with an RBF kernel, the two classes are perfectly separated.
Application 2: Kernel Support Vector Machine (KSVM)
Review of Linear SVM
Objective: Find a hyperplane that separates two classes of data with the maximum margin.
Mathematically, this is equivalent to solving a constrained quadratic optimization problem.
The Dual Problem
Through Lagrangian duality, the SVM optimization problem can be converted into its ‘dual form’.
Magically, in the dual form, both the optimization objective and the decision function depend only on the inner products of the data points, \(\mathbf{x}_i^T \mathbf{x}_j\).
From SVM to KSVM: A Single Substitution
The dual optimization objective for a linear SVM is: \[
\large{ \max_{\boldsymbol{\lambda}} \sum_{n=1}^N \lambda_n - \frac{1}{2} \sum_{n=1}^N \sum_{m=1}^N \lambda_n \lambda_m y_n y_m (\mathbf{x}_m^T \mathbf{x}_n) }
\]
To get a Kernel SVM (KSVM), we just need to make one simple substitution: \[
\large{ \mathbf{x}_m^T \mathbf{x}_n \quad \longrightarrow \quad K(\mathbf{x}_m, \mathbf{x}_n) }
\]
This allows the SVM to find the maximum margin hyperplane in a high-dimensional feature space, resulting in a non-linear decision boundary in the original space.
Case Study: Solving the Moons Problem with KSVM
Let’s return to our original ‘moons’ classification problem. We previously solved it using QDA. Now let’s see how SVMs with different kernels perform.
Objective:
Train an SVM with a linear kernel.
Train an SVM with a polynomial kernel.
Train an SVM with an RBF kernel.
Visualize and compare their decision boundaries.
Preparing Data and Environment (KSVM)
from sklearn.svm import SVC# Use the previous moons datasetX, y = make_moons(n_samples=200, noise=0.1, random_state=42)# Create classifier instancessvc_linear = SVC(kernel='linear', C=1.0)svc_poly = SVC(kernel='poly', degree=3, C=1.0)svc_rbf = SVC(kernel='rbf', gamma=2, C=1.0) # gamma=2 works well for this problem# Train the modelssvc_linear.fit(X, y);svc_poly.fit(X, y);svc_rbf.fit(X, y);print("All SVM models have been trained.")
All SVM models have been trained.
KSVM Decision Boundary Visualization
Figure 6: Comparison of different SVM kernels on a non-linear problem.
Observations:
Linear Kernel: Fails to separate the data, just like LDA.
Polynomial Kernel: Successfully fits a curved boundary.
RBF Kernel: Also succeeds, and the boundary appears very smooth and natural.
Chapter Summary
Comparison of Three Non-linear Modeling Philosophies
Feature / Method
Piecewise Linear Discriminant
Quadratic Discriminant Analysis (QDA)
Kernel Method (Kernel Trick)
Core Idea
Approximate with line segments
Directly fit a quadratic curve
Lift dimension, turn curve into line
Flexibility
High (can fit any shape in theory)
Medium (only quadratic forms)
Very High (RBF can fit any shape)
Computational Cost
Depends on number of segments
Medium (\(O(d^2)\))
High (\(O(N^2)\) to \(O(N^3)\))
Interpretability
Medium (like a decision tree)
High (quadratic function)
Low (high-dim space is not intuitive)
Main Assumption
No specific distribution assumption
Data follows a Gaussian distribution
No specific distribution assumption
Key Parameters
Number of subclasses/segments
None (but relies on Gaussian assumption)
Kernel type and its parameters (gamma, d)
Use Case
Intuitive modeling, tree-like scenarios
Different class covariances, low feature dimension
Complex non-linear problems, the powerful default tool
Key Takeaways
Embrace Non-linearity: The real world is complex. Linear models have their limits, and learning non-linear models is essential.
QDA: Generates quadratic boundaries by assuming different covariance matrices for each class. It’s a natural extension of LDA and strikes a good balance between LDA and more complex models.
Kernel Trick:
Core Idea: Lift dimension + Apply a linear model.
Uses a kernel function \(K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle\) as a computational shortcut, avoiding the curse of dimensionality.
Can ‘kernelize’ any algorithm that relies on inner products (like PCA, SVM), giving it powerful non-linear capabilities.
The RBF kernel is the most powerful and commonly used default option.