Alternative Data: Satellite imagery, news sentiment, supply chain info…
We can easily end up with a dataset of hundreds or even thousands of dimensions.
We Face a ‘Data-Rich, Insight-Poor’ Dilemma
The explosive growth in data volume does not directly translate to an increase in insight. The goal of representation learning is to extract the signal from the noise.
This is the so-called ‘Curse of Dimensionality’.
What is the ‘Curse of Dimensionality’?
As the number of dimensions d increases, a fixed number of samples becomes increasingly sparse, and the space between data points grows exponentially.
The Curse of Dimensionality is an Enemy of Modeling and Analysis
When data dimensionality d is too high, a series of serious problems arise:
Problem Category
Specific Manifestation
Impact on Economic Research
Computational Efficiency
Algorithm complexity grows exponentially
Models take too long to train, hindering iteration.
Data Sparsity
A fixed number of samples becomes very sparse
Samples are not representative; hard to find significant relationships.
Model Overfitting
The model learns noise, not the true pattern
Perfect in-sample performance, but poor out-of-sample (predictive) power.
Multicollinearity
Many features are highly correlated
Difficult to identify the true impact of individual variables; unstable parameter estimates.
Representation learning (or dimensionality reduction) is the key to solving this problem.
The Goal of Representation Learning: Simplify with Minimal Information Loss
Our goal is to map a high-dimensional sample set \(X \in \mathbb{R}^{d \times N}\) to a low-dimensional space \(Z \in \mathbb{R}^{l \times N}\), where \(l \ll d\).
Core Requirement: The new representation \(Z\) must preserve the most important ‘structure’ or ‘information’ from the original data \(X\). Different algorithms define ‘structure’ differently, leading to various reduction methods.
Before We Begin: Preprocessing is the Foundation of Success
Before applying any complex dimensionality reduction algorithm, we must clean the raw data. This is like laying the foundation before building a house.
Preprocessing Issue 1: Outliers
Extreme values can severely distort a model’s variance calculation (e.g., in PCA), pulling it towards the direction of the outlier.
Common Treatments: Winsorization, log transformation, or direct removal.
Preprocessing Issue 2: Missing Data
Most algorithms cannot handle missing values (NaN).
Common Strategies:
Deletion: If the missing proportion is small, delete the row or column.
Imputation: Fill with the mean, median, or more complex models (like K-Nearest Neighbors).
Preprocessing Issue 3: Inconsistent Scales
If ‘Market Cap’ (trillions) and ‘P/E Ratio’ (tens) are analyzed together, market cap will completely dominate the results.
Solution: Feature Scaling. The most common is Standardization, which transforms data to have a mean of 0 and a variance of 1.
Standardization transforms all features to a distribution with a mean of 0 and a variance of 1. This ensures all features have equal weight in subsequent models like PCA.
Rearrange to get the final form: This reveals the core mathematical identity of PCA.
\[ \large{Sw = \lambda w} \]
Conclusion: The optimal projection directions (the principal components) \(w\) are the eigenvectors of the covariance matrix \(S\). The corresponding variance explained by each component is its eigenvalue\(\lambda\). The eigenvector with the largest eigenvalue is the first principal component.
Summary of Key Steps in PCA
This translates the abstract mathematical theory into a clear operational workflow.
Python Hands-On: Analyzing the U.S. Treasury Yield Curve with PCA
Yield curve movements are central to macroeconomic analysis. Rates at different maturities are highly correlated, making them a perfect candidate for PCA.
We run PCA on the daily changes in yields, as finance is often more concerned with changes than levels.
from sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAimport matplotlib.pyplot as pltimport matplotlib.ticker as mticker# Standardize the datascaler_yield = StandardScaler()scaled_changes = scaler_yield.fit_transform(yield_changes)# Run PCApca = PCA()pca.fit(scaled_changes)# Visualize explained varianceexplained_variance_ratio = pca.explained_variance_ratio_cumulative_variance_ratio = np.cumsum(explained_variance_ratio)fig, ax = plt.subplots(figsize=(10, 6))ax.bar(range(1, len(explained_variance_ratio) +1), explained_variance_ratio, alpha=0.6, color='skyblue', label='Individual explained variance')ax.step(range(1, len(cumulative_variance_ratio) +1), cumulative_variance_ratio, where='mid', color='red', linestyle='--', label='Cumulative explained variance')ax.set_ylabel('Explained variance ratio')ax.set_xlabel('Principal component index')ax.set_title('The First 3 Components Explain Over 95% of Yield Curve Variance', fontsize=16)ax.axhline(y=0.95, color='gray', linestyle=':', linewidth=2)ax.text(len(explained_variance_ratio), 0.95, '95% threshold', va='bottom', ha='right')ax.legend(loc='best')ax.xaxis.set_major_locator(mticker.MaxNLocator(integer=True))plt.show()
Figure 1: PCA Explained Variance Ratio
The Economic Meaning of the Principal Components: Level, Slope, and Curvature
By examining the loadings (i.e., the eigenvectors) of the principal components, we can assign them economic meaning.
#| echo: true#| warning: false#| label: fig-pca-components#| fig-cap: 'Economic Interpretation of Yield Curve Principal Components'#| cont: truecomponents = pd.DataFrame(pca.components_[:3, :].T, columns=['PC1 (Level)', 'PC2 (Slope)', 'PC3 (Curvature)'], index=yield_changes.columns)# To align with theory, we may need to flip the sign of some vectors (this doesn't change the interpretation)if components['PC1 (Level)'].mean() <0: components['PC1 (Level)'] *=-1if components['PC2 (Slope)'] >0: components['PC2 (Slope)'] *=-1if components['PC3 (Curvature)'].mean() >0: components['PC3 (Curvature)'] *=-1fig, ax = plt.subplots(figsize=(12, 7))components.plot(ax=ax, marker='o')ax.set_title('Economic Interpretation of Yield Curve Principal Components', fontsize=16)ax.set_ylabel('Component Loading')ax.set_xlabel('Maturity')ax.axhline(0, color='black', linewidth=0.5, linestyle='--')ax.legend(title='Principal Components')plt.show()
Economic Interpretation of the Principal Components
PC1 (Level): All maturities have loadings of the same sign. Represents a parallel shift of the entire yield curve. This is the most significant movement, often related to the overall stance of monetary policy.
PC2 (Slope): Short-term loadings are negative, long-term are positive. Represents a change in the slope of the yield curve (steepening or flattening), reflecting market expectations of future short-term rates and economic growth.
PC3 (Curvature): Short and long-term loadings are positive, mid-term are negative. Represents a change in the curvature (the ‘bow’ shape), related to expectations of interest rate volatility.
Linear Discriminant Analysis (LDA): Reduction for Classification
LDA is a supervised learning algorithm for dimensionality reduction. Unlike PCA, which seeks maximum variance, LDA’s goal is to find a projection direction that maximizes the separation between different classes while minimizing the variance within each class.
LDA’s Objective: The Within-Class Scatter Matrix
Within-class Scatter Matrix (\(S_w\)): Measures the scatter of data points within each class.
PCA only cares about overall variance. It would choose the horizontal direction as PC1, which does a poor job of separating the two classes.
LDA
LDA considers the class labels and chooses a projection that maximizes the separation between the classes, perfectly distinguishing them.
Solving LDA: A Generalized Eigenvalue Problem
Maximizing the ratio \(J(W)\) can be transformed into solving a generalized eigenvalue problem:
\[
\large{S_b w = \lambda S_w w}
\]
Multiplying both sides by \(S_w^{-1}\), we get a more familiar form:
\[
\large{S_w^{-1} S_b w = \lambda w}
\]
Conclusion: The optimal projection directions \(w\) for LDA are the eigenvectors of the matrix \(S_w^{-1} S_b\).
Python Hands-On: Classifying the Iris Dataset with LDA
The Iris dataset is the ‘Hello World’ of classification algorithms. It contains 3 classes, each with 4 features. Our goal is to reduce it to 2 dimensions and visualize the result.
#| echo: true#| warning: false#| label: fig-lda-iris#| fig-cap: 'LDA projects the 4D Iris dataset onto 2D'from sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.datasets import load_irisimport matplotlib.pyplot as plt# Load the datairis = load_iris()X = iris.datay = iris.targettarget_names = iris.target_names# Run LDA, reducing to 2 componentslda = LinearDiscriminantAnalysis(n_components=2)X_r = lda.fit(X, y).transform(X)# Visualizationplt.figure(figsize=(8, 6))colors = ['navy', 'turquoise', 'darkorange']for color, i, target_name inzip(colors,, target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], alpha=.8, color=color, label=target_name)plt.legend(loc='best', shadow=False, scatterpoints=1)plt.title('LDA of IRIS dataset')plt.xlabel('LD1')plt.ylabel('LD2')plt.grid(True)plt.show()
Analysis: Even after compressing from 4D to 2D, LDA excellently separates the three classes, demonstrating its powerful classification capabilities.
Multidimensional Scaling (MDS): Reconstructing a ‘Map’ from Distances
MDS has a completely different starting point from PCA and LDA. It doesn’t work with the feature matrix \(X\) directly but starts from a known distance (or dissimilarity) matrix\(D\).
Core Idea: Find a set of points \(Z\) in a low-dimensional space such that the Euclidean distances between these points are as close as possible to the original distance matrix \(D\).
Use Case: When we don’t have the original features but can measure the dissimilarity between objects. For example, survey data on ‘brand similarity’ or the ‘edit distance’ between genetic sequences.
MDS Analogy: Reconstructing a City Map
Imagine you only know the straight-line flight distances between major cities, but you have no latitude or longitude information.
The goal of MDS is to find the optimal 2D coordinates for each city based on this distance matrix.
Part 2: Nonlinear Dimensionality Reduction (Manifold Learning)
The Limitation of Linear Methods: When Data Structure is Curved
Linear methods like PCA and LDA assume that the data lies on a flat hyperplane. But what if the data’s intrinsic structure is curved?
A linear method like PCA would incorrectly project distant points (like A and B) close together, failing to ‘unroll’ the data.
The Core Idea of Manifold Learning: Data Lives on a Low-Dimensional Manifold
Manifold Hypothesis: The high-dimensional data we observe is actually generated by a few latent variables (the intrinsic dimension), and these data points lie on a low-dimensional manifold embedded in the high-dimensional space.
Goal: To ‘unroll’ this manifold and find low-dimensional coordinates that reflect the true neighborhood relationships of the data.
Difference from Linear Methods: Manifold learning focuses on local structure, assuming that Euclidean distances are only reliable between nearby points.
Isomap: Measuring Distance Along the ‘Surface’
Isomap is a clever extension of MDS that replaces Euclidean distance with Geodesic Distance.
Steps: 1. Construct Neighborhood Graph: For each point, connect it only to its K-nearest neighbors. 2. Compute Shortest Paths: Use a graph algorithm (like Dijkstra’s) to compute the shortest path between all pairs of points, approximating the geodesic distance. 3. Apply MDS: Use the resulting shortest-path distance matrix as input to the classical MDS algorithm.
Locally Linear Embedding (LLE): Preserving Local Linear Relationships
LLE assumes that each data point can be linearly reconstructed by its neighbors, and this local geometric relationship should be preserved in the low-dimensional space.
The Heart of LLE: Preserve Reconstruction Weights, Not Distances
t-SNE: The Swiss Army Knife of Data Visualization
t-SNE (t-distributed Stochastic Neighbor Embedding) is currently the most powerful and popular tool for reducing high-dimensional data to 2D or 3D for visualization.
Core Idea (Probabilistic Matching):
In high-D space, convert Euclidean distances between points into conditional probabilities that represent the likelihood that point \(i\) would pick point \(j\) as its neighbor (using a Gaussian distribution).
In low-D space, define a similar conditional probability (using a heavier-tailed t-distribution).
Adjust the positions of points in the low-D space to make the two probability distributions as similar as possible (by minimizing the KL divergence).
Advantage: It is exceptionally good at revealing the clustering structure of data.
t-SNE Key Parameter: Perplexity
Perplexity:
This is the most important parameter. It can be loosely interpreted as the ‘effective number of neighbors’ each point considers.
Typical values are between 5 and 50.
A lower value focuses on local structure, while a higher value considers more of the global structure.
Important Cautions for Interpreting t-SNE plots
Do not over-interpret distances between clusters: The distance between two clusters on a t-SNE plot does not meaningfully represent how ‘far apart’ they are in the original space.
Do not over-interpret the size of clusters: The area of a cluster on the plot does not mean it contains more data points or has a larger variance.
t-SNE is an exploratory visualization tool, not a rigorous clustering analysis method.
Python Hands-On: Visualizing Handwritten Digits with t-SNE
We’ll use the handwritten digits dataset from scikit-learn. Each digit is an 8x8 = 64-dimensional vector. We will use t-SNE to reduce it to 2 dimensions.
#| echo: true#| warning: false#| label: fig-tsne-digits#| fig-cap: 't-SNE reduces 64D handwritten digit data to 2D'from sklearn.manifold import TSNEfrom sklearn.datasets import load_digitsimport matplotlib.pyplot as pltimport numpy as np# Load the datadigits = load_digits()X = digits.datay = digits.target# Run t-SNE# init='pca' speeds up convergence, learning_rate='auto' is recommended in new scikit-learn versionstsne = TSNE(n_components=2, init='pca', random_state=0, perplexity=30, learning_rate='auto')X_tsne = tsne.fit_transform(X)# Visualizationplt.figure(figsize=(10, 8))scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=plt.cm.get_cmap('jet', 10))plt.title('t-SNE visualization of handwritten digits')plt.xlabel('t-SNE dimension 1')plt.ylabel('t-SNE dimension 2')cbar = plt.colorbar(scatter)cbar.set_ticks(np.arange(10))cbar.set_ticklabels(np.arange(10))plt.show()
Analysis: t-SNE perfectly separates the samples of different digits into distinct clusters in 2D space, which is incredibly helpful for understanding the structure of high-dimensional data.
Part 3: Advanced Topic: Sparse Representation
Sparse Representation: Building Signals with the Fewest ‘Blocks’
Previous methods aimed to ‘compress’ data, but sparse representation has a different starting point.
Core Idea: Any signal (e.g., an economic time series) can be represented as a linear combination of a few ‘atoms’ (basis vectors) from a ‘dictionary’ \(\Psi\). \[ \large{x = \Psi s} \] Here, \(s\) is a sparse vector, meaning most of its elements are zero.
Economic Intuition: The complex dynamics of the market might be driven by a combination of only a few ‘latent economic states’ or ‘shocks’. Sparse representation aims to find these core drivers.
Compressed Sensing: Recovering the Full Signal from Fewer Samples
The sparsity assumption leads to a surprising conclusion: compressed sensing.
If we know that a signal \(x\) is sparse under a certain dictionary \(\Psi\), we don’t need to observe the entire signal \(x\). We can take a small number of random measurements \(z\) and still perfectly reconstruct the original signal \(x\).
\[
\large{z = \Phi x = \Phi \Psi s = \Theta s}
\]
\(z\): a small number of observations (\(l \times 1\))
\(\Phi\): the measurement matrix (\(l \times d\), \(l \ll d\))
\(\Theta\): the sensing matrix
Goal: Given \(z\) and \(\Theta\), solve for the sparse vector \(s\). This is the foundation of modern signal processing, MRI, and more.
Reconstruction Algorithm: Matching Pursuit
Finding the sparse vector \(s\) from \(z\) and \(\Theta\) is an NP-hard problem. Matching Pursuit is a greedy algorithm that approximates the solution iteratively.
Find Most Correlated Atom: Find the atom in the dictionary \(\Theta\) that is most correlated with the current residual \(r\) (has the largest inner product).
Update Solution: Add the contribution of this atom to the solution \(s\).
Update Residual: Subtract the contribution of this atom from the current residual \(r\).
Iterate: Repeat steps 2-4 until the residual is small enough or a maximum number of iterations is reached.
Sparse representation can be used to identify structural breaks in time series.
Signal (x): The first difference of an economic time series.
Dictionary (Ψ): An identity matrix.
Sparse Coefficients (s): During a stable period, the differenced series is close to zero, so s is also sparse (close to zero). When a structural break (a sudden change) occurs, the difference will have a large spike, corresponding to a large non-zero element in s.
By finding the non-zero entries in s, we can automatically detect moments when economic regimes or market behaviors have changed.
Summary: How to Choose the Right Method?
Method
Type
Core Idea
Pros
Cons
PCA
Linear, Unsupervised
Maximize variance
Simple, fast, highly interpretable
Cannot handle nonlinear structures
LDA
Linear, Supervised
Maximize class separability
Excellent for classification
Requires class labels, has assumptions about class distribution
MDS
Distance-driven
Preserve pairwise distances
Flexible, only needs a distance matrix
Computationally expensive, results depend on the distance metric
Isomap
Nonlinear, Unsupervised
Preserve geodesic distance
Can ‘unroll’ certain manifolds
Sensitive to ‘shortcut’ noise, computationally intensive
LLE
Nonlinear, Unsupervised
Preserve local linear reconstruction
Computationally efficient, can handle various manifolds
Sensitive to K selection, can produce distortions
t-SNE
Nonlinear, Unsupervised
Preserve neighborhood probabilities
Excellent for visualization, reveals clusters
Computationally expensive, for visualization only, not for reduction
Final Thought: Reduction is a Means to an End, Not the Goal Itself
In this chapter, we have explored a range of representation learning (dimensionality reduction) methods, from linear to nonlinear.
They are powerful tools for exploratory data analysis, helping us discover hidden structures like factors, clusters, and low-dimensional manifolds in seemingly chaotic high-dimensional data.
They are also a crucial preprocessing step for building predictive models, effectively improving model stability and generalization.
The key is to always combine your domain knowledge (economics, finance) to interpret the results of dimensionality reduction, giving real-world meaning to these abstract dimensions and structures.