02 Representation Learning: Extracting Core Insights from High-Dimensional Data

Welcome to Chapter 2: Representation Learning

Today’s Agenda:

Introduction: Why do Economics and Finance need ‘dimensionality reduction’?
Data Preprocessing: The cornerstone of success
Part 1: Linear Methods (PCA, LDA, MDS)
Part 2: Nonlinear Manifold Learning (Isomap, LLE, t-SNE)
Part 3: Advanced Topics (Sparse Representation)
Conclusion: How to choose the right tool for your problem

Core Question: Why Study ‘Dimensionality Reduction’ in Economics?

Imagine we want to predict a company’s stock return. How many variables might we have?

Firm-Level Data: Hundreds of financial ratios (P/E, ROA, leverage…)
Market Data: Historical prices, trading volumes, volatility…
Macroeconomic Data: GDP, interest rates, inflation, unemployment…
Alternative Data: Satellite imagery, news sentiment, supply chain info…

We can easily end up with a dataset of hundreds or even thousands of dimensions.

We Face a ‘Data-Rich, Insight-Poor’ Dilemma

The explosive growth in data volume does not directly translate to an increase in insight. The goal of representation learning is to extract the signal from the noise.

This is the so-called ‘Curse of Dimensionality’.

What is the ‘Curse of Dimensionality’?

As the number of dimensions d increases, a fixed number of samples becomes increasingly sparse, and the space between data points grows exponentially.

The Curse of Dimensionality is an Enemy of Modeling and Analysis

When data dimensionality d is too high, a series of serious problems arise:

Problem Category	Specific Manifestation	Impact on Economic Research
Computational Efficiency	Algorithm complexity grows exponentially	Models take too long to train, hindering iteration.
Data Sparsity	A fixed number of samples becomes very sparse	Samples are not representative; hard to find significant relationships.
Model Overfitting	The model learns noise, not the true pattern	Perfect in-sample performance, but poor out-of-sample (predictive) power.
Multicollinearity	Many features are highly correlated	Difficult to identify the true impact of individual variables; unstable parameter estimates.

Representation learning (or dimensionality reduction) is the key to solving this problem.

The Goal of Representation Learning: Simplify with Minimal Information Loss

Our goal is to map a high-dimensional sample set \(X \in \mathbb{R}^{d \times N}\) to a low-dimensional space \(Z \in \mathbb{R}^{l \times N}\), where \(l \ll d\).

\[ \large{ \underbrace{ \begin{pmatrix} z_{1,n} \\ \vdots \\ z_{l,n} \end{pmatrix} }_{Z_n \in \mathbb{R}^{l \times 1}} = \underbrace{ \begin{pmatrix} w_{1,1} & \cdots & w_{1,d} \\ \vdots & \ddots & \vdots \\ w_{l,1} & \cdots & w_{l,d} \end{pmatrix} }_{W^T \in \mathbb{R}^{l \times d}} \underbrace{ \begin{pmatrix} x_{1,n} \\ \vdots \\ x_{d,n} \end{pmatrix} }_{X_n \in \mathbb{R}^{d \times 1}} } \]

Core Requirement: The new representation \(Z\) must preserve the most important ‘structure’ or ‘information’ from the original data \(X\). Different algorithms define ‘structure’ differently, leading to various reduction methods.

Before We Begin: Preprocessing is the Foundation of Success

Before applying any complex dimensionality reduction algorithm, we must clean the raw data. This is like laying the foundation before building a house.

Preprocessing Issue 1: Outliers

Extreme values can severely distort a model’s variance calculation (e.g., in PCA), pulling it towards the direction of the outlier.

Common Treatments: Winsorization, log transformation, or direct removal.

Preprocessing Issue 2: Missing Data

Most algorithms cannot handle missing values (NaN).

Common Strategies:
1. Deletion: If the missing proportion is small, delete the row or column.
2. Imputation: Fill with the mean, median, or more complex models (like K-Nearest Neighbors).

Preprocessing Issue 3: Inconsistent Scales

If ‘Market Cap’ (trillions) and ‘P/E Ratio’ (tens) are analyzed together, market cap will completely dominate the results.

Solution: Feature Scaling. The most common is Standardization, which transforms data to have a mean of 0 and a variance of 1.
Formula: \(x'_{i} = \large{\frac{x_i - \mu_i}{\sigma_i}}\)

Hands-On Preprocessing: Cleaning Stock Financial Data with Python

Let’s use a few fundamental indicators for S&P 500 companies as an example to show how to preprocess with scikit-learn.

#| echo: true
#| warning: false

# To ensure reproducibility, we create a mock dataset instead of relying on a live API.
# The structure of this dataset is similar to what you'd get from yfinance.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

sp500_tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA', 'TSLA', 'JPM', 'V', 'JNJ', 'WMT']
data = {
    'MarketCap': [2.8e12, 2.5e12, 1.8e12, 1.5e12, 1.2e12, 8e11, 4.5e11, 5e11, 4.8e11, 4.2e11],
    'trailingPE': [28.5, 35.2, 26.8, 60.1, 95.3, 120.2, 12.1, 38.5, 25.4, 22.1],
    'forwardPE': [27.1, 33.1, 25.0, 55.6, 70.1, np.nan, 11.5, 36.2, 24.1, 21.0],
    'returnOnEquity': [1.5, 0.45, 0.3, 0.25, 0.6, 0.28, 0.17, 0.22, np.nan, 0.2],
    'priceToBook': [45.1, 12.3, 7.1, 9.8, 30.2, 25.1, 1.8, 12.5, 6.7, 5.4],
    'debtToEquity': [150.1, 50.2, 12.5, 120.8, 30.1, 20.5, np.nan, 55.3, 40.1, 80.2]
}
df = pd.DataFrame(data, index=sp500_tickers)
df.index.name = 'Ticker'

print('Simulated Raw Data (First 5 rows):')
print(df.head())

Preprocessing Step 1: Handle Outliers and Missing Values

A log transformation can mitigate the effect of extreme values (like Market Cap). Then, we’ll fill missing NaN values with the feature’s mean.

#| echo: true
#| warning: false
#| cont: true

# Log transform can mitigate extreme values and right-skewed distributions
df['MarketCap_log'] = np.log(df['MarketCap'])

# Select features for analysis
features = ['MarketCap_log', 'trailingPE', 'forwardPE', 'returnOnEquity', 'priceToBook', 'debtToEquity']
df_features = df[features]

# Impute missing values using the mean
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df_features), columns=features, index=df.index)

print('\nAfter Imputing Missing Values (First 5 rows):')
print(df_imputed.head())

Preprocessing Step 2: Feature Scaling (Standardization)

Standardization transforms all features to a distribution with a mean of 0 and a variance of 1. This ensures all features have equal weight in subsequent models like PCA.

#| echo: true
#| warning: false
#| cont: true

# Standardization
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=features, index=df.index)

print('\nAfter Standardization (First 5 rows):')
print(df_scaled.head())

Now, our data is ready for the models.

Part 1: Linear Dimensionality Reduction Methods

Principal Component Analysis (PCA): Finding the Directions of Maximum Variance

PCA is the most classic and commonly used linear dimensionality reduction method.

Core Idea: Rotate the coordinate system so that the new axes (principal components) explain the maximum possible variance in the data.
Goal: To preserve variance is to preserve information.

PCA’s Objective Function: Maximizing Projected Variance

PCA seeks a projection direction (a unit vector \(w\)) that maximizes the variance of the projected data.

Projected Data: \(Z = Xw\)
Projected Variance: \(\text{Var}(Z) = \text{Var}(Xw) = w^T S w\), where \(S\) is the covariance matrix of \(X\).

The optimization problem is:

\[ \large{\max_{w} \quad w^T S w} \] \[ \large{\text{s.t.} \quad w^T w = 1} \]

PCA’s Derivation: The Lagrangian

We use the method of Lagrange multipliers to solve this constrained optimization problem.

Formulate the Lagrangian: The goal is to maximize \(w^T S w\) subject to the constraint that \(w\) is a unit vector, i.e., \(w^T w = 1\).

\[ \large{L(w, \lambda) = w^T S w - \lambda(w^T w - 1)} \]

PCA’s Derivation: The First-Order Condition

Take the derivative with respect to \(w\) and set it to zero: This finds the critical points of the Lagrangian function.

\[ \large{\frac{\partial L}{\partial w} = 2Sw - 2\lambda w = 0} \]

PCA’s Derivation: A Classic Eigenvalue Problem

Rearrange to get the final form: This reveals the core mathematical identity of PCA.

\[ \large{Sw = \lambda w} \]

Conclusion: The optimal projection directions (the principal components) \(w\) are the eigenvectors of the covariance matrix \(S\). The corresponding variance explained by each component is its eigenvalue \(\lambda\). The eigenvector with the largest eigenvalue is the first principal component.

Summary of Key Steps in PCA

This translates the abstract mathematical theory into a clear operational workflow.

Python Hands-On: Analyzing the U.S. Treasury Yield Curve with PCA

Yield curve movements are central to macroeconomic analysis. Rates at different maturities are highly correlated, making them a perfect candidate for PCA.

# To ensure reproducibility, we generate simulated yield data.
# Real data can be fetched using libraries like fredapi.
import pandas as pd
import numpy as np

# Simulate data
np.random.seed(42)
dates = pd.date_range('2000-01-01', '2024-01-01', freq='B')
n_days = len(dates)
base_level = np.linspace(1.0, 3.0, n_days) + np.random.randn(n_days).cumsum() * 0.05
maturities = ['1M', '3M', '6M', '1Y', '2Y', '3Y', '5Y', '7Y', '10Y', '20Y', '30Y']
n_maturities = len(maturities)

# Create factors
level_factor = np.random.randn(n_days) * 0.1
slope_factor = np.random.randn(n_days) * 0.05
curve_factor = np.random.randn(n_days) * 0.02

# Create yields
slope_loadings = np.linspace(-1, 1, n_maturities)
curve_loadings = np.sin(np.linspace(0, np.pi, n_maturities))
yields = base_level[:, None] + level_factor[:, None] * 1.0 + slope_factor[:, None] * slope_loadings + curve_factor[:, None] * curve_loadings
yield_df = pd.DataFrame(yields, index=dates, columns=maturities)

# Calculate daily changes
yield_changes = yield_df.diff().dropna()
print('Simulated Daily Yield Changes (First 5 rows):')
print(yield_changes.head())

Simulated Daily Yield Changes (First 5 rows):
                  1M        3M        6M        1Y        2Y        3Y  \
2000-01-04 -0.119564 -0.115018 -0.110528 -0.106145 -0.101909 -0.097846   
2000-01-05  0.090287  0.098524  0.105740  0.111013  0.113614  0.113072   
2000-01-06 -0.085673 -0.077000 -0.067781 -0.057523 -0.045833 -0.032462   
2000-01-07  0.148844  0.138733  0.127924  0.115786  0.101820  0.085705   
2000-01-10 -0.138239 -0.116690 -0.094897 -0.072639 -0.049741 -0.026092   

                  5Y        7Y       10Y       20Y       30Y  
2000-01-04 -0.093964 -0.090255 -0.086693 -0.083238 -0.079839  
2000-01-05  0.109226  0.102237  0.092576  0.080972  0.068347  
2000-01-06 -0.017323 -0.000502  0.017750  0.037041  0.056879  
2000-01-07  0.067329  0.046804  0.024450  0.000768 -0.023612  
2000-01-10 -0.001653  0.023538  0.049368  0.075662  0.102201

Running PCA and Explaining the Variance

We run PCA on the daily changes in yields, as finance is often more concerned with changes than levels.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

# Standardize the data
scaler_yield = StandardScaler()
scaled_changes = scaler_yield.fit_transform(yield_changes)

# Run PCA
pca = PCA()
pca.fit(scaled_changes)

# Visualize explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)


fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.6, color='skyblue', label='Individual explained variance')
ax.step(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, where='mid', color='red', linestyle='--', label='Cumulative explained variance')
ax.set_ylabel('Explained variance ratio')
ax.set_xlabel('Principal component index')
ax.set_title('The First 3 Components Explain Over 95% of Yield Curve Variance', fontsize=16)
ax.axhline(y=0.95, color='gray', linestyle=':', linewidth=2)
ax.text(len(explained_variance_ratio), 0.95, '95% threshold', va='bottom', ha='right')
ax.legend(loc='best')
ax.xaxis.set_major_locator(mticker.MaxNLocator(integer=True))
plt.show()

The Economic Meaning of the Principal Components: Level, Slope, and Curvature

By examining the loadings (i.e., the eigenvectors) of the principal components, we can assign them economic meaning.

#| echo: true
#| warning: false
#| label: fig-pca-components
#| fig-cap: 'Economic Interpretation of Yield Curve Principal Components'
#| cont: true

components = pd.DataFrame(pca.components_[:3, :].T, 
                          columns=['PC1 (Level)', 'PC2 (Slope)', 'PC3 (Curvature)'], 
                          index=yield_changes.columns)
# To align with theory, we may need to flip the sign of some vectors (this doesn't change the interpretation)
if components['PC1 (Level)'].mean() < 0: components['PC1 (Level)'] *= -1
if components['PC2 (Slope)'] > 0: components['PC2 (Slope)'] *= -1
if components['PC3 (Curvature)'].mean() > 0: components['PC3 (Curvature)'] *= -1


fig, ax = plt.subplots(figsize=(12, 7))
components.plot(ax=ax, marker='o')
ax.set_title('Economic Interpretation of Yield Curve Principal Components', fontsize=16)
ax.set_ylabel('Component Loading')
ax.set_xlabel('Maturity')
ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
ax.legend(title='Principal Components')
plt.show()

Economic Interpretation of the Principal Components

PC1 (Level): All maturities have loadings of the same sign. Represents a parallel shift of the entire yield curve. This is the most significant movement, often related to the overall stance of monetary policy.
PC2 (Slope): Short-term loadings are negative, long-term are positive. Represents a change in the slope of the yield curve (steepening or flattening), reflecting market expectations of future short-term rates and economic growth.
PC3 (Curvature): Short and long-term loadings are positive, mid-term are negative. Represents a change in the curvature (the ‘bow’ shape), related to expectations of interest rate volatility.

Linear Discriminant Analysis (LDA): Reduction for Classification

LDA is a supervised learning algorithm for dimensionality reduction. Unlike PCA, which seeks maximum variance, LDA’s goal is to find a projection direction that maximizes the separation between different classes while minimizing the variance within each class.

LDA’s Objective: The Within-Class Scatter Matrix

Within-class Scatter Matrix (\(S_w\)): Measures the scatter of data points within each class.
- \(S_w = \sum_{c=1}^{C} \sum_{x_i \in c} (x_i - \mu_c)(x_i - \mu_c)^T\)
- We want to minimize this. It represents how compact each class is.

LDA’s Objective: The Between-Class Scatter Matrix

Between-class Scatter Matrix (\(S_b\)): Measures the scatter of the class means around the overall mean.
- \(S_b = \sum_{c=1}^{C} N_c (\mu_c - \mu)(\mu_c - \mu)^T\)
- We want to maximize this. It represents how far apart the classes are from each other.

LDA’s Objective Function: Maximizing the Ratio

LDA aims to find the projection matrix \(W\) that maximizes the ratio of the between-class scatter to the within-class scatter.

\[ \large{J(W) = \frac{\text{tr}(W^T S_b W)}{\text{tr}(W^T S_w W)}} \]

This is known as Fisher’s criterion.

PCA vs. LDA: An Intuitive Comparison

PCA

PCA only cares about overall variance. It would choose the horizontal direction as PC1, which does a poor job of separating the two classes.

LDA

LDA considers the class labels and chooses a projection that maximizes the separation between the classes, perfectly distinguishing them.

Solving LDA: A Generalized Eigenvalue Problem

Maximizing the ratio \(J(W)\) can be transformed into solving a generalized eigenvalue problem:

\[ \large{S_b w = \lambda S_w w} \]

Multiplying both sides by \(S_w^{-1}\), we get a more familiar form:

\[ \large{S_w^{-1} S_b w = \lambda w} \]

Conclusion: The optimal projection directions \(w\) for LDA are the eigenvectors of the matrix \(S_w^{-1} S_b\).

Python Hands-On: Classifying the Iris Dataset with LDA

The Iris dataset is the ‘Hello World’ of classification algorithms. It contains 3 classes, each with 4 features. Our goal is to reduce it to 2 dimensions and visualize the result.

#| echo: true
#| warning: false
#| label: fig-lda-iris
#| fig-cap: 'LDA projects the 4D Iris dataset onto 2D'

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the data
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Run LDA, reducing to 2 components
lda = LinearDiscriminantAnalysis(n_components=2)
X_r = lda.fit(X, y).transform(X)

# Visualization
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors,, target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], alpha=.8, color=color,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of IRIS dataset')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.grid(True)
plt.show()

Analysis: Even after compressing from 4D to 2D, LDA excellently separates the three classes, demonstrating its powerful classification capabilities.

Multidimensional Scaling (MDS): Reconstructing a ‘Map’ from Distances

MDS has a completely different starting point from PCA and LDA. It doesn’t work with the feature matrix \(X\) directly but starts from a known distance (or dissimilarity) matrix \(D\).

Core Idea: Find a set of points \(Z\) in a low-dimensional space such that the Euclidean distances between these points are as close as possible to the original distance matrix \(D\).
Use Case: When we don’t have the original features but can measure the dissimilarity between objects. For example, survey data on ‘brand similarity’ or the ‘edit distance’ between genetic sequences.

MDS Analogy: Reconstructing a City Map

Imagine you only know the straight-line flight distances between major cities, but you have no latitude or longitude information.

The goal of MDS is to find the optimal 2D coordinates for each city based on this distance matrix.

Part 2: Nonlinear Dimensionality Reduction (Manifold Learning)

The Limitation of Linear Methods: When Data Structure is Curved

Linear methods like PCA and LDA assume that the data lies on a flat hyperplane. But what if the data’s intrinsic structure is curved?

A linear method like PCA would incorrectly project distant points (like A and B) close together, failing to ‘unroll’ the data.

The Core Idea of Manifold Learning: Data Lives on a Low-Dimensional Manifold

Manifold Hypothesis: The high-dimensional data we observe is actually generated by a few latent variables (the intrinsic dimension), and these data points lie on a low-dimensional manifold embedded in the high-dimensional space.
Goal: To ‘unroll’ this manifold and find low-dimensional coordinates that reflect the true neighborhood relationships of the data.
Difference from Linear Methods: Manifold learning focuses on local structure, assuming that Euclidean distances are only reliable between nearby points.

Isomap: Measuring Distance Along the ‘Surface’

Isomap is a clever extension of MDS that replaces Euclidean distance with Geodesic Distance.

Steps: 1. Construct Neighborhood Graph: For each point, connect it only to its K-nearest neighbors. 2. Compute Shortest Paths: Use a graph algorithm (like Dijkstra’s) to compute the shortest path between all pairs of points, approximating the geodesic distance. 3. Apply MDS: Use the resulting shortest-path distance matrix as input to the classical MDS algorithm.

Locally Linear Embedding (LLE): Preserving Local Linear Relationships

LLE assumes that each data point can be linearly reconstructed by its neighbors, and this local geometric relationship should be preserved in the low-dimensional space.

The Heart of LLE: Preserve Reconstruction Weights, Not Distances

t-SNE: The Swiss Army Knife of Data Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) is currently the most powerful and popular tool for reducing high-dimensional data to 2D or 3D for visualization.

Core Idea (Probabilistic Matching):
1. In high-D space, convert Euclidean distances between points into conditional probabilities that represent the likelihood that point \(i\) would pick point \(j\) as its neighbor (using a Gaussian distribution).
2. In low-D space, define a similar conditional probability (using a heavier-tailed t-distribution).
3. Adjust the positions of points in the low-D space to make the two probability distributions as similar as possible (by minimizing the KL divergence).
Advantage: It is exceptionally good at revealing the clustering structure of data.

t-SNE Key Parameter: Perplexity

Perplexity:
- This is the most important parameter. It can be loosely interpreted as the ‘effective number of neighbors’ each point considers.
- Typical values are between 5 and 50.
- A lower value focuses on local structure, while a higher value considers more of the global structure.

Important Cautions for Interpreting t-SNE plots

Do not over-interpret distances between clusters: The distance between two clusters on a t-SNE plot does not meaningfully represent how ‘far apart’ they are in the original space.
Do not over-interpret the size of clusters: The area of a cluster on the plot does not mean it contains more data points or has a larger variance.
t-SNE is an exploratory visualization tool, not a rigorous clustering analysis method.

Python Hands-On: Visualizing Handwritten Digits with t-SNE

We’ll use the handwritten digits dataset from scikit-learn. Each digit is an 8x8 = 64-dimensional vector. We will use t-SNE to reduce it to 2 dimensions.

#| echo: true
#| warning: false
#| label: fig-tsne-digits
#| fig-cap: 't-SNE reduces 64D handwritten digit data to 2D'
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np

# Load the data
digits = load_digits()
X = digits.data
y = digits.target

# Run t-SNE
# init='pca' speeds up convergence, learning_rate='auto' is recommended in new scikit-learn versions
tsne = TSNE(n_components=2, init='pca', random_state=0, perplexity=30, learning_rate='auto')
X_tsne = tsne.fit_transform(X)

# Visualization
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=plt.cm.get_cmap('jet', 10))
plt.title('t-SNE visualization of handwritten digits')
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
cbar = plt.colorbar(scatter)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(np.arange(10))
plt.show()

Analysis: t-SNE perfectly separates the samples of different digits into distinct clusters in 2D space, which is incredibly helpful for understanding the structure of high-dimensional data.

Part 3: Advanced Topic: Sparse Representation

Sparse Representation: Building Signals with the Fewest ‘Blocks’

Previous methods aimed to ‘compress’ data, but sparse representation has a different starting point.

Core Idea: Any signal (e.g., an economic time series) can be represented as a linear combination of a few ‘atoms’ (basis vectors) from a ‘dictionary’ \(\Psi\). \[ \large{x = \Psi s} \] Here, \(s\) is a sparse vector, meaning most of its elements are zero.

Economic Intuition: The complex dynamics of the market might be driven by a combination of only a few ‘latent economic states’ or ‘shocks’. Sparse representation aims to find these core drivers.

Compressed Sensing: Recovering the Full Signal from Fewer Samples

The sparsity assumption leads to a surprising conclusion: compressed sensing.

If we know that a signal \(x\) is sparse under a certain dictionary \(\Psi\), we don’t need to observe the entire signal \(x\). We can take a small number of random measurements \(z\) and still perfectly reconstruct the original signal \(x\).

\[ \large{z = \Phi x = \Phi \Psi s = \Theta s} \]

\(z\): a small number of observations (\(l \times 1\))
\(\Phi\): the measurement matrix (\(l \times d\), \(l \ll d\))
\(\Theta\): the sensing matrix
Goal: Given \(z\) and \(\Theta\), solve for the sparse vector \(s\). This is the foundation of modern signal processing, MRI, and more.

Reconstruction Algorithm: Matching Pursuit

Finding the sparse vector \(s\) from \(z\) and \(\Theta\) is an NP-hard problem. Matching Pursuit is a greedy algorithm that approximates the solution iteratively.

Initialize: Residual \(r_0 = z\), sparse solution \(s=0\).
Find Most Correlated Atom: Find the atom in the dictionary \(\Theta\) that is most correlated with the current residual \(r\) (has the largest inner product).
Update Solution: Add the contribution of this atom to the solution \(s\).
Update Residual: Subtract the contribution of this atom from the current residual \(r\).
Iterate: Repeat steps 2-4 until the residual is small enough or a maximum number of iterations is reached.

Economic Application: Identifying Structural Breaks

Sparse representation can be used to identify structural breaks in time series.

Signal (x): The first difference of an economic time series.
Dictionary (Ψ): An identity matrix.
Sparse Coefficients (s): During a stable period, the differenced series is close to zero, so s is also sparse (close to zero). When a structural break (a sudden change) occurs, the difference will have a large spike, corresponding to a large non-zero element in s.

By finding the non-zero entries in s, we can automatically detect moments when economic regimes or market behaviors have changed.

Summary: How to Choose the Right Method?

Method	Type	Core Idea	Pros	Cons
PCA	Linear, Unsupervised	Maximize variance	Simple, fast, highly interpretable	Cannot handle nonlinear structures
LDA	Linear, Supervised	Maximize class separability	Excellent for classification	Requires class labels, has assumptions about class distribution
MDS	Distance-driven	Preserve pairwise distances	Flexible, only needs a distance matrix	Computationally expensive, results depend on the distance metric
Isomap	Nonlinear, Unsupervised	Preserve geodesic distance	Can ‘unroll’ certain manifolds	Sensitive to ‘shortcut’ noise, computationally intensive
LLE	Nonlinear, Unsupervised	Preserve local linear reconstruction	Computationally efficient, can handle various manifolds	Sensitive to K selection, can produce distortions
t-SNE	Nonlinear, Unsupervised	Preserve neighborhood probabilities	Excellent for visualization, reveals clusters	Computationally expensive, for visualization only, not for reduction

Final Thought: Reduction is a Means to an End, Not the Goal Itself

In this chapter, we have explored a range of representation learning (dimensionality reduction) methods, from linear to nonlinear.

They are powerful tools for exploratory data analysis, helping us discover hidden structures like factors, clusters, and low-dimensional manifolds in seemingly chaotic high-dimensional data.
They are also a crucial preprocessing step for building predictive models, effectively improving model stability and generalization.

The key is to always combine your domain knowledge (economics, finance) to interpret the results of dimensionality reduction, giving real-world meaning to these abstract dimensions and structures.

Thank You!

Questions & Discussion