02 Representation Learning: Extracting Core Insights from High-Dimensional Data

Welcome to Chapter 2: Representation Learning

Today’s Agenda:

  1. Introduction: Why do Economics and Finance need ‘dimensionality reduction’?
  2. Data Preprocessing: The cornerstone of success
  3. Part 1: Linear Methods (PCA, LDA, MDS)
  4. Part 2: Nonlinear Manifold Learning (Isomap, LLE, t-SNE)
  5. Part 3: Advanced Topics (Sparse Representation)
  6. Conclusion: How to choose the right tool for your problem

Core Question: Why Study ‘Dimensionality Reduction’ in Economics?

Imagine we want to predict a company’s stock return. How many variables might we have?

  • Firm-Level Data: Hundreds of financial ratios (P/E, ROA, leverage…)
  • Market Data: Historical prices, trading volumes, volatility…
  • Macroeconomic Data: GDP, interest rates, inflation, unemployment…
  • Alternative Data: Satellite imagery, news sentiment, supply chain info…

We can easily end up with a dataset of hundreds or even thousands of dimensions.

We Face a ‘Data-Rich, Insight-Poor’ Dilemma

The explosive growth in data volume does not directly translate to an increase in insight. The goal of representation learning is to extract the signal from the noise.

From High-Dimensional Chaos to Low-Dimensional Clarity A diagram showing noisy high-dimensional data being processed by representation learning to reveal clear, low-dimensional insights. The Curse of Dimensionality High-Dimensional Space Noisy & Redundant Representation Learning Finds meaningful structure Low-Dimensional Insight Space Clear & Actionable

This is the so-called ‘Curse of Dimensionality’.

What is the ‘Curse of Dimensionality’?

As the number of dimensions d increases, a fixed number of samples becomes increasingly sparse, and the space between data points grows exponentially.

Curse of Dimensionality A three-panel diagram showing six data points becoming sparser as the dimension increases from 1D to 2D to 3D. Curse of Dimensionality 1D Space Data is dense 2D Space Data becomes sparse 3D Space Data is extremely sparse

The Curse of Dimensionality is an Enemy of Modeling and Analysis

When data dimensionality d is too high, a series of serious problems arise:

Problem Category Specific Manifestation Impact on Economic Research
Computational Efficiency Algorithm complexity grows exponentially Models take too long to train, hindering iteration.
Data Sparsity A fixed number of samples becomes very sparse Samples are not representative; hard to find significant relationships.
Model Overfitting The model learns noise, not the true pattern Perfect in-sample performance, but poor out-of-sample (predictive) power.
Multicollinearity Many features are highly correlated Difficult to identify the true impact of individual variables; unstable parameter estimates.

Representation learning (or dimensionality reduction) is the key to solving this problem.

The Goal of Representation Learning: Simplify with Minimal Information Loss

Our goal is to map a high-dimensional sample set \(X \in \mathbb{R}^{d \times N}\) to a low-dimensional space \(Z \in \mathbb{R}^{l \times N}\), where \(l \ll d\).

\[ \large{ \underbrace{ \begin{pmatrix} z_{1,n} \\ \vdots \\ z_{l,n} \end{pmatrix} }_{Z_n \in \mathbb{R}^{l \times 1}} = \underbrace{ \begin{pmatrix} w_{1,1} & \cdots & w_{1,d} \\ \vdots & \ddots & \vdots \\ w_{l,1} & \cdots & w_{l,d} \end{pmatrix} }_{W^T \in \mathbb{R}^{l \times d}} \underbrace{ \begin{pmatrix} x_{1,n} \\ \vdots \\ x_{d,n} \end{pmatrix} }_{X_n \in \mathbb{R}^{d \times 1}} } \]

Core Requirement: The new representation \(Z\) must preserve the most important ‘structure’ or ‘information’ from the original data \(X\). Different algorithms define ‘structure’ differently, leading to various reduction methods.

Before We Begin: Preprocessing is the Foundation of Success

Before applying any complex dimensionality reduction algorithm, we must clean the raw data. This is like laying the foundation before building a house.

Data Preprocessing Pipeline A three-step flowchart: Raw Data -> Clean & Impute -> Standardize -> Ready Data. Raw DataOutliers, NaN, Scales 1. Clean & ImputeHandle outliers/NaNs 2. StandardizeUnify scales Ready Data

Preprocessing Issue 1: Outliers

Extreme values can severely distort a model’s variance calculation (e.g., in PCA), pulling it towards the direction of the outlier.

Effect of Outliers on PCA A two-panel diagram showing that an outlier skews the direction of the first principal component (PC1). Effect of Outliers on Principal Component Analysis (PCA) No Outliers PC1 (Direction of Max Variance) With Outliers Outlier Skewed PC1

Common Treatments: Winsorization, log transformation, or direct removal.

Preprocessing Issue 2: Missing Data

Most algorithms cannot handle missing values (NaN).

  • Common Strategies:
    1. Deletion: If the missing proportion is small, delete the row or column.
    2. Imputation: Fill with the mean, median, or more complex models (like K-Nearest Neighbors).

Preprocessing Issue 3: Inconsistent Scales

If ‘Market Cap’ (trillions) and ‘P/E Ratio’ (tens) are analyzed together, market cap will completely dominate the results.

  • Solution: Feature Scaling. The most common is Standardization, which transforms data to have a mean of 0 and a variance of 1.
  • Formula: \(x'_{i} = \large{\frac{x_i - \mu_i}{\sigma_i}}\)

Hands-On Preprocessing: Cleaning Stock Financial Data with Python

Let’s use a few fundamental indicators for S&P 500 companies as an example to show how to preprocess with scikit-learn.

#| echo: true
#| warning: false

# To ensure reproducibility, we create a mock dataset instead of relying on a live API.
# The structure of this dataset is similar to what you'd get from yfinance.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

sp500_tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA', 'TSLA', 'JPM', 'V', 'JNJ', 'WMT']
data = {
    'MarketCap': [2.8e12, 2.5e12, 1.8e12, 1.5e12, 1.2e12, 8e11, 4.5e11, 5e11, 4.8e11, 4.2e11],
    'trailingPE': [28.5, 35.2, 26.8, 60.1, 95.3, 120.2, 12.1, 38.5, 25.4, 22.1],
    'forwardPE': [27.1, 33.1, 25.0, 55.6, 70.1, np.nan, 11.5, 36.2, 24.1, 21.0],
    'returnOnEquity': [1.5, 0.45, 0.3, 0.25, 0.6, 0.28, 0.17, 0.22, np.nan, 0.2],
    'priceToBook': [45.1, 12.3, 7.1, 9.8, 30.2, 25.1, 1.8, 12.5, 6.7, 5.4],
    'debtToEquity': [150.1, 50.2, 12.5, 120.8, 30.1, 20.5, np.nan, 55.3, 40.1, 80.2]
}
df = pd.DataFrame(data, index=sp500_tickers)
df.index.name = 'Ticker'

print('Simulated Raw Data (First 5 rows):')
print(df.head())

Preprocessing Step 1: Handle Outliers and Missing Values

A log transformation can mitigate the effect of extreme values (like Market Cap). Then, we’ll fill missing NaN values with the feature’s mean.

#| echo: true
#| warning: false
#| cont: true

# Log transform can mitigate extreme values and right-skewed distributions
df['MarketCap_log'] = np.log(df['MarketCap'])

# Select features for analysis
features = ['MarketCap_log', 'trailingPE', 'forwardPE', 'returnOnEquity', 'priceToBook', 'debtToEquity']
df_features = df[features]

# Impute missing values using the mean
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df_features), columns=features, index=df.index)

print('\nAfter Imputing Missing Values (First 5 rows):')
print(df_imputed.head())

Preprocessing Step 2: Feature Scaling (Standardization)

Standardization transforms all features to a distribution with a mean of 0 and a variance of 1. This ensures all features have equal weight in subsequent models like PCA.

#| echo: true
#| warning: false
#| cont: true

# Standardization
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=features, index=df.index)

print('\nAfter Standardization (First 5 rows):')
print(df_scaled.head())

Now, our data is ready for the models.

Part 1: Linear Dimensionality Reduction Methods

Principal Component Analysis (PCA): Finding the Directions of Maximum Variance

PCA is the most classic and commonly used linear dimensionality reduction method.

  • Core Idea: Rotate the coordinate system so that the new axes (principal components) explain the maximum possible variance in the data.
  • Goal: To preserve variance is to preserve information.
Principal Component Analysis Intuition A scatter plot showing PC1 aligned with the maximum variance direction. PC1 (Direction of Max Variance) PC2

PCA’s Objective Function: Maximizing Projected Variance

PCA seeks a projection direction (a unit vector \(w\)) that maximizes the variance of the projected data.

  • Projected Data: \(Z = Xw\)
  • Projected Variance: \(\text{Var}(Z) = \text{Var}(Xw) = w^T S w\), where \(S\) is the covariance matrix of \(X\).

The optimization problem is:

\[ \large{\max_{w} \quad w^T S w} \] \[ \large{\text{s.t.} \quad w^T w = 1} \]

PCA’s Derivation: The Lagrangian

We use the method of Lagrange multipliers to solve this constrained optimization problem.

  1. Formulate the Lagrangian: The goal is to maximize \(w^T S w\) subject to the constraint that \(w\) is a unit vector, i.e., \(w^T w = 1\).

    \[ \large{L(w, \lambda) = w^T S w - \lambda(w^T w - 1)} \]

PCA’s Derivation: The First-Order Condition

  1. Take the derivative with respect to \(w\) and set it to zero: This finds the critical points of the Lagrangian function.

    \[ \large{\frac{\partial L}{\partial w} = 2Sw - 2\lambda w = 0} \]

PCA’s Derivation: A Classic Eigenvalue Problem

  1. Rearrange to get the final form: This reveals the core mathematical identity of PCA.

    \[ \large{Sw = \lambda w} \]

Conclusion: The optimal projection directions (the principal components) \(w\) are the eigenvectors of the covariance matrix \(S\). The corresponding variance explained by each component is its eigenvalue \(\lambda\). The eigenvector with the largest eigenvalue is the first principal component.

Summary of Key Steps in PCA

This translates the abstract mathematical theory into a clear operational workflow.

PCA Algorithm Steps A flowchart illustrating the five main steps of performing PCA. 1. StandardizeData 2. Compute Cov.Matrix S 3. Find Eigenvalues& Eigenvectors 4. Select top lComponents 5. TransformData: $Z = W^T X$

Python Hands-On: Analyzing the U.S. Treasury Yield Curve with PCA

Yield curve movements are central to macroeconomic analysis. Rates at different maturities are highly correlated, making them a perfect candidate for PCA.

# To ensure reproducibility, we generate simulated yield data.
# Real data can be fetched using libraries like fredapi.
import pandas as pd
import numpy as np

# Simulate data
np.random.seed(42)
dates = pd.date_range('2000-01-01', '2024-01-01', freq='B')
n_days = len(dates)
base_level = np.linspace(1.0, 3.0, n_days) + np.random.randn(n_days).cumsum() * 0.05
maturities = ['1M', '3M', '6M', '1Y', '2Y', '3Y', '5Y', '7Y', '10Y', '20Y', '30Y']
n_maturities = len(maturities)

# Create factors
level_factor = np.random.randn(n_days) * 0.1
slope_factor = np.random.randn(n_days) * 0.05
curve_factor = np.random.randn(n_days) * 0.02

# Create yields
slope_loadings = np.linspace(-1, 1, n_maturities)
curve_loadings = np.sin(np.linspace(0, np.pi, n_maturities))
yields = base_level[:, None] + level_factor[:, None] * 1.0 + slope_factor[:, None] * slope_loadings + curve_factor[:, None] * curve_loadings
yield_df = pd.DataFrame(yields, index=dates, columns=maturities)

# Calculate daily changes
yield_changes = yield_df.diff().dropna()
print('Simulated Daily Yield Changes (First 5 rows):')
print(yield_changes.head())
Simulated Daily Yield Changes (First 5 rows):
                  1M        3M        6M        1Y        2Y        3Y  \
2000-01-04 -0.119564 -0.115018 -0.110528 -0.106145 -0.101909 -0.097846   
2000-01-05  0.090287  0.098524  0.105740  0.111013  0.113614  0.113072   
2000-01-06 -0.085673 -0.077000 -0.067781 -0.057523 -0.045833 -0.032462   
2000-01-07  0.148844  0.138733  0.127924  0.115786  0.101820  0.085705   
2000-01-10 -0.138239 -0.116690 -0.094897 -0.072639 -0.049741 -0.026092   

                  5Y        7Y       10Y       20Y       30Y  
2000-01-04 -0.093964 -0.090255 -0.086693 -0.083238 -0.079839  
2000-01-05  0.109226  0.102237  0.092576  0.080972  0.068347  
2000-01-06 -0.017323 -0.000502  0.017750  0.037041  0.056879  
2000-01-07  0.067329  0.046804  0.024450  0.000768 -0.023612  
2000-01-10 -0.001653  0.023538  0.049368  0.075662  0.102201  

Running PCA and Explaining the Variance

We run PCA on the daily changes in yields, as finance is often more concerned with changes than levels.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

# Standardize the data
scaler_yield = StandardScaler()
scaled_changes = scaler_yield.fit_transform(yield_changes)

# Run PCA
pca = PCA()
pca.fit(scaled_changes)

# Visualize explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)


fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio, alpha=0.6, color='skyblue', label='Individual explained variance')
ax.step(range(1, len(cumulative_variance_ratio) + 1), cumulative_variance_ratio, where='mid', color='red', linestyle='--', label='Cumulative explained variance')
ax.set_ylabel('Explained variance ratio')
ax.set_xlabel('Principal component index')
ax.set_title('The First 3 Components Explain Over 95% of Yield Curve Variance', fontsize=16)
ax.axhline(y=0.95, color='gray', linestyle=':', linewidth=2)
ax.text(len(explained_variance_ratio), 0.95, '95% threshold', va='bottom', ha='right')
ax.legend(loc='best')
ax.xaxis.set_major_locator(mticker.MaxNLocator(integer=True))
plt.show()
Figure 1: PCA Explained Variance Ratio

The Economic Meaning of the Principal Components: Level, Slope, and Curvature

By examining the loadings (i.e., the eigenvectors) of the principal components, we can assign them economic meaning.

#| echo: true
#| warning: false
#| label: fig-pca-components
#| fig-cap: 'Economic Interpretation of Yield Curve Principal Components'
#| cont: true

components = pd.DataFrame(pca.components_[:3, :].T, 
                          columns=['PC1 (Level)', 'PC2 (Slope)', 'PC3 (Curvature)'], 
                          index=yield_changes.columns)
# To align with theory, we may need to flip the sign of some vectors (this doesn't change the interpretation)
if components['PC1 (Level)'].mean() < 0: components['PC1 (Level)'] *= -1
if components['PC2 (Slope)'] > 0: components['PC2 (Slope)'] *= -1
if components['PC3 (Curvature)'].mean() > 0: components['PC3 (Curvature)'] *= -1


fig, ax = plt.subplots(figsize=(12, 7))
components.plot(ax=ax, marker='o')
ax.set_title('Economic Interpretation of Yield Curve Principal Components', fontsize=16)
ax.set_ylabel('Component Loading')
ax.set_xlabel('Maturity')
ax.axhline(0, color='black', linewidth=0.5, linestyle='--')
ax.legend(title='Principal Components')
plt.show()

Economic Interpretation of the Principal Components

  • PC1 (Level): All maturities have loadings of the same sign. Represents a parallel shift of the entire yield curve. This is the most significant movement, often related to the overall stance of monetary policy.

  • PC2 (Slope): Short-term loadings are negative, long-term are positive. Represents a change in the slope of the yield curve (steepening or flattening), reflecting market expectations of future short-term rates and economic growth.

  • PC3 (Curvature): Short and long-term loadings are positive, mid-term are negative. Represents a change in the curvature (the ‘bow’ shape), related to expectations of interest rate volatility.

Linear Discriminant Analysis (LDA): Reduction for Classification

LDA is a supervised learning algorithm for dimensionality reduction. Unlike PCA, which seeks maximum variance, LDA’s goal is to find a projection direction that maximizes the separation between different classes while minimizing the variance within each class.

LDA Objective LDA aims to maximize between-class distance and minimize within-class distance. Maximize between-class distance Minimize within-class distance

LDA’s Objective: The Within-Class Scatter Matrix

  • Within-class Scatter Matrix (\(S_w\)): Measures the scatter of data points within each class.
    • \(S_w = \sum_{c=1}^{C} \sum_{x_i \in c} (x_i - \mu_c)(x_i - \mu_c)^T\)
    • We want to minimize this. It represents how compact each class is.

LDA’s Objective: The Between-Class Scatter Matrix

  • Between-class Scatter Matrix (\(S_b\)): Measures the scatter of the class means around the overall mean.
    • \(S_b = \sum_{c=1}^{C} N_c (\mu_c - \mu)(\mu_c - \mu)^T\)
    • We want to maximize this. It represents how far apart the classes are from each other.

LDA’s Objective Function: Maximizing the Ratio

LDA aims to find the projection matrix \(W\) that maximizes the ratio of the between-class scatter to the within-class scatter.

\[ \large{J(W) = \frac{\text{tr}(W^T S_b W)}{\text{tr}(W^T S_w W)}} \]

This is known as Fisher’s criterion.

PCA vs. LDA: An Intuitive Comparison

PCA

PCA only cares about overall variance. It would choose the horizontal direction as PC1, which does a poor job of separating the two classes.

PCA Projection PCA projects data along the axis of maximum variance, resulting in poor separation of the two classes. PC1

LDA

LDA considers the class labels and chooses a projection that maximizes the separation between the classes, perfectly distinguishing them.

LDA Projection LDA projects data along an axis that best separates the classes, leading to clear distinction. LD1

Solving LDA: A Generalized Eigenvalue Problem

Maximizing the ratio \(J(W)\) can be transformed into solving a generalized eigenvalue problem:

\[ \large{S_b w = \lambda S_w w} \]

Multiplying both sides by \(S_w^{-1}\), we get a more familiar form:

\[ \large{S_w^{-1} S_b w = \lambda w} \]

Conclusion: The optimal projection directions \(w\) for LDA are the eigenvectors of the matrix \(S_w^{-1} S_b\).

Python Hands-On: Classifying the Iris Dataset with LDA

The Iris dataset is the ‘Hello World’ of classification algorithms. It contains 3 classes, each with 4 features. Our goal is to reduce it to 2 dimensions and visualize the result.

#| echo: true
#| warning: false
#| label: fig-lda-iris
#| fig-cap: 'LDA projects the 4D Iris dataset onto 2D'

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load the data
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# Run LDA, reducing to 2 components
lda = LinearDiscriminantAnalysis(n_components=2)
X_r = lda.fit(X, y).transform(X)

# Visualization
plt.figure(figsize=(8, 6))
colors = ['navy', 'turquoise', 'darkorange']
for color, i, target_name in zip(colors,, target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], alpha=.8, color=color,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('LDA of IRIS dataset')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.grid(True)
plt.show()

Analysis: Even after compressing from 4D to 2D, LDA excellently separates the three classes, demonstrating its powerful classification capabilities.

Multidimensional Scaling (MDS): Reconstructing a ‘Map’ from Distances

MDS has a completely different starting point from PCA and LDA. It doesn’t work with the feature matrix \(X\) directly but starts from a known distance (or dissimilarity) matrix \(D\).

  • Core Idea: Find a set of points \(Z\) in a low-dimensional space such that the Euclidean distances between these points are as close as possible to the original distance matrix \(D\).
  • Use Case: When we don’t have the original features but can measure the dissimilarity between objects. For example, survey data on ‘brand similarity’ or the ‘edit distance’ between genetic sequences.

MDS Analogy: Reconstructing a City Map

Imagine you only know the straight-line flight distances between major cities, but you have no latitude or longitude information.

MDS Analogy with Cities An illustration showing the concept of reconstructing a map of cities from a distance matrix using MDS. Input: Distance Matrix (km) City A City B City A City B 0 1080 1080 0 MDS Output: 2D Coordinates (Map) City A City B City C

The goal of MDS is to find the optimal 2D coordinates for each city based on this distance matrix.

Part 2: Nonlinear Dimensionality Reduction (Manifold Learning)

The Limitation of Linear Methods: When Data Structure is Curved

Linear methods like PCA and LDA assume that the data lies on a flat hyperplane. But what if the data’s intrinsic structure is curved?

PCA Failure on Swiss Roll Manifold A diagram showing that linear PCA projection incorrectly maps distant points on a curved manifold close together in the 2D space. 1. 3D 'Swiss Roll' Manifold The intrinsic data structure is curved Point A Point B Geodesic distance is large Linear Projection (PCA) 2. Incorrect Linear PCA Projection Nonlinear structure is lost A' B' Euclidean distance is incorrectly shortened

A linear method like PCA would incorrectly project distant points (like A and B) close together, failing to ‘unroll’ the data.

The Core Idea of Manifold Learning: Data Lives on a Low-Dimensional Manifold

  • Manifold Hypothesis: The high-dimensional data we observe is actually generated by a few latent variables (the intrinsic dimension), and these data points lie on a low-dimensional manifold embedded in the high-dimensional space.
  • Goal: To ‘unroll’ this manifold and find low-dimensional coordinates that reflect the true neighborhood relationships of the data.
  • Difference from Linear Methods: Manifold learning focuses on local structure, assuming that Euclidean distances are only reliable between nearby points.

Isomap: Measuring Distance Along the ‘Surface’

Isomap is a clever extension of MDS that replaces Euclidean distance with Geodesic Distance.

Isomap Intuition: Geodesic vs. Euclidean Distance A four-panel diagram explaining Geodesic distance (the path an ant takes on a surface) versus Euclidean distance (the direct path through space), and its application in Isomap. Step 1: Imagine a Flat World A B The shortest path from A to B is a straight line (This is the familiar Euclidean Distance) Step 2: Bend the World A B Now, imagine an ant Constraint: The ant can only walk on the surface, it cannot fly. Step 3: Two Types of 'Distance' Emerge Geodesic Distance (the ant's path) Euclidean Distance (the 'flying' path) A B Step 4: Application to a Data Manifold A B Isomap is like the ant: it measures distance along the 'surface' of the data, not 'through' it.

Steps: 1. Construct Neighborhood Graph: For each point, connect it only to its K-nearest neighbors. 2. Compute Shortest Paths: Use a graph algorithm (like Dijkstra’s) to compute the shortest path between all pairs of points, approximating the geodesic distance. 3. Apply MDS: Use the resulting shortest-path distance matrix as input to the classical MDS algorithm.

Locally Linear Embedding (LLE): Preserving Local Linear Relationships

LLE assumes that each data point can be linearly reconstructed by its neighbors, and this local geometric relationship should be preserved in the low-dimensional space.

Locally Linear Embedding (LLE): Unrolling a Manifold A diagram showing that LLE preserves local reconstruction weights, not necessarily distances or angles, when mapping from a high-dimensional manifold to a low-dimensional space. Locally Linear Embedding (LLE): 'Unrolling' the Manifold 1. Define local relationships in high-D space Xᵢ High-D Manifold Compute reconstruction weights Wᵢⱼ LLE Embedding 2. Preserve local relationships in low-D space Zᵢ Low-D Embedding Preserve weights Wᵢⱼ, not distances (Thus, local geometry may change)

The Heart of LLE: Preserve Reconstruction Weights, Not Distances

LLE Core Idea: Preserving Weights, Not Distances A three-panel diagram showing that LLE preserves the relative weights (proportional position) of a point within its neighborhood, even if the neighborhood's shape is distorted in the lower dimension. This is contrasted with a rigid embedding that would preserve distances. The Core of LLE: Preserve Reconstruction Weights (Wᵢⱼ), Not Distances 1. Original High-D Neighborhood M₁ Xᵢ Compute weights W here (determines relative position) 1 2 2. LLE Embedding (Preserves Weights) M₂ Zᵢ Geometry changes, but proportional relationships are kept Zᵢ = Σ Wᵢⱼ Zⱼ 1 2 3. Contrast: Rigid Embedding (This is not LLE's goal: Preserving distances)

t-SNE: The Swiss Army Knife of Data Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) is currently the most powerful and popular tool for reducing high-dimensional data to 2D or 3D for visualization.

  • Core Idea (Probabilistic Matching):
    1. In high-D space, convert Euclidean distances between points into conditional probabilities that represent the likelihood that point \(i\) would pick point \(j\) as its neighbor (using a Gaussian distribution).
    2. In low-D space, define a similar conditional probability (using a heavier-tailed t-distribution).
    3. Adjust the positions of points in the low-D space to make the two probability distributions as similar as possible (by minimizing the KL divergence).
  • Advantage: It is exceptionally good at revealing the clustering structure of data.

t-SNE Key Parameter: Perplexity

  • Perplexity:
    • This is the most important parameter. It can be loosely interpreted as the ‘effective number of neighbors’ each point considers.
    • Typical values are between 5 and 50.
    • A lower value focuses on local structure, while a higher value considers more of the global structure.

Important Cautions for Interpreting t-SNE plots

  1. Do not over-interpret distances between clusters: The distance between two clusters on a t-SNE plot does not meaningfully represent how ‘far apart’ they are in the original space.
  2. Do not over-interpret the size of clusters: The area of a cluster on the plot does not mean it contains more data points or has a larger variance.
  3. t-SNE is an exploratory visualization tool, not a rigorous clustering analysis method.

Python Hands-On: Visualizing Handwritten Digits with t-SNE

We’ll use the handwritten digits dataset from scikit-learn. Each digit is an 8x8 = 64-dimensional vector. We will use t-SNE to reduce it to 2 dimensions.

#| echo: true
#| warning: false
#| label: fig-tsne-digits
#| fig-cap: 't-SNE reduces 64D handwritten digit data to 2D'
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import numpy as np

# Load the data
digits = load_digits()
X = digits.data
y = digits.target

# Run t-SNE
# init='pca' speeds up convergence, learning_rate='auto' is recommended in new scikit-learn versions
tsne = TSNE(n_components=2, init='pca', random_state=0, perplexity=30, learning_rate='auto')
X_tsne = tsne.fit_transform(X)

# Visualization
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=plt.cm.get_cmap('jet', 10))
plt.title('t-SNE visualization of handwritten digits')
plt.xlabel('t-SNE dimension 1')
plt.ylabel('t-SNE dimension 2')
cbar = plt.colorbar(scatter)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(np.arange(10))
plt.show()

Analysis: t-SNE perfectly separates the samples of different digits into distinct clusters in 2D space, which is incredibly helpful for understanding the structure of high-dimensional data.

Part 3: Advanced Topic: Sparse Representation

Sparse Representation: Building Signals with the Fewest ‘Blocks’

Previous methods aimed to ‘compress’ data, but sparse representation has a different starting point.

  • Core Idea: Any signal (e.g., an economic time series) can be represented as a linear combination of a few ‘atoms’ (basis vectors) from a ‘dictionary’ \(\Psi\). \[ \large{x = \Psi s} \] Here, \(s\) is a sparse vector, meaning most of its elements are zero.
Sparse Representation Analogy A complex signal is constructed by a few active elements from a large dictionary. Signal x = Dictionary Ψ × Sparse Coeff. s 1.2 0.8

Economic Intuition: The complex dynamics of the market might be driven by a combination of only a few ‘latent economic states’ or ‘shocks’. Sparse representation aims to find these core drivers.

Compressed Sensing: Recovering the Full Signal from Fewer Samples

The sparsity assumption leads to a surprising conclusion: compressed sensing.

If we know that a signal \(x\) is sparse under a certain dictionary \(\Psi\), we don’t need to observe the entire signal \(x\). We can take a small number of random measurements \(z\) and still perfectly reconstruct the original signal \(x\).

\[ \large{z = \Phi x = \Phi \Psi s = \Theta s} \]

  • \(z\): a small number of observations (\(l \times 1\))
  • \(\Phi\): the measurement matrix (\(l \times d\), \(l \ll d\))
  • \(\Theta\): the sensing matrix
  • Goal: Given \(z\) and \(\Theta\), solve for the sparse vector \(s\). This is the foundation of modern signal processing, MRI, and more.

Reconstruction Algorithm: Matching Pursuit

Finding the sparse vector \(s\) from \(z\) and \(\Theta\) is an NP-hard problem. Matching Pursuit is a greedy algorithm that approximates the solution iteratively.

  1. Initialize: Residual \(r_0 = z\), sparse solution \(s=0\).
  2. Find Most Correlated Atom: Find the atom in the dictionary \(\Theta\) that is most correlated with the current residual \(r\) (has the largest inner product).
  3. Update Solution: Add the contribution of this atom to the solution \(s\).
  4. Update Residual: Subtract the contribution of this atom from the current residual \(r\).
  5. Iterate: Repeat steps 2-4 until the residual is small enough or a maximum number of iterations is reached.

Economic Application: Identifying Structural Breaks

Sparse representation can be used to identify structural breaks in time series.

  • Signal (x): The first difference of an economic time series.
  • Dictionary (Ψ): An identity matrix.
  • Sparse Coefficients (s): During a stable period, the differenced series is close to zero, so s is also sparse (close to zero). When a structural break (a sudden change) occurs, the difference will have a large spike, corresponding to a large non-zero element in s.

By finding the non-zero entries in s, we can automatically detect moments when economic regimes or market behaviors have changed.

Summary: How to Choose the Right Method?

Method Type Core Idea Pros Cons
PCA Linear, Unsupervised Maximize variance Simple, fast, highly interpretable Cannot handle nonlinear structures
LDA Linear, Supervised Maximize class separability Excellent for classification Requires class labels, has assumptions about class distribution
MDS Distance-driven Preserve pairwise distances Flexible, only needs a distance matrix Computationally expensive, results depend on the distance metric
Isomap Nonlinear, Unsupervised Preserve geodesic distance Can ‘unroll’ certain manifolds Sensitive to ‘shortcut’ noise, computationally intensive
LLE Nonlinear, Unsupervised Preserve local linear reconstruction Computationally efficient, can handle various manifolds Sensitive to K selection, can produce distortions
t-SNE Nonlinear, Unsupervised Preserve neighborhood probabilities Excellent for visualization, reveals clusters Computationally expensive, for visualization only, not for reduction

Final Thought: Reduction is a Means to an End, Not the Goal Itself

In this chapter, we have explored a range of representation learning (dimensionality reduction) methods, from linear to nonlinear.

  • They are powerful tools for exploratory data analysis, helping us discover hidden structures like factors, clusters, and low-dimensional manifolds in seemingly chaotic high-dimensional data.
  • They are also a crucial preprocessing step for building predictive models, effectively improving model stability and generalization.

The key is to always combine your domain knowledge (economics, finance) to interpret the results of dimensionality reduction, giving real-world meaning to these abstract dimensions and structures.

Thank You!

Questions & Discussion