13: Unsupervised Learning Methods

Supervised vs. Unsupervised Learning

Dimension Supervised Learning Unsupervised Learning
Target variable Yes — \(Y\) is known No — no labels
Goal Predict \(Y\) from \(X\) Discover structure in \(X\)
Evaluation Clear metrics (AUC, MSE) Ambiguous — no ground truth
Examples Regression, Classification Clustering, Dimensionality Reduction
Financial use Credit scoring, price prediction Client segmentation, factor discovery

The fundamental challenge: Without labels, how do we know if our answer is right?

This chapter covers two families:

  1. Clustering — group similar observations (K-means, Hierarchical)
  2. Dimensionality Reduction — compress features while preserving structure (PCA, t-SNE, UMAP)

Dirty Work: The Rorschach Test — Clusters in Pure Noise

The trap: K-means will always return \(K\) clusters, even when the data has no structure.

Experiment: Generate 1,000 points from a uniform distribution — pure noise, zero structure.

Apply K-means with \(K = 3\):

  • Result: 3 “clean” clusters with clear boundaries
  • Silhouette score: 0.3916 — appears reasonable!

Compare: Our real YRD company clustering produces silhouette = 0.3012lower than the noise!

The lesson: Every clustering algorithm will produce an answer. The algorithm has no concept of “there are no clusters.” It’s your job to validate that the clusters are:

  1. Stable — reproduced across subsamples
  2. Meaningful — aligned with business domain knowledge
  3. Non-trivial — better than what random data would produce

Dirty Work: Scaling Effects — The Feature That Ate the Distance

Euclidean distance treats all features equally:

\[\large{ d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{j=1}^p (x_j - y_j)^2} }\]

The problem: If Total Assets is in billions (range: 0.1–500) and ROA is in percentages (range: −10 to 20), then:

Distance is 99.9% determined by Total Assets — ROA is effectively invisible.

The fix: Always standardize before clustering — subtract mean, divide by standard deviation:

\[\large{ z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j} }\]

This is not optional. Forgetting to standardize is one of the most common mistakes in applied clustering. After standardization, all features contribute equally to the distance metric.

K-Means: The Workhorse Algorithm

Objective: Minimize total within-cluster sum of squares:

\[\large{ \min_{C_1,...,C_K} \sum_{k=1}^K \sum_{x_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 }\]

Lloyd’s Algorithm:

  1. Initialize: Randomly choose \(K\) centroids
  2. Assign: Each point → nearest centroid
  3. Update: Each centroid → mean of its assigned points
  4. Repeat steps 2–3 until no assignments change

Convergence guarantee: The objective \(J\) is bounded below (≥ 0) and monotonically decreasing at each step → must converge. But: converges to a local minimum only — different initializations may give different results.

Practical fix: Run K-means 10+ times with different random seeds and keep the best solution (lowest \(J\)).

Choosing K: The Elbow, Silhouette, and Gap Methods

Method 1: Elbow Method — Plot WCSS vs. \(K\), look for the “bend”

  • Con: Often subjective — where exactly is the elbow?

Method 2: Silhouette Score — For each point \(i\):

\[\large{ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \in [-1, 1] }\]

  • \(a(i)\) = average distance to points in same cluster
  • \(b(i)\) = average distance to points in nearest other cluster
  • \(s(i) \approx 1\): well clustered | \(s(i) \approx 0\): on boundary | \(s(i) < 0\): misclassified

Method 3: Gap Statistic — Compares WCSS to expected WCSS under a null (uniform) distribution

\[\large{ \text{Gap}(K) = E[\log(W_K^*)] - \log(W_K) }\]

Choose the smallest \(K\) where the gap exceeds a threshold.

Case Study: Financial Profiling of YRD Companies

Data: 1,856 YRD listed companies (2023) | 4 features (ROA, Debt Ratio, Current Ratio, ln Assets) | K=4 selected via Elbow + Silhouette

Cluster n Avg ROA Avg Debt(%) Avg CR Avg ln(Assets) Profile
1 397 2.96% 60.4 1.50 5.49 Large, leveraged
2 409 5.49% 17.0 5.23 3.29 Asset-light, efficient
3 161 −10.04% 56.1 1.27 3.62 Financially distressed
4 889 4.36% 39.4 2.35 3.69 Mid-size, balanced

Cluster 3 (8.7% of firms): ROA deeply negative, high leverage — these are potential ST candidates. Cross-referencing: many overlap with actual ST-flagged companies from Chapter 11.

Hierarchical Clustering: Bottom-Up Assembly

Agglomerative (bottom-up) approach:

  1. Start: Each point is its own cluster (\(n\) clusters)
  2. Merge the two closest clusters
  3. Repeat until only 1 cluster remains
  4. Cut the dendrogram at desired height to get \(K\) clusters

Which pair is “closest”? — Linkage methods:

Linkage Distance Between Clusters Behavior
Single \(\min\) distance between any pair Chaining effect → elongated clusters
Complete \(\max\) distance between any pair Compact, spherical clusters
Average Mean pairwise distance Compromise between single and complete
Ward Minimize increase in total within-cluster variance Most similar to K-means; generally preferred

Advantage over K-means: No need to pre-specify \(K\) — the dendrogram reveals the natural hierarchy.

PCA: The Math — Maximizing Variance

Goal: Find the direction \(\mathbf{w}\) that captures maximum variance in the data.

Optimization problem:

\[\large{ \max_{\mathbf{w}} \mathbf{w}' \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}'\mathbf{w} = 1 }\]

Solution via Lagrangian:

\[\large{ \mathcal{L} = \mathbf{w}'\mathbf{S}\mathbf{w} - \lambda(\mathbf{w}'\mathbf{w} - 1) }\]

Taking the derivative and setting to zero:

\[\large{ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0 \implies \mathbf{S}\mathbf{w} = \lambda\mathbf{w} }\]

This is the eigenvalue equation! The first PC = eigenvector of the largest eigenvalue of \(\mathbf{S}\).

Geometric interpretation: PCA finds the principal axes of the data’s ellipsoidal cloud. The eigenvalues are the squared lengths of those axes.

PCA Results: Compressing 4 Features into 2

Applied to YRD company financial profiles (standardized data):

Component Eigenvalue Variance Explained Cumulative
PC1 2.122 53.06% 53.06%
PC2 1.043 26.07% 79.14%
PC3 0.604 15.09% 94.18%
PC4 0.231 5.82% 100.00%

Interpretation of loadings:

  • PC1 ≈ “Leverage vs. Liquidity axis” (Debt Ratio positive, Current Ratio negative)
  • PC2 ≈ “Profitability axis” (ROA dominates)

Two components capture 79% of total variance — we reduced 4 dimensions to 2 with minimal information loss.

Kaiser rule: Keep components with eigenvalue > 1 → keeps PC1 and PC2 (both > 1).

t-SNE and UMAP: Nonlinear Dimensionality Reduction

PCA limitation: It finds only linear combinations. Complex nonlinear structures are invisible.

t-SNE (t-distributed Stochastic Neighbor Embedding):

  • Minimizes KL divergence between high-dimensional and low-dimensional neighborhoods:

\[\large{ D_{KL}(P||Q) = \sum_i \sum_j p_{ij} \log\frac{p_{ij}}{q_{ij}} }\]

  • Excels at: Revealing local cluster structure and manifold topology
  • Limitation: Does not preserve global distances; different runs give different layouts

UMAP (Uniform Manifold Approximation and Projection):

  • Based on Riemannian geometry and topological data analysis
  • Faster than t-SNE and better preserves global structure
  • Increasingly the default choice for exploratory visualization
Feature PCA t-SNE UMAP
Structure Linear Nonlinear (local) Nonlinear (local + global)
Speed Fast Slow Medium
Reproducible Yes No (stochastic) No (stochastic)

Clustering Evaluation: Internal and External Metrics

Internal metrics (no ground truth needed):

Metric Formula Interpretation
Silhouette \(\frac{b(i)-a(i)}{\max\{a,b\}}\) [−1, 1]; higher = better separation
Calinski-Harabasz \(\frac{SS_B / (K-1)}{SS_W / (n-K)}\) Higher = better; ratio of between/within variance
Davies-Bouldin \(\frac{1}{K}\sum \max_{j \neq i}\frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\) Lower = better; ratio of cluster spread to separation

Our YRD clustering quality:

  • Silhouette = 0.3012 — moderate (recall: noise data gave 0.3916!)
  • Calinski-Harabasz = 840.68 — relatively high (good between-cluster separation)
  • Davies-Bouldin = 1.0640 — moderate (some cluster overlap)

External metrics (when true labels exist): ARI, NMI, V-measure — compare discovered clusters to known groups.

Heuristic 1: Spurious Clusters — The Noise Benchmark

The most dangerous mistake in clustering: Interpreting structure where none exists.

Comparative experiment:

Dataset K Silhouette Interpretation
Real data (1,856 companies) 4 0.3012 Actual financial profiles
Random noise (1,000 uniform points) 3 0.3916 NO structure at all

Random noise scores HIGHER than real data!

Why: K-means can always partition space into regions. Uniform data creates equally-spaced centroids with clean-looking Voronoi regions — high silhouette by geometry, not by structure.

The antidote:

  1. Compare your clustering metrics to a null distribution (permuted or random data)
  2. Test stability: run on subsamples — do the same clusters re-appear?
  3. Demand business interpretability: can you explain why these groups exist?

Heuristic 2: The Curse of Dimensionality

In high dimensions, all points become equidistant.

The relative contrast of distances:

\[\large{ R = \frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as dimension} \to \infty }\]

Dimension Max Distance Min Distance Contrast \(R\)
2 1.32 0.08 15.5
10 3.02 1.89 0.60
100 5.85 4.51 0.30
1000 13.21 12.09 0.09

At 1000 dimensions, the farthest point is only 9% farther than the nearest!

Practical implication: Clustering and distance-based methods break down in high dimensions. You must reduce dimensionality first (PCA, feature selection) before clustering.

Summary: The Unsupervised Learning Toolkit

Topic Key Takeaway
Supervised vs. Unsupervised No labels → must validate with domain expertise
Rorschach Test Algorithms ALWAYS find clusters, even in pure noise
Scaling ALWAYS standardize before computing distances
K-Means Simple, fast, local optima; run multiple initializations
K Selection Elbow + Silhouette + Gap; no single metric is sufficient
YRD Case 4 profiles: leveraged giants, asset-light stars, distressed, balanced
Hierarchical No pre-specified K; Ward linkage generally preferred
PCA Eigenvalue decomposition; 2 PCs capture 79% of variance
t-SNE / UMAP Nonlinear visualization; UMAP preserves global structure better
Curse of Dimensionality Distance contrast → 0 in high-d; reduce dimensions first
Validation Always: null benchmark + stability + business meaning

The meta-lesson: Unsupervised learning is powerful for exploration, but demands rigorous skepticism. Clusters must earn your trust.