13: Unsupervised Learning Methods

Supervised vs. Unsupervised Learning

Dimension	Supervised Learning	Unsupervised Learning
Target variable	Yes — \(Y\) is known	No — no labels
Goal	Predict \(Y\) from \(X\)	Discover structure in \(X\)
Evaluation	Clear metrics (AUC, MSE)	Ambiguous — no ground truth
Examples	Regression, Classification	Clustering, Dimensionality Reduction
Financial use	Credit scoring, price prediction	Client segmentation, factor discovery

The fundamental challenge: Without labels, how do we know if our answer is right?

This chapter covers two families:

Clustering — group similar observations (K-means, Hierarchical)
Dimensionality Reduction — compress features while preserving structure (PCA, t-SNE, UMAP)

Dirty Work: The Rorschach Test — Clusters in Pure Noise

The trap: K-means will always return \(K\) clusters, even when the data has no structure.

Experiment: Generate 1,000 points from a uniform distribution — pure noise, zero structure.

Apply K-means with \(K = 3\):

Result: 3 “clean” clusters with clear boundaries
Silhouette score: 0.3916 — appears reasonable!

Compare: Our real YRD company clustering produces silhouette = 0.3012 — lower than the noise!

The lesson: Every clustering algorithm will produce an answer. The algorithm has no concept of “there are no clusters.” It’s your job to validate that the clusters are:

Stable — reproduced across subsamples
Meaningful — aligned with business domain knowledge
Non-trivial — better than what random data would produce

Dirty Work: Scaling Effects — The Feature That Ate the Distance

Euclidean distance treats all features equally:

\[\large{ d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{j=1}^p (x_j - y_j)^2} }\]

The problem: If Total Assets is in billions (range: 0.1–500) and ROA is in percentages (range: −10 to 20), then:

Distance is 99.9% determined by Total Assets — ROA is effectively invisible.

The fix: Always standardize before clustering — subtract mean, divide by standard deviation:

\[\large{ z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j} }\]

This is not optional. Forgetting to standardize is one of the most common mistakes in applied clustering. After standardization, all features contribute equally to the distance metric.

K-Means: The Workhorse Algorithm

Objective: Minimize total within-cluster sum of squares:

\[\large{ \min_{C_1,...,C_K} \sum_{k=1}^K \sum_{x_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 }\]

Lloyd’s Algorithm:

Initialize: Randomly choose \(K\) centroids
Assign: Each point → nearest centroid
Update: Each centroid → mean of its assigned points
Repeat steps 2–3 until no assignments change

Convergence guarantee: The objective \(J\) is bounded below (≥ 0) and monotonically decreasing at each step → must converge. But: converges to a local minimum only — different initializations may give different results.

Practical fix: Run K-means 10+ times with different random seeds and keep the best solution (lowest \(J\)).

Choosing K: The Elbow, Silhouette, and Gap Methods

Method 1: Elbow Method — Plot WCSS vs. \(K\), look for the “bend”

Con: Often subjective — where exactly is the elbow?

Method 2: Silhouette Score — For each point \(i\):

\[\large{ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \in [-1, 1] }\]

\(a(i)\) = average distance to points in same cluster
\(b(i)\) = average distance to points in nearest other cluster
\(s(i) \approx 1\): well clustered | \(s(i) \approx 0\): on boundary | \(s(i) < 0\): misclassified

Method 3: Gap Statistic — Compares WCSS to expected WCSS under a null (uniform) distribution

\[\large{ \text{Gap}(K) = E[\log(W_K^*)] - \log(W_K) }\]

Choose the smallest \(K\) where the gap exceeds a threshold.

Case Study: Financial Profiling of YRD Companies

Data: 1,856 YRD listed companies (2023) | 4 features (ROA, Debt Ratio, Current Ratio, ln Assets) | K=4 selected via Elbow + Silhouette

Cluster	n	Avg ROA	Avg Debt(%)	Avg CR	Avg ln(Assets)	Profile
1	397	2.96%	60.4	1.50	5.49	Large, leveraged
2	409	5.49%	17.0	5.23	3.29	Asset-light, efficient
3	161	−10.04%	56.1	1.27	3.62	Financially distressed
4	889	4.36%	39.4	2.35	3.69	Mid-size, balanced

Cluster 3 (8.7% of firms): ROA deeply negative, high leverage — these are potential ST candidates. Cross-referencing: many overlap with actual ST-flagged companies from Chapter 11.

Now let’s see K-means in action on real data. We clustered 1,856 Yangtze River Delta companies into four groups. Cluster 2 are the stars: high profitability, low leverage, generous liquidity — think of asset-light tech companies. Cluster 1 are the giants: large firms with high leverage but moderate profitability — typical of capital-intensive industries like utilities and real estate. Cluster 4 is the mainstream middle. And Cluster 3 is the trouble zone: 161 companies with deeply negative ROA and high debt. These are firms in financial distress. When we cross-reference with the ST labels from Chapter 11, many Cluster 3 members are indeed flagged. The clustering found the distressed companies without ever being told which ones were in trouble — that’s the power of unsupervised learning.

Hierarchical Clustering: Bottom-Up Assembly

Agglomerative (bottom-up) approach:

Start: Each point is its own cluster (\(n\) clusters)
Merge the two closest clusters
Repeat until only 1 cluster remains
Cut the dendrogram at desired height to get \(K\) clusters

Which pair is “closest”? — Linkage methods:

Linkage	Distance Between Clusters	Behavior
Single	\(\min\) distance between any pair	Chaining effect → elongated clusters
Complete	\(\max\) distance between any pair	Compact, spherical clusters
Average	Mean pairwise distance	Compromise between single and complete
Ward	Minimize increase in total within-cluster variance	Most similar to K-means; generally preferred

Advantage over K-means: No need to pre-specify \(K\) — the dendrogram reveals the natural hierarchy.

PCA: The Math — Maximizing Variance

Goal: Find the direction \(\mathbf{w}\) that captures maximum variance in the data.

Optimization problem:

\[\large{ \max_{\mathbf{w}} \mathbf{w}' \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}'\mathbf{w} = 1 }\]

Solution via Lagrangian:

\[\large{ \mathcal{L} = \mathbf{w}'\mathbf{S}\mathbf{w} - \lambda(\mathbf{w}'\mathbf{w} - 1) }\]

Taking the derivative and setting to zero:

\[\large{ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0 \implies \mathbf{S}\mathbf{w} = \lambda\mathbf{w} }\]

This is the eigenvalue equation! The first PC = eigenvector of the largest eigenvalue of \(\mathbf{S}\).

Geometric interpretation: PCA finds the principal axes of the data’s ellipsoidal cloud. The eigenvalues are the squared lengths of those axes.

PCA Results: Compressing 4 Features into 2

Applied to YRD company financial profiles (standardized data):

Component	Eigenvalue	Variance Explained	Cumulative
PC1	2.122	53.06%	53.06%
PC2	1.043	26.07%	79.14%
PC3	0.604	15.09%	94.18%
PC4	0.231	5.82%	100.00%

Interpretation of loadings:

PC1 ≈ “Leverage vs. Liquidity axis” (Debt Ratio positive, Current Ratio negative)
PC2 ≈ “Profitability axis” (ROA dominates)

Two components capture 79% of total variance — we reduced 4 dimensions to 2 with minimal information loss.

Kaiser rule: Keep components with eigenvalue > 1 → keeps PC1 and PC2 (both > 1).

t-SNE and UMAP: Nonlinear Dimensionality Reduction

PCA limitation: It finds only linear combinations. Complex nonlinear structures are invisible.

t-SNE (t-distributed Stochastic Neighbor Embedding):

Minimizes KL divergence between high-dimensional and low-dimensional neighborhoods:

\[\large{ D_{KL}(P||Q) = \sum_i \sum_j p_{ij} \log\frac{p_{ij}}{q_{ij}} }\]

Excels at: Revealing local cluster structure and manifold topology
Limitation: Does not preserve global distances; different runs give different layouts

UMAP (Uniform Manifold Approximation and Projection):

Based on Riemannian geometry and topological data analysis
Faster than t-SNE and better preserves global structure
Increasingly the default choice for exploratory visualization

Feature	PCA	t-SNE	UMAP
Structure	Linear	Nonlinear (local)	Nonlinear (local + global)
Speed	Fast	Slow	Medium
Reproducible	Yes	No (stochastic)	No (stochastic)

Clustering Evaluation: Internal and External Metrics

Internal metrics (no ground truth needed):

Metric	Formula	Interpretation
Silhouette	\(\frac{b(i)-a(i)}{\max\{a,b\}}\)	[−1, 1]; higher = better separation
Calinski-Harabasz	\(\frac{SS_B / (K-1)}{SS_W / (n-K)}\)	Higher = better; ratio of between/within variance
Davies-Bouldin	\(\frac{1}{K}\sum \max_{j \neq i}\frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\)	Lower = better; ratio of cluster spread to separation

Our YRD clustering quality:

Silhouette = 0.3012 — moderate (recall: noise data gave 0.3916!)
Calinski-Harabasz = 840.68 — relatively high (good between-cluster separation)
Davies-Bouldin = 1.0640 — moderate (some cluster overlap)

External metrics (when true labels exist): ARI, NMI, V-measure — compare discovered clusters to known groups.

Heuristic 1: Spurious Clusters — The Noise Benchmark

The most dangerous mistake in clustering: Interpreting structure where none exists.

Comparative experiment:

Dataset	K	Silhouette	Interpretation
Real data (1,856 companies)	4	0.3012	Actual financial profiles
Random noise (1,000 uniform points)	3	0.3916	NO structure at all

Random noise scores HIGHER than real data!

Why: K-means can always partition space into regions. Uniform data creates equally-spaced centroids with clean-looking Voronoi regions — high silhouette by geometry, not by structure.

The antidote:

Compare your clustering metrics to a null distribution (permuted or random data)
Test stability: run on subsamples — do the same clusters re-appear?
Demand business interpretability: can you explain why these groups exist?

Heuristic 2: The Curse of Dimensionality

In high dimensions, all points become equidistant.

The relative contrast of distances:

\[\large{ R = \frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as dimension} \to \infty }\]

Dimension	Max Distance	Min Distance	Contrast \(R\)
2	1.32	0.08	15.5
10	3.02	1.89	0.60
100	5.85	4.51	0.30
1000	13.21	12.09	0.09

At 1000 dimensions, the farthest point is only 9% farther than the nearest!

Practical implication: Clustering and distance-based methods break down in high dimensions. You must reduce dimensionality first (PCA, feature selection) before clustering.

Summary: The Unsupervised Learning Toolkit

Topic	Key Takeaway
Supervised vs. Unsupervised	No labels → must validate with domain expertise
Rorschach Test	Algorithms ALWAYS find clusters, even in pure noise
Scaling	ALWAYS standardize before computing distances
K-Means	Simple, fast, local optima; run multiple initializations
K Selection	Elbow + Silhouette + Gap; no single metric is sufficient
YRD Case	4 profiles: leveraged giants, asset-light stars, distressed, balanced
Hierarchical	No pre-specified K; Ward linkage generally preferred
PCA	Eigenvalue decomposition; 2 PCs capture 79% of variance
t-SNE / UMAP	Nonlinear visualization; UMAP preserves global structure better
Curse of Dimensionality	Distance contrast → 0 in high-d; reduce dimensions first
Validation	Always: null benchmark + stability + business meaning

The meta-lesson: Unsupervised learning is powerful for exploration, but demands rigorous skepticism. Clusters must earn your trust.

Let me wrap up with the overarching message of this chapter. Unsupervised learning opens a window into the hidden structure of data — customer segments, risk profiles, factor structures — things you didn’t know to look for. But this power comes with a responsibility: without labels, there’s no automatic way to know if your answer is right. The clustering that looks beautiful on a plot may be nothing more than the algorithm imposing structure on noise. Your defense is a three-part validation: statistical benchmarking against null data, stability testing across subsamples, and — most importantly — domain expertise. If you can name the clusters and explain why they exist, you’re probably on solid ground. If you can’t, go back and question everything. This skeptical mindset is what separates rigorous data science from data storytelling.