| Dimension | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Target variable | Yes — \(Y\) is known | No — no labels |
| Goal | Predict \(Y\) from \(X\) | Discover structure in \(X\) |
| Evaluation | Clear metrics (AUC, MSE) | Ambiguous — no ground truth |
| Examples | Regression, Classification | Clustering, Dimensionality Reduction |
| Financial use | Credit scoring, price prediction | Client segmentation, factor discovery |
The fundamental challenge: Without labels, how do we know if our answer is right?
This chapter covers two families:
The trap: K-means will always return \(K\) clusters, even when the data has no structure.
Experiment: Generate 1,000 points from a uniform distribution — pure noise, zero structure.
Apply K-means with \(K = 3\):
Compare: Our real YRD company clustering produces silhouette = 0.3012 — lower than the noise!
The lesson: Every clustering algorithm will produce an answer. The algorithm has no concept of “there are no clusters.” It’s your job to validate that the clusters are:
Euclidean distance treats all features equally:
\[\large{ d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{j=1}^p (x_j - y_j)^2} }\]
The problem: If Total Assets is in billions (range: 0.1–500) and ROA is in percentages (range: −10 to 20), then:
Distance is 99.9% determined by Total Assets — ROA is effectively invisible.
The fix: Always standardize before clustering — subtract mean, divide by standard deviation:
\[\large{ z_{ij} = \frac{x_{ij} - \bar{x}_j}{s_j} }\]
This is not optional. Forgetting to standardize is one of the most common mistakes in applied clustering. After standardization, all features contribute equally to the distance metric.
Objective: Minimize total within-cluster sum of squares:
\[\large{ \min_{C_1,...,C_K} \sum_{k=1}^K \sum_{x_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 }\]
Lloyd’s Algorithm:
Convergence guarantee: The objective \(J\) is bounded below (≥ 0) and monotonically decreasing at each step → must converge. But: converges to a local minimum only — different initializations may give different results.
Practical fix: Run K-means 10+ times with different random seeds and keep the best solution (lowest \(J\)).
Method 1: Elbow Method — Plot WCSS vs. \(K\), look for the “bend”
Method 2: Silhouette Score — For each point \(i\):
\[\large{ s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \in [-1, 1] }\]
Method 3: Gap Statistic — Compares WCSS to expected WCSS under a null (uniform) distribution
\[\large{ \text{Gap}(K) = E[\log(W_K^*)] - \log(W_K) }\]
Choose the smallest \(K\) where the gap exceeds a threshold.
Data: 1,856 YRD listed companies (2023) | 4 features (ROA, Debt Ratio, Current Ratio, ln Assets) | K=4 selected via Elbow + Silhouette
| Cluster | n | Avg ROA | Avg Debt(%) | Avg CR | Avg ln(Assets) | Profile |
|---|---|---|---|---|---|---|
| 1 | 397 | 2.96% | 60.4 | 1.50 | 5.49 | Large, leveraged |
| 2 | 409 | 5.49% | 17.0 | 5.23 | 3.29 | Asset-light, efficient |
| 3 | 161 | −10.04% | 56.1 | 1.27 | 3.62 | Financially distressed |
| 4 | 889 | 4.36% | 39.4 | 2.35 | 3.69 | Mid-size, balanced |
Cluster 3 (8.7% of firms): ROA deeply negative, high leverage — these are potential ST candidates. Cross-referencing: many overlap with actual ST-flagged companies from Chapter 11.
Agglomerative (bottom-up) approach:
Which pair is “closest”? — Linkage methods:
| Linkage | Distance Between Clusters | Behavior |
|---|---|---|
| Single | \(\min\) distance between any pair | Chaining effect → elongated clusters |
| Complete | \(\max\) distance between any pair | Compact, spherical clusters |
| Average | Mean pairwise distance | Compromise between single and complete |
| Ward | Minimize increase in total within-cluster variance | Most similar to K-means; generally preferred |
Advantage over K-means: No need to pre-specify \(K\) — the dendrogram reveals the natural hierarchy.
Goal: Find the direction \(\mathbf{w}\) that captures maximum variance in the data.
Optimization problem:
\[\large{ \max_{\mathbf{w}} \mathbf{w}' \mathbf{S} \mathbf{w} \quad \text{subject to} \quad \mathbf{w}'\mathbf{w} = 1 }\]
Solution via Lagrangian:
\[\large{ \mathcal{L} = \mathbf{w}'\mathbf{S}\mathbf{w} - \lambda(\mathbf{w}'\mathbf{w} - 1) }\]
Taking the derivative and setting to zero:
\[\large{ \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = 2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0 \implies \mathbf{S}\mathbf{w} = \lambda\mathbf{w} }\]
This is the eigenvalue equation! The first PC = eigenvector of the largest eigenvalue of \(\mathbf{S}\).
Geometric interpretation: PCA finds the principal axes of the data’s ellipsoidal cloud. The eigenvalues are the squared lengths of those axes.
Applied to YRD company financial profiles (standardized data):
| Component | Eigenvalue | Variance Explained | Cumulative |
|---|---|---|---|
| PC1 | 2.122 | 53.06% | 53.06% |
| PC2 | 1.043 | 26.07% | 79.14% |
| PC3 | 0.604 | 15.09% | 94.18% |
| PC4 | 0.231 | 5.82% | 100.00% |
Interpretation of loadings:
Two components capture 79% of total variance — we reduced 4 dimensions to 2 with minimal information loss.
Kaiser rule: Keep components with eigenvalue > 1 → keeps PC1 and PC2 (both > 1).
PCA limitation: It finds only linear combinations. Complex nonlinear structures are invisible.
t-SNE (t-distributed Stochastic Neighbor Embedding):
\[\large{ D_{KL}(P||Q) = \sum_i \sum_j p_{ij} \log\frac{p_{ij}}{q_{ij}} }\]
UMAP (Uniform Manifold Approximation and Projection):
| Feature | PCA | t-SNE | UMAP |
|---|---|---|---|
| Structure | Linear | Nonlinear (local) | Nonlinear (local + global) |
| Speed | Fast | Slow | Medium |
| Reproducible | Yes | No (stochastic) | No (stochastic) |
Internal metrics (no ground truth needed):
| Metric | Formula | Interpretation |
|---|---|---|
| Silhouette | \(\frac{b(i)-a(i)}{\max\{a,b\}}\) | [−1, 1]; higher = better separation |
| Calinski-Harabasz | \(\frac{SS_B / (K-1)}{SS_W / (n-K)}\) | Higher = better; ratio of between/within variance |
| Davies-Bouldin | \(\frac{1}{K}\sum \max_{j \neq i}\frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\) | Lower = better; ratio of cluster spread to separation |
Our YRD clustering quality:
External metrics (when true labels exist): ARI, NMI, V-measure — compare discovered clusters to known groups.
The most dangerous mistake in clustering: Interpreting structure where none exists.
Comparative experiment:
| Dataset | K | Silhouette | Interpretation |
|---|---|---|---|
| Real data (1,856 companies) | 4 | 0.3012 | Actual financial profiles |
| Random noise (1,000 uniform points) | 3 | 0.3916 | NO structure at all |
Random noise scores HIGHER than real data!
Why: K-means can always partition space into regions. Uniform data creates equally-spaced centroids with clean-looking Voronoi regions — high silhouette by geometry, not by structure.
The antidote:
In high dimensions, all points become equidistant.
The relative contrast of distances:
\[\large{ R = \frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as dimension} \to \infty }\]
| Dimension | Max Distance | Min Distance | Contrast \(R\) |
|---|---|---|---|
| 2 | 1.32 | 0.08 | 15.5 |
| 10 | 3.02 | 1.89 | 0.60 |
| 100 | 5.85 | 4.51 | 0.30 |
| 1000 | 13.21 | 12.09 | 0.09 |
At 1000 dimensions, the farthest point is only 9% farther than the nearest!
Practical implication: Clustering and distance-based methods break down in high dimensions. You must reduce dimensionality first (PCA, feature selection) before clustering.
| Topic | Key Takeaway |
|---|---|
| Supervised vs. Unsupervised | No labels → must validate with domain expertise |
| Rorschach Test | Algorithms ALWAYS find clusters, even in pure noise |
| Scaling | ALWAYS standardize before computing distances |
| K-Means | Simple, fast, local optima; run multiple initializations |
| K Selection | Elbow + Silhouette + Gap; no single metric is sufficient |
| YRD Case | 4 profiles: leveraged giants, asset-light stars, distressed, balanced |
| Hierarchical | No pre-specified K; Ward linkage generally preferred |
| PCA | Eigenvalue decomposition; 2 PCs capture 79% of variance |
| t-SNE / UMAP | Nonlinear visualization; UMAP preserves global structure better |
| Curse of Dimensionality | Distance contrast → 0 in high-d; reduce dimensions first |
| Validation | Always: null benchmark + stability + business meaning |
The meta-lesson: Unsupervised learning is powerful for exploration, but demands rigorous skepticism. Clusters must earn your trust.