Imagine you are the CEO of Netflix, holding viewing data for hundreds of millions of users.
How would you recommend movies that users might like?
How would you produce original content that caters to local tastes in different markets?
How would you discover and serve emerging, niche viewing communities?
This Chapter’s Learning Roadmap
We will embark on a comprehensive journey to master the core concepts and practices of cluster analysis.
The Core Task is to Discover Structure in Unstructured Data
A better strategy is to identify groups of users with similar tastes and then make targeted recommendations.
This process of ‘grouping like with like’ is the core of what we will study today: Clustering. It is the art and science of automatically grouping similar data objects into clusters without predefined categories.
Clustering is a Form of Unsupervised Learning
Clustering is a classic Unsupervised Learning method.
Supervised Learning: Data has clear labels (e.g., an email is marked as ‘spam’ or ‘not spam’), and we learn to predict these labels.
Unsupervised Learning: Data has no labels. Our goal is to discover the inherent, hidden structures or patterns within the data itself.
The task of clustering is to partition similar data points into the same group (cluster).
Clustering vs. Classification: A Critical Distinction
Before diving into the world of clustering, it’s essential to clarify the difference between Clustering and Classification.
Classification (Supervised Learning)
Goal: Predict known labels.
Input: Labeled data (e.g., this email is/is not spam).
Process: Learn the mapping rules from features to labels.
Output: A model that can predict labels for new data.
Clustering (Unsupervised Learning)
Goal: Discover hidden groups in the data.
Input: Unlabeled data.
Process: Group data based on its intrinsic similarity.
Output: A partition of the data into clusters and an interpretation of these clusters.
The Goal is a Rigorous Partition of the Data
Given a dataset \(X = \{x_1, x_2, \ldots, x_N\}\) with N samples, clustering aims to divide it into K clusters \(\{C_1, C_2, \ldots, C_K\}\).
This division must satisfy three fundamental properties to form a Partition.
Partition Property 1/3: Non-empty
Each defined cluster must actually contain at least one data point. We don’t make meaningless divisions.
Use Case: When movement is restricted to a grid (like a chessboard) or when the scales of different dimensions vary greatly, Manhattan distance can be more robust than Euclidean distance.
Visualizing Distance: Euclidean vs. Manhattan
A Critical Practical Step: Feature Scaling
Warning: Before using distance-based clustering algorithms (like K-Means), you must standardize your data!
Customer
Age (Years)
Annual Income (USD)
A
25
50,000
B
30
50,100
Problem: Without scaling, the squared difference of $100 in income will far outweigh the squared difference of 5 years in age, causing the clustering result to be completely dominated by income.
Solution: Use tools like StandardScaler to bring all features to a similar scale (e.g., mean of 0, variance of 1).
The Impact of Feature Scaling
Without scaling, distance calculations are ‘hijacked’ by features with large numerical ranges.
Component 2: Defining the ‘Heart’ of a Cluster
The cluster center (or centroid) is the representative point of a cluster. Different algorithms define it differently, mainly in two ways.
1. Mean-based Center
Representative Algorithm: K-Means
Definition: The arithmetic mean of all sample points within the cluster.
Pro: Simple and intuitive. Con: Can be misleadingly high if the number of clusters is large (e.g., each point is a cluster), as it doesn’t penalize fragmentation.
Purity Calculation Example
Suppose we have 10 samples, with true classes being 5 diamonds and 5 hexagons. A clustering algorithm divides them into two clusters, K1 and K2.
True Class
Diamond
Hexagon
Cluster K1
5 (dominant)
1
Cluster K2
0
4 (dominant)
Calculate Purity of K1:\(Purity(K_1) = \max(\frac{5}{6}, \frac{1}{6}) = \frac{5}{6}\)
Idea: Measures the ‘disorder’ or ‘impurity’ within a cluster. If a cluster contains a mix of samples from various true classes, its entropy is high. If it’s pure, its entropy is low.
Calculation: For a cluster \(C_k\), its entropy is calculated as: \[ \large{\text{Entropy}(C_k) = - \sum_{s=1}^{S} p_{ks} \log_2(p_{ks})}
\]
where \(p_{ks} = \frac{|C_k \cap L_s|}{|C_k|}\) is the probability of true class \(s\) appearing in cluster \(k\).
Entropy Calculation Example
Let’s use cluster K1 from the previous example to calculate its entropy.
K1 has 6 samples.
The probability of the ‘Diamond’ class is \(p_{diamond} = 5/6\).
The probability of the ‘Hexagon’ class is \(p_{hexagon} = 1/6\).
The closer this value is to 0, the purer the cluster. K2 contains only hexagons, so \(p_{hexagon} = 1\), and \(\text{Entropy}(K_2) = -1 \log_2(1) = 0\), which is perfectly pure.
External Metrics (3): Homogeneity & Completeness
Purity and Entropy only tell part of the story. Homogeneity and Completeness provide a more comprehensive view.
Homogeneity: Measures whether each cluster contains only members of a single class. If every cluster perfectly contains members of only one true class, homogeneity is 1. This is analogous to ‘Precision’ in economics.
Completeness: Measures whether all members of a given class are assigned to the same cluster. If all members of a true class are perfectly grouped into one cluster, completeness is high. This is analogous to ‘Recall’.
These two metrics are often in a trade-off.
Homogeneity vs. Completeness: An Intuitive Example
Assume the ground truth is {3 Apples, 3 Pears}.
Perfect Clustering
Cluster 1: {🍎, 🍎, 🍎}
Cluster 2: {🍐, 🍐, 🍐}
Homogeneity = 1
Completeness = 1
High Homogeneity, Low Completeness
Cluster 1: {🍎}
Cluster 2: {🍎, 🍎}
Cluster 3: {🍐, 🍐, 🍐}
Homogeneity = 1 (each cluster is pure)
Completeness < 1 (Apples are split)
Low Homogeneity, High Completeness
Cluster 1: {🍎, 🍎, 🍎, 🍐, 🍐, 🍐}
Cluster 2: {}
Homogeneity < 1 (Cluster 1 is impure)
Completeness = 1 (all same-class items are together)
Internal Metric: Silhouette Coefficient
When we don’t have true labels, how do we evaluate clustering quality? The Silhouette Coefficient is a powerful tool.
Idea: For each data point, we want it to be very close to points in its own cluster, and very far from points in other clusters.
Calculation: For a sample \(x_n\):
\(a(x_n)\): The average distance from \(x_n\) to all other points in its own cluster (measures intra-cluster cohesion).
\(b(x_n)\): The average distance from \(x_n\) to all points in the nearest other cluster (measures inter-cluster separation).
Visualizing the Silhouette Coefficient Calculation
Interpreting the Silhouette Coefficient
The value of the Silhouette Coefficient \(SC(x_n)\) ranges from -1 to 1.
SC ≈ 1: Excellent! This means \(a(x_n)\) is much smaller than \(b(x_n)\). The point is well-matched to its own cluster and poorly-matched to neighboring clusters.
SC ≈ 0: The point is on or very close to the decision boundary between two neighboring clusters.
SC ≈ -1: Very bad! This means the point is likely assigned to the wrong cluster.
We can calculate the average silhouette coefficient over all data points to evaluate the entire clustering result.
Algorithm 1: K-Means Clustering
K-Means is one of the most famous and widely used clustering algorithms.
Core Idea: Iteratively partition the data into K clusters, such that each data point belongs to the cluster with the nearest mean (cluster center), and the within-cluster sum of squares is minimized.
Pros: Simple algorithm, fast computation, works extremely well for spherical, similarly sized clusters.
Cons: Requires the number of clusters K to be specified beforehand, sensitive to initial centroid placement, performs poorly on non-spherical clusters and with outliers.
The Objective Function of K-Means
K-Means aims to minimize the Sum of Squared Errors (SSE) within all clusters, also known as Inertia.
\(\mu_k\) is the centroid of the \(k\)-th cluster.
\(||x_n - \mu_k||^2\) is the squared Euclidean distance from data point \(x_n\) to the centroid of its assigned cluster.
Each step of the algorithm (assignment and update) is designed to reduce this total error value.
The Iterative Process of the K-Means Algorithm
The K-Means algorithm process is like a game of “capturing territory,” repeated in a four-step cycle:
Initialize: Randomly select K data points as the initial cluster centers.
Assign: For each data point, calculate its distance to all K cluster centers and assign it to the nearest one.
Update: For each cluster, recalculate its center (i.e., the mean of all data points in that cluster).
Repeat: Repeat steps 2 and 3 until the cluster centers no longer change significantly, or the assignment of data points no longer changes.
K-Means Step 1/4: Initialization
Assume K=3. We randomly select three points in the data space as the initial centroids.
K-Means Step 2/4: Assignment
Based on the distance of each point to these three initial centers, color them with the color of the nearest center.
K-Means Step 3/4: Update
For each colored cluster, calculate the mean of all its points and move the cluster center to this new location.
K-Means Step 4/4: Convergence
Repeat the ‘Assign-Update’ steps until the cluster centers no longer move. At this point, we have our final clustering result.
The Achilles’ Heel of K-Means: Initialization Matters
Random initialization can lead to suboptimal or even completely wrong clustering results.
Problem: If the initial centers are chosen poorly (e.g., all within the same true cluster), the algorithm can get stuck in a local optimum and fail to find the global optimum.
Solution: K-Means++
Instead of being purely random, it uses a smarter strategy: the chosen initial centers are as far away from each other as possible.
This greatly increases the probability of finding the global optimum. In scikit-learn, init='k-means++' is the default, so we usually don’t need to worry.
The K-Means++ Initialization Strategy Explained
The Most Critical Question: How to Choose the Value of K?
The K-Means algorithm itself cannot tell us what the optimal K is. This is a hyperparameter that we must determine using domain knowledge and data insights.
Fortunately, we have two common heuristic methods to aid this decision:
The Elbow Method
Silhouette Analysis
Choosing K (1): The Elbow Method
Idea: Try different values of K (e.g., from 2 to 10) and calculate the Sum of Squared Errors (SSE, a.k.a. Inertia) for each K.
As K increases, SSE will necessarily decrease. We are looking for the ‘inflection point’ where the rate of decrease slows down significantly—the ‘elbow’.
Choosing K (2): Silhouette Analysis
Idea: For each tested K value, calculate the average silhouette coefficient of the resulting clustering. We choose the K value that yields the highest average silhouette coefficient.
Compared to the Elbow Method, silhouette analysis considers both intra-cluster cohesion and inter-cluster separation, making it a generally more reliable indicator.
Case Study: Mall Customer Segmentation
We’ll use a simulated customer dataset containing annual income and spending scores. Our goal is to identify distinct customer segments.
1. Data Exploration
Table 1: Sample Customer Data
CustomerID
Gender
Age
Annual Income (k$)
Spending Score (1-100)
0
1
Male
19
15
39
1
2
Male
21
15
81
2
3
Female
20
16
6
3
4
Female
23
16
77
4
5
Female
31
17
40
2. Finding the Optimal K
Before applying K-Means, we use the Elbow Method and Silhouette Analysis to determine the best number of clusters, K. Both methods point to K=5 as a strong choice.
3. Python Implementation
Now we run K-Means with our chosen value, K=5.
import pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.cluster import KMeans# 1. Load and prepare the dataurl ='https://raw.githubusercontent.com/subashgandyer/datasets/main/mall_customers.csv'# For demonstration purposes, add error handling for the URLtry: df = pd.read_csv(url)except: # If the URL fails, create mock dataprint("Failed to load data from URL, using mock data instead.") data = {'Annual Income (k$)': np.random.randint(15, 140, 200),'Spending Score (1-100)': np.random.randint(1, 100, 200)} df = pd.DataFrame(data)X = df[['Annual Income (k$)', 'Spending Score (1-100)']]# 2. Feature Scaling (a crucial step)scaler = StandardScaler()X_scaled = scaler.fit_transform(X)# 3. Run K-Means (we have determined K=5)kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')# Use the scaled data for training and predictiondf['Cluster'] = kmeans.fit_predict(X_scaled)
Failed to load data from URL, using mock data instead.
4. Visualizing the Results
Let’s visualize the clustering results to see what customer segments we have discovered.
import matplotlib.pyplot as pltimport seaborn as sns# Map cluster numbers to meaningful names for the legendcluster_map = {0: 'Standard',1: 'Target',2: 'Careful',3: 'Frugal',4: 'Potential'}# Note: This mapping is based on a visual inspection of the plot from a specific `random_state`.# The cluster numbers might change if the state changes, but the five groups will be the same.df['Segment'] = df['Cluster'].map(cluster_map)# Get centroids and inverse transform them to the original scale for plottingcentroids_scaled = kmeans.cluster_centers_centroids = scaler.inverse_transform(centroids_scaled)# Create the plotplt.style.use('seaborn-v0_8-whitegrid')plt.figure(figsize=(10, 6))sns.scatterplot( data=df, x='Annual Income (k$)', y='Spending Score (1-100)', hue='Segment', palette='viridis', s=100, alpha=0.8, edgecolor='w')# Plot the centroidsplt.scatter( centroids[:, 0], centroids[:, 1], s=250, c='red', marker='X', edgecolor='black', label='Centroids')# Beautify the plotplt.title('Customer Segments', fontsize=16)plt.xlabel('Annual Income (k$)', fontsize=12)plt.ylabel('Spending Score (1-100)', fontsize=12)plt.legend(title='Customer Segment', bbox_to_anchor=(1.05, 1), loc='upper left')plt.tight_layout()plt.show()
Figure 1: The five distinct customer segments identified by K-Means.
5. Deriving Business Insights
The clustering result clearly reveals five distinct customer segments:
Careful: High income, low spending.
Standard: Medium income, medium spending.
Potential: Low income, high spending (likely students or young professionals).
Target: High income, high spending (prime customers).
Frugal: Low income, low spending.
This segmentation provides a clear, data-driven basis for the mall’s targeted marketing, membership management, and resource allocation strategies.
Algorithm 2: Hierarchical Clustering
Unlike K-Means, which partitions data into K clusters in one go, Hierarchical Clustering creates a hierarchy of clusters.
Core Idea: It doesn’t produce a single clustering but a tree-like structure (a Dendrogram) that shows how data points are successively merged (or split).
Two Main Strategies:
Agglomerative: Bottom-up. Starts with each point as its own cluster, then progressively merges the most similar clusters.
Divisive: Top-down. Starts with all points in one cluster, then progressively splits the most dissimilar ones.
The Agglomerative Hierarchical Clustering Workflow
Start: Treat each data point as an individual cluster.
Merge: Find the two closest clusters among all clusters and merge them into a new one.
Repeat: Repeat step 2 until all data points are merged into a single cluster.
The entire history of these merges forms a tree-like hierarchical structure.
The Key Question: How to Measure ‘Cluster-to-Cluster’ Distance?
This is defined by the Linkage Criterion. There are three main types:
Visualizing Hierarchical Clustering: The Dendrogram
The dendrogram is the signature output of hierarchical clustering. It visually represents the merging process.
Vertical Axis: Represents the distance or dissimilarity between clusters. The height of a merge indicates how dissimilar the two merged clusters were.
Horizontal Axis: Represents the individual data points.
We can obtain a specific number of clusters by making a horizontal ‘cut’ across the dendrogram at a certain height.
Algorithm 3: Density-Based Clustering (DBSCAN)
When K-Means Fails: K-Means assumes clusters are spherical. It struggles with irregularly shaped clusters (like crescents or rings).
Enter DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Core Idea: Groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.
Pros: Does not require specifying the number of clusters, can find arbitrarily shaped clusters, and is robust to noise.
Cons: Sensitive to the choice of its two core parameters (eps and min_samples), struggles with datasets of varying density.
The Core Concepts of DBSCAN
DBSCAN’s logic relies on three key definitions, controlled by two parameters: eps (a radius) and min_samples (a minimum number of points).
The Intuitive Flow of the DBSCAN Algorithm
The DBSCAN process is like a “snowball effect”:
Randomly select an unvisited point P.
Check if P is a core point.
If it is: Start a new cluster with P. Then, find all density-reachable points through P’s ε-neighborhood (including other core and border points) and add them all to this cluster. This process expands like a chain reaction until the entire density-connected region is covered.
If it is not (i.e., it’s a border or noise point): Temporarily mark it as noise and move to the next point.
Repeat steps 1 and 2 until all points have been visited.
Case Study: DBSCAN Shines Where K-Means Fails
Let’s look at an example that K-Means cannot handle but DBSCAN solves perfectly: the “two moons” dataset.
Algorithm 4: Gaussian Mixture Model (GMM)
Beyond ‘Hard’ Clustering: K-Means performs a ‘black-or-white’ hard assignment for each point. But sometimes, a point might lie on the boundary of two clusters, and we’d prefer a probabilistic description.
Gaussian Mixture Model (GMM) offers a Soft Clustering approach.
Core Idea: Assumes that all data points are generated from a mixture of K different Gaussian (normal) distributions. The clustering process is about finding the parameters (mean, variance, weight) of these K distributions.
Result: GMM doesn’t directly tell us which cluster each point belongs to; instead, it gives us the probability that the point belongs to each cluster.
GMM vs. K-Means: A Comparison
Feature
K-Means
Gaussian Mixture Model (GMM)
Cluster Shape
Assumes spherical (circular)
Can adapt to elliptical shapes
Assignment
Hard Assignment
Soft Assignment (Probabilistic)
Mathematical Basis
Distance-based
Probability-based (Expectation-Maximization)
Flexibility
Lower
Higher, better at fitting complex data
GMM can be seen as a probabilistic generalization of K-Means.
Case Study: GMM Excels at Fitting Elliptical Clusters
When cluster shapes are not circular but elliptical, the advantages of GMM become apparent.
```
## Summary: How to Choose the Right Clustering Algorithm?
Today we've learned four major clustering algorithms, each with its own strengths and weaknesses. In practice, there is no 'best' algorithm, only the 'most suitable' one.
| Algorithm | Key Advantages | Key Disadvantages | Best For... |
|:---|:---|:---|:---|
| **K-Means** | Fast, simple, scalable | Requires K, sensitive to spherical assumption | Large datasets with simple, well-separated clusters |
| **Hierarchical** | No need for K, provides hierarchy | Computationally expensive (O(N^2))| Small datasets, exploring data structure |
| **DBSCAN** | No need for K, finds any shape | Sensitive to parameters, struggles with varied density | Non-spherical clusters, datasets with noise |
| **GMM** | Soft clustering, flexible shapes | Complex, high computational cost | Probabilistic output needed, overlapping clusters |
## An Algorithm Selection Decision Tree
Here is a simplified decision-making flowchart to help you choose an appropriate clustering algorithm in practice.
```{=html}
The Business Value of Cluster Analysis
Cluster analysis is more than just a set of algorithms; it’s a powerful tool for business insight.
Marketing: Identify customer segments for targeted advertising and personalized recommendations (e.g., Amazon, Netflix).
Finance: Detect unusual transaction patterns to identify potential fraud (e.g., American Express).
Urban Planning: Group city areas by resident travel patterns to create functional zones (e.g., Smart City projects).
Bioinformatics: Cluster gene expression data to discover groups of genes with similar functions (e.g., Cancer research).
Social Networks: Discover communities and interest groups for friend recommendations (e.g., Facebook, LinkedIn).
Mastering clustering means mastering the key to finding structure and creating value from vast amounts of data.