08 Cluster Analysis

Welcome to the World of Cluster Analysis

Imagine you are the CEO of Netflix, holding viewing data for hundreds of millions of users.

  • How would you recommend movies that users might like?
  • How would you produce original content that caters to local tastes in different markets?
  • How would you discover and serve emerging, niche viewing communities?

This Chapter’s Learning Roadmap

We will embark on a comprehensive journey to master the core concepts and practices of cluster analysis.

Chapter Learning Roadmap A four-stage learning path: Foundations, Evaluation, Algorithms, and Practice. Cluster Analysis: A Complete Path from Theory to Practice 📖 1. Foundations 'What is it?' 📊 2. Evaluation 'Is it good?' ⚙️ 3. Core Algorithms 'How to do it?' 💼 4. Business Practice 'How to use it?' Building a complete, closed-loop knowledge system

The Core Task is to Discover Structure in Unstructured Data

A better strategy is to identify groups of users with similar tastes and then make targeted recommendations.

From Unordered to Ordered: Revealing Structure in Data A diagram showing chaotic points on the left being transformed into three distinct colored groups on the right via the process of clustering analysis. From Unordered to Ordered: Revealing Structure in Data Unstructured Data Clustering Analysis Identified Structures

This process of ‘grouping like with like’ is the core of what we will study today: Clustering. It is the art and science of automatically grouping similar data objects into clusters without predefined categories.

Clustering is a Form of Unsupervised Learning

Clustering is a classic Unsupervised Learning method.

  • Supervised Learning: Data has clear labels (e.g., an email is marked as ‘spam’ or ‘not spam’), and we learn to predict these labels.
  • Unsupervised Learning: Data has no labels. Our goal is to discover the inherent, hidden structures or patterns within the data itself.
Supervised vs. Unsupervised Learning The left side shows labeled data points (squares and circles of different colors) for supervised learning. The right side shows unlabeled gray circles for unsupervised learning. Supervised Learning Data ✔ Includes explicit labels Unsupervised Learning Data ❌ No preset labels

The task of clustering is to partition similar data points into the same group (cluster).

Clustering vs. Classification: A Critical Distinction

Before diving into the world of clustering, it’s essential to clarify the difference between Clustering and Classification.

Classification (Supervised Learning)

  • Goal: Predict known labels.
  • Input: Labeled data (e.g., this email is/is not spam).
  • Process: Learn the mapping rules from features to labels.
  • Output: A model that can predict labels for new data.
Supervised Learning: Classification A visualization of classification, showing two classes of data points separated by a learned decision boundary. Supervised Learning: Classification Class A Class B Learned Decision Boundary

Clustering (Unsupervised Learning)

  • Goal: Discover hidden groups in the data.
  • Input: Unlabeled data.
  • Process: Group data based on its intrinsic similarity.
  • Output: A partition of the data into clusters and an interpretation of these clusters.
Unsupervised Learning: Clustering Analysis A diagram showing unlabeled data points on the left being grouped into distinct, circled clusters on the right through the clustering process. Unsupervised Learning: Clustering Analysis Unlabeled Data Clustering Process Discovered Clusters

The Goal is a Rigorous Partition of the Data

Given a dataset \(X = \{x_1, x_2, \ldots, x_N\}\) with N samples, clustering aims to divide it into K clusters \(\{C_1, C_2, \ldots, C_K\}\).

This division must satisfy three fundamental properties to form a Partition.

Partition Property 1/3: Non-empty

Each defined cluster must actually contain at least one data point. We don’t make meaningless divisions.

\[ \large{\forall k \in \{1, \dots, K\}, C_k \neq \emptyset} \]

Non-empty Property of Clustering A side-by-side comparison showing a valid cluster with data points and an invalid empty cluster. Valid: Non-empty Cluster Invalid: Empty Cluster

Partition Property 2/3: Mutually Exclusive

Each data point can only belong to one single cluster.

\[ \large{\forall k \neq l, C_k \cap C_l = \emptyset} \]

Mutually Exclusive Property Violation A data point is shown in the overlapping region of two clusters, which violates the mutually exclusive principle. A data point cannot belong to two clusters simultaneously Violation of the Mutually Exclusive Principle

Partition Property 3/3: Exhaustive

Every single data point in the dataset must be assigned to some cluster.

\[ \large{\bigcup_{k=1}^{K} C_k = X} \]

Exhaustive Property Violation A data point is shown outside of all defined clusters within the dataset, violating the exhaustive principle. Dataset X This 'stray' point must be assigned to a cluster

The Core Idea: Maximize Intra-cluster Similarity

The intuitive goal of clustering is straightforward:

Samples within a cluster should be as ‘similar’ as possible, while samples between clusters should be as ‘dissimilar’ as possible.

Clustering Visualization Example Points in space are divided into four distinct groups. Points within each group are close, while the groups themselves are far apart. Inter-Cluster Separation (Dissimilar) Intra-Cluster Cohesion (Similar)

The Two Foundational Components of Clustering

To achieve this goal, we must first address two fundamental questions:

The Two Core Components of Clustering A central concept 'Cluster Analysis' branches into two core questions: how to measure similarity (distance metric) and how to define a representative (cluster center). Cluster Analysis 1. How to measure 'similarity'? (Distance Metric) 2. How to define a 'representative'? (Cluster Center)

Component 1: Defining ‘Similarity’ with Mathematics

In clustering algorithms, ‘dissimilarity’ is typically measured by ‘distance’. The farther the distance, the more dissimilar the points.

An effective distance metric \(DM(x, y)\) must satisfy the following four properties:

Property Mathematical Expression Economic Intuition
Non-negativity \(DM(x, y) \geq 0\) The difference between two customers cannot be negative.
Identity \(DM(x, x) = 0\) A customer’s difference from themselves is zero.
Symmetry \(DM(x, y) = DM(y, x)\) The difference from customer A to B is the same as from B to A.
Triangle Inequality \(DM(x, z) \leq DM(x, y) + DM(y, z)\) Indirect difference via a third party is never less than the direct difference.

Common Distance Metric (1): Euclidean Distance

This is the most common and intuitive definition of distance: the straight-line distance between two points in space.

For two d-dimensional vectors \(x=(x_1, \dots, x_d)\) and \(y=(y_1, \dots, y_d)\), their Euclidean distance is:

\[ \large{DM_{euc}(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}} \]

Use Case: When the dimensions of the data are comparable and we care about the absolute physical distance, Euclidean distance is the top choice.

Common Distance Metric (2): Manhattan Distance

Also known as ‘city block distance’, it calculates the sum of the absolute differences of their Cartesian coordinates.

\[ \large{DM_{man}(x, y) = \sum_{i=1}^{d} |x_i - y_i|} \]

Use Case: When movement is restricted to a grid (like a chessboard) or when the scales of different dimensions vary greatly, Manhattan distance can be more robust than Euclidean distance.

Visualizing Distance: Euclidean vs. Manhattan

Euclidean vs. Manhattan Distance The diagram shows the Euclidean distance (blue dashed line) and the Manhattan distance (green solid line) from point A to point B on a grid. A B Euclidean Distance Manhattan Distance

A Critical Practical Step: Feature Scaling

Warning: Before using distance-based clustering algorithms (like K-Means), you must standardize your data!

Customer Age (Years) Annual Income (USD)
A 25 50,000
B 30 50,100
  • Problem: Without scaling, the squared difference of $100 in income will far outweigh the squared difference of 5 years in age, causing the clustering result to be completely dominated by income.
  • Solution: Use tools like StandardScaler to bring all features to a similar scale (e.g., mean of 0, variance of 1).

The Impact of Feature Scaling

Without scaling, distance calculations are ‘hijacked’ by features with large numerical ranges.

Effect of Feature Scaling A two-panel diagram. The left panel shows that before scaling, points A and B are closer due to the large range of the Income axis. The right panel shows that after scaling, points A and C are closer. Effect of Feature Scaling 1. Before Scaling Annual Income Age A B C Distance dominated by 'Income' range. Thus, A and B are 'closer'. 2. After Scaling Income (Standardized) Age (Standardized) A B C Feature contributions are balanced. True proximity of A and C revealed.

Component 2: Defining the ‘Heart’ of a Cluster

The cluster center (or centroid) is the representative point of a cluster. Different algorithms define it differently, mainly in two ways.

1. Mean-based Center

  • Representative Algorithm: K-Means
  • Definition: The arithmetic mean of all sample points within the cluster.
  • Formula: \[ \large{\mu_k = \frac{1}{|C_k|} \sum_{x_n \in C_k} x_n} \]
  • Characteristic: Intuitive, but sensitive to outliers.

2. Density-based Center

  • Representative Algorithms: DBSCAN, CFSFDP
  • Definition: The area of highest density.
  • Requirement: A point is a center if its surrounding neighbors have lower density, and it’s sufficiently far from other high-density points.
  • Characteristic: More robust, can find arbitrarily shaped clusters.

How Outliers Affect Centroid Calculation

Centroid Sensitivity to Outliers A two-panel diagram. The left panel shows a mean-based centroid being pulled away from the main cluster by an outlier. The right panel shows a density-based centroid remaining stable in the dense region, unaffected by the outlier. Centroid Sensitivity to Outliers 1. Mean-based Centroid (Sensitive) Outlier Centroid is pulled away 2. Density-based Centroid (Robust) Outlier (ignored) Centroid remains stable

Evaluating Performance: Is Our Grouping Meaningful?

We need a set of objective metrics to evaluate the effectiveness of our clustering. Evaluation methods fall into two main categories:

External Measures

  • Prerequisite: You have ‘ground truth’ class labels (for evaluation only).
  • Goal: Measure the consistency between the clustering result and the true labels.
  • Representative Metrics:
    • Purity
    • Entropy
    • Homogeneity
    • Completeness

Internal Measures

  • Prerequisite: You do not have true labels.
  • Goal: Evaluate the quality of the clustering based only on the data itself.
  • Core Idea: Are clusters sufficiently compact (cohesive) and well-separated?
  • Representative Metric:
    • Silhouette Coefficient

External Metric (1): Purity

Idea: How ‘pure’ is a cluster? We check the proportion of the most frequent ‘true class’ within that cluster.

Calculation:

  1. For each cluster \(C_k\), find the true class \(L_s\) that is most dominant.
  2. Calculate the proportion of samples from this class, \(|C_k \cap L_s|\), relative to the total number of samples in the cluster, \(|C_k|\).
  3. The overall Purity is the weighted average over all clusters.

\[ \large{\text{Purity} = \sum_{k=1}^{K} \frac{|C_k|}{N} \max_s \left( \frac{|C_k \cap L_s|}{|C_k|} \right)} \]

Pro: Simple and intuitive. Con: Can be misleadingly high if the number of clusters is large (e.g., each point is a cluster), as it doesn’t penalize fragmentation.

Purity Calculation Example

Suppose we have 10 samples, with true classes being 5 diamonds and 5 hexagons. A clustering algorithm divides them into two clusters, K1 and K2.

True Class Diamond Hexagon
Cluster K1 5 (dominant) 1
Cluster K2 0 4 (dominant)

Calculate Purity of K1: \(Purity(K_1) = \max(\frac{5}{6}, \frac{1}{6}) = \frac{5}{6}\)

Calculate Purity of K2: \(Purity(K_2) = \max(\frac{0}{4}, \frac{4}{4}) = \frac{4}{4} = 1\)

Calculate Overall Purity (Weighted Average): \(Purity_{total} = \frac{|K1|}{N} \times Purity(K_1) + \frac{|K_2|}{N} \times Purity(K_2) = \frac{6}{10} \times \frac{5}{6} + \frac{4}{10} \times 1 = 0.5 + 0.4 = 0.9\)

External Metric (2): Entropy

Idea: Measures the ‘disorder’ or ‘impurity’ within a cluster. If a cluster contains a mix of samples from various true classes, its entropy is high. If it’s pure, its entropy is low.

Calculation: For a cluster \(C_k\), its entropy is calculated as: \[ \large{\text{Entropy}(C_k) = - \sum_{s=1}^{S} p_{ks} \log_2(p_{ks})} \]

where \(p_{ks} = \frac{|C_k \cap L_s|}{|C_k|}\) is the probability of true class \(s\) appearing in cluster \(k\).

Entropy Calculation Example

Let’s use cluster K1 from the previous example to calculate its entropy.

  • K1 has 6 samples.
  • The probability of the ‘Diamond’ class is \(p_{diamond} = 5/6\).
  • The probability of the ‘Hexagon’ class is \(p_{hexagon} = 1/6\).

\[ \large{\text{Entropy}(K_1) = - \left( \frac{5}{6} \log_2\left(\frac{5}{6}\right) + \frac{1}{6} \log_2\left(\frac{1}{6}\right) \right) \approx 0.65} \]

The closer this value is to 0, the purer the cluster. K2 contains only hexagons, so \(p_{hexagon} = 1\), and \(\text{Entropy}(K_2) = -1 \log_2(1) = 0\), which is perfectly pure.

External Metrics (3): Homogeneity & Completeness

Purity and Entropy only tell part of the story. Homogeneity and Completeness provide a more comprehensive view.

  • Homogeneity: Measures whether each cluster contains only members of a single class. If every cluster perfectly contains members of only one true class, homogeneity is 1. This is analogous to ‘Precision’ in economics.

  • Completeness: Measures whether all members of a given class are assigned to the same cluster. If all members of a true class are perfectly grouped into one cluster, completeness is high. This is analogous to ‘Recall’.

These two metrics are often in a trade-off.

Homogeneity vs. Completeness: An Intuitive Example

Assume the ground truth is {3 Apples, 3 Pears}.

Perfect Clustering

  • Cluster 1: {🍎, 🍎, 🍎}
  • Cluster 2: {🍐, 🍐, 🍐}
  • Homogeneity = 1
  • Completeness = 1

High Homogeneity, Low Completeness

  • Cluster 1: {🍎}
  • Cluster 2: {🍎, 🍎}
  • Cluster 3: {🍐, 🍐, 🍐}
  • Homogeneity = 1 (each cluster is pure)
  • Completeness < 1 (Apples are split)

Low Homogeneity, High Completeness

  • Cluster 1: {🍎, 🍎, 🍎, 🍐, 🍐, 🍐}
  • Cluster 2: {}
  • Homogeneity < 1 (Cluster 1 is impure)
  • Completeness = 1 (all same-class items are together)

Internal Metric: Silhouette Coefficient

When we don’t have true labels, how do we evaluate clustering quality? The Silhouette Coefficient is a powerful tool.

Idea: For each data point, we want it to be very close to points in its own cluster, and very far from points in other clusters.

Calculation: For a sample \(x_n\):

  • \(a(x_n)\): The average distance from \(x_n\) to all other points in its own cluster (measures intra-cluster cohesion).
  • \(b(x_n)\): The average distance from \(x_n\) to all points in the nearest other cluster (measures inter-cluster separation).

\[ \large{SC(x_n) = \frac{b(x_n) - a(x_n)}{\max\{a(x_n), b(x_n)\}}} \]

Visualizing the Silhouette Coefficient Calculation

Ideal Scenario for a High Silhouette Coefficient A diagram showing a sample point x_n that is close to its own cluster members (small distance a) and far from the nearest neighboring cluster (large distance b). Ideal Scenario for the Silhouette Coefficient x_n a(x_n): Intra-Cluster Distance (Goal: as small as possible) b(x_n): Nearest-Cluster Distance (Goal: as large as possible)

Interpreting the Silhouette Coefficient

The value of the Silhouette Coefficient \(SC(x_n)\) ranges from -1 to 1.

  • SC ≈ 1: Excellent! This means \(a(x_n)\) is much smaller than \(b(x_n)\). The point is well-matched to its own cluster and poorly-matched to neighboring clusters.
  • SC ≈ 0: The point is on or very close to the decision boundary between two neighboring clusters.
  • SC ≈ -1: Very bad! This means the point is likely assigned to the wrong cluster.

We can calculate the average silhouette coefficient over all data points to evaluate the entire clustering result.

Algorithm 1: K-Means Clustering

K-Means is one of the most famous and widely used clustering algorithms.

  • Core Idea: Iteratively partition the data into K clusters, such that each data point belongs to the cluster with the nearest mean (cluster center), and the within-cluster sum of squares is minimized.
  • Pros: Simple algorithm, fast computation, works extremely well for spherical, similarly sized clusters.
  • Cons: Requires the number of clusters K to be specified beforehand, sensitive to initial centroid placement, performs poorly on non-spherical clusters and with outliers.

The Objective Function of K-Means

K-Means aims to minimize the Sum of Squared Errors (SSE) within all clusters, also known as Inertia.

\[ \large{\text{minimize} \sum_{k=1}^{K} \sum_{x_n \in C_k} ||x_n - \mu_k||^2} \]

Where:

  • \(K\) is the number of clusters.
  • \(C_k\) is the \(k\)-th cluster.
  • \(\mu_k\) is the centroid of the \(k\)-th cluster.
  • \(||x_n - \mu_k||^2\) is the squared Euclidean distance from data point \(x_n\) to the centroid of its assigned cluster.

Each step of the algorithm (assignment and update) is designed to reduce this total error value.

The Iterative Process of the K-Means Algorithm

The K-Means algorithm process is like a game of “capturing territory,” repeated in a four-step cycle:

  1. Initialize: Randomly select K data points as the initial cluster centers.
  2. Assign: For each data point, calculate its distance to all K cluster centers and assign it to the nearest one.
  3. Update: For each cluster, recalculate its center (i.e., the mean of all data points in that cluster).
  4. Repeat: Repeat steps 2 and 3 until the cluster centers no longer change significantly, or the assignment of data points no longer changes.

K-Means Step 1/4: Initialization

Assume K=3. We randomly select three points in the data space as the initial centroids.

K-Means Step 1: Initialization A scatter plot of gray data points with three randomly placed colored 'X's representing the initial cluster centers. Step 1: Randomly initialize 3 cluster centers X X X

K-Means Step 2/4: Assignment

Based on the distance of each point to these three initial centers, color them with the color of the nearest center.

K-Means Step 2: Assignment Data points are colored based on their proximity to one of the three initial cluster centers. Step 2: Assign each point to the nearest cluster center X X X

K-Means Step 3/4: Update

For each colored cluster, calculate the mean of all its points and move the cluster center to this new location.

K-Means Step 3: Update Cluster centers move to the average position of their cluster members, with arrows indicating the direction of movement. Step 3: Update cluster centers (move to the mean position) XX XX XX

K-Means Step 4/4: Convergence

Repeat the ‘Assign-Update’ steps until the cluster centers no longer move. At this point, we have our final clustering result.

K-Means Step 4: Convergence The final stable clustering result, showing data points clearly partitioned into three distinct clusters with their final centroids. Step 4: Algorithm converges, final result is obtained X X X

The Achilles’ Heel of K-Means: Initialization Matters

Random initialization can lead to suboptimal or even completely wrong clustering results.

  • Problem: If the initial centers are chosen poorly (e.g., all within the same true cluster), the algorithm can get stuck in a local optimum and fail to find the global optimum.
  • Solution: K-Means++
    • Instead of being purely random, it uses a smarter strategy: the chosen initial centers are as far away from each other as possible.
    • This greatly increases the probability of finding the global optimum. In scikit-learn, init='k-means++' is the default, so we usually don’t need to worry.

The K-Means++ Initialization Strategy Explained

K-Means++ Initialization Strategy A three-step diagram showing how K-Means++ selects initial centroids to be far from one another to improve clustering quality. K-Means++ Initialization Strategy Improves quality by choosing initial centers that are far apart 1. Choose C1 randomly 2. Choose C2 far from C1 3. Choose C3 far from existing centers

The Most Critical Question: How to Choose the Value of K?

The K-Means algorithm itself cannot tell us what the optimal K is. This is a hyperparameter that we must determine using domain knowledge and data insights.

Fortunately, we have two common heuristic methods to aid this decision:

  1. The Elbow Method
  2. Silhouette Analysis

Choosing K (1): The Elbow Method

Idea: Try different values of K (e.g., from 2 to 10) and calculate the Sum of Squared Errors (SSE, a.k.a. Inertia) for each K.

As K increases, SSE will necessarily decrease. We are looking for the ‘inflection point’ where the rate of decrease slows down significantly—the ‘elbow’.

The Elbow Method A line graph showing SSE on the y-axis and the number of clusters K on the x-axis. The curve drops sharply and then flattens, forming an 'elbow' at K=4, which is highlighted as the optimal choice. The Elbow Method Sum of Squared Errors (SSE) Number of Clusters (K) 2 3 4 5 6 7 8 9 10 The 'Elbow' is here Rate of SSE decrease slows significantly K=4 is a good choice

Choosing K (2): Silhouette Analysis

Idea: For each tested K value, calculate the average silhouette coefficient of the resulting clustering. We choose the K value that yields the highest average silhouette coefficient.

Compared to the Elbow Method, silhouette analysis considers both intra-cluster cohesion and inter-cluster separation, making it a generally more reliable indicator.

Silhouette Analysis for Optimal K A curve shows the average silhouette coefficient for different values of K. The peak of the curve at K=4 indicates the optimal number of clusters. Silhouette Analysis: Finding the Optimal K Number of Clusters (K) Average Silhouette Score 234 567 Highest score is here K=4 is the optimal choice

Case Study: Mall Customer Segmentation

We’ll use a simulated customer dataset containing annual income and spending scores. Our goal is to identify distinct customer segments.

1. Data Exploration

Table 1: Sample Customer Data
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40

2. Finding the Optimal K

Before applying K-Means, we use the Elbow Method and Silhouette Analysis to determine the best number of clusters, K. Both methods point to K=5 as a strong choice.

K-Selection Analysis for Mall Customer Data The left panel shows the Elbow Method, and the right panel shows Silhouette Analysis. Both plots suggest that K=5 is the optimal number of clusters. Selecting the Optimal K for Mall Customer Data (K=5) Elbow Method Value of KSSE 5 Silhouette Analysis Value of KSilhouette Score 5

3. Python Implementation

Now we run K-Means with our chosen value, K=5.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# 1. Load and prepare the data
url = 'https://raw.githubusercontent.com/subashgandyer/datasets/main/mall_customers.csv'
# For demonstration purposes, add error handling for the URL
try:
    df = pd.read_csv(url)
except: # If the URL fails, create mock data
    print("Failed to load data from URL, using mock data instead.")
    data = {'Annual Income (k$)': np.random.randint(15, 140, 200),
            'Spending Score (1-100)': np.random.randint(1, 100, 200)}
    df = pd.DataFrame(data)

X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# 2. Feature Scaling (a crucial step)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Run K-Means (we have determined K=5)
kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')
# Use the scaled data for training and prediction
df['Cluster'] = kmeans.fit_predict(X_scaled)
Failed to load data from URL, using mock data instead.

4. Visualizing the Results

Let’s visualize the clustering results to see what customer segments we have discovered.

import matplotlib.pyplot as plt
import seaborn as sns

# Map cluster numbers to meaningful names for the legend
cluster_map = {
    0: 'Standard',
    1: 'Target',
    2: 'Careful',
    3: 'Frugal',
    4: 'Potential'
}
# Note: This mapping is based on a visual inspection of the plot from a specific `random_state`.
# The cluster numbers might change if the state changes, but the five groups will be the same.
df['Segment'] = df['Cluster'].map(cluster_map)

# Get centroids and inverse transform them to the original scale for plotting
centroids_scaled = kmeans.cluster_centers_
centroids = scaler.inverse_transform(centroids_scaled)

# Create the plot
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df,
    x='Annual Income (k$)',
    y='Spending Score (1-100)',
    hue='Segment',
    palette='viridis',
    s=100,
    alpha=0.8,
    edgecolor='w'
)

# Plot the centroids
plt.scatter(
    centroids[:, 0], 
    centroids[:, 1], 
    s=250, 
    c='red', 
    marker='X',
    edgecolor='black',
    label='Centroids'
)

# Beautify the plot
plt.title('Customer Segments', fontsize=16)
plt.xlabel('Annual Income (k$)', fontsize=12)
plt.ylabel('Spending Score (1-100)', fontsize=12)
plt.legend(title='Customer Segment', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Figure 1: The five distinct customer segments identified by K-Means.

5. Deriving Business Insights

The clustering result clearly reveals five distinct customer segments:

  1. Careful: High income, low spending.
  2. Standard: Medium income, medium spending.
  3. Potential: Low income, high spending (likely students or young professionals).
  4. Target: High income, high spending (prime customers).
  5. Frugal: Low income, low spending.

This segmentation provides a clear, data-driven basis for the mall’s targeted marketing, membership management, and resource allocation strategies.

Algorithm 2: Hierarchical Clustering

Unlike K-Means, which partitions data into K clusters in one go, Hierarchical Clustering creates a hierarchy of clusters.

  • Core Idea: It doesn’t produce a single clustering but a tree-like structure (a Dendrogram) that shows how data points are successively merged (or split).
  • Two Main Strategies:
    • Agglomerative: Bottom-up. Starts with each point as its own cluster, then progressively merges the most similar clusters.
    • Divisive: Top-down. Starts with all points in one cluster, then progressively splits the most dissimilar ones.

The Agglomerative Hierarchical Clustering Workflow

  1. Start: Treat each data point as an individual cluster.
  2. Merge: Find the two closest clusters among all clusters and merge them into a new one.
  3. Repeat: Repeat step 2 until all data points are merged into a single cluster.

The entire history of these merges forms a tree-like hierarchical structure.

The Key Question: How to Measure ‘Cluster-to-Cluster’ Distance?

This is defined by the Linkage Criterion. There are three main types:

Hierarchical Clustering Linkage Criteria A three-panel diagram illustrating Single Linkage (closest points), Complete Linkage (farthest points), and Average Linkage (average of all point pairs) for measuring inter-cluster distance. Defining Inter-Cluster Distance: Linkage Criteria Single Linkage min(distance) Distance between closest points Complete Linkage max(distance) Distance between farthest points Average Linkage average(distance) Average of all pairwise distances

Visualizing Hierarchical Clustering: The Dendrogram

The dendrogram is the signature output of hierarchical clustering. It visually represents the merging process.

  • Vertical Axis: Represents the distance or dissimilarity between clusters. The height of a merge indicates how dissimilar the two merged clusters were.
  • Horizontal Axis: Represents the individual data points.

We can obtain a specific number of clusters by making a horizontal ‘cut’ across the dendrogram at a certain height.

Hierarchical Clustering Dendrogram A dendrogram showing the hierarchical merging of 10 sample points. Two horizontal cut lines demonstrate how to obtain K=3 and K=2 clusters. Hierarchical Clustering Dendrogram Distance Sample Points P1P2 P3P4 P5P6 P7P8 P9P10 Cut Line 1 (yields K=3) Cut Line 2 (yields K=2)

Algorithm 3: Density-Based Clustering (DBSCAN)

When K-Means Fails: K-Means assumes clusters are spherical. It struggles with irregularly shaped clusters (like crescents or rings).

Enter DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

  • Core Idea: Groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.
  • Pros: Does not require specifying the number of clusters, can find arbitrarily shaped clusters, and is robust to noise.
  • Cons: Sensitive to the choice of its two core parameters (eps and min_samples), struggles with datasets of varying density.

The Core Concepts of DBSCAN

DBSCAN’s logic relies on three key definitions, controlled by two parameters: eps (a radius) and min_samples (a minimum number of points).

DBSCAN Point Classifications A three-panel diagram explaining Core, Border, and Noise points in DBSCAN, based on an epsilon radius and a min_samples count of 4. DBSCAN Clustering: Point Classifications Parameters: min_samples = 4 Core Point ≥ 4 points in ε-neighborhood (incl. self) Border Point Fewer than 4 neighbors, but is a neighbor of a core point Noise Point Neither a Core nor a Border point

The Intuitive Flow of the DBSCAN Algorithm

The DBSCAN process is like a “snowball effect”:

  1. Randomly select an unvisited point P.
  2. Check if P is a core point.
    • If it is: Start a new cluster with P. Then, find all density-reachable points through P’s ε-neighborhood (including other core and border points) and add them all to this cluster. This process expands like a chain reaction until the entire density-connected region is covered.
    • If it is not (i.e., it’s a border or noise point): Temporarily mark it as noise and move to the next point.
  3. Repeat steps 1 and 2 until all points have been visited.

Case Study: DBSCAN Shines Where K-Means Fails

Let’s look at an example that K-Means cannot handle but DBSCAN solves perfectly: the “two moons” dataset.

K-Means vs. DBSCAN on the 'Two Moons' Dataset A two-panel comparison. The left panel shows K-Means incorrectly splitting the two moons dataset vertically. The right panel shows DBSCAN correctly identifying each moon as a separate cluster. K-Means Incorrect Result DBSCAN Perfect Result

Algorithm 4: Gaussian Mixture Model (GMM)

Beyond ‘Hard’ Clustering: K-Means performs a ‘black-or-white’ hard assignment for each point. But sometimes, a point might lie on the boundary of two clusters, and we’d prefer a probabilistic description.

Gaussian Mixture Model (GMM) offers a Soft Clustering approach.

  • Core Idea: Assumes that all data points are generated from a mixture of K different Gaussian (normal) distributions. The clustering process is about finding the parameters (mean, variance, weight) of these K distributions.
  • Result: GMM doesn’t directly tell us which cluster each point belongs to; instead, it gives us the probability that the point belongs to each cluster.

GMM vs. K-Means: A Comparison

Feature K-Means Gaussian Mixture Model (GMM)
Cluster Shape Assumes spherical (circular) Can adapt to elliptical shapes
Assignment Hard Assignment Soft Assignment (Probabilistic)
Mathematical Basis Distance-based Probability-based (Expectation-Maximization)
Flexibility Lower Higher, better at fitting complex data

GMM can be seen as a probabilistic generalization of K-Means.

Case Study: GMM Excels at Fitting Elliptical Clusters

When cluster shapes are not circular but elliptical, the advantages of GMM become apparent.

GMM vs. K-Means: Handling Elliptical Clusters A side-by-side comparison. The left panel shows K-Means using a poor circular fit for an elliptical data cluster. The right panel shows GMM using a perfect elliptical fit for the same data. GMM vs. K-Means: Handling Elliptical Clusters K-Means Limitation K-Means forces a circular boundary, leading to a poor fit. GMM Advantage GMM adapts its elliptical boundary to perfectly match the data. GMM uses a covariance matrix to capture the orientation and spread of the data, enabling a more precise fit. ``` ## Summary: How to Choose the Right Clustering Algorithm? Today we've learned four major clustering algorithms, each with its own strengths and weaknesses. In practice, there is no 'best' algorithm, only the 'most suitable' one. | Algorithm | Key Advantages | Key Disadvantages | Best For... | |:---|:---|:---|:---| | **K-Means** | Fast, simple, scalable | Requires K, sensitive to spherical assumption | Large datasets with simple, well-separated clusters | | **Hierarchical** | No need for K, provides hierarchy | Computationally expensive (O(N^2))| Small datasets, exploring data structure | | **DBSCAN** | No need for K, finds any shape | Sensitive to parameters, struggles with varied density | Non-spherical clusters, datasets with noise | | **GMM** | Soft clustering, flexible shapes | Complex, high computational cost | Probabilistic output needed, overlapping clusters | ## An Algorithm Selection Decision Tree Here is a simplified decision-making flowchart to help you choose an appropriate clustering algorithm in practice. ```{=html} Clustering Algorithm Selection Decision Tree A flowchart that helps choose between K-Means, GMM, DBSCAN, and Hierarchical clustering based on a series of questions. Need to pre-specify number of clusters (K)? Yes No Are clusters expected to be spherical? Need to handle noise/outliers? Yes (or unsure) No (elliptical) K-Means GMM Yes No (need hierarchy) DBSCAN Hierarchical

The Business Value of Cluster Analysis

Cluster analysis is more than just a set of algorithms; it’s a powerful tool for business insight.

  • Marketing: Identify customer segments for targeted advertising and personalized recommendations (e.g., Amazon, Netflix).
  • Finance: Detect unusual transaction patterns to identify potential fraud (e.g., American Express).
  • Urban Planning: Group city areas by resident travel patterns to create functional zones (e.g., Smart City projects).
  • Bioinformatics: Cluster gene expression data to discover groups of genes with similar functions (e.g., Cancer research).
  • Social Networks: Discover communities and interest groups for friend recommendations (e.g., Facebook, LinkedIn).

Mastering clustering means mastering the key to finding structure and creating value from vast amounts of data.

Thank You!

Q & A