08 Cluster Analysis

Welcome to the World of Cluster Analysis

Imagine you are the CEO of Netflix, holding viewing data for hundreds of millions of users.

How would you recommend movies that users might like?
How would you produce original content that caters to local tastes in different markets?
How would you discover and serve emerging, niche viewing communities?

This Chapter’s Learning Roadmap

We will embark on a comprehensive journey to master the core concepts and practices of cluster analysis.

The Core Task is to Discover Structure in Unstructured Data

A better strategy is to identify groups of users with similar tastes and then make targeted recommendations.

This process of ‘grouping like with like’ is the core of what we will study today: Clustering. It is the art and science of automatically grouping similar data objects into clusters without predefined categories.

Clustering is a Form of Unsupervised Learning

Clustering is a classic Unsupervised Learning method.

Supervised Learning: Data has clear labels (e.g., an email is marked as ‘spam’ or ‘not spam’), and we learn to predict these labels.
Unsupervised Learning: Data has no labels. Our goal is to discover the inherent, hidden structures or patterns within the data itself.

The task of clustering is to partition similar data points into the same group (cluster).

Clustering vs. Classification: A Critical Distinction

Before diving into the world of clustering, it’s essential to clarify the difference between Clustering and Classification.

Classification (Supervised Learning)

Goal: Predict known labels.
Input: Labeled data (e.g., this email is/is not spam).
Process: Learn the mapping rules from features to labels.
Output: A model that can predict labels for new data.

Clustering (Unsupervised Learning)

Goal: Discover hidden groups in the data.
Input: Unlabeled data.
Process: Group data based on its intrinsic similarity.
Output: A partition of the data into clusters and an interpretation of these clusters.

The Goal is a Rigorous Partition of the Data

Given a dataset $X = \{x_1, x_2, \ldots, x_N\}$ with N samples, clustering aims to divide it into K clusters $\{C_1, C_2, \ldots, C_K\}$.

This division must satisfy three fundamental properties to form a Partition.

Partition Property 1/3: Non-empty

Each defined cluster must actually contain at least one data point. We don’t make meaningless divisions.

\[ \large{\forall k \in \{1, \dots, K\}, C_k \neq \emptyset} \]

Partition Property 2/3: Mutually Exclusive

Each data point can only belong to one single cluster.

\[ \large{\forall k \neq l, C_k \cap C_l = \emptyset} \]

Partition Property 3/3: Exhaustive

Every single data point in the dataset must be assigned to some cluster.

\[ \large{\bigcup_{k=1}^{K} C_k = X} \]

The Core Idea: Maximize Intra-cluster Similarity

The intuitive goal of clustering is straightforward:

Samples within a cluster should be as ‘similar’ as possible, while samples between clusters should be as ‘dissimilar’ as possible.

The Two Foundational Components of Clustering

To achieve this goal, we must first address two fundamental questions:

Component 1: Defining ‘Similarity’ with Mathematics

In clustering algorithms, ‘dissimilarity’ is typically measured by ‘distance’. The farther the distance, the more dissimilar the points.

An effective distance metric $DM(x, y)$ must satisfy the following four properties:

Property	Mathematical Expression	Economic Intuition
Non-negativity	$DM(x, y) \geq 0$	The difference between two customers cannot be negative.
Identity	$DM(x, x) = 0$	A customer’s difference from themselves is zero.
Symmetry	$DM(x, y) = DM(y, x)$	The difference from customer A to B is the same as from B to A.
Triangle Inequality	$DM(x, z) \leq DM(x, y) + DM(y, z)$	Indirect difference via a third party is never less than the direct difference.

Common Distance Metric (1): Euclidean Distance

This is the most common and intuitive definition of distance: the straight-line distance between two points in space.

For two d-dimensional vectors $x=(x_1, \dots, x_d)$ and $y=(y_1, \dots, y_d)$, their Euclidean distance is:

\[ \large{DM_{euc}(x, y) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}} \]

Use Case: When the dimensions of the data are comparable and we care about the absolute physical distance, Euclidean distance is the top choice.

Common Distance Metric (2): Manhattan Distance

Also known as ‘city block distance’, it calculates the sum of the absolute differences of their Cartesian coordinates.

\[ \large{DM_{man}(x, y) = \sum_{i=1}^{d} |x_i - y_i|} \]

Use Case: When movement is restricted to a grid (like a chessboard) or when the scales of different dimensions vary greatly, Manhattan distance can be more robust than Euclidean distance.

Visualizing Distance: Euclidean vs. Manhattan

A Critical Practical Step: Feature Scaling

Warning: Before using distance-based clustering algorithms (like K-Means), you must standardize your data!

Customer	Age (Years)	Annual Income (USD)
A	25	50,000
B	30	50,100

Problem: Without scaling, the squared difference of $100 in income will far outweigh the squared difference of 5 years in age, causing the clustering result to be completely dominated by income.
Solution: Use tools like StandardScaler to bring all features to a similar scale (e.g., mean of 0, variance of 1).

The Impact of Feature Scaling

Without scaling, distance calculations are ‘hijacked’ by features with large numerical ranges.

Component 2: Defining the ‘Heart’ of a Cluster

The cluster center (or centroid) is the representative point of a cluster. Different algorithms define it differently, mainly in two ways.

1. Mean-based Center

Representative Algorithm: K-Means
Definition: The arithmetic mean of all sample points within the cluster.
Formula: \[ \large{\mu_k = \frac{1}{|C_k|} \sum_{x_n \in C_k} x_n} \]
Characteristic: Intuitive, but sensitive to outliers.

2. Density-based Center

Representative Algorithms: DBSCAN, CFSFDP
Definition: The area of highest density.
Requirement: A point is a center if its surrounding neighbors have lower density, and it’s sufficiently far from other high-density points.
Characteristic: More robust, can find arbitrarily shaped clusters.

How Outliers Affect Centroid Calculation

Evaluating Performance: Is Our Grouping Meaningful?

We need a set of objective metrics to evaluate the effectiveness of our clustering. Evaluation methods fall into two main categories:

External Measures

Prerequisite: You have ‘ground truth’ class labels (for evaluation only).
Goal: Measure the consistency between the clustering result and the true labels.
Representative Metrics:
- Purity
- Entropy
- Homogeneity
- Completeness

Internal Measures

Prerequisite: You do not have true labels.
Goal: Evaluate the quality of the clustering based only on the data itself.
Core Idea: Are clusters sufficiently compact (cohesive) and well-separated?
Representative Metric:
- Silhouette Coefficient

External Metric (1): Purity

Idea: How ‘pure’ is a cluster? We check the proportion of the most frequent ‘true class’ within that cluster.

Calculation:

For each cluster $C_k$, find the true class $L_s$ that is most dominant.
Calculate the proportion of samples from this class, $|C_k \cap L_s|$, relative to the total number of samples in the cluster, $|C_k|$.
The overall Purity is the weighted average over all clusters.

\[ \large{\text{Purity} = \sum_{k=1}^{K} \frac{|C_k|}{N} \max_s \left( \frac{|C_k \cap L_s|}{|C_k|} \right)} \]

Pro: Simple and intuitive. Con: Can be misleadingly high if the number of clusters is large (e.g., each point is a cluster), as it doesn’t penalize fragmentation.

Purity Calculation Example

Suppose we have 10 samples, with true classes being 5 diamonds and 5 hexagons. A clustering algorithm divides them into two clusters, K1 and K2.

True Class	Diamond	Hexagon
Cluster K1	5 (dominant)	1
Cluster K2	0	4 (dominant)

Calculate Purity of K1: $Purity(K_1) = \max(\frac{5}{6}, \frac{1}{6}) = \frac{5}{6}$

Calculate Purity of K2: $Purity(K_2) = \max(\frac{0}{4}, \frac{4}{4}) = \frac{4}{4} = 1$

Calculate Overall Purity (Weighted Average): $Purity_{total} = \frac{|K1|}{N} \times Purity(K_1) + \frac{|K_2|}{N} \times Purity(K_2) = \frac{6}{10} \times \frac{5}{6} + \frac{4}{10} \times 1 = 0.5 + 0.4 = 0.9$

External Metric (2): Entropy

Idea: Measures the ‘disorder’ or ‘impurity’ within a cluster. If a cluster contains a mix of samples from various true classes, its entropy is high. If it’s pure, its entropy is low.

Calculation: For a cluster $C_k$, its entropy is calculated as: \[ \large{\text{Entropy}(C_k) = - \sum_{s=1}^{S} p_{ks} \log_2(p_{ks})} \]

where $p_{ks} = \frac{|C_k \cap L_s|}{|C_k|}$ is the probability of true class $s$ appearing in cluster $k$.

Entropy Calculation Example

Let’s use cluster K1 from the previous example to calculate its entropy.

K1 has 6 samples.
The probability of the ‘Diamond’ class is $p_{diamond} = 5/6$.
The probability of the ‘Hexagon’ class is $p_{hexagon} = 1/6$.

\[ \large{\text{Entropy}(K_1) = - \left( \frac{5}{6} \log_2\left(\frac{5}{6}\right) + \frac{1}{6} \log_2\left(\frac{1}{6}\right) \right) \approx 0.65} \]

The closer this value is to 0, the purer the cluster. K2 contains only hexagons, so $p_{hexagon} = 1$, and $\text{Entropy}(K_2) = -1 \log_2(1) = 0$, which is perfectly pure.

External Metrics (3): Homogeneity & Completeness

Purity and Entropy only tell part of the story. Homogeneity and Completeness provide a more comprehensive view.

Homogeneity: Measures whether each cluster contains only members of a single class. If every cluster perfectly contains members of only one true class, homogeneity is 1. This is analogous to ‘Precision’ in economics.
Completeness: Measures whether all members of a given class are assigned to the same cluster. If all members of a true class are perfectly grouped into one cluster, completeness is high. This is analogous to ‘Recall’.

These two metrics are often in a trade-off.

Homogeneity vs. Completeness: An Intuitive Example

Assume the ground truth is {3 Apples, 3 Pears}.

Perfect Clustering

Cluster 1: {🍎, 🍎, 🍎}
Cluster 2: {🍐, 🍐, 🍐}
Homogeneity = 1
Completeness = 1

High Homogeneity, Low Completeness

Cluster 1: {🍎}
Cluster 2: {🍎, 🍎}
Cluster 3: {🍐, 🍐, 🍐}
Homogeneity = 1 (each cluster is pure)
Completeness < 1 (Apples are split)

Low Homogeneity, High Completeness

Cluster 1: {🍎, 🍎, 🍎, 🍐, 🍐, 🍐}
Cluster 2: {}
Homogeneity < 1 (Cluster 1 is impure)
Completeness = 1 (all same-class items are together)

Internal Metric: Silhouette Coefficient

When we don’t have true labels, how do we evaluate clustering quality? The Silhouette Coefficient is a powerful tool.

Idea: For each data point, we want it to be very close to points in its own cluster, and very far from points in other clusters.

Calculation: For a sample $x_n$:

$a(x_n)$: The average distance from $x_n$ to all other points in its own cluster (measures intra-cluster cohesion).
$b(x_n)$: The average distance from $x_n$ to all points in the nearest other cluster (measures inter-cluster separation).

\[ \large{SC(x_n) = \frac{b(x_n) - a(x_n)}{\max\{a(x_n), b(x_n)\}}} \]

Visualizing the Silhouette Coefficient Calculation

Interpreting the Silhouette Coefficient

The value of the Silhouette Coefficient $SC(x_n)$ ranges from -1 to 1.

SC ≈ 1: Excellent! This means $a(x_n)$ is much smaller than $b(x_n)$. The point is well-matched to its own cluster and poorly-matched to neighboring clusters.
SC ≈ 0: The point is on or very close to the decision boundary between two neighboring clusters.
SC ≈ -1: Very bad! This means the point is likely assigned to the wrong cluster.

We can calculate the average silhouette coefficient over all data points to evaluate the entire clustering result.

Algorithm 1: K-Means Clustering

K-Means is one of the most famous and widely used clustering algorithms.

Core Idea: Iteratively partition the data into K clusters, such that each data point belongs to the cluster with the nearest mean (cluster center), and the within-cluster sum of squares is minimized.
Pros: Simple algorithm, fast computation, works extremely well for spherical, similarly sized clusters.
Cons: Requires the number of clusters K to be specified beforehand, sensitive to initial centroid placement, performs poorly on non-spherical clusters and with outliers.

The Objective Function of K-Means

K-Means aims to minimize the Sum of Squared Errors (SSE) within all clusters, also known as Inertia.

\[ \large{\text{minimize} \sum_{k=1}^{K} \sum_{x_n \in C_k} ||x_n - \mu_k||^2} \]

Where:

$K$ is the number of clusters.
$C_k$ is the $k$-th cluster.
$\mu_k$ is the centroid of the $k$-th cluster.
$||x_n - \mu_k||^2$ is the squared Euclidean distance from data point $x_n$ to the centroid of its assigned cluster.

Each step of the algorithm (assignment and update) is designed to reduce this total error value.

The Iterative Process of the K-Means Algorithm

The K-Means algorithm process is like a game of “capturing territory,” repeated in a four-step cycle:

Initialize: Randomly select K data points as the initial cluster centers.
Assign: For each data point, calculate its distance to all K cluster centers and assign it to the nearest one.
Update: For each cluster, recalculate its center (i.e., the mean of all data points in that cluster).
Repeat: Repeat steps 2 and 3 until the cluster centers no longer change significantly, or the assignment of data points no longer changes.

K-Means Step 1/4: Initialization

Assume K=3. We randomly select three points in the data space as the initial centroids.

K-Means Step 2/4: Assignment

Based on the distance of each point to these three initial centers, color them with the color of the nearest center.

K-Means Step 3/4: Update

For each colored cluster, calculate the mean of all its points and move the cluster center to this new location.

K-Means Step 4/4: Convergence

Repeat the ‘Assign-Update’ steps until the cluster centers no longer move. At this point, we have our final clustering result.

The Achilles’ Heel of K-Means: Initialization Matters

Random initialization can lead to suboptimal or even completely wrong clustering results.

Problem: If the initial centers are chosen poorly (e.g., all within the same true cluster), the algorithm can get stuck in a local optimum and fail to find the global optimum.
Solution: K-Means++
- Instead of being purely random, it uses a smarter strategy: the chosen initial centers are as far away from each other as possible.
- This greatly increases the probability of finding the global optimum. In scikit-learn, init='k-means++' is the default, so we usually don’t need to worry.

The K-Means++ Initialization Strategy Explained

The Most Critical Question: How to Choose the Value of K?

The K-Means algorithm itself cannot tell us what the optimal K is. This is a hyperparameter that we must determine using domain knowledge and data insights.

Fortunately, we have two common heuristic methods to aid this decision:

The Elbow Method
Silhouette Analysis

Choosing K (1): The Elbow Method

Idea: Try different values of K (e.g., from 2 to 10) and calculate the Sum of Squared Errors (SSE, a.k.a. Inertia) for each K.

As K increases, SSE will necessarily decrease. We are looking for the ‘inflection point’ where the rate of decrease slows down significantly—the ‘elbow’.

Choosing K (2): Silhouette Analysis

Idea: For each tested K value, calculate the average silhouette coefficient of the resulting clustering. We choose the K value that yields the highest average silhouette coefficient.

Compared to the Elbow Method, silhouette analysis considers both intra-cluster cohesion and inter-cluster separation, making it a generally more reliable indicator.

Case Study: Mall Customer Segmentation

We’ll use a simulated customer dataset containing annual income and spending scores. Our goal is to identify distinct customer segments.

1. Data Exploration

Table 1: Sample Customer Data

	CustomerID	Gender	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40

2. Finding the Optimal K

Before applying K-Means, we use the Elbow Method and Silhouette Analysis to determine the best number of clusters, K. Both methods point to K=5 as a strong choice.

3. Python Implementation

Now we run K-Means with our chosen value, K=5.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# 1. Load and prepare the data
url = 'https://raw.githubusercontent.com/subashgandyer/datasets/main/mall_customers.csv'
# For demonstration purposes, add error handling for the URL
try:
    df = pd.read_csv(url)
except: # If the URL fails, create mock data
    print("Failed to load data from URL, using mock data instead.")
    data = {'Annual Income (k$)': np.random.randint(15, 140, 200),
            'Spending Score (1-100)': np.random.randint(1, 100, 200)}
    df = pd.DataFrame(data)

X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

# 2. Feature Scaling (a crucial step)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Run K-Means (we have determined K=5)
kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')
# Use the scaled data for training and prediction
df['Cluster'] = kmeans.fit_predict(X_scaled)

Failed to load data from URL, using mock data instead.

4. Visualizing the Results

Let’s visualize the clustering results to see what customer segments we have discovered.

import matplotlib.pyplot as plt
import seaborn as sns

# Map cluster numbers to meaningful names for the legend
cluster_map = {
    0: 'Standard',
    1: 'Target',
    2: 'Careful',
    3: 'Frugal',
    4: 'Potential'
}
# Note: This mapping is based on a visual inspection of the plot from a specific `random_state`.
# The cluster numbers might change if the state changes, but the five groups will be the same.
df['Segment'] = df['Cluster'].map(cluster_map)

# Get centroids and inverse transform them to the original scale for plotting
centroids_scaled = kmeans.cluster_centers_
centroids = scaler.inverse_transform(centroids_scaled)

# Create the plot
plt.style.use('seaborn-v0_8-whitegrid')
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df,
    x='Annual Income (k$)',
    y='Spending Score (1-100)',
    hue='Segment',
    palette='viridis',
    s=100,
    alpha=0.8,
    edgecolor='w'
)

# Plot the centroids
plt.scatter(
    centroids[:, 0], 
    centroids[:, 1], 
    s=250, 
    c='red', 
    marker='X',
    edgecolor='black',
    label='Centroids'
)

# Beautify the plot
plt.title('Customer Segments', fontsize=16)
plt.xlabel('Annual Income (k$)', fontsize=12)
plt.ylabel('Spending Score (1-100)', fontsize=12)
plt.legend(title='Customer Segment', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

Figure 1: The five distinct customer segments identified by K-Means.

5. Deriving Business Insights

The clustering result clearly reveals five distinct customer segments:

Careful: High income, low spending.
Standard: Medium income, medium spending.
Potential: Low income, high spending (likely students or young professionals).
Target: High income, high spending (prime customers).
Frugal: Low income, low spending.

This segmentation provides a clear, data-driven basis for the mall’s targeted marketing, membership management, and resource allocation strategies.

Algorithm 2: Hierarchical Clustering

Unlike K-Means, which partitions data into K clusters in one go, Hierarchical Clustering creates a hierarchy of clusters.

Core Idea: It doesn’t produce a single clustering but a tree-like structure (a Dendrogram) that shows how data points are successively merged (or split).
Two Main Strategies:
- Agglomerative: Bottom-up. Starts with each point as its own cluster, then progressively merges the most similar clusters.
- Divisive: Top-down. Starts with all points in one cluster, then progressively splits the most dissimilar ones.

The Agglomerative Hierarchical Clustering Workflow

Start: Treat each data point as an individual cluster.
Merge: Find the two closest clusters among all clusters and merge them into a new one.
Repeat: Repeat step 2 until all data points are merged into a single cluster.

The entire history of these merges forms a tree-like hierarchical structure.

The Key Question: How to Measure ‘Cluster-to-Cluster’ Distance?

This is defined by the Linkage Criterion. There are three main types:

Visualizing Hierarchical Clustering: The Dendrogram

The dendrogram is the signature output of hierarchical clustering. It visually represents the merging process.

Vertical Axis: Represents the distance or dissimilarity between clusters. The height of a merge indicates how dissimilar the two merged clusters were.
Horizontal Axis: Represents the individual data points.

We can obtain a specific number of clusters by making a horizontal ‘cut’ across the dendrogram at a certain height.

Algorithm 3: Density-Based Clustering (DBSCAN)

When K-Means Fails: K-Means assumes clusters are spherical. It struggles with irregularly shaped clusters (like crescents or rings).

Enter DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Core Idea: Groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.
Pros: Does not require specifying the number of clusters, can find arbitrarily shaped clusters, and is robust to noise.
Cons: Sensitive to the choice of its two core parameters (eps and min_samples), struggles with datasets of varying density.

The Core Concepts of DBSCAN

DBSCAN’s logic relies on three key definitions, controlled by two parameters: eps (a radius) and min_samples (a minimum number of points).

The Intuitive Flow of the DBSCAN Algorithm

The DBSCAN process is like a “snowball effect”:

Randomly select an unvisited point P.
Check if P is a core point.
- If it is: Start a new cluster with P. Then, find all density-reachable points through P’s ε-neighborhood (including other core and border points) and add them all to this cluster. This process expands like a chain reaction until the entire density-connected region is covered.
- If it is not (i.e., it’s a border or noise point): Temporarily mark it as noise and move to the next point.
Repeat steps 1 and 2 until all points have been visited.

Case Study: DBSCAN Shines Where K-Means Fails

Let’s look at an example that K-Means cannot handle but DBSCAN solves perfectly: the “two moons” dataset.

Algorithm 4: Gaussian Mixture Model (GMM)

Beyond ‘Hard’ Clustering: K-Means performs a ‘black-or-white’ hard assignment for each point. But sometimes, a point might lie on the boundary of two clusters, and we’d prefer a probabilistic description.

Gaussian Mixture Model (GMM) offers a Soft Clustering approach.

Core Idea: Assumes that all data points are generated from a mixture of K different Gaussian (normal) distributions. The clustering process is about finding the parameters (mean, variance, weight) of these K distributions.
Result: GMM doesn’t directly tell us which cluster each point belongs to; instead, it gives us the probability that the point belongs to each cluster.

GMM vs. K-Means: A Comparison

Feature	K-Means	Gaussian Mixture Model (GMM)
Cluster Shape	Assumes spherical (circular)	Can adapt to elliptical shapes
Assignment	Hard Assignment	Soft Assignment (Probabilistic)
Mathematical Basis	Distance-based	Probability-based (Expectation-Maximization)
Flexibility	Lower	Higher, better at fitting complex data

GMM can be seen as a probabilistic generalization of K-Means.

Case Study: GMM Excels at Fitting Elliptical Clusters

When cluster shapes are not circular but elliptical, the advantages of GMM become apparent.

``` ## Summary: How to Choose the Right Clustering Algorithm? Today we've learned four major clustering algorithms, each with its own strengths and weaknesses. In practice, there is no 'best' algorithm, only the 'most suitable' one. | Algorithm | Key Advantages | Key Disadvantages | Best For... | |:---|:---|:---|:---| | **K-Means** | Fast, simple, scalable | Requires K, sensitive to spherical assumption | Large datasets with simple, well-separated clusters | | **Hierarchical** | No need for K, provides hierarchy | Computationally expensive (O(N^2))| Small datasets, exploring data structure | | **DBSCAN** | No need for K, finds any shape | Sensitive to parameters, struggles with varied density | Non-spherical clusters, datasets with noise | | **GMM** | Soft clustering, flexible shapes | Complex, high computational cost | Probabilistic output needed, overlapping clusters | ## An Algorithm Selection Decision Tree Here is a simplified decision-making flowchart to help you choose an appropriate clustering algorithm in practice. ```{=html}

The Business Value of Cluster Analysis

Cluster analysis is more than just a set of algorithms; it’s a powerful tool for business insight.

Marketing: Identify customer segments for targeted advertising and personalized recommendations (e.g., Amazon, Netflix).
Finance: Detect unusual transaction patterns to identify potential fraud (e.g., American Express).
Urban Planning: Group city areas by resident travel patterns to create functional zones (e.g., Smart City projects).
Bioinformatics: Cluster gene expression data to discover groups of genes with similar functions (e.g., Cancer research).
Social Networks: Discover communities and interest groups for friend recommendations (e.g., Facebook, LinkedIn).

Mastering clustering means mastering the key to finding structure and creating value from vast amounts of data.

Thank You!

Q & A

Property	Mathematical Expression	Economic Intuition
Non-negativity	\(DM(x, y) \geq 0\)	The difference between two customers cannot be negative.
Identity	\(DM(x, x) = 0\)	A customer’s difference from themselves is zero.
Symmetry	\(DM(x, y) = DM(y, x)\)	The difference from customer A to B is the same as from B to A.
Triangle Inequality	\(DM(x, z) \leq DM(x, y) + DM(y, z)\)	Indirect difference via a third party is never less than the direct difference.