Clustering ➛ "Evaluation"

Unlike supervised learning, in unsupervised learning we do not have truth labels, making evaluation more challenging. Thus we need alternate internal validation metrics that assess cluster's quality based on geometric properties.

Overview of Evaluation Metrics

Clustering evaluation metrics generally focus on two key aspects:

Cohesion (Compactness): How close are points within the same cluster?
Separation: How distinct are different clusters from each other?

Main Evaluation Metrics:

Sum of Squared Error (SSE)
Silhouette Score
Dunn Index
Davies-Bouldin Index
Calinski-Harabasz Index

1. Sum of Squared Error (SSE)

Also known as ➛ Within-Cluster Sum of Squares (WCSS)

The Sum of Squared Error (SSE), quantifies the compactness of clusters by measuring the total squared distance between each point and its cluster centroid.

Key Characteristics:

Lower SSE = Better clustering (more compact clusters).
Primary metric for K-Means optimization.
Used in the Elbow Method to find optimal number of clusters.

Formula:

1. Cluster Centroid ( $c_{i}$ ):

c_{i} = \frac{1}{m_{i}} \sum_{x \in C_{i}} x

Where:

$c_{i}$ : The centroid of the $i^{t h}$ cluster ( $C_{i}$ )
$m_{i}$ : Number of data points in cluster ( $C_{i}$ )
$x \in C_{i}$ : Data points belonging to cluster ( $C_{i}$ )

2. Sum of Squared Errors (SSE)

S S E = \sum_{i = 1}^{K} \sum_{x \in C_{i}} dist (c_{i}, x)^{2}

Where:

$K$ : Total number of clusters
$C_{i}$ : The $i^{t h}$ cluster
$dist (c_{i}, x)$ : Euclidean distance between the centroid $c_{i}$ and a point $x$ within cluster $C_{i}$ .
$S S E$ : Sum of squared errors, which measures intra-cluster variability

➢ Interpretation & Usage:

Lower SSE: Points are closer to their centroids → more compact clusters
Higher SSE: Points are spread out → loose clusters
SSE always decreases as K increases (adding more clusters always reduces SSE). Used in the Elbow Method to find optimal number of clusters.

Advantages:

Simple and intuitive
Fast to compute
Directly optimized by K-means algorithm

Disadvantages:

Always decreases with more clusters (need elbow method)
Sensitive to outliers
Only measures compactness, not separation
Assumes spherical clusters of similar sizes

2. Silhouette Score

The Silhouette Score is a metric used to evaluate the quality (goodness) of clustering. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

Visual representation

Formula:

1. Silhouette Score for a Single Point $i$ :

s (i) = \frac{b_{i} - a_{i}}{max (a_{i}, b_{i})}

Where:

$a_{i}$ : The average distance from point $i$ to all other points in the same cluster (intra-cluster distance)
$b_{i}$ : The smallest average distance from point $i$ to all points in another cluster, i.e., the nearest cluster (inter-cluster distance)
$max (a_{i}, b_{i})$ : Maximum of $a_{i}$ and $b_{i}$ used for normalization

2. Silhouette Score for the Entire Dataset:

To compute the silhouette score for an entire clustering solution, take the mean silhouette score $s (i)$ over all data points $n$ :

S = \frac{1}{n} \sum_{i = 1}^{n} s (i)

➢ Detailed Explanation:

Cohesion ( $a_{i}$ ):
- Measures how similar point $i$ is to other points within its own cluster
- A lower $a_{i}$ value indicates that point $i$ is close to its own cluster members, meaning it is well-clustered (dense cluster)
Separation ( $b_{i}$ ):
- Measures the average distance from point $i$ to points in the nearest neighboring cluster
- A higher $b_{i}$ value indicates that point $i$ is far from other clusters, which is desirable for good clustering
Silhouette Score ( $s (i)$ ):
- Combines both cohesion and separation into a single normalized metric
- When $a_{i} < b_{i}$ : positive score, point is well-clustered
- When $a_{i} > b_{i}$ : negative score, point might belong to another cluster
- When $a_{i} = b_{i}$ : zero score, point is on the boundary
Dataset Silhouette Score ( $S$ ):
- Average of silhouette scores of all points
- Provides an overall quality measure for the clustering solution

➢ Interpretation Guidelines:

$S > 0.7$ : Strong clustering structure
$0.5 < S \leq 0.7$ : Reasonable clustering structure
$0.25 < S \leq 0.5$ : Weak clustering structure, some overlap
$0 \leq S \leq 0.25$ : No substantial clustering structure
$S < 0$ : Points is likely misclassified.

Advantages:

Intuitive and easy to interpret
Considers both cohesion and separation
Can identify misclassified points (negative scores)
Works with any distance metric
Can be visualized per cluster

Disadvantages:

Computationally expensive: $O (n^{2})$ for n data points
Biased towards convex clusters
Can be misleading for complex cluster shapes (e.g., concentric circles)
Sensitive to cluster size imbalance

Recap

$a_{i}$ = average intra-cluster distance (within cluster)
$b_{i}$ = average inter-cluster distance (to nearest different cluster)
We want $a_{i}$ to be small (tight cluster) and $b_{i}$ to be large (well-separated clusters)

3. Dunn Index

The Dunn Index is a clustering validation index used to evaluate the clustering quality.

It analyzes the compactness and separation of clusters in a dataset.
A higher Dunn Index indicates better clustering, as it favors clusters that are compact (low intra-cluster variation) and well-separated from one another (high inter-cluster distance).

Formula

Refer 👉 permetrics.readthedocs.io ➛ Dunn Index

Let us denote by $d_{m i n}$ the minimal distance between points of different clusters
and $d_{m a x}$ the largest within-cluster distance.

The Dunn index ( $C$ ) is defined as the quotient of $d_{m i n}$ and $d_{m a x}$ :

C = \frac{d_{m i n}}{d_{m a x}}

➢ Interpretation:

Higher Dunn Index = Better clustering (well-separated and compact clusters)
Lower Dunn Index = Poor clustering (overlapping or loose clusters)
Range: $[0, \infty)$ , but typically between 0 and 2

Advantages:

Intuitive interpretation
Considers both separation and compactness

Disadvantages:

Computationally expensive for large datasets
Sensitive to outliers
May not work well with clusters of different sizes or densities

4. Davies-Bouldin Index (DBI)

The Davies-Bouldin Index (DBI) is a metric designed to evaluate the quality of clustering results by analyzing both the intra-cluster similarity (compactness) and inter-cluster separation (distinctiveness).

Application

Single Metric for Quality: DBI provides a single numeric value and combines compactness and separation into a single score for cluster evaluation where lower values indicate better clustering performance
Model Comparison: Can be used to compare the performance of different clustering algorithms on the same dataset (e.g., K-Means, DBSCAN).
Hyperparameter Tuning: Helps in selecting the optimal number of clusters ( $K$ ) by minimizing the DBI score.
Clustering Validation: Provides an objective measure to determine how well the algorithm has segmented the data.

How Does DBI Work?

The Davies-Bouldin Index quantifies the ratio of:

Intra-cluster distance (compactness):
- Measures how tightly grouped the points within each cluster are (clusters should be compact).
Inter-cluster distance (separation):
- Measures how far apart different clusters are from each other (clusters should be well-separated).

Visual Representing

Formula:

D B = \frac{1}{K} \sum_{i = 1}^{K} max_{j \neq i} (\frac{S_{i} + S_{j}}{d (c_{i}, c_{j})})

Where:

$K$ : Number of clusters
$S_{i}$ : Average distance of all points in cluster $C_{i}$ to its centroid $c_{i}$ $$S_i = \frac{1}{|C_i|} \sum_{x \in C_i} \text{dist}(x, c_i)$$
$d (c_{i}, c_{j})$ : Distance between centroids of clusters $C_{i}$ and $C_{j}$
The $max$ ensures that we use the "worst case" (most similar pair of clusters).

Interpretation:

Lower DBI = Better clustering (compact clusters that are far apart)
Higher DBI = Poor clustering (scattered or overlapping clusters)
Range: $[0, \infty)$

Advantages:

Faster to compute than Dunn Index
Considers all clusters, not just extremes
The low complexity and interpretability of DBI make it an important clustering evaluation metric

Disadvantages:

Assumes clusters are convex and isotropic
Biased towards spherical clusters

5. Calinski-Harabasz Index (Variance Ratio Criterion)

The Calinski-Harabasz Index (also known as the Variance Ratio Criterion) evaluates clustering based on the ratio of between-cluster variance to within-cluster variance.

Formula:

C H = \frac{S S_{B} / (K - 1)}{S S_{W} / (n - K)}

Where:

$K$ : Number of clusters
$n$ : Total number of data points
$S S_{B}$ : Between-cluster sum of squares (variance between clusters) $S S_{B} = \sum_{i = 1}^{K} | C_{i} | \cdot dist (c_{i}, c)^{2}$
$S S_{W}$ : Within-cluster sum of squares (same as SSE) $S S_{W} = \sum_{i = 1}^{K} \sum_{x \in C_{i}} dist (x, c_{i})^{2}$
$c$ : Global centroid (mean of all data points)
$c_{i}$ : Centroid of cluster $C_{i}$

Interpretation:

Higher CH Index = Better clustering (tight, well-separated clusters)
Lower CH Index = Poor clustering
No fixed range, depends on dataset

Advantages:

Fast to compute
Works well for convex clusters
Can be used to find optimal number of clusters

Disadvantages:

Biased towards convex clusters
Not suitable for non-convex or density-based clusters

Comparison of Evaluation Metrics

Metric	Range	Best Value	Pros	Cons	Use Case
SSE	$[0, \infty)$	Lower	Simple, fast	Decreases with K, no normalization	K-Means optimization
Silhouette	$[- 1, 1]$	Closer to 1	Intuitive, considers both cohesion & separation	Slow for large datasets	General clustering evaluation
Dunn Index	$[0, \infty)$	Higher	Intuitive, robust metric	Very slow, sensitive to outliers	Small datasets with clear separation
Davies-Bouldin	$[0, \infty)$	Lower	Fast, intuitive	Assumes spherical clusters	Quick evaluation, spherical clusters
Calinski-Harabasz	$[0, \infty)$	Higher	Very fast, good for finding K	Biased to convex clusters	Large datasets, finding optimal K

Choosing the Right Metric

flowchart TD
    Start([Need to Evaluate
Clustering]) --> Q1{Large Dataset?
 >10,000 points}
    
    Q1 -->|Yes| Q2{Need to find
optimal K?}
    Q1 -->|No| Q3{Know cluster
shapes?}
    
    Q2 -->|Yes| CH[✓ Calinski-Harabasz
Fast & effective for K selection]
    Q2 -->|No| DB[✓ Davies-Bouldin
Fast evaluation]
    
    Q3 -->|Spherical| SIL1[✓ Silhouette Score
Best overall metric]
    Q3 -->|Arbitrary| Q4{High separation
expected?}
    
    Q4 -->|Yes| DUNN[✓ Dunn Index
Emphasizes separation]
    Q4 -->|No| SIL2[✓ Silhouette Score
Balanced evaluation]
    
    CH --> Tip1[Tip: Use with SSE
for elbow method]
    DB --> Tip2[Tip: Assumes convex
clusters]
    SIL1 --> Tip3[Tip: Computationally
intensive]
    DUNN --> Tip4[Tip: Very slow,
sensitive to outliers]
    SIL2 --> Tip5[Tip: Most reliable
for various shapes]
    
    style Start fill:#FFE5E5
    style CH fill:#F3E5FF
    style DB fill:#FFE5F3
    style SIL1 fill:#E5FFE5
    style SIL2 fill:#E5FFE5
    style DUNN fill:#FFF5E5
    style Tip1 fill:#F3E5FF,stroke:#9370DB
    style Tip2 fill:#FFE5F3,stroke:#DB70AB
    style Tip3 fill:#E5FFE5,stroke:#70DB70
    style Tip4 fill:#FFF5E5,stroke:#DBAE70
    style Tip5 fill:#E5FFE5,stroke:#70DB70

Best Practices for Clustering Evaluation

Use Multiple Metrics: Never rely on a single metric. Different metrics capture different aspects of clustering quality.
Consider the Context:
- For K-Means: Use SSE (Elbow method) + Silhouette Score
- For DBSCAN: Use Silhouette Score + manual inspection
- For Hierarchical: Use Dendrogram + Silhouette Score
Find Optimal K:
- Plot metrics for different values of K
- Look for the "elbow" in SSE
- Choose K with highest Silhouette/CH score or lowest DB score
Domain Knowledge:
- Metrics are tools, not absolute truth
- Validate results with domain expertise
- Consider interpretability and business value
Visual Validation:
- Always visualize clusters (use PCA/t-SNE for high dimensions)
- Check if clusters make semantic sense
- Look for outliers and boundary cases
Computational Considerations:
- Large datasets: Use CH or DB index
- Small datasets: Can afford Silhouette or Dunn index
- Real-time applications: Pre-compute metrics or use approximations

Clustering ➛ "Evaluation"

Overview of Evaluation Metrics

1. Sum of Squared Error (SSE)

Formula:

1. Cluster Centroid (ci​):

2. Sum of Squared Errors (SSE)

➢ Interpretation & Usage:

Advantages:

Disadvantages:

2. Silhouette Score

Visual representation

Formula:

1. Silhouette Score for a Single Point i:

2. Silhouette Score for the Entire Dataset:

➢ Detailed Explanation:

➢ Interpretation Guidelines:

Advantages:

Disadvantages:

3. Dunn Index

Formula

➢ Interpretation:

Advantages:

Disadvantages:

4. Davies-Bouldin Index (DBI)

How Does DBI Work?

Visual Representing

Formula:

Advantages:

Disadvantages:

5. Calinski-Harabasz Index (Variance Ratio Criterion)

Formula:

Advantages:

Disadvantages:

Comparison of Evaluation Metrics

Choosing the Right Metric

Best Practices for Clustering Evaluation

1. Cluster Centroid ( $c_{i}$ ):

1. Silhouette Score for a Single Point $i$ :