K-means Clustering

ML_AI/Claude/images/k-means-1.png

K-Means is a centroid-based, iterative partition algorithm that groups unlabeled data into K distinct, non-overlapping clusters. It is the "workhorse" of clustering due to its efficiency and simplicity.

Key Concepts

How it Works (The Algorithm)

  1. Initialization: Choose K points as initial centroids.
  2. Assignment: For each data point xi, find the closest centroid cj (usually using Euclidean distance).
  3. Centroid Update: For each cluster, calculate the mean of all points assigned to it. This mean becomes the new centroid.
  4. Convergence: Repeat steps 2 and 3 until centroids no longer move significantly or a maximum number of iterations is reached.

Internal Evaluation Metrics

During the iterative process, the algorithm monitors two key metrics:

  1. Within-Cluster Sum of Squares (WCSS) / Inertia:

    • Measures Cohesion (how tight the clusters are).
    • Goal: Minimize this value.
    • During each iterations, WCSS decreases.
  2. Between-Cluster Sum of Squares (BCSS):

    • Measures Separation (how far apart clusters are).
    • Goal: Maximize this value.
    • During each iterations, BCSS increases.

Finding the Optimal 'K' (Model Selection)

1. The Elbow Method

The Elbow method is used to find the "sweet spot" for K.

  1. Run K-Means for a range of K values (e.g., 1 to 10).
  2. Calculate the SSE (Inertia) for each K.
  3. Plot K vs. SSE.
  4. Inflection Point: Identify the "elbow"—the point where the rate of decrease in SSE slows down significantly.

2. Silhouette Analysis

While the Elbow method is popular, Silhouette Analysis is often more robust as it considers both cohesion and separation.

Best Practices for Data Scientists

  1. Always Scale Features: K-Means is extremely sensitive to feature magnitude due to Euclidean distance. Use normalization/standardization is MUST.
  2. K-Means++ Initialization: Standard random initialization can lead to poor local minima. Use init='k-means++'.
  3. Multiple Runs: Use the n_init parameter to run the algorithm with different starting points and pick the one with the lowest inertia.
  4. Post-Processing: Analyze cluster centroids to understand the meaning of each group.

Limitations of K-Means

While K-Means is widely used due to its simplicity and speed, it has several significant limitations:

1. Sensitivity to Initial Centroid Selection:
2. The "K" Problem:
3. Assumes Spherical Clusters:
4. Sensitive to Outliers:
5. Sensitive to Feature Scaling:
6. Hard Assignment (Non-Probabilistic):
7. Inability to Handle Categorical Data:
8. Varying Densities and Sizes:
9. Curse of Dimensionality:

Time Complexity

The time complexity of K-Means is O(n×K×I×D), where:

Generally, K-Means is considered O(n) (linear) with respect to data size, making it one of the most scalable clustering algorithms available.