K-Means#
Welcome to K-Means, where your data points get grouped into squads based purely on vibes (technically, Euclidean distance, but same thing).
It’s the “high school cafeteria” of machine learning:
Each data point just wants to sit with others that look like it. 🍕📚
🎯 What It Does#
K-Means partitions your data into K clusters by:
Guessing K random centers (centroids).
Assigning each data point to its nearest centroid — “join the closest gang.”
Recalculating each centroid as the mean of its members.
Repeating steps 2–3 until everyone’s sitting comfortably (or the model gives up).
It’s literally called K-Means because:
K = number of groups
Means = the average of points in each group
Voilà! 🪄 You’ve got clusters.
🧠 Intuition#
Imagine running a store and you want to group customers by behavior:
Some buy luxury stuff 💎
Some hunt for discounts 🏷️
Some browse but never buy 👀
K-Means figures out these segments automatically — no labels needed, just patterns in the data.
📉 The Objective Function#
K-Means minimizes the within-cluster sum of squared distances:
[ J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2 ]
Where:
( C_i ) = cluster i
( \mu_i ) = centroid of cluster i
( ||x - \mu_i||^2 ) = how far each point is from its squad leader
In short:
“We want everyone close to their centroid, like an emotionally healthy family.” ❤️
🧮 Python Example#
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Assign clusters
labels = kmeans.labels_
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
color='red', marker='X', s=200, label='Centroids')
plt.legend()
plt.title("K-Means Clustering: Everyone Finds Their Squad 💅")
plt.show()
📈 Choosing the Right K#
You can’t just pick K = 3 because it “feels right.” We use the Elbow Method — plot K vs inertia (a.k.a. how unhappy the points are):
inertias = []
for k in range(1, 10):
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(range(1, 10), inertias, marker='o')
plt.title("Elbow Method – Finding the Ideal K")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Inertia")
plt.show()
Look for the “elbow point” — the spot where adding more clusters doesn’t make much difference.
It’s like realizing that having more group chats doesn’t actually make you happier. 💬😅
🧭 Business Examples#
Industry |
Application |
Example |
|---|---|---|
Marketing |
Customer segmentation |
Identify budget vs premium buyers |
Finance |
Risk profiling |
Group investors by behavior |
HR |
Employee clustering |
Group by performance or engagement |
Retail |
Inventory planning |
Cluster stores by sales pattern |
Basically, if your data has patterns — K-Means will find them (even if you didn’t ask it to).
⚠️ Common Pitfalls#
🚩 Sensitive to scale: Always standardize your data first. 🚩 Random initialization: Different seeds = different clusters. 🚩 Assumes spherical clusters: If your data looks like a banana, K-Means gets confused. 🍌 🚩 Must predefine K: It won’t guess for you. (Unlike ChatGPT 😜)
🧪 Practice Exercise#
Load your favorite customer dataset.
Try K-Means with K = 3, 4, 5.
Plot the clusters and centroids.
Use the elbow method to pick the best K.
Write a one-line business insight for each cluster (e.g., “Cluster 2 = loyal, low-spend customers”).
🧍♀️ Quick Recap#
K-Means = find K groups by minimizing within-cluster variance.
Use standardization and elbow method wisely.
Great for segmentation, clustering, and pattern discovery.
Works best when your data behaves (spoiler: it usually doesn’t).
Next up: GMM – The Drama Queen of Clustering: “Why pick one cluster when you can belong to all?” 🎭
K-Means and Clustering K-Means is an unsupervised clustering algorithm that partitions a dataset into ( k ) clusters by minimizing the variance within each cluster. It assigns each data point to the nearest cluster centroid and iteratively updates the centroids to optimize the objective function.
Objective Function Minimization Objective: Minimize the within-cluster sum of squares (WCSS), also known as the inertia, which measures the total squared distance between each point and its assigned cluster centroid. • Mathematical Formulation: Given a dataset ( X = {x_1, x_2, \dots, x_n} ) with ( n ) points in ( \mathbb{R}^p ), and ( k ) clusters with centroids ( {\mu_1, \mu_2, \dots, \mu_k} ), the objective function is: [ J = \sum_{i=1}^n \sum_{j=1}^k r_{ij} | x_i - \mu_j |^2 ] where: ◦ ( r_{ij} = 1 ) if point ( x_i ) is assigned to cluster ( j ), and ( 0 ) otherwise. ◦ ( | x_i - \mu_j |^2 ) is the squared distance between point ( x_i ) and centroid ( \mu_j ). • Goal: Find the optimal assignments ( r_{ij} ) and centroids ( \mu_j ) that minimize ( J ).
Lloyd’s Algorithm Lloyd’s algorithm is the standard iterative method for solving the K-Means optimization problem. It alternates between two steps until convergence: 1 Assignment Step: ◦ Assign each data point to the nearest centroid based on a distance metric: [ r_{ij} = \begin{cases} 1 & \text{if } j = \arg\min_{j’} | x_i - \mu_{j’} |^2 \ 0 & \text{otherwise} \end{cases} ] 2 Update Step: ◦ Recalculate the centroid of each cluster as the mean of all points assigned to it: [ \mu_j = \frac{\sum_{i=1}^n r_{ij} x_i}{\sum_{i=1}^n r_{ij}} ] • Convergence: ◦ The algorithm iterates until the centroids stabilize (i.e., no significant change in ( J ) or assignments) or a maximum number of iterations is reached. ◦ Note: K-Means is guaranteed to converge to a local minimum, but the solution may not be globally optimal. • Challenges: ◦ Sensitive to initial centroid placement, which can lead to suboptimal solutions. ◦ Requires specifying ( k ) in advance. ◦ Assumes clusters are spherical and of similar size due to the use of distance-based assignments.
Distance Metrics The choice of distance metric affects how points are assigned to clusters. Common metrics include: • Euclidean Distance: ◦ The standard distance metric used in K-Means: [ d(x_i, \mu_j) = \sqrt{\sum_{l=1}^p (x_{il} - \mu_{jl})^2} ] ◦ Assumes clusters are spherical and works well for continuous, isotropic data. ◦ Sensitive to feature scaling. • Manhattan Distance: ◦ Also known as L1 distance or city-block distance: [ d(x_i, \mu_j) = \sum_{l=1}^p |x_{il} - \mu_{jl}| ] ◦ Better suited for data with non-spherical clusters or when robustness to outliers is desired. ◦ Less sensitive to large differences in individual dimensions compared to Euclidean distance. • Other Metrics: ◦ Cosine similarity, Mahalanobis distance, or custom metrics can be used for specific applications, but they require modifications to the standard K-Means algorithm. • Choosing a Metric: ◦ Euclidean is default for K-Means due to its compatibility with the mean-based centroid update. ◦ Standardize features (zero mean, unit variance) to ensure fair contributions from all dimensions.
Python: K-Means with Random Initialization Below is a Python implementation of K-Means using random initialization, followed by a scikit-learn example for comparison. Implementation from Scratch import numpy as np
def kmeans(X, k, max_iters=100, distance_metric=’euclidean’): # Randomly initialize centroids n_samples, n_features = X.shape idx = np.random.choice(n_samples, k, replace=False) centroids = X[idx]
for _ in range(max_iters):
# Assignment step
distances = np.zeros((n_samples, k))
for j in range(k):
if distance_metric == 'euclidean':
distances[:, j] = np.sum((X - centroids[j])**2, axis=1)
elif distance_metric == 'manhattan':
distances[:, j] = np.sum(np.abs(X - centroids[j]), axis=1)
labels = np.argmin(distances, axis=1)
# Update step
new_centroids = np.zeros_like(centroids)
for j in range(k):
if np.sum(labels == j) > 0: # Avoid division by zero
new_centroids[j] = np.mean(X[labels == j], axis=0)
else:
new_centroids[j] = centroids[j] # Keep old centroid if no points assigned
# Check for convergence
if np.all(centroids == new_centroids):
break
centroids = new_centroids
# Compute WCSS (inertia)
inertia = 0
for j in range(k):
if distance_metric == 'euclidean':
inertia += np.sum((X[labels == j] - centroids[j])**2)
elif distance_metric == 'manhattan':
inertia += np.sum(np.abs(X[labels == j] - centroids[j]))
return labels, centroids, inertia
Example usage#
np.random.seed(42) X = np.concatenate([np.random.normal(0, 1, (50, 2)), np.random.normal(5, 1, (50, 2)), np.random.normal(10, 1, (50, 2))]) # 3 clusters k = 3 labels, centroids, inertia = kmeans(X, k, distance_metric=’euclidean’)
print(“Cluster Labels:”, labels) print(“Centroids:”, centroids) print(“Inertia (WCSS):”, inertia) Using scikit-learn from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler
Standardize the data#
scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Apply K-Means#
kmeans_sklearn = KMeans(n_clusters=k, init=’random’, max_iter=100, random_state=42) labels_sklearn = kmeans_sklearn.fit_predict(X_scaled) centroids_sklearn = scaler.inverse_transform(kmeans_sklearn.cluster_centers_) # Transform back inertia_sklearn = kmeans_sklearn.inertia_
print(“scikit-learn Cluster Labels:”, labels_sklearn) print(“scikit-learn Centroids:”, centroids_sklearn) print(“scikit-learn Inertia (WCSS):”, inertia_sklearn) Output Explanation • Cluster Labels: The cluster assignment for each data point (0 to ( k-1 )). • Centroids: The coordinates of the cluster centers. • Inertia (WCSS): The objective function value, measuring the total within-cluster variance. Notes • Random Initialization: The scratch implementation uses random selection of initial centroids, which can lead to different results across runs. For better results, consider K-Means++ initialization (available in scikit-learn). • Standardization: Always standardize the data before clustering to ensure all features contribute equally. • Choosing ( k ): Use the elbow method (plotting inertia vs. ( k )) or silhouette score to select an appropriate number of clusters. • scikit-learn: More robust, supports K-Means++ initialization, and handles edge cases efficiently.
Summary • Objective Function: K-Means minimizes the within-cluster sum of squares (WCSS). • Lloyd’s Algorithm: Iteratively assigns points to the nearest centroid and updates centroids. • Distance Metrics: Euclidean is standard, but Manhattan or others can be used for specific cases. • Python Implementation: Random initialization works but is sensitive to starting points; scikit-learn’s K-Means is recommended for production. Let me know if you need further details, visualization code, or help with evaluating clustering performance!
⸻
K-means
The KMeans algorithm clusters data by trying to separate samples into \(n\) groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large numbers of samples and has been used across a large range of application areas in many different fields.
The k-means algorithm divides a set of \(N\) samples \(X\) into \(K\) disjoint clusters \(C\), each described by the mean \(\mu_j\) of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from \(X\), although they live in the same space.
The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion:
Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks: • Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters or manifolds with irregular shapes. • Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). • Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.
⸻
Step-by-Step: KMeans Clustering#
# Step 1: Import KMeans and Reduce Data
from sklearn.cluster import KMeans
# Use 2D PCA coordinates of stocks (each stock is a point)
X_cluster = W.T # shape (2, num_stocks)
X_cluster = X_cluster.T # shape (num_stocks, 2)
print("Clustering Data Shape:", X_cluster.shape) # (10, 2)
# KMeans clustering (e.g., 3 clusters)
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_cluster)
# Step 2: Visualize Clusters in 2D PCA Space
# Plot stocks with cluster colors
plt.figure(figsize=(10, 7))
colors = ['red', 'green', 'blue', 'purple', 'orange']
for i, ticker in enumerate(tickers):
plt.scatter(X_cluster[i, 0], X_cluster[i, 1], color=colors[labels[i] % len(colors)], s=100)
plt.text(X_cluster[i, 0]+0.01, X_cluster[i, 1]+0.01, ticker, fontsize=12)
plt.title('KMeans Clustering of Stocks in 2D PCA Space')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid(True)
plt.axis('equal')
plt.show()
# Step 3 (Optional): Cluster Membership Table
# Show which stocks belong to which cluster
df_clusters = pd.DataFrame({'Stock': tickers, 'Cluster': labels})
df_clusters = df_clusters.sort_values('Cluster')
print(df_clusters)
Elbow Method vs Silhouette Score — Which is More Popular?
Elbow Method • Idea: Plot total within-cluster variance (inertia) vs number of clusters k. • As k increases, inertia decreases. But after a point, the rate of improvement “bends” like an elbow — suggesting the optimal number of clusters. • Popular because it’s simple and visual. • Works well when you expect compact, spherical clusters. • Used more often in practice for KMeans, especially in business or EDA contexts.
Silhouette Score • Idea: Measures how similar a point is to its own cluster vs others. • Score ranges from -1 to 1 (higher is better). • Tells both how well-separated and how dense clusters are. • Better for algorithmic tuning, especially for non-spherical clusters or comparing across clustering algorithms.
⸻
Which One Should You Use?
Use Case Recommended Method Quick visual guess (KMeans) Elbow Method Want a more objective, numerical score Silhouette Score Comparing non-KMeans algorithms Silhouette Score Exploratory, business reporting Elbow (visual appeal) Small datasets or when clusters are hard to spot Silhouette is more reliable
⸻
inertias = []
K_range = range(1, 10)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(X_cluster)
inertias.append(km.inertia_)
# Plot Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, 'o-', color='darkblue')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.grid(True)
plt.show()
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(42)
def euclidean_distance(x1, x2):
return np.sqrt(np.sum((x1 - x2) ** 2))
class KMeans:
def __init__(self, K=5, max_iters=100, plot_steps=False):
self.K = K
self.max_iters = max_iters
self.plot_steps = plot_steps
# list of sample indices for each cluster
self.clusters = [[] for _ in range(self.K)]
# the centers (mean feature vector) for each cluster
self.centroids = []
def predict(self, X):
self.X = X
self.n_samples, self.n_features = X.shape
# initialize
random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
self.centroids = [self.X[idx] for idx in random_sample_idxs]
# Optimize clusters
for _ in range(self.max_iters):
# Assign samples to closest centroids (create clusters)
self.clusters = self._create_clusters(self.centroids)
if self.plot_steps:
self.plot()
# Calculate new centroids from the clusters
centroids_old = self.centroids
self.centroids = self._get_centroids(self.clusters)
# check if clusters have changed
if self._is_converged(centroids_old, self.centroids):
break
if self.plot_steps:
self.plot()
# Classify samples as the index of their clusters
return self._get_cluster_labels(self.clusters)
def _get_cluster_labels(self, clusters):
# each sample will get the label of the cluster it was assigned to
labels = np.empty(self.n_samples)
for cluster_idx, cluster in enumerate(clusters):
for sample_index in cluster:
labels[sample_index] = cluster_idx
return labels
def _create_clusters(self, centroids):
# Assign the samples to the closest centroids to create clusters
clusters = [[] for _ in range(self.K)]
for idx, sample in enumerate(self.X):
centroid_idx = self._closest_centroid(sample, centroids)
clusters[centroid_idx].append(idx)
return clusters
def _closest_centroid(self, sample, centroids):
# distance of the current sample to each centroid
distances = [euclidean_distance(sample, point) for point in centroids]
closest_index = np.argmin(distances)
return closest_index
def _get_centroids(self, clusters):
# assign mean value of clusters to centroids
centroids = np.zeros((self.K, self.n_features))
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids
def _is_converged(self, centroids_old, centroids):
# distances between each old and new centroids, fol all centroids
distances = [
euclidean_distance(centroids_old[i], centroids[i]) for i in range(self.K)
]
return sum(distances) == 0
def plot(self):
fig, ax = plt.subplots(figsize=(12, 8))
for i, index in enumerate(self.clusters):
point = self.X[index].T
ax.scatter(*point)
for point in self.centroids:
ax.scatter(*point, marker="x", color="black", linewidth=2)
plt.show()
# Testing
if __name__ == "__main__":
from sklearn.datasets import make_blobs
X, y = make_blobs(
centers=3, n_samples=500, n_features=2, shuffle=True, random_state=40
)
print(X.shape)
clusters = len(np.unique(y))
print(clusters)
k = KMeans(K=clusters, max_iters=150, plot_steps=True)
y_pred = k.predict(X)
k.plot()
# Your code here