Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Welcome to the Marketing Department’s favorite use of machine learning: turning a boring customer spreadsheet into fancy “segments” so everyone can nod wisely in meetings. 😎


🎯 Goal

Use Unsupervised Learning (PCA + K-Means + t-SNE/UMAP) to segment customers based on their behavior.

By the end of this lab, you’ll:

  • Identify meaningful customer clusters 🧍‍♀️🧍‍♂️🧍‍♀️

  • Visualize them beautifully 🎨

  • Give them cool names like “Budget Shoppers” and “Luxury Lovers” 💸


🧩 Step 1: Load and Explore the Data

Let’s start with some fictional customer data — think of an online store with spending patterns.

import pandas as pd

df = pd.read_csv("customers.csv")
df.head()
CustomerIDAgeIncomeSpendingScoreLoyaltyYearsOnlinePurchases
10012545_00072112
10024285_0003556

Now, take a peek at some stats 👀

df.describe()

⚙️ Step 2: Prepare and Scale

Distance-based algorithms hate unscaled data — treat all features equally or your model will think “Income” is the only thing that matters.”

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop(columns=["CustomerID"]))

🧠 Step 3: Dimensionality Reduction with PCA

Even marketers don’t like 6D scatterplots. Let’s compress it down while keeping the main variance.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Check out how much info we kept:

pca.explained_variance_ratio_.sum()

Usually around 70–90% = good enough for storytelling 🎬


💡 Step 4: Cluster with K-Means

Now, the star of the show — K-Means! 💥 (aka “let’s pretend we know how many clusters exist.”)

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=42)
df["Cluster"] = kmeans.fit_predict(X_scaled)

Boom — customers grouped by mysterious mathematical forces. ✨


🧭 Step 5: Visualize with t-SNE or UMAP

Because management loves visuals.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_vis = tsne.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
plt.scatter(X_vis[:,0], X_vis[:,1], c=df["Cluster"], cmap='tab10')
plt.title("t-SNE Visualization – Customer Segments 🎨")
plt.show()

Each color = a different segment. Try UMAP for faster, calmer results. 🧘‍♀️


🕵️ Step 6: Interpret the Clusters

Now for the marketing translation step — a.k.a. turning math into personas 😅

df.groupby("Cluster").mean()
ClusterAgeIncomeSpendingScoreLoyaltyYearsOnlinePurchases
02840k80110
14590k3053
23560k6028

Possible names:

  • 🧑‍💻 Young Spenders – low loyalty, high impulse

  • 👨‍👩‍👧 Family Budgeters – steady income, average spending

  • 💎 Luxury Loyalists – high income, low churn risk


💬 Step 7: Business Insight Time

Now the fun part — what do we do with these segments?

SegmentStrategy
Young SpendersFlash sales & online ads
Family BudgetersLoyalty programs
Luxury LoyalistsPremium tier or early access offers

This is where machine learning becomes money learning. 💰📈


🔁 Optional: Automate the Pipeline

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2)),
    ("kmeans", KMeans(n_clusters=4, random_state=42))
])

pipeline.fit(df.drop(columns=["CustomerID"]))

Now you’ve got a reusable segmentation pipeline — ready to plug into dashboards or marketing campaigns!


🧍 Recap

StepWhat You DidWhy It Matters
1Load & Scale DataPrep for ML
2PCAReduce dimensions
3K-MeansFind hidden groups
4t-SNE/UMAPVisualize beautifully
5Interpret & ActTurn insight into strategy

🏁 Wrap-Up

You just:

  • Found hidden customer patterns 🧠

  • Gave them business meaning 💼

  • Created visuals that your CMO will love 🎨

Next time someone asks,

“Can we use AI to segment our customers?”

You can say confidently —

“Already done. And it looks gorgeous.” 😎

# Your code here