Welcome to the Marketing Department’s favorite use of machine learning: turning a boring customer spreadsheet into fancy “segments” so everyone can nod wisely in meetings. 😎
🎯 Goal¶
Use Unsupervised Learning (PCA + K-Means + t-SNE/UMAP) to segment customers based on their behavior.
By the end of this lab, you’ll:
Identify meaningful customer clusters 🧍♀️🧍♂️🧍♀️
Visualize them beautifully 🎨
Give them cool names like “Budget Shoppers” and “Luxury Lovers” 💸
🧩 Step 1: Load and Explore the Data¶
Let’s start with some fictional customer data — think of an online store with spending patterns.
import pandas as pd
df = pd.read_csv("customers.csv")
df.head()| CustomerID | Age | Income | SpendingScore | LoyaltyYears | OnlinePurchases |
|---|---|---|---|---|---|
| 1001 | 25 | 45_000 | 72 | 1 | 12 |
| 1002 | 42 | 85_000 | 35 | 5 | 6 |
Now, take a peek at some stats 👀
df.describe()⚙️ Step 2: Prepare and Scale¶
Distance-based algorithms hate unscaled data — treat all features equally or your model will think “Income” is the only thing that matters.”
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df.drop(columns=["CustomerID"]))🧠 Step 3: Dimensionality Reduction with PCA¶
Even marketers don’t like 6D scatterplots. Let’s compress it down while keeping the main variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)Check out how much info we kept:
pca.explained_variance_ratio_.sum()Usually around 70–90% = good enough for storytelling 🎬
💡 Step 4: Cluster with K-Means¶
Now, the star of the show — K-Means! 💥 (aka “let’s pretend we know how many clusters exist.”)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
df["Cluster"] = kmeans.fit_predict(X_scaled)Boom — customers grouped by mysterious mathematical forces. ✨
🧭 Step 5: Visualize with t-SNE or UMAP¶
Because management loves visuals.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_vis = tsne.fit_transform(X_scaled)
plt.figure(figsize=(8,6))
plt.scatter(X_vis[:,0], X_vis[:,1], c=df["Cluster"], cmap='tab10')
plt.title("t-SNE Visualization – Customer Segments 🎨")
plt.show()Each color = a different segment. Try UMAP for faster, calmer results. 🧘♀️
🕵️ Step 6: Interpret the Clusters¶
Now for the marketing translation step — a.k.a. turning math into personas 😅
df.groupby("Cluster").mean()| Cluster | Age | Income | SpendingScore | LoyaltyYears | OnlinePurchases |
|---|---|---|---|---|---|
| 0 | 28 | 40k | 80 | 1 | 10 |
| 1 | 45 | 90k | 30 | 5 | 3 |
| 2 | 35 | 60k | 60 | 2 | 8 |
Possible names:
🧑💻 Young Spenders – low loyalty, high impulse
👨👩👧 Family Budgeters – steady income, average spending
💎 Luxury Loyalists – high income, low churn risk
💬 Step 7: Business Insight Time¶
Now the fun part — what do we do with these segments?
| Segment | Strategy |
|---|---|
| Young Spenders | Flash sales & online ads |
| Family Budgeters | Loyalty programs |
| Luxury Loyalists | Premium tier or early access offers |
This is where machine learning becomes money learning. 💰📈
🔁 Optional: Automate the Pipeline¶
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=2)),
("kmeans", KMeans(n_clusters=4, random_state=42))
])
pipeline.fit(df.drop(columns=["CustomerID"]))Now you’ve got a reusable segmentation pipeline — ready to plug into dashboards or marketing campaigns!
🧍 Recap¶
| Step | What You Did | Why It Matters |
|---|---|---|
| 1 | Load & Scale Data | Prep for ML |
| 2 | PCA | Reduce dimensions |
| 3 | K-Means | Find hidden groups |
| 4 | t-SNE/UMAP | Visualize beautifully |
| 5 | Interpret & Act | Turn insight into strategy |
🏁 Wrap-Up¶
You just:
Found hidden customer patterns 🧠
Gave them business meaning 💼
Created visuals that your CMO will love 🎨
Next time someone asks,
“Can we use AI to segment our customers?”
You can say confidently —
“Already done. And it looks gorgeous.” 😎
# Your code here