Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

“Because sometimes, predicting who clicks an ad is harder than convincing your boss why you need more GPUs.” 😅


🎯 Objective

You’re the new data scientist at AdAstra Marketing Co. 🚀 The marketing team wants to predict which customers are most likely to respond to a campaign, so they can stop wasting money on people who think “unsubscribe” means “tell me more.”

Your job:

Build, tune, and evaluate multiple ML models — then choose the one that makes the most business sense, not just the highest accuracy.


🧩 What You’ll Practice

✅ Model comparison with Cross-Validation and Nested CV ✅ Hyperparameter tuning using GridSearchCV / RandomizedSearchCV ✅ Translating model performance → ROI, lift, and profit curves ✅ Choosing the best campaign target group based on predicted response probability


🗂️ Dataset Overview

Use the dataset: marketing_campaign.csv (available in /data folder or downloadable from Colab link below)

FeatureDescription
ageCustomer age
incomeAnnual income ($)
spending_scorePrior engagement with campaigns
channelCampaign channel (email, social, etc.)
response1 if responded positively, else 0

🧰 Setup & Notebook Access

You can run this lab directly in:

OptionLink
🧮 JupyterLite (in-browser)Run in JupyterLite ▶️
☁️ Google ColabOpen in Colab 🚀
💾 Download NotebookDownload .ipynb

🧠 Step-by-Step Instructions

1. Load & Explore Data

import pandas as pd

df = pd.read_csv("marketing_campaign.csv")
df.head()

Check data types, missing values, and basic stats. Don’t forget to ask the existential question:

“Why is income missing for half our audience?” 💸


2. Preprocess

  • Encode categorical features (channel)

  • Handle missing income with median imputation

  • Scale continuous variables

Bonus Challenge: Try both StandardScaler and MinMaxScaler — and see if your model’s mood improves. 😆


3. Split Data & Define Models

Use a Stratified Train-Test Split to preserve response balance.

Try multiple models:

  • Logistic Regression

  • Random Forest

  • XGBoost (optional but fun 💥)


4. Evaluate with Cross-Validation

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
scores = cross_val_score(rf, X, y, cv=5, scoring="f1")
print("Average F1:", scores.mean())

💡 Tip: Keep a leaderboard of all models — treat it like a data science version of The Bachelor 💔🌹


5. Hyperparameter Tuning

Use GridSearchCV or RandomizedSearchCV to find the sweet spot.

ModelKey Hyperparameters
Logistic RegressionC, penalty
Random Forestn_estimators, max_depth, min_samples_split
XGBoosteta, max_depth, subsample

Don’t overfit — the goal is ROI, not ego points. 😎


6. Business-Aware Evaluation

Compute ROI, lift, and expected profit for each model.

Example profit function:

def campaign_profit(y_true, y_pred_proba, threshold=0.5):
    TP_gain, FP_cost, FN_cost = 100, 10, 50
    preds = (y_pred_proba >= threshold).astype(int)
    TP = ((preds == 1) & (y_true == 1)).sum()
    FP = ((preds == 1) & (y_true == 0)).sum()
    FN = ((preds == 0) & (y_true == 1)).sum()
    return TP*TP_gain - FP*FP_cost - FN*FN_cost

Try different thresholds to find the profit-maximizing one, not just where F1 is max. 💰


7. Visualize Profit Curve

import numpy as np
import matplotlib.pyplot as plt

thresholds = np.linspace(0, 1, 100)
profits = [campaign_profit(y_test, y_pred_proba, t) for t in thresholds]

plt.plot(thresholds, profits)
plt.xlabel("Threshold")
plt.ylabel("Expected Profit ($)")
plt.title("Optimize for 💵, not just F1")
plt.show()

8. Interpret & Present

Explain results to your “boss” (or pretend one 🧑‍💼):

  • What’s the best model and why?

  • What threshold gives the best ROI?

  • Which features drive campaign response?

Pro tip: Use visuals — executives fear math but love graphs. 📊❤️


🧪 Stretch Goal

Deploy a threshold-based segmenter:

  • Top 10% predicted responders = Premium segment

  • Next 30% = Target later

  • Rest = Send memes instead of ads 😜


🎓 Deliverables

  • Notebook with:

    • Model training + tuning

    • ROI / lift plots

    • Final model comparison table

  • Short business summary:

    “This model improves campaign ROI by 18% with 40% fewer promotions sent.”


💬 Final Thought

“In marketing, knowing who not to target is often worth more than finding who to target.” 💡


# Your code here