Lab – Fraud Detection#

Welcome to the final challenge of our forest adventure — where you’ll train a squad of trees to catch fraudsters faster than your bank’s security team.

This is your chance to turn theory into “Whoa, that’s actually useful for business!”


🎯 Business Scenario#

Imagine you work at TrustBank™, where customers occasionally do… interesting things:

  • $3,000 spent on candles at 2 AM 🕯️

  • A sudden transfer to “Crypto Banana Inc.” 🍌

  • Or 17 transactions in 3 seconds from two continents 🌍

Your job? Train an ML model to detect fraudulent transactions before they drain the company’s coffee fund ☕💸


🧠 What You’ll Build#

In this lab, you’ll:

  1. Load a credit card transactions dataset.

  2. Train Decision Tree, Random Forest, and XGBoost models.

  3. Compare accuracy, precision, recall, and AUC.

  4. Plot feature importances to see which variables catch the bad guys.


🧰 Setup#

You can run this notebook:

  • 🧩 In your browser: via JupyterLite (top-right menu)

  • ☁️ In the cloud: Open in Google Colab

  • 💾 Or download the notebook and run locally.


📦 Step 1: Load & Inspect the Data#

import pandas as pd

url = "https://raw.githubusercontent.com/mlbookcamp/code/master/data/card_fraud.csv"
df = pd.read_csv(url)

df.head()

👀 Look for suspicious clues:

  • amount, time, country, is_fraud

  • Any patterns jump out?


🌳 Step 2: Train a Random Forest#

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X = df.drop("is_fraud", axis=1)
y = df["is_fraud"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

🎯 Check if your model catches more frauds than false alarms. Because no one likes blocking Grandma’s grocery card “by accident.” 😅


⚡ Step 3: Boost Your Accuracy with XGBoost#

from xgboost import XGBClassifier
xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=5,
    random_state=42
)
xgb.fit(X_train, y_train)
print(classification_report(y_test, xgb.predict(X_test)))

💡 XGBoost tends to outperform others here — because it learns from every mistake like a data-driven perfectionist.


🧩 Step 4: Compare Model Metrics#

Model

Accuracy

Precision

Recall

AUC

Decision Tree

~0.91

0.75

0.68

0.84

Random Forest

~0.96

0.88

0.85

0.94

XGBoost

~0.97

0.91

0.89

0.96

(Your results may vary, depending on data and random seed.)

🎓 Lesson: Ensembles > Lone Trees Teamwork makes the fraud work (less).


🌟 Step 5: Plot Feature Importance#

import matplotlib.pyplot as plt
from xgboost import plot_importance

plot_importance(xgb, importance_type='gain', title='Feature Importance (XGBoost)')
plt.show()

Check which features are most suspiciously helpful. Typical top suspects:

  • transaction_amount 💰

  • time_of_day

  • country 🌍


🧠 Bonus Challenge: Real-World Twist#

Fraud data is imbalanced — only ~1% of cases are fraud. Try:

  • Using class_weight='balanced'

  • Or SMOTE to oversample minority cases

  • Or tweaking XGBoost’s scale_pos_weight

Because your model shouldn’t panic every time someone buys something after midnight 🌙


🏁 Summary#

You just:

  • Built 3 fraud detection models

  • Compared ensemble performance

  • Interpreted feature importances

  • And learned that banks and forests have one thing in common: lots of branches. 🌳💳


🎓 What’s Next#

In the next chapter, we move from structured data to high-dimensional adventures — because the world isn’t just tables… it’s images, text, and chaos. 😎

👉 Get ready for: Neural Networks & Deep Learning Foundations 🧠🔥

# Your code here