Welcome to the final challenge of our forest adventure — where you’ll train a squad of trees to catch fraudsters faster than your bank’s security team.
This is your chance to turn theory into “Whoa, that’s actually useful for business!”
🎯 Business Scenario¶
Imagine you work at TrustBank™, where customers occasionally do... interesting things:
$3,000 spent on candles at 2 AM 🕯️
A sudden transfer to “Crypto Banana Inc.” 🍌
Or 17 transactions in 3 seconds from two continents 🌍
Your job? Train an ML model to detect fraudulent transactions before they drain the company’s coffee fund ☕💸
🧠 What You’ll Build¶
In this lab, you’ll:
Load a credit card transactions dataset.
Train Decision Tree, Random Forest, and XGBoost models.
Compare accuracy, precision, recall, and AUC.
Plot feature importances to see which variables catch the bad guys.
🧰 Setup¶
You can run this notebook:
🧩 In your browser: via JupyterLite (top-right menu)
☁️ In the cloud: Open in Google Colab
💾 Or download the notebook and run locally.
📦 Step 1: Load & Inspect the Data¶
import pandas as pd
url = "https://raw.githubusercontent.com/mlbookcamp/code/master/data/card_fraud.csv"
df = pd.read_csv(url)
df.head()👀 Look for suspicious clues:
amount,time,country,is_fraudAny patterns jump out?
🌳 Step 2: Train a Random Forest¶
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X = df.drop("is_fraud", axis=1)
y = df["is_fraud"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))🎯 Check if your model catches more frauds than false alarms. Because no one likes blocking Grandma’s grocery card “by accident.” 😅
⚡ Step 3: Boost Your Accuracy with XGBoost¶
from xgboost import XGBClassifier
xgb = XGBClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=5,
random_state=42
)
xgb.fit(X_train, y_train)
print(classification_report(y_test, xgb.predict(X_test)))💡 XGBoost tends to outperform others here — because it learns from every mistake like a data-driven perfectionist.
🧩 Step 4: Compare Model Metrics¶
| Model | Accuracy | Precision | Recall | AUC |
|---|---|---|---|---|
| Decision Tree | ~0.91 | 0.75 | 0.68 | 0.84 |
| Random Forest | ~0.96 | 0.88 | 0.85 | 0.94 |
| XGBoost | ~0.97 | 0.91 | 0.89 | 0.96 |
(Your results may vary, depending on data and random seed.)
🎓 Lesson: Ensembles > Lone Trees Teamwork makes the fraud work (less).
🌟 Step 5: Plot Feature Importance¶
import matplotlib.pyplot as plt
from xgboost import plot_importance
plot_importance(xgb, importance_type='gain', title='Feature Importance (XGBoost)')
plt.show()Check which features are most suspiciously helpful. Typical top suspects:
transaction_amount💰time_of_day⏰country🌍
🧠 Bonus Challenge: Real-World Twist¶
Fraud data is imbalanced — only ~1% of cases are fraud. Try:
Using
class_weight='balanced'Or
SMOTEto oversample minority casesOr tweaking XGBoost’s
scale_pos_weight
Because your model shouldn’t panic every time someone buys something after midnight 🌙
🏁 Summary¶
You just:
Built 3 fraud detection models
Compared ensemble performance
Interpreted feature importances
And learned that banks and forests have one thing in common: lots of branches. 🌳💳
🎓 What’s Next¶
In the next chapter, we move from structured data to high-dimensional adventures — because the world isn’t just tables… it’s images, text, and chaos. 😎
👉 Get ready for: Neural Networks & Deep Learning Foundations 🧠🔥
# Your code here