Lab – Fraud Detection#
Welcome to the final challenge of our forest adventure — where you’ll train a squad of trees to catch fraudsters faster than your bank’s security team.
This is your chance to turn theory into “Whoa, that’s actually useful for business!”
🎯 Business Scenario#
Imagine you work at TrustBank™, where customers occasionally do… interesting things:
$3,000 spent on candles at 2 AM 🕯️
A sudden transfer to “Crypto Banana Inc.” 🍌
Or 17 transactions in 3 seconds from two continents 🌍
Your job? Train an ML model to detect fraudulent transactions before they drain the company’s coffee fund ☕💸
🧠 What You’ll Build#
In this lab, you’ll:
Load a credit card transactions dataset.
Train Decision Tree, Random Forest, and XGBoost models.
Compare accuracy, precision, recall, and AUC.
Plot feature importances to see which variables catch the bad guys.
🧰 Setup#
You can run this notebook:
🧩 In your browser: via JupyterLite (top-right menu)
☁️ In the cloud: Open in Google Colab
💾 Or download the notebook and run locally.
📦 Step 1: Load & Inspect the Data#
import pandas as pd
url = "https://raw.githubusercontent.com/mlbookcamp/code/master/data/card_fraud.csv"
df = pd.read_csv(url)
df.head()
👀 Look for suspicious clues:
amount,time,country,is_fraudAny patterns jump out?
🌳 Step 2: Train a Random Forest#
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X = df.drop("is_fraud", axis=1)
y = df["is_fraud"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
🎯 Check if your model catches more frauds than false alarms. Because no one likes blocking Grandma’s grocery card “by accident.” 😅
⚡ Step 3: Boost Your Accuracy with XGBoost#
from xgboost import XGBClassifier
xgb = XGBClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=5,
random_state=42
)
xgb.fit(X_train, y_train)
print(classification_report(y_test, xgb.predict(X_test)))
💡 XGBoost tends to outperform others here — because it learns from every mistake like a data-driven perfectionist.
🧩 Step 4: Compare Model Metrics#
Model |
Accuracy |
Precision |
Recall |
AUC |
|---|---|---|---|---|
Decision Tree |
~0.91 |
0.75 |
0.68 |
0.84 |
Random Forest |
~0.96 |
0.88 |
0.85 |
0.94 |
XGBoost |
~0.97 |
0.91 |
0.89 |
0.96 |
(Your results may vary, depending on data and random seed.)
🎓 Lesson: Ensembles > Lone Trees Teamwork makes the fraud work (less).
🌟 Step 5: Plot Feature Importance#
import matplotlib.pyplot as plt
from xgboost import plot_importance
plot_importance(xgb, importance_type='gain', title='Feature Importance (XGBoost)')
plt.show()
Check which features are most suspiciously helpful. Typical top suspects:
transaction_amount💰time_of_day⏰country🌍
🧠 Bonus Challenge: Real-World Twist#
Fraud data is imbalanced — only ~1% of cases are fraud. Try:
Using
class_weight='balanced'Or
SMOTEto oversample minority casesOr tweaking XGBoost’s
scale_pos_weight
Because your model shouldn’t panic every time someone buys something after midnight 🌙
🏁 Summary#
You just:
Built 3 fraud detection models
Compared ensemble performance
Interpreted feature importances
And learned that banks and forests have one thing in common: lots of branches. 🌳💳
🎓 What’s Next#
In the next chapter, we move from structured data to high-dimensional adventures — because the world isn’t just tables… it’s images, text, and chaos. 😎
👉 Get ready for: Neural Networks & Deep Learning Foundations 🧠🔥
# Your code here