Lab – Fraud Detection

Lab – Fraud Detection#

Welcome to the final challenge of our forest adventure — where you’ll train a squad of trees to catch fraudsters faster than your bank’s security team.

This is your chance to turn theory into “Whoa, that’s actually useful for business!”

🎯 Business Scenario#

Imagine you work at TrustBank™, where customers occasionally do… interesting things:

$3,000 spent on candles at 2 AM 🕯️
A sudden transfer to “Crypto Banana Inc.” 🍌
Or 17 transactions in 3 seconds from two continents 🌍

Your job? Train an ML model to detect fraudulent transactions before they drain the company’s coffee fund ☕💸

🧠 What You’ll Build#

In this lab, you’ll:

Load a credit card transactions dataset.
Train Decision Tree, Random Forest, and XGBoost models.
Compare accuracy, precision, recall, and AUC.
Plot feature importances to see which variables catch the bad guys.

🧰 Setup#

You can run this notebook:

🧩 In your browser: via JupyterLite (top-right menu)
☁️ In the cloud: Open in Google Colab
💾 Or download the notebook and run locally.

📦 Step 1: Load & Inspect the Data#

import pandas as pd

url = "https://raw.githubusercontent.com/mlbookcamp/code/master/data/card_fraud.csv"
df = pd.read_csv(url)

df.head()

👀 Look for suspicious clues:

amount, time, country, is_fraud
Any patterns jump out?

🌳 Step 2: Train a Random Forest#

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X = df.drop("is_fraud", axis=1)
y = df["is_fraud"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))

🎯 Check if your model catches more frauds than false alarms. Because no one likes blocking Grandma’s grocery card “by accident.” 😅

⚡ Step 3: Boost Your Accuracy with XGBoost#

from xgboost import XGBClassifier
xgb = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=5,
    random_state=42
)
xgb.fit(X_train, y_train)
print(classification_report(y_test, xgb.predict(X_test)))

💡 XGBoost tends to outperform others here — because it learns from every mistake like a data-driven perfectionist.

🧩 Step 4: Compare Model Metrics#

Model	Accuracy	Precision	Recall	AUC
Decision Tree	~0.91	0.75	0.68	0.84
Random Forest	~0.96	0.88	0.85	0.94
XGBoost	~0.97	0.91	0.89	0.96

(Your results may vary, depending on data and random seed.)

🎓 Lesson: Ensembles > Lone Trees Teamwork makes the fraud work (less).

🌟 Step 5: Plot Feature Importance#

import matplotlib.pyplot as plt
from xgboost import plot_importance

plot_importance(xgb, importance_type='gain', title='Feature Importance (XGBoost)')
plt.show()

Check which features are most suspiciously helpful. Typical top suspects:

transaction_amount 💰
time_of_day ⏰
country 🌍

🧠 Bonus Challenge: Real-World Twist#

Fraud data is imbalanced — only ~1% of cases are fraud. Try:

Using class_weight='balanced'
Or SMOTE to oversample minority cases
Or tweaking XGBoost’s scale_pos_weight

Because your model shouldn’t panic every time someone buys something after midnight 🌙

🏁 Summary#

You just:

Built 3 fraud detection models
Compared ensemble performance
Interpreted feature importances
And learned that banks and forests have one thing in common: lots of branches. 🌳💳

🎓 What’s Next#

In the next chapter, we move from structured data to high-dimensional adventures — because the world isn’t just tables… it’s images, text, and chaos. 😎

👉 Get ready for: Neural Networks & Deep Learning Foundations 🧠🔥

# Your code here