Lab – Synthetic Dataset Generation#
“Because sometimes you need data… and your boss says, ‘Just make it up.’” 😅
🎯 Objective#
In this lab, you’ll create synthetic business data — safely, responsibly, and stylishly — using PyTorch, Faker, and SDV (Synthetic Data Vault).
You’ll generate fake customers, transactions, and behaviors that look real enough to train models — without risking a GDPR horror story. 👻
🧰 1. Setup#
Before we start conjuring fake humans and their imaginary spending habits:
pip install torch pandas faker sdv
That’s it — no GPU required, unless you want your fake data to have six-pack abs. 💪
🧑💻 2. Generate Simple Fake Customers (Warm-Up)#
Let’s start small — 5 customers who could totally exist, but don’t.
from faker import Faker
import pandas as pd
import torch
fake = Faker()
data = []
for _ in range(5):
data.append({
"customer_id": fake.uuid4(),
"name": fake.name(),
"age": torch.randint(18, 65, (1,)).item(),
"country": fake.country(),
"annual_spend": round(torch.normal(50000, 15000, (1,)).item(), 2)
})
df = pd.DataFrame(data)
print(df.head())
🎉 Congrats! You’ve just created a miniature fake economy. Somewhere, a data privacy lawyer just exhaled in relief.
🧠 3. Scaling It Up: The SDV Way#
When your boss says:
“We need 50,000 records. By lunch.”
That’s when you bring in SDV — MIT’s Synthetic Data Vault. 🏦
from sdv.tabular import CTGAN
from sdv.datasets.demo import load_demo
# Use a demo dataset first
real_data = load_demo(metadata=False)
model = CTGAN()
model.fit(real_data)
synthetic_data = model.sample(5000)
print(synthetic_data.head())
Boom. You’ve now synthesized 5,000 data points with realistic statistical patterns. Your models can train, your privacy team can sleep, and your GPU can rest. 😴
🧩 4. A Custom Example – E-Commerce Transactions#
Let’s simulate some transaction data that feels like it belongs in a real company.
transactions = pd.DataFrame({
"customer_age": torch.randint(18, 70, (1000,)).tolist(),
"cart_value": torch.normal(80, 30, (1000,)).abs().tolist(),
"category": [fake.random_element(["Electronics", "Fashion", "Books", "Beauty"]) for _ in range(1000)],
"is_discounted": torch.randint(0, 2, (1000,)).tolist()
})
Then we make a synthetic copy using SDV:
model = CTGAN()
model.fit(transactions)
fake_tx = model.sample(1000)
Now you can:
Train recommendation models
Test churn predictors
Build dashboards that impress management
…all without touching a single real credit card number. 🪄
📊 5. Validate Your Synthetic Data#
You can check how well your synthetic data matches reality using SDV metrics:
from sdv.evaluation import evaluate
score = evaluate(transactions, fake_tx)
print(f"Synthetic data quality score: {score:.2f}")
Score ≈ 1.0 → Excellent imitation
Score ≈ 0.5 → “It’s learning.”
Score ≈ 0.0 → Your GAN just invented a new planet
💡 6. Optional Challenge: The “Bias Fixer”#
Real datasets often overrepresent one category (e.g., 90% “Fashion”). Use synthetic generation to rebalance your data automatically.
Hint:
balanced_tx = fake_tx.groupby("category").apply(
lambda x: x.sample(200, replace=True)
).reset_index(drop=True)
Now you’ve got a perfectly balanced dataset. Thanos would be proud. 🧤✨
🧾 7. Business Reflection#
Why does this matter for your career (and your sanity)?
Problem |
Traditional Fix |
Synthetic Fix |
|---|---|---|
Not enough labeled data |
Cry |
Generate more |
Biased dataset |
Re-collect data |
Auto-balance with GANs |
Privacy concerns |
Hire lawyers |
Use SDV |
Need sandbox for ML |
Manual masking |
Synthetic simulation |
🧙♂️ 8. Summary#
You can generate, validate, and rebalance data easily.
PyTorch makes randomness fun.
SDV gives you AI interns that never complain.
And best of all — no one gets sued. 🎉
💬 Thought Experiment#
Imagine you work at a startup with no data yet, but you want to show investors a “working AI model.”
You can literally:
Generate synthetic customer & purchase data
Train a model on it
Demo it live
…and everyone goes, “Wow, they already have traction!” 😏
(Just remember to replace it later with real data when you actually have customers.)
# Your code here