Lab – Synthetic Dataset Generation#

“Because sometimes you need data… and your boss says, ‘Just make it up.’” 😅


🎯 Objective#

In this lab, you’ll create synthetic business data — safely, responsibly, and stylishly — using PyTorch, Faker, and SDV (Synthetic Data Vault).

You’ll generate fake customers, transactions, and behaviors that look real enough to train models — without risking a GDPR horror story. 👻


🧰 1. Setup#

Before we start conjuring fake humans and their imaginary spending habits:

pip install torch pandas faker sdv

That’s it — no GPU required, unless you want your fake data to have six-pack abs. 💪


🧑‍💻 2. Generate Simple Fake Customers (Warm-Up)#

Let’s start small — 5 customers who could totally exist, but don’t.

from faker import Faker
import pandas as pd
import torch

fake = Faker()

data = []
for _ in range(5):
    data.append({
        "customer_id": fake.uuid4(),
        "name": fake.name(),
        "age": torch.randint(18, 65, (1,)).item(),
        "country": fake.country(),
        "annual_spend": round(torch.normal(50000, 15000, (1,)).item(), 2)
    })

df = pd.DataFrame(data)
print(df.head())

🎉 Congrats! You’ve just created a miniature fake economy. Somewhere, a data privacy lawyer just exhaled in relief.


🧠 3. Scaling It Up: The SDV Way#

When your boss says:

“We need 50,000 records. By lunch.”

That’s when you bring in SDV — MIT’s Synthetic Data Vault. 🏦

from sdv.tabular import CTGAN
from sdv.datasets.demo import load_demo

# Use a demo dataset first
real_data = load_demo(metadata=False)

model = CTGAN()
model.fit(real_data)

synthetic_data = model.sample(5000)
print(synthetic_data.head())

Boom. You’ve now synthesized 5,000 data points with realistic statistical patterns. Your models can train, your privacy team can sleep, and your GPU can rest. 😴


🧩 4. A Custom Example – E-Commerce Transactions#

Let’s simulate some transaction data that feels like it belongs in a real company.

transactions = pd.DataFrame({
    "customer_age": torch.randint(18, 70, (1000,)).tolist(),
    "cart_value": torch.normal(80, 30, (1000,)).abs().tolist(),
    "category": [fake.random_element(["Electronics", "Fashion", "Books", "Beauty"]) for _ in range(1000)],
    "is_discounted": torch.randint(0, 2, (1000,)).tolist()
})

Then we make a synthetic copy using SDV:

model = CTGAN()
model.fit(transactions)
fake_tx = model.sample(1000)

Now you can:

  • Train recommendation models

  • Test churn predictors

  • Build dashboards that impress management

…all without touching a single real credit card number. 🪄


📊 5. Validate Your Synthetic Data#

You can check how well your synthetic data matches reality using SDV metrics:

from sdv.evaluation import evaluate
score = evaluate(transactions, fake_tx)
print(f"Synthetic data quality score: {score:.2f}")
  • Score ≈ 1.0 → Excellent imitation

  • Score ≈ 0.5 → “It’s learning.”

  • Score ≈ 0.0 → Your GAN just invented a new planet


💡 6. Optional Challenge: The “Bias Fixer”#

Real datasets often overrepresent one category (e.g., 90% “Fashion”). Use synthetic generation to rebalance your data automatically.

Hint:

balanced_tx = fake_tx.groupby("category").apply(
    lambda x: x.sample(200, replace=True)
).reset_index(drop=True)

Now you’ve got a perfectly balanced dataset. Thanos would be proud. 🧤✨


🧾 7. Business Reflection#

Why does this matter for your career (and your sanity)?

Problem

Traditional Fix

Synthetic Fix

Not enough labeled data

Cry

Generate more

Biased dataset

Re-collect data

Auto-balance with GANs

Privacy concerns

Hire lawyers

Use SDV

Need sandbox for ML

Manual masking

Synthetic simulation


🧙‍♂️ 8. Summary#

  • You can generate, validate, and rebalance data easily.

  • PyTorch makes randomness fun.

  • SDV gives you AI interns that never complain.

  • And best of all — no one gets sued. 🎉


💬 Thought Experiment#

Imagine you work at a startup with no data yet, but you want to show investors a “working AI model.”

You can literally:

  1. Generate synthetic customer & purchase data

  2. Train a model on it

  3. Demo it live

…and everyone goes, “Wow, they already have traction!” 😏

(Just remember to replace it later with real data when you actually have customers.)

# Your code here