Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

“Because every ML adventure begins with a CSV — and sometimes ends there too.”


🎯 Why This Exists

Welcome, brave data explorer! This section lists all the datasets used throughout Machine Learning for Business — from squeaky-clean demos to real-world messes that make Excel cry.

Each dataset is linked (where possible) and annotated with what it’s good for, what you’ll probably break while using it, and how many nulls you’ll meet along the way.

⚠️ Pro Tip: Real business data doesn’t come labeled, balanced, or documented. These datasets are the polite ones — the real world is chaos with commas.


🧾 Dataset Table of Contents

DatasetsPublic Link(s)Notes
churn_data.csv – telco‐style churnTelco Customer Churn (Kaggle) – link (Kaggle)
Also IBM/GitHub mirror – link (Databricks)
Good match for churn analysis. You can rename as churn_data.csv.
fraud_detection.csv – transaction‐fraud datasetSynthetic Financial Datasets For Fraud Detection (Kaggle) – link (Kaggle)
Also Cifer Fraud Detection Dataset AF (Hugging Face) – link (Hugging Face)
You might need to convert to “Parquet” format yourself if you want parity with your list.
market_basket.csv – transactional dataYou could use publicly available “Market Basket / Association Rules” datasets on Kaggle or UCI (e.g., “Online Retail” dataset) — (while I don’t have the exact link in this list, you can search for “Market Basket dataset retail csv”).You may need to source this one manually.
customer_segments.csv – clustering / PCA playgroundMany generic “customer segmentation” datasets exist on Kaggle/UCI. Use one and rename accordingly.Adaptation required.
sales_forecasting.csv – monthly product salesYou can use any open “time-series retail sales” dataset from public repos (e.g., Kaggle “Retail Sales Forecasting”) and rename it.Adaptation required.
clv_dataset.csv, inventory_timeseries.csv, campaign_targeting.csv, feature_drift.csv, capstone_full_dataset/, synthetic_images/, ocr_invoices.zip, huggingface_reviews.csvThese are more specialized / custom lists. You may need to create or simulate them yourself (or adapt from publicly available datasets).Custom work required.

🧠 Tips for Using the Datasets

  • Backup your work. You will overwrite a CSV eventually.

  • Don’t trust the columns. If you see Unnamed: 0, pretend it’s a feature named “Spreadsheet Chaos Index.”

  • Plot first, panic later. Visualization will tell you where the bodies (outliers) are buried.

  • Version everything. If your model suddenly performs worse, it’s probably your “updated” data.


💼 Bonus: Real-World Dataset Sources

If you want to go beyond the polite sandbox data:


🧃 Pro-Tip for Business Contexts

When sourcing real business datasets, remember:

  • Always remove personal data (GDPR nightmares aren’t fun).

  • Align features with business KPIs, not just accuracy.

  • Document every transformation — because 6 months later, even you won’t remember why you created customer_value_score_v3_final_final.csv.


🧩 TL;DR

Every dataset has a story. Most start with “It was supposed to be clean…” and end with “We’ll fix it in preprocessing.”

Keep this index handy — it’s your treasure map through the data jungle. 🗺️

# Your code here