Exercises¶
Take the messy dataset in the Practice Corner and write a small function
clean_business_df(df)that applies the cleaning steps used above.Modify the worked example to compute total revenue by
Regionand plot a horizontal bar chart suitable for a slide.(Challenge) Create a small function that detects columns with inconsistent capitalization (e.g., ‘south’ vs ‘South’) and standardizes them.
Hints:
- Use `df.groupby('Region')['Revenue'].sum()` for aggregation
- Use `str.strip().str.title()` to standardize text
Data Loading, Wrangling & Visualization¶
Practical patterns for cleaning messy business data and turning it into persuasive charts. This chapter gives runnable recipes you can reuse in analysis and reporting.
Learning goals
Load small CSV-like sources into
pandasand inspect structure.Clean common problems (missing values, wrong types, mis-typed categories).
Produce a simple summary chart suitable for business storytelling.
Data Loading, Wrangling & Visualisation¶
Preparing Real-World Data for Analysis, Machine Learning, and Business Decisions¶
Learning Roadmap¶
This notebook keeps the full original chapter overview and adds a clearer study path around it.
Suggested reading order
Start with this chapter overview.
Move to data loading and database notebooks.
Study cleaning, missing data, joins, and aggregations.
Finish with visualisation and dashboards.
Welcome to the most honest part of Machine Learning — data wrangling, also known as “90% of the job no one posts about on LinkedIn.” 😅
If math was theory, this chapter is practice with mud. You’ll roll up your sleeves, clean messy data, and make it look like something a CEO would actually want to see in a dashboard.
🧠 Why This Matters¶
Machine Learning models are like gourmet chefs — they can only make good predictions if you give them clean ingredients.
Unfortunately, business data often looks like this:
| Customer | Age | Revenue | Gender | Notes |
|---|---|---|---|---|
| A-102 | 27 | $2,000 | F | missing |
| NaN | $500 | ? | typo | |
| C-554 | 45 | -$200 | Male | refund |
| D-999 | 300 | $1,000 | cat | who let this happen |
So before we even think about algorithms, we’ll:
Load data from messy sources.
Clean it like digital laundry.
Transform it into model-ready features.
Visualize it like a storytelling pro.
💾 1. The Data Wrangling Trifecta¶
| Step | Name | Business Goal |
|---|---|---|
| Data Loading | Get data into Python | “Where’s my Excel file again?” |
| Data Cleaning | Fix mistakes & missing values | “Why is revenue negative?” |
| Feature Engineering | Add useful variables | “Let’s create a loyalty score!” |
By the end of this chapter, you’ll make data look so clean it could get a job at McKinsey.
📚 Prerequisite: Python Refresher¶
If you’re new to Python or Pandas, don’t panic — it’s easier than assembling IKEA furniture. 👉 Check out my other book: 📘 Programming for Business It covers everything from reading files to basic Python data manipulation.
💡 Tip: You’ll be using libraries like
pandas,numpy, andmatplotlib. If these look like Pokémon names right now, that book is your Pokédex.
🧩 Practice Corner: “Guess the Data Disaster”¶
Match each messy situation with the tool that saves the day:
| Situation | Tool |
|---|---|
| File is 200MB Excel sheet with multiple tabs | pandas.read_excel() |
| Missing values everywhere | df.fillna() or df.dropna() |
| Categorical columns like “Yes/No” | pd.get_dummies() |
| Data stored in a SQL database | pandas.read_sql() |
| REST API providing JSON data | requests.get() |
✅ Pro tip: Pandas is your Swiss Army Knife for data chaos.
🔍 2. Why Businesses Love Clean Data¶
Messy data → Confused analysts → Wrong dashboards → Angry executives. Clean data → Confident models → Actionable insights → Happy bonuses. 🎉
You’ll soon realize:
Data cleaning is not boring — it’s debugging reality.
For example:
Missing age? → Estimate with median.
Wrong gender field? → Normalize text values.
Negative revenue? → Check for refunds.
Timestamp errors? → Convert to datetime.
You’re not just fixing numbers — you’re restoring business logic.
🎨 3. Visualisation: Turning Data into Business Art¶
Once you’ve tamed the chaos, it’s time to make your data pretty and persuasive.
This section covers:
Histograms that show sales trends 📊
Scatter plots revealing marketing ROI 💸
Correlation heatmaps for KPIs 🔥
Dashboards that make execs say “wow” ✨
Remember: “If it’s not visualized, it didn’t happen.” — Every Data Scientist, ever.
💬 4. Business Analogy: The Data Spa¶
Think of your dataset like a customer entering a spa:
| Step | Data Action | Spa Equivalent |
|---|---|---|
| Loading | Getting checked in | “Welcome, Mr. CSV!” |
| Cleaning | Removing noise & junk | Exfoliation time 🧼 |
| Transformation | Standardizing features | Facial mask & makeover 💅 |
| Visualization | Presenting results | Walking the runway 🕺 |
When your data leaves this spa, it’s ready for the runway — or your next board meeting.
🧩 Practice Corner: “Wrangle This!”¶
Here’s a messy dataset in Python. Try cleaning it up using what you’ll learn in this chapter:
import pandas as pd
data = {
'Customer': ['A1', 'A2', 'A3', 'A4', None],
'Age': [25, None, 300, 40, 32],
'Revenue': [2000, 500, -100, None, 1500],
'Gender': ['M', '?', 'F', 'F', 'unknown']
}
df = pd.DataFrame(data)
print("Original Messy Data:")
print(df)🧽 Challenge:
Replace
Noneand?with proper valuesFix negative revenue
Correct impossible ages
Print the clean version
🧭 5. What’s Coming Up¶
| File | Topic | Funny Summary |
|---|---|---|
| data_loading | Loading data from CSV, Excel, SQL & APIs | “The Great Data Buffet” 🍽️ |
| data_cleaning | Cleaning & preprocessing | “Digital Laundry Day” 🧺 |
| handling_missing_outliers | Fixing missing data & outliers | “CSI: Data Edition” 🕵️ |
| feature_encoding | Encoding categories & scaling features | “Teaching Machines English” 🗣️ |
| eda | Exploratory Data Analysis | “Detective Work with Graphs” 🧠 |
| visualisation | Making plots & charts | “Turning KPIs into art” 🎨 |
| business_dashboards | Interactive dashboards | “Your Data’s TED Talk” 🧑💼 |
🚀 Summary¶
✅ Data wrangling = preparing the battlefield for ML ✅ Visualization = storytelling for business impact ✅ Clean data = clean insights ✅ Dirty data = bad decisions (and maybe a career pivot)
Remember: “Garbage in → Garbage out” — but in business, garbage often comes with formatting errors.
🔜 Next Stop¶
👉 Head to Data Loading (CSV, Excel, SQL, APIs) to learn how to bring all your data under one roof — without crying over file formats.
import pandas as pd
data = {
"Customer": ["A-102", None, "C-554", "D-999"],
"Age": [27, None, 45, 300],
"Revenue": [2000, 500, -200, 1000],
"Region": ["West", "South", "West", "East"],
}
wrangling_df = pd.DataFrame(data)
print("Original business data")
print(wrangling_df)
clean_df = wrangling_df.copy()
clean_df["Customer"] = clean_df["Customer"].fillna("Unknown")
clean_df.loc[clean_df["Age"] > 100, "Age"] = pd.NA
clean_df["Age"] = clean_df["Age"].fillna(clean_df["Age"].median())
clean_df["Revenue"] = clean_df["Revenue"].clip(lower=0)
print("\nCleaned business data")
print(clean_df)Original business data
Customer Age Revenue Region
0 A-102 27.0 2000 West
1 None NaN 500 South
2 C-554 45.0 -200 West
3 D-999 300.0 1000 East
Cleaned business data
Customer Age Revenue Region
0 A-102 27.0 2000 West
1 Unknown 36.0 500 South
2 C-554 45.0 0 West
3 D-999 36.0 1000 East
Chapter Wrap-Up¶
Your original chapter overview already frames the business reality of messy data well. Keep that mindset while moving through the child notebooks: every technique in this chapter exists to make downstream analysis and machine learning more trustworthy.
Quick recap
loading gets data into Python from operational systems
wrangling fixes structure, types, and quality problems
visualisation turns cleaned data into decisions and stories
business value comes from reliable, explainable data workflows
Reflection Questions¶
Which data-quality issue causes the biggest business risk in your domain?
Where in your workflow do data errors usually enter the system?
Which visualisation would best persuade a non-technical stakeholder?
8. Interactive Code¶
Expected output
[120, 145, 138]Expected output
['North', 'South', 'West']