Exercises¶

Take the messy dataset in the Practice Corner and write a small function clean_business_df(df) that applies the cleaning steps used above.
Modify the worked example to compute total revenue by Region and plot a horizontal bar chart suitable for a slide.
(Challenge) Create a small function that detects columns with inconsistent capitalization (e.g., ‘south’ vs ‘South’) and standardizes them.

Hints:
- Use `df.groupby('Region')['Revenue'].sum()` for aggregation
- Use `str.strip().str.title()` to standardize text

Data Loading, Wrangling & Visualization¶

Practical patterns for cleaning messy business data and turning it into persuasive charts. This chapter gives runnable recipes you can reuse in analysis and reporting.

Learning goals

Load small CSV-like sources into pandas and inspect structure.
Clean common problems (missing values, wrong types, mis-typed categories).
Produce a simple summary chart suitable for business storytelling.

Data Loading, Wrangling & Visualisation¶

Preparing Real-World Data for Analysis, Machine Learning, and Business Decisions¶

Learning Roadmap¶

This notebook keeps the full original chapter overview and adds a clearer study path around it.

Welcome to the most honest part of Machine Learning — data wrangling, also known as “90% of the job no one posts about on LinkedIn.” 😅

If math was theory, this chapter is practice with mud. You’ll roll up your sleeves, clean messy data, and make it look like something a CEO would actually want to see in a dashboard.

🧠 Why This Matters¶

Machine Learning models are like gourmet chefs — they can only make good predictions if you give them clean ingredients.

Unfortunately, business data often looks like this:

Customer	Age	Revenue	Gender	Notes
A-102	27	$2,000	F	missing
	NaN	$500	?	typo
C-554	45	-$200	Male	refund
D-999	300	$1,000	cat	who let this happen

So before we even think about algorithms, we’ll:

Load data from messy sources.
Clean it like digital laundry.
Transform it into model-ready features.
Visualize it like a storytelling pro.

💾 1. The Data Wrangling Trifecta¶

Step	Name	Business Goal
Data Loading	Get data into Python	“Where’s my Excel file again?”
Data Cleaning	Fix mistakes & missing values	“Why is revenue negative?”
Feature Engineering	Add useful variables	“Let’s create a loyalty score!”

By the end of this chapter, you’ll make data look so clean it could get a job at McKinsey.

📚 Prerequisite: Python Refresher¶

If you’re new to Python or Pandas, don’t panic — it’s easier than assembling IKEA furniture. 👉 Check out my other book: 📘 Programming for Business It covers everything from reading files to basic Python data manipulation.

💡 Tip: You’ll be using libraries like pandas, numpy, and matplotlib. If these look like Pokémon names right now, that book is your Pokédex.

🧩 Practice Corner: “Guess the Data Disaster”¶

Match each messy situation with the tool that saves the day:

Situation	Tool
File is 200MB Excel sheet with multiple tabs	`pandas.read_excel()`
Missing values everywhere	`df.fillna()` or `df.dropna()`
Categorical columns like “Yes/No”	`pd.get_dummies()`
Data stored in a SQL database	`pandas.read_sql()`
REST API providing JSON data	`requests.get()`

✅ Pro tip: Pandas is your Swiss Army Knife for data chaos.

🔍 2. Why Businesses Love Clean Data¶

Messy data → Confused analysts → Wrong dashboards → Angry executives. Clean data → Confident models → Actionable insights → Happy bonuses. 🎉

You’ll soon realize:

Data cleaning is not boring — it’s debugging reality.

For example:

Missing age? → Estimate with median.
Wrong gender field? → Normalize text values.
Negative revenue? → Check for refunds.
Timestamp errors? → Convert to datetime.

You’re not just fixing numbers — you’re restoring business logic.

🎨 3. Visualisation: Turning Data into Business Art¶

Once you’ve tamed the chaos, it’s time to make your data pretty and persuasive.

This section covers:

Histograms that show sales trends 📊
Scatter plots revealing marketing ROI 💸
Correlation heatmaps for KPIs 🔥
Dashboards that make execs say “wow” ✨

Remember: “If it’s not visualized, it didn’t happen.” — Every Data Scientist, ever.

💬 4. Business Analogy: The Data Spa¶

Think of your dataset like a customer entering a spa:

Step	Data Action	Spa Equivalent
Loading	Getting checked in	“Welcome, Mr. CSV!”
Cleaning	Removing noise & junk	Exfoliation time 🧼
Transformation	Standardizing features	Facial mask & makeover 💅
Visualization	Presenting results	Walking the runway 🕺

When your data leaves this spa, it’s ready for the runway — or your next board meeting.

🧩 Practice Corner: “Wrangle This!”¶

Here’s a messy dataset in Python. Try cleaning it up using what you’ll learn in this chapter:

import pandas as pd

data = {
    'Customer': ['A1', 'A2', 'A3', 'A4', None],
    'Age': [25, None, 300, 40, 32],
    'Revenue': [2000, 500, -100, None, 1500],
    'Gender': ['M', '?', 'F', 'F', 'unknown']
}

df = pd.DataFrame(data)
print("Original Messy Data:")
print(df)

🧽 Challenge:
Replace None and ? with proper values
Fix negative revenue
Correct impossible ages
Print the clean version

🧭 5. What’s Coming Up¶

File	Topic	Funny Summary
data_loading	Loading data from CSV, Excel, SQL & APIs	“The Great Data Buffet” 🍽️
data_cleaning	Cleaning & preprocessing	“Digital Laundry Day” 🧺
handling_missing_outliers	Fixing missing data & outliers	“CSI: Data Edition” 🕵️
feature_encoding	Encoding categories & scaling features	“Teaching Machines English” 🗣️
eda	Exploratory Data Analysis	“Detective Work with Graphs” 🧠
visualisation	Making plots & charts	“Turning KPIs into art” 🎨
business_dashboards	Interactive dashboards	“Your Data’s TED Talk” 🧑‍💼

🚀 Summary¶

✅ Data wrangling = preparing the battlefield for ML ✅ Visualization = storytelling for business impact ✅ Clean data = clean insights ✅ Dirty data = bad decisions (and maybe a career pivot)

Remember: “Garbage in → Garbage out” — but in business, garbage often comes with formatting errors.

🔜 Next Stop¶

👉 Head to Data Loading (CSV, Excel, SQL, APIs) to learn how to bring all your data under one roof — without crying over file formats.

import pandas as pd

data = {
    "Customer": ["A-102", None, "C-554", "D-999"],
    "Age": [27, None, 45, 300],
    "Revenue": [2000, 500, -200, 1000],
    "Region": ["West", "South", "West", "East"],
}

wrangling_df = pd.DataFrame(data)
print("Original business data")
print(wrangling_df)

clean_df = wrangling_df.copy()
clean_df["Customer"] = clean_df["Customer"].fillna("Unknown")
clean_df.loc[clean_df["Age"] > 100, "Age"] = pd.NA
clean_df["Age"] = clean_df["Age"].fillna(clean_df["Age"].median())
clean_df["Revenue"] = clean_df["Revenue"].clip(lower=0)

print("\nCleaned business data")
print(clean_df)

Original business data
  Customer    Age  Revenue Region
0    A-102   27.0     2000   West
1     None    NaN      500  South
2    C-554   45.0     -200   West
3    D-999  300.0     1000   East

Cleaned business data
  Customer   Age  Revenue Region
0    A-102  27.0     2000   West
1  Unknown  36.0      500  South
2    C-554  45.0        0   West
3    D-999  36.0     1000   East

Chapter Wrap-Up¶

Your original chapter overview already frames the business reality of messy data well. Keep that mindset while moving through the child notebooks: every technique in this chapter exists to make downstream analysis and machine learning more trustworthy.

Reflection Questions¶

Which data-quality issue causes the biggest business risk in your domain?
Where in your workflow do data errors usually enter the system?
Which visualisation would best persuade a non-technical stakeholder?

8. Interactive Code¶

9. Guided Practice¶

What is the main goal of data wrangling?¶

To make data larger without reasonWrangling focuses on quality and usability.

To clean and prepare data for analysisCorrect. Wrangling makes data usable.

To avoid checking data qualityQuality checks are a core part of wrangling.

To replace analysis entirelyWrangling supports analysis; it does not replace it.

Why was `strip().title()` useful in the second example?¶

It removed numbers from the listThe example cleaned text labels, not numbers.

It standardized whitespace and capitalizationCorrect. That makes categories more consistent.

It sorted the labels alphabeticallyThose methods do not sort the list.

It converted strings into integersThat is not what those methods do.