Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Exercises

  1. Take the messy dataset in the Practice Corner and write a small function clean_business_df(df) that applies the cleaning steps used above.

  2. Modify the worked example to compute total revenue by Region and plot a horizontal bar chart suitable for a slide.

  3. (Challenge) Create a small function that detects columns with inconsistent capitalization (e.g., ‘south’ vs ‘South’) and standardizes them.

Hints:
- Use `df.groupby('Region')['Revenue'].sum()` for aggregation
- Use `str.strip().str.title()` to standardize text

Data Loading, Wrangling & Visualization

Practical patterns for cleaning messy business data and turning it into persuasive charts. This chapter gives runnable recipes you can reuse in analysis and reporting.

Learning goals

  • Load small CSV-like sources into pandas and inspect structure.

  • Clean common problems (missing values, wrong types, mis-typed categories).

  • Produce a simple summary chart suitable for business storytelling.

Data Loading, Wrangling & Visualisation

Preparing Real-World Data for Analysis, Machine Learning, and Business Decisions

Learning Roadmap

This notebook keeps the full original chapter overview and adds a clearer study path around it.

Suggested reading order
  1. Start with this chapter overview.

  2. Move to data loading and database notebooks.

  3. Study cleaning, missing data, joins, and aggregations.

  4. Finish with visualisation and dashboards.

Welcome to the most honest part of Machine Learning — data wrangling, also known as “90% of the job no one posts about on LinkedIn.” 😅

If math was theory, this chapter is practice with mud. You’ll roll up your sleeves, clean messy data, and make it look like something a CEO would actually want to see in a dashboard.


🧠 Why This Matters

Machine Learning models are like gourmet chefs — they can only make good predictions if you give them clean ingredients.

Unfortunately, business data often looks like this:

CustomerAgeRevenueGenderNotes
A-10227$2,000Fmissing
NaN$500?typo
C-55445-$200Malerefund
D-999300$1,000catwho let this happen

So before we even think about algorithms, we’ll:

  1. Load data from messy sources.

  2. Clean it like digital laundry.

  3. Transform it into model-ready features.

  4. Visualize it like a storytelling pro.


💾 1. The Data Wrangling Trifecta

StepNameBusiness Goal
Data LoadingGet data into Python“Where’s my Excel file again?”
Data CleaningFix mistakes & missing values“Why is revenue negative?”
Feature EngineeringAdd useful variables“Let’s create a loyalty score!”

By the end of this chapter, you’ll make data look so clean it could get a job at McKinsey.


📚 Prerequisite: Python Refresher

If you’re new to Python or Pandas, don’t panic — it’s easier than assembling IKEA furniture. 👉 Check out my other book: 📘 Programming for Business It covers everything from reading files to basic Python data manipulation.

💡 Tip: You’ll be using libraries like pandas, numpy, and matplotlib. If these look like Pokémon names right now, that book is your Pokédex.


🧩 Practice Corner: “Guess the Data Disaster”

Match each messy situation with the tool that saves the day:

SituationTool
File is 200MB Excel sheet with multiple tabspandas.read_excel()
Missing values everywheredf.fillna() or df.dropna()
Categorical columns like “Yes/No”pd.get_dummies()
Data stored in a SQL databasepandas.read_sql()
REST API providing JSON datarequests.get()

Pro tip: Pandas is your Swiss Army Knife for data chaos.


🔍 2. Why Businesses Love Clean Data

Messy data → Confused analysts → Wrong dashboards → Angry executives. Clean data → Confident models → Actionable insights → Happy bonuses. 🎉

You’ll soon realize:

Data cleaning is not boring — it’s debugging reality.

For example:

  • Missing age? → Estimate with median.

  • Wrong gender field? → Normalize text values.

  • Negative revenue? → Check for refunds.

  • Timestamp errors? → Convert to datetime.

You’re not just fixing numbers — you’re restoring business logic.


🎨 3. Visualisation: Turning Data into Business Art

Once you’ve tamed the chaos, it’s time to make your data pretty and persuasive.

This section covers:

  • Histograms that show sales trends 📊

  • Scatter plots revealing marketing ROI 💸

  • Correlation heatmaps for KPIs 🔥

  • Dashboards that make execs say “wow” ✨

Remember: “If it’s not visualized, it didn’t happen.” — Every Data Scientist, ever.


💬 4. Business Analogy: The Data Spa

Think of your dataset like a customer entering a spa:

StepData ActionSpa Equivalent
LoadingGetting checked in“Welcome, Mr. CSV!”
CleaningRemoving noise & junkExfoliation time 🧼
TransformationStandardizing featuresFacial mask & makeover 💅
VisualizationPresenting resultsWalking the runway 🕺

When your data leaves this spa, it’s ready for the runway — or your next board meeting.


🧩 Practice Corner: “Wrangle This!”

Here’s a messy dataset in Python. Try cleaning it up using what you’ll learn in this chapter:

import pandas as pd

data = {
    'Customer': ['A1', 'A2', 'A3', 'A4', None],
    'Age': [25, None, 300, 40, 32],
    'Revenue': [2000, 500, -100, None, 1500],
    'Gender': ['M', '?', 'F', 'F', 'unknown']
}

df = pd.DataFrame(data)
print("Original Messy Data:")
print(df)

🧽 Challenge:

  1. Replace None and ? with proper values

  2. Fix negative revenue

  3. Correct impossible ages

  4. Print the clean version


🧭 5. What’s Coming Up

FileTopicFunny Summary
data_loadingLoading data from CSV, Excel, SQL & APIs“The Great Data Buffet” 🍽️
data_cleaningCleaning & preprocessing“Digital Laundry Day” 🧺
handling_missing_outliersFixing missing data & outliers“CSI: Data Edition” 🕵️
feature_encodingEncoding categories & scaling features“Teaching Machines English” 🗣️
edaExploratory Data Analysis“Detective Work with Graphs” 🧠
visualisationMaking plots & charts“Turning KPIs into art” 🎨
business_dashboardsInteractive dashboards“Your Data’s TED Talk” 🧑‍💼

🚀 Summary

✅ Data wrangling = preparing the battlefield for ML ✅ Visualization = storytelling for business impact ✅ Clean data = clean insights ✅ Dirty data = bad decisions (and maybe a career pivot)

Remember: “Garbage in → Garbage out” — but in business, garbage often comes with formatting errors.


🔜 Next Stop

👉 Head to Data Loading (CSV, Excel, SQL, APIs) to learn how to bring all your data under one roof — without crying over file formats.

import pandas as pd

data = {
    "Customer": ["A-102", None, "C-554", "D-999"],
    "Age": [27, None, 45, 300],
    "Revenue": [2000, 500, -200, 1000],
    "Region": ["West", "South", "West", "East"],
}

wrangling_df = pd.DataFrame(data)
print("Original business data")
print(wrangling_df)

clean_df = wrangling_df.copy()
clean_df["Customer"] = clean_df["Customer"].fillna("Unknown")
clean_df.loc[clean_df["Age"] > 100, "Age"] = pd.NA
clean_df["Age"] = clean_df["Age"].fillna(clean_df["Age"].median())
clean_df["Revenue"] = clean_df["Revenue"].clip(lower=0)

print("\nCleaned business data")
print(clean_df)
Original business data
  Customer    Age  Revenue Region
0    A-102   27.0     2000   West
1     None    NaN      500  South
2    C-554   45.0     -200   West
3    D-999  300.0     1000   East

Cleaned business data
  Customer   Age  Revenue Region
0    A-102  27.0     2000   West
1  Unknown  36.0      500  South
2    C-554  45.0        0   West
3    D-999  36.0     1000   East

Chapter Wrap-Up

Your original chapter overview already frames the business reality of messy data well. Keep that mindset while moving through the child notebooks: every technique in this chapter exists to make downstream analysis and machine learning more trustworthy.

Quick recap
  • loading gets data into Python from operational systems

  • wrangling fixes structure, types, and quality problems

  • visualisation turns cleaned data into decisions and stories

  • business value comes from reliable, explainable data workflows

Reflection Questions

  1. Which data-quality issue causes the biggest business risk in your domain?

  2. Where in your workflow do data errors usually enter the system?

  3. Which visualisation would best persuade a non-technical stakeholder?

8. Interactive Code

Expected output
[120, 145, 138]
Expected output
['North', 'South', 'West']

9. Guided Practice

What is the main goal of data wrangling?

To make data larger without reasonWrangling focuses on quality and usability.
To clean and prepare data for analysisCorrect. Wrangling makes data usable.
To avoid checking data qualityQuality checks are a core part of wrangling.
To replace analysis entirelyWrangling supports analysis; it does not replace it.

Why was `strip().title()` useful in the second example?

It removed numbers from the listThe example cleaned text labels, not numbers.
It standardized whitespace and capitalizationCorrect. That makes categories more consistent.
It sorted the labels alphabeticallyThose methods do not sort the list.
It converted strings into integersThat is not what those methods do.