Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.
Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.
Why Clean Data Matters¶
Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out
Your mission: transform raw business data into something your model (and your manager) will love.
Prerequisite: Pandas Power¶
We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business
Step-by-Step Cleaning Workflow¶
1. Inspect Before You Clean¶
Always look first:
import pandas as pd
df = pd.read_csv("sales_data.csv")
df.info() # missing values + data types
df.describe() # outliers & strange stats
df.head() # spot obvious garbageCommon red flags:
| Symptom | Likely Cause |
|---|---|
NaN / None | Missing data |
"?", "N/A" | Excel placeholders |
customer_id as float | CSV parsing error |
| Dates stored as strings | Wrong type |
2. Clean Column Names¶
Messy names = future bugs.
df.columns = (df.columns
.str.strip()
.str.lower()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
# "Customer Name (v2 FINAL)!" → customer_name_v2_final3. Remove Duplicates¶
Duplicates waste space and skew results.
before = len(df)
df = df.drop_duplicates(subset=['order_id', 'customer_id']) # or all columns
print(f"Removed {before - len(df)} duplicates")4. Handle Missing Values¶
df.isna().sum() # see the damageSmart fixes:
| Situation | Code Example |
|---|---|
| Numeric columns | df['revenue'].fillna(df['revenue'].median()) |
| Categorical | df['region'].fillna('Unknown') |
| Drop empty rows | df.dropna(how='all', inplace=True) |
| Drop if key column missing | df.dropna(subset=['customer_id']) |
5. Fix Data Types¶
Excel loves turning IDs into floats.
df['customer_id'] = df['customer_id'].astype('Int64').astype(str)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['product_category'] = df['product_category'].astype('category')6. Standardize Inconsistent Values¶
# Text cleanup
df['region'] = (df['region']
.str.strip()
.str.lower()
.replace({'n. america': 'north america',
'usa': 'united states',
'uk': 'united kingdom'}))
# Consistent categories
df['status'] = df['status'].str.title()7. Quick Outlier Filter (IQR Method)¶
def remove_outliers(df, col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return df[(df[col] >= lower) & (df[col] <= upper)]
df = remove_outliers(df, 'revenue')Recap: The 7-Step Cleaning Checklist¶
| Step | Goal | Key Methods |
|---|---|---|
| 1 | Inspect | .info(), .describe(), .head() |
| 2 | Clean names | .str.lower().replace() |
| 3 | Deduplicate | .drop_duplicates() |
| 4 | Handle missing | .fillna(), .dropna() |
| 5 | Fix types | .astype(), pd.to_datetime() |
| 6 | Standardize values | .str.strip(), .replace() |
| 7 | Derive new features | arithmetic, assign(), date parts |
Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.
Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)
# Your code hereimport numpy as np
import pandas as pd
# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)
# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
dfSetting Values by Label¶
df.at[dates[0], "A"] = 0
print("After setting value by label:")
dfSetting Values by Position¶
df.iat[0, 1] = 0
print("After setting value by position:")
dfSetting with NumPy Array¶
df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
dfConditional Setting with where()¶
df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2