Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.

Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.


Why Clean Data Matters

Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out

Your mission: transform raw business data into something your model (and your manager) will love.


Prerequisite: Pandas Power

We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business


Step-by-Step Cleaning Workflow

1. Inspect Before You Clean

Always look first:

import pandas as pd
df = pd.read_csv("sales_data.csv")

df.info()      # missing values + data types
df.describe()  # outliers & strange stats
df.head()      # spot obvious garbage

Common red flags:

SymptomLikely Cause
NaN / NoneMissing data
"?", "N/A"Excel placeholders
customer_id as floatCSV parsing error
Dates stored as stringsWrong type

2. Clean Column Names

Messy names = future bugs.

df.columns = (df.columns
                .str.strip()
                .str.lower()
                .str.replace(' ', '_')
                .str.replace('(', '')
                .str.replace(')', ''))
# "Customer Name (v2 FINAL)!" → customer_name_v2_final

3. Remove Duplicates

Duplicates waste space and skew results.

before = len(df)
df = df.drop_duplicates(subset=['order_id', 'customer_id'])  # or all columns
print(f"Removed {before - len(df)} duplicates")

4. Handle Missing Values

df.isna().sum()  # see the damage

Smart fixes:

SituationCode Example
Numeric columnsdf['revenue'].fillna(df['revenue'].median())
Categoricaldf['region'].fillna('Unknown')
Drop empty rowsdf.dropna(how='all', inplace=True)
Drop if key column missingdf.dropna(subset=['customer_id'])

5. Fix Data Types

Excel loves turning IDs into floats.

df['customer_id'] = df['customer_id'].astype('Int64').astype(str)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['product_category'] = df['product_category'].astype('category')

6. Standardize Inconsistent Values

# Text cleanup
df['region'] = (df['region']
                .str.strip()
                .str.lower()
                .replace({'n. america': 'north america',
                          'usa': 'united states',
                          'uk': 'united kingdom'}))

# Consistent categories
df['status'] = df['status'].str.title()

7. Quick Outlier Filter (IQR Method)

def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

df = remove_outliers(df, 'revenue')

Recap: The 7-Step Cleaning Checklist

StepGoalKey Methods
1Inspect.info(), .describe(), .head()
2Clean names.str.lower().replace()
3Deduplicate.drop_duplicates()
4Handle missing.fillna(), .dropna()
5Fix types.astype(), pd.to_datetime()
6Standardize values.str.strip(), .replace()
7Derive new featuresarithmetic, assign(), date parts

Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.


Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)

# Your code here

🔧 Setting and Modifying Values

Setting a New Column

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)

# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
df

Setting Values by Label

df.at[dates[0], "A"] = 0
print("After setting value by label:")
df

Setting Values by Position

df.iat[0, 1] = 0
print("After setting value by position:")
df

Setting with NumPy Array

df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
df

Conditional Setting with where()

df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2