Data Cleaning & Preprocessing - Programming for Machine Learning and Business

Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.

Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.

Why Clean Data Matters¶

Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out

Your mission: transform raw business data into something your model (and your manager) will love.

Prerequisite: Pandas Power¶

We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business

Step-by-Step Cleaning Workflow¶

1. Inspect Before You Clean¶

Always look first:

import pandas as pd
df = pd.read_csv("sales_data.csv")

df.info()      # missing values + data types
df.describe()  # outliers & strange stats
df.head()      # spot obvious garbage

Common red flags:

Symptom	Likely Cause
`NaN` / `None`	Missing data
`"?"`, `"N/A"`	Excel placeholders
`customer_id` as float	CSV parsing error
Dates stored as strings	Wrong type

2. Clean Column Names¶

Messy names = future bugs.

df.columns = (df.columns
                .str.strip()
                .str.lower()
                .str.replace(' ', '_')
                .str.replace('(', '')
                .str.replace(')', ''))
# "Customer Name (v2 FINAL)!" → customer_name_v2_final

3. Remove Duplicates¶

Duplicates waste space and skew results.

before = len(df)
df = df.drop_duplicates(subset=['order_id', 'customer_id'])  # or all columns
print(f"Removed {before - len(df)} duplicates")

4. Handle Missing Values¶

df.isna().sum()  # see the damage

Smart fixes:

Situation	Code Example
Numeric columns	`df['revenue'].fillna(df['revenue'].median())`
Categorical	`df['region'].fillna('Unknown')`
Drop empty rows	`df.dropna(how='all', inplace=True)`
Drop if key column missing	`df.dropna(subset=['customer_id'])`

5. Fix Data Types¶

Excel loves turning IDs into floats.

df['customer_id'] = df['customer_id'].astype('Int64').astype(str)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['product_category'] = df['product_category'].astype('category')

6. Standardize Inconsistent Values¶

# Text cleanup
df['region'] = (df['region']
                .str.strip()
                .str.lower()
                .replace({'n. america': 'north america',
                          'usa': 'united states',
                          'uk': 'united kingdom'}))

# Consistent categories
df['status'] = df['status'].str.title()

7. Quick Outlier Filter (IQR Method)¶

def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

df = remove_outliers(df, 'revenue')

Recap: The 7-Step Cleaning Checklist¶

Step	Goal	Key Methods
1	Inspect	`.info()`, `.describe()`, `.head()`
2	Clean names	`.str.lower().replace()`
3	Deduplicate	`.drop_duplicates()`
4	Handle missing	`.fillna()`, `.dropna()`
5	Fix types	`.astype()`, `pd.to_datetime()`
6	Standardize values	`.str.strip()`, `.replace()`
7	Derive new features	arithmetic, `assign()`, date parts

Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.

Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)

# Your code here

🔧 Setting and Modifying Values¶

Setting a New Column¶

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)

# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
df

Setting Values by Label¶

df.at[dates[0], "A"] = 0
print("After setting value by label:")
df

Setting Values by Position¶

df.iat[0, 1] = 0
print("After setting value by position:")
df

Setting with NumPy Array¶

df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
df

Conditional Setting with where()¶

df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2