Data Cleaning & Preprocessing#
Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.
Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.
Why Clean Data Matters#
Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out
Your mission: transform raw business data into something your model (and your manager) will love.
Prerequisite: Pandas Power#
We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business
Step-by-Step Cleaning Workflow#
1. Inspect Before You Clean#
Always look first:
Common red flags:
Symptom |
Likely Cause |
|---|---|
|
Missing data |
|
Excel placeholders |
|
CSV parsing error |
Dates stored as strings |
Wrong type |
2. Clean Column Names#
Messy names = future bugs.
3. Remove Duplicates#
Duplicates waste space and skew results.
4. Handle Missing Values#
Smart fixes:
Situation |
Code Example |
|---|---|
Numeric columns |
|
Categorical |
|
Drop empty rows |
|
Drop if key column missing |
|
5. Fix Data Types#
Excel loves turning IDs into floats.
6. Standardize Inconsistent Values#
7. Quick Outlier Filter (IQR Method)#
Recap: The 7-Step Cleaning Checklist#
Step |
Goal |
Key Methods |
|---|---|---|
1 |
Inspect |
|
2 |
Clean names |
|
3 |
Deduplicate |
|
4 |
Handle missing |
|
5 |
Fix types |
|
6 |
Standardize values |
|
7 |
Derive new features |
arithmetic, |
Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.
Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)
# Your code here
🔧 Setting and Modifying Values#
Setting a New Column#
import numpy as np
import pandas as pd
# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)
# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
df
Setting Values by Label#
df.at[dates[0], "A"] = 0
print("After setting value by label:")
df
Setting Values by Position#
df.iat[0, 1] = 0
print("After setting value by position:")
df
Setting with NumPy Array#
df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
df
Conditional Setting with where()#
df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2