Data Cleaning & Preprocessing#

⏳ Loading Pyodide…

Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.

Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.


Why Clean Data Matters#

Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out

Your mission: transform raw business data into something your model (and your manager) will love.


Prerequisite: Pandas Power#

We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business


Step-by-Step Cleaning Workflow#

1. Inspect Before You Clean#

Always look first:

Common red flags:

Symptom

Likely Cause

NaN / None

Missing data

"?", "N/A"

Excel placeholders

customer_id as float

CSV parsing error

Dates stored as strings

Wrong type

2. Clean Column Names#

Messy names = future bugs.

3. Remove Duplicates#

Duplicates waste space and skew results.

4. Handle Missing Values#

Smart fixes:

Situation

Code Example

Numeric columns

df['revenue'].fillna(df['revenue'].median())

Categorical

df['region'].fillna('Unknown')

Drop empty rows

df.dropna(how='all', inplace=True)

Drop if key column missing

df.dropna(subset=['customer_id'])

5. Fix Data Types#

Excel loves turning IDs into floats.

6. Standardize Inconsistent Values#

7. Quick Outlier Filter (IQR Method)#


Recap: The 7-Step Cleaning Checklist#

Step

Goal

Key Methods

1

Inspect

.info(), .describe(), .head()

2

Clean names

.str.lower().replace()

3

Deduplicate

.drop_duplicates()

4

Handle missing

.fillna(), .dropna()

5

Fix types

.astype(), pd.to_datetime()

6

Standardize values

.str.strip(), .replace()

7

Derive new features

arithmetic, assign(), date parts

Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.


Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)

# Your code here

🔧 Setting and Modifying Values#

Setting a New Column#

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)

# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
df

Setting Values by Label#

df.at[dates[0], "A"] = 0
print("After setting value by label:")
df

Setting Values by Position#

df.iat[0, 1] = 0
print("After setting value by position:")
df

Setting with NumPy Array#

df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
df

Conditional Setting with where()#

df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2