Exercises¶
Implement a function
clean_basic(df)that applies the 7-step checklist to a DataFrame and returns(clean_df, report)wherereportlists the actions taken.Try the Pyodide demo CSV and then modify the cleaning to preserve rows flagged as ‘returned’ in a separate table.
(Challenge) Write a small validator that checks
customer_idformats and flags anomalies.
Hints:
- Use `.isna().sum()` for missingness
- Use `.clip(lower=0)` to remove negative revenuesWhich step should you do first when you receive a new dataset?¶

Data Cleaning & Preprocessing¶
Recognize common messes, apply repeatable cleaning steps, and document your assumptions. This notebook includes a lightweight Pyodide demo that safely runs in the browser using generated data when files are unavailable.
Learning goals
Spot common quality issues and apply pragmatic fixes.
Use
pandasidioms to standardize columns, types, and missing values.Follow a repeatable cleaning checklist and produce consistent, documented datasets.
Notebook Guide¶
The original data-cleaning explanations and examples remain in place. This section makes the study goal explicit.
Learning objectives¶
identify common data-quality problems
distinguish cleaning from transformation
apply repeatable preprocessing steps
document assumptions made during cleanup
Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.
Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.
Why Clean Data Matters¶
Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out
Your mission: transform raw business data into something your model (and your manager) will love.
Prerequisite: Pandas Power¶
We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business
Step-by-Step Cleaning Workflow¶
1. Inspect Before You Clean¶
Always look first:
import pandas as pd
df = pd.read_csv("sales_data.csv")
df.info() # missing values + data types
df.describe() # outliers & strange stats
df.head() # spot obvious garbageCommon red flags:
| Symptom | Likely Cause |
|---|---|
NaN / None | Missing data |
"?", "N/A" | Excel placeholders |
customer_id as float | CSV parsing error |
| Dates stored as strings | Wrong type |
2. Clean Column Names¶
Messy names = future bugs.
df.columns = (df.columns
.str.strip()
.str.lower()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))
# "Customer Name (v2 FINAL)!" → customer_name_v2_final3. Remove Duplicates¶
Duplicates waste space and skew results.
before = len(df)
df = df.drop_duplicates(subset=['order_id', 'customer_id']) # or all columns
print(f"Removed {before - len(df)} duplicates")4. Handle Missing Values¶
df.isna().sum() # see the damageSmart fixes:
| Situation | Code Example |
|---|---|
| Numeric columns | df['revenue'].fillna(df['revenue'].median()) |
| Categorical | df['region'].fillna('Unknown') |
| Drop empty rows | df.dropna(how='all', inplace=True) |
| Drop if key column missing | df.dropna(subset=['customer_id']) |
5. Fix Data Types¶
Excel loves turning IDs into floats.
df['customer_id'] = df['customer_id'].astype('Int64').astype(str)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['product_category'] = df['product_category'].astype('category')6. Standardize Inconsistent Values¶
# Text cleanup
df['region'] = (df['region']
.str.strip()
.str.lower()
.replace({'n. america': 'north america',
'usa': 'united states',
'uk': 'united kingdom'}))
# Consistent categories
df['status'] = df['status'].str.title()7. Quick Outlier Filter (IQR Method)¶
def remove_outliers(df, col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return df[(df[col] >= lower) & (df[col] <= upper)]
df = remove_outliers(df, 'revenue')Recap: The 7-Step Cleaning Checklist¶
| Step | Goal | Key Methods |
|---|---|---|
| 1 | Inspect | .info(), .describe(), .head() |
| 2 | Clean names | .str.lower().replace() |
| 3 | Deduplicate | .drop_duplicates() |
| 4 | Handle missing | .fillna(), .dropna() |
| 5 | Fix types | .astype(), pd.to_datetime() |
| 6 | Standardize values | .str.strip(), .replace() |
| 7 | Derive new features | arithmetic, assign(), date parts |
Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.
Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)
# Your code hereimport numpy as np
import pandas as pd
# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)
# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
dfSetting Values by Label¶
df.at[dates[0], "A"] = 0
print("After setting value by label:")
dfSetting Values by Position¶
df.iat[0, 1] = 0
print("After setting value by position:")
dfSetting with NumPy Array¶
df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
dfConditional Setting with where()¶
df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2Reflection¶
Cleaning is where analysts make many silent assumptions. The original notebook content is most valuable when you explicitly record what was changed, why it was changed, and what risk remains.
8. Interactive Code¶
Expected output
['Asha', 'Rahul', 'Mira']Expected output
[1200, 1450]