Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Exercises

  1. Implement a function clean_basic(df) that applies the 7-step checklist to a DataFrame and returns (clean_df, report) where report lists the actions taken.

  2. Try the Pyodide demo CSV and then modify the cleaning to preserve rows flagged as ‘returned’ in a separate table.

  3. (Challenge) Write a small validator that checks customer_id formats and flags anomalies.

Hints:
- Use `.isna().sum()` for missingness
- Use `.clip(lower=0)` to remove negative revenues

Which step should you do first when you receive a new dataset?

Inspect the dataset structure (`df.info()`, `df.head()`)Correct — inspection informs the cleaning plan.
Start by dropping columnsDropping early can remove needed context.
Immediately normalize all textNormalization is often needed, but first inspect types and missingness.

Data Cleaning & Preprocessing

Recognize common messes, apply repeatable cleaning steps, and document your assumptions. This notebook includes a lightweight Pyodide demo that safely runs in the browser using generated data when files are unavailable.

Learning goals

  • Spot common quality issues and apply pragmatic fixes.

  • Use pandas idioms to standardize columns, types, and missing values.

  • Follow a repeatable cleaning checklist and produce consistent, documented datasets.

Data Cleaning & Preprocessing

Correcting Quality Issues Before Analysis and Modeling

Notebook Guide

The original data-cleaning explanations and examples remain in place. This section makes the study goal explicit.

Learning objectives

  • identify common data-quality problems

  • distinguish cleaning from transformation

  • apply repeatable preprocessing steps

  • document assumptions made during cleanup

Welcome to Digital Laundry Day!
Real-world data is messy: typos, missing values, inconsistent formats, rogue Excel exports. This chapter turns chaos into clean, model-ready datasets.

Data scientists spend 80% of their time cleaning data — and the other 20% complaining about it.


Why Clean Data Matters

Machine Learning models are like high-end coffee machines:
Garbage In → Garbage Out
Clean In → Insights Out

Your mission: transform raw business data into something your model (and your manager) will love.


Prerequisite: Pandas Power

We’ll use pandas — the Swiss Army knife of data manipulation.
New to pandas? Check my book:
Programming for Business


Step-by-Step Cleaning Workflow

1. Inspect Before You Clean

Always look first:

import pandas as pd
df = pd.read_csv("sales_data.csv")

df.info()      # missing values + data types
df.describe()  # outliers & strange stats
df.head()      # spot obvious garbage

Common red flags:

SymptomLikely Cause
NaN / NoneMissing data
"?", "N/A"Excel placeholders
customer_id as floatCSV parsing error
Dates stored as stringsWrong type

2. Clean Column Names

Messy names = future bugs.

df.columns = (df.columns
                .str.strip()
                .str.lower()
                .str.replace(' ', '_')
                .str.replace('(', '')
                .str.replace(')', ''))
# "Customer Name (v2 FINAL)!" → customer_name_v2_final

3. Remove Duplicates

Duplicates waste space and skew results.

before = len(df)
df = df.drop_duplicates(subset=['order_id', 'customer_id'])  # or all columns
print(f"Removed {before - len(df)} duplicates")

4. Handle Missing Values

df.isna().sum()  # see the damage

Smart fixes:

SituationCode Example
Numeric columnsdf['revenue'].fillna(df['revenue'].median())
Categoricaldf['region'].fillna('Unknown')
Drop empty rowsdf.dropna(how='all', inplace=True)
Drop if key column missingdf.dropna(subset=['customer_id'])

5. Fix Data Types

Excel loves turning IDs into floats.

df['customer_id'] = df['customer_id'].astype('Int64').astype(str)
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['product_category'] = df['product_category'].astype('category')

6. Standardize Inconsistent Values

# Text cleanup
df['region'] = (df['region']
                .str.strip()
                .str.lower()
                .replace({'n. america': 'north america',
                          'usa': 'united states',
                          'uk': 'united kingdom'}))

# Consistent categories
df['status'] = df['status'].str.title()

7. Quick Outlier Filter (IQR Method)

def remove_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

df = remove_outliers(df, 'revenue')

Recap: The 7-Step Cleaning Checklist

StepGoalKey Methods
1Inspect.info(), .describe(), .head()
2Clean names.str.lower().replace()
3Deduplicate.drop_duplicates()
4Handle missing.fillna(), .dropna()
5Fix types.astype(), pd.to_datetime()
6Standardize values.str.strip(), .replace()
7Derive new featuresarithmetic, assign(), date parts

Clean data = faster modeling + happier stakeholders + zero “why is my model broken?” moments.


Next → [Handling Missing Data & Outliers]
(Deep dive into imputation, scaling, and robust outlier strategies)

# Your code here

🔧 Setting and Modifying Values

Setting a New Column

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

# Create a new series
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
print("New Series:")
print(s1)

# Add as new column (automatically aligns by index)
df["F"] = s1
print("\nDataFrame with new column:")
df

Setting Values by Label

df.at[dates[0], "A"] = 0
print("After setting value by label:")
df

Setting Values by Position

df.iat[0, 1] = 0
print("After setting value by position:")
df

Setting with NumPy Array

df.loc[:, "D"] = np.array([5] * len(df))
print("After setting column with NumPy array:")
df

Conditional Setting with where()

df2 = df.copy()
# Replace positive values with their negative
df2[df2 > 0] = -df2
print("After conditional setting:")
df2

Reflection

Cleaning is where analysts make many silent assumptions. The original notebook content is most valuable when you explicitly record what was changed, why it was changed, and what risk remains.

8. Interactive Code

Expected output
['Asha', 'Rahul', 'Mira']
Expected output
[1200, 1450]

9. Guided Practice

What is a common data-cleaning task?

Increasing typo countCleaning aims to reduce errors, not introduce them.
Standardizing inconsistent valuesCorrect. Consistency is a major goal.
Avoiding missing-value checksMissing values are an important cleaning concern.
Removing every columnThat would not be useful cleaning.

Why was `None` removed in the second example?

Because Python cannot store lists with `None`Python lists can store `None`.
Because missing values often need explicit handling before analysisCorrect. Missing data can distort later steps.
Because `None` means a negative sale`None` usually indicates missing data, not a negative number.
Because `None` is larger than 1450That is not meaningful here.