Handling Missing Data & Outliers#
Welcome to “The Case of the Missing Values and the Suspicious Outliers” — a thrilling data mystery starring you, the Data Detective 🕵️♀️, and your trusty sidekick, pandas 🐼.
💬 “In every dataset, there’s a missing piece and one number that looks way too rich to be true.”
🎯 Why This Matters#
Before a machine can learn, it must first trust the data. Missing values and outliers are the equivalent of giving your model incomplete homework and fake receipts.
Missing values → incomplete truth
Outliers → exaggerated stories
Your job: decide when to fill, fix, or fire them.
⚙️ Prerequisite: Pandas & Basic Cleaning#
If you haven’t already, check out: 👉 📘 Programming for Business for Python & pandas basics.
And if you skipped the Data Cleaning chapter — go wash your data first 🧼 before coming back here.
🧩 Step 1. Spot the Disappearances#
Let’s begin by investigating how much data is missing.
import pandas as pd
df = pd.read_csv("sales_data.csv")
df.isnull().sum()
💡
.isnull()shows where data is missing..sum()adds up the evidence.
Column |
Missing Count |
% Missing |
|---|---|---|
|
0 |
0% |
|
25 |
2.5% |
|
4 |
0.4% |
💬 “If your data were people, missing values would be the ones who ghosted your survey.”
🧴 Step 2. Filling in the Blanks#
There’s no single “right” way to handle missing data — it depends on context. Let’s explore the 3Rs of Imputation: Remove, Replace, Reimagine.
🔹 Option 1: Remove (if few missing)#
df.dropna(subset=['sales_amount'], inplace=True)
Good when missing data is minimal (<5%).
🔹 Option 2: Replace with Smart Guess#
df['region'].fillna('Unknown', inplace=True)
df['sales_amount'].fillna(df['sales_amount'].median(), inplace=True)
Median works great when data is skewed (like business revenue 💰).
🔹 Option 3: Reimagine with Logic#
df.loc[df['region'].isna() & (df['country'] == 'USA'), 'region'] = 'North America'
Use business context — not just math.
📉 Step 3. Detecting Outliers — Meet the Trouble Makers#
Outliers are those “creative” data points that don’t play by the rules. Think of them as:
“The employee who claims 10,000 hours of work in one week.”
Let’s visualize the drama.
import matplotlib.pyplot as plt
plt.boxplot(df['sales_amount'])
plt.title("Boxplot - Sales Amount Outliers")
plt.show()
🔍 Step 4. Finding Outliers Programmatically#
Using the Interquartile Range (IQR)#
Q1 = df['sales_amount'].quantile(0.25)
Q3 = df['sales_amount'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['sales_amount'] < (Q1 - 1.5 * IQR)) | (df['sales_amount'] > (Q3 + 1.5 * IQR))]
print(outliers)
💡 Anything outside 1.5×IQR is suspicious — maybe fraud, error, or a future CEO bonus.
💸 Step 5. What To Do With Outliers#
Scenario |
Action |
|---|---|
Obvious error (e.g., negative sales) |
Drop |
Real extreme (e.g., large order) |
Keep |
Probably typo |
Cap or Winsorize |
Influencing mean too much |
Use median-based stats |
Example:#
Q1 = df['sales_amount'].quantile(0.25)
Q3 = df['sales_amount'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df['sales_amount'] = df['sales_amount'].clip(lower, upper)
💬 “Clipping outliers is like cutting off split ends — keeps things tidy without losing the essence.”
🧠 Step 6. Visual Double Check#
Visualization helps you confirm your cleanup worked.
import seaborn as sns
sns.histplot(df['sales_amount'], bins=30, kde=True)
plt.title("After Cleaning - Sales Distribution")
plt.show()
✅ A healthy distribution looks like a hill. ❌ A suspicious one looks like Mount Everest ⛰️.
🧪 Step 7. Practice Lab — “The Outlier Hunt”#
Use sales_dirty.csv and try this mini-mission:
Check for missing data with
.isnull().sum().Fill missing regions with
"Unknown".Replace missing sales with the median.
Identify outliers using IQR.
Clip them using
.clip().Plot before & after histograms.
🕵️♂️ Bonus: Count how many outliers were “arrested”.
🧭 Recap#
Task |
Tool |
Business Analogy |
|---|---|---|
Identify missing |
|
Attendance check |
Fill or drop |
|
Substitute values |
Detect outliers |
IQR, Boxplot |
Expense audit |
Handle outliers |
|
Correcting exaggerated claims |
🎯 Clean, realistic data → smarter models → fewer “WTF” moments in your dashboards.
🔜 Next Stop#
👉 Head to Feature Types & Encoding where we’ll teach our models how to speak business — turning categories, dates, and booleans into something algorithms actually understand.
# Your code here