Handling Missing Data & Outliers#

Welcome to “The Case of the Missing Values and the Suspicious Outliers” — a thrilling data mystery starring you, the Data Detective 🕵️‍♀️, and your trusty sidekick, pandas 🐼.

💬 “In every dataset, there’s a missing piece and one number that looks way too rich to be true.”


🎯 Why This Matters#

Before a machine can learn, it must first trust the data. Missing values and outliers are the equivalent of giving your model incomplete homework and fake receipts.

  • Missing values → incomplete truth

  • Outliers → exaggerated stories

Your job: decide when to fill, fix, or fire them.


⚙️ Prerequisite: Pandas & Basic Cleaning#

If you haven’t already, check out: 👉 📘 Programming for Business for Python & pandas basics.

And if you skipped the Data Cleaning chapter — go wash your data first 🧼 before coming back here.


🧩 Step 1. Spot the Disappearances#

Let’s begin by investigating how much data is missing.

import pandas as pd

df = pd.read_csv("sales_data.csv")
df.isnull().sum()

💡 .isnull() shows where data is missing. .sum() adds up the evidence.

Column

Missing Count

% Missing

customer_id

0

0%

region

25

2.5%

sales_amount

4

0.4%

💬 “If your data were people, missing values would be the ones who ghosted your survey.”


🧴 Step 2. Filling in the Blanks#

There’s no single “right” way to handle missing data — it depends on context. Let’s explore the 3Rs of Imputation: Remove, Replace, Reimagine.

🔹 Option 1: Remove (if few missing)#

df.dropna(subset=['sales_amount'], inplace=True)

Good when missing data is minimal (<5%).

🔹 Option 2: Replace with Smart Guess#

df['region'].fillna('Unknown', inplace=True)
df['sales_amount'].fillna(df['sales_amount'].median(), inplace=True)

Median works great when data is skewed (like business revenue 💰).

🔹 Option 3: Reimagine with Logic#

df.loc[df['region'].isna() & (df['country'] == 'USA'), 'region'] = 'North America'

Use business context — not just math.


📉 Step 3. Detecting Outliers — Meet the Trouble Makers#

Outliers are those “creative” data points that don’t play by the rules. Think of them as:

“The employee who claims 10,000 hours of work in one week.”

Let’s visualize the drama.

import matplotlib.pyplot as plt

plt.boxplot(df['sales_amount'])
plt.title("Boxplot - Sales Amount Outliers")
plt.show()

🔍 Step 4. Finding Outliers Programmatically#

Using the Interquartile Range (IQR)#

Q1 = df['sales_amount'].quantile(0.25)
Q3 = df['sales_amount'].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df['sales_amount'] < (Q1 - 1.5 * IQR)) | (df['sales_amount'] > (Q3 + 1.5 * IQR))]
print(outliers)

💡 Anything outside 1.5×IQR is suspicious — maybe fraud, error, or a future CEO bonus.


💸 Step 5. What To Do With Outliers#

Scenario

Action

Obvious error (e.g., negative sales)

Drop

Real extreme (e.g., large order)

Keep

Probably typo

Cap or Winsorize

Influencing mean too much

Use median-based stats

Example:#

Q1 = df['sales_amount'].quantile(0.25)
Q3 = df['sales_amount'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df['sales_amount'] = df['sales_amount'].clip(lower, upper)

💬 “Clipping outliers is like cutting off split ends — keeps things tidy without losing the essence.”


🧠 Step 6. Visual Double Check#

Visualization helps you confirm your cleanup worked.

import seaborn as sns

sns.histplot(df['sales_amount'], bins=30, kde=True)
plt.title("After Cleaning - Sales Distribution")
plt.show()

✅ A healthy distribution looks like a hill. ❌ A suspicious one looks like Mount Everest ⛰️.


🧪 Step 7. Practice Lab — “The Outlier Hunt”#

Use sales_dirty.csv and try this mini-mission:

  1. Check for missing data with .isnull().sum().

  2. Fill missing regions with "Unknown".

  3. Replace missing sales with the median.

  4. Identify outliers using IQR.

  5. Clip them using .clip().

  6. Plot before & after histograms.

🕵️‍♂️ Bonus: Count how many outliers were “arrested”.


🧭 Recap#

Task

Tool

Business Analogy

Identify missing

.isna().sum()

Attendance check

Fill or drop

.fillna(), .dropna()

Substitute values

Detect outliers

IQR, Boxplot

Expense audit

Handle outliers

.clip(), .replace()

Correcting exaggerated claims

🎯 Clean, realistic data → smarter models → fewer “WTF” moments in your dashboards.


🔜 Next Stop#

👉 Head to Feature Types & Encoding where we’ll teach our models how to speak business — turning categories, dates, and booleans into something algorithms actually understand.

# Your code here