Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Real-world business quote: “Garbage in, garbage out. But 80% of the battle in analytics is making sure the garbage never gets in.” — Former Chief Data Officer, Walmart


🎯 Why Every Business Student MUST Master This (Before Your First Internship)

Business RealityData Problem₹₹₹ Cost if Ignored
Revenue forecastingMissing daily sales₹50L+ wrong inventory order
Customer segmentationOutlier = fake VIP customer₹10L misallocated marketing budget
Credit-risk modelsNegative age typoLoan approved to -20 year old → fraud
Board dashboardOne $999,999,999 typoCEO thinks company made $1Bn in a day → stock jumps 8% → SEC investigation

Bottom line: 70–80% of a data analyst’s time is spent cleaning data. Master this → you become the most valuable intern in the room.

Step 1: Load Data & First Crime-Scene Investigation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("sales_data.csv")
print(df.head())
print(df.info())

Business student tip: Always check dtypes — if sales_amount is object → someone typed “₹50,000” instead of 50000 → disaster!


Step 2: Missing Values → The Ghost Customers

missing = pd.DataFrame({
    'Missing Count': df.isnull().sum(),
    '% Missing': round(df.isnull().sum() / len(df) * 100, 2)
})
print(missing)

Real Business Rules (Add these comments in your code!)

# Business Rule 1: If <5% missing → safe to impute
# Business Rule 2: If >30% missing in a column → consider dropping entire column
# Business Rule 3: Never ever use mean() for revenue/salary → always median()

Smart Imputation Cheat-Sheet for Business Data

Column TypeBest ImputationCode Example
Customer regionMode or “Unknown”df['region'].fillna('Unknown')
Sales amountMedian (skewed data!)df['sales_amount'].fillna(df['sales_amount'].median())
Time-series salesLinear or Seasonal interpolationdf.interpolate(method='time')
Customer ageMedian by segmentdf['age'].fillna(df.groupby('segment')['age'].transform('median'))

Pro move for interviews:

# Segment-wise median — shows you think like a business analyst
df['sales_amount'] = df.groupby('region')['sales_amount'].transform(
    lambda x: x.fillna(x.median())
)

Step 3: Outliers → The ₹9,99,99,999 “Typo” That Crashed Your Model

Visualization First (Always!)

plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
sns.boxplot(x=df['sales_amount'])
plt.title("Boxplot - Spot the Billionaire Typo")

plt.subplot(1,2,2)
sns.histplot(df['sales_amount'], bins=50, kde=True)
plt.title("Histogram - Is that a second peak at 1 Billion?")
plt.tight_layout()
plt.show()

IQR Method (Standard in 90% of companies)

def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower) | (data[column] > upper)]
    print(f"⚠️ Found {len(outliers)} outliers in {column}")
    return outliers, lower, upper

outliers, lower_bound, upper_bound = detect_outliers_iqr(df, 'sales_amount')

Business Decision Matrix (Paste this in your Jupyter notebook!)

Outlier ExampleLikely CauseActionCode
Sales = ₹999,999,999TypoClip.clip()
Sales = ₹5,00,000Genuine bulk orderKeep + Flagcreate is_large_order
Sales = -500System errorSet to 0.clip(lower=0)
Age = 150Data entry errorReplace with NaN → impute
# Winsorization (used by HDFC Bank risk models)
df['sales_amount'] = df['sales_amount'].clip(lower=lower_bound, upper=upper_bound)

# Or create a flag (Goldman Sachs style)
df['is_outlier'] = (df['sales_amount'] < lower_bound) | (df['sales_amount'] > upper_bound)

Step 4: Before vs After — The Money Shot for Your Resume

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))

ax1.hist(df_raw['sales_amount'], bins=50, alpha=0.7, label='Before', color='red')
ax1.set_title("Before Cleaning - Chaos")
ax1.legend()

ax2.hist(df['sales_amount'], bins=50, alpha=0.7, label='After', color='green')
ax2.set_title("After Cleaning - Beautiful Normal Distribution")
ax2.legend()

plt.suptitle("Impact of Proper Data Cleaning - From Garbage to Gold")
plt.show()

Sample insight to impress your interviewer:

“After cleaning, average ticket size dropped from ₹1,42,000 to ₹68,000 — revealing that 3 typo entries were inflating AOV by 108%. This prevents ₹42 crore over-forecast in annual revenue.”


The 3 Golden Rules for Business Analytics Roles

  1. Never trust raw data — always assume it’s lying

  2. Document every cleaning decision — your future self (or auditor) will thank you

  3. Always validate with business stakeholder — “Is ₹5 crore order possible?” → call sales head!


Analyzing Distribution

  • Description: Quantifying the shape of a distribution by evaluating its asymmetry (skewness) and the heaviness of its tails (kurtosis). These measures provide insights beyond central tendency and dispersion, revealing important characteristics of the data’s distribution.

  • Related Concepts:

    • Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. A positive skew indicates a long tail extending towards more positive values (right-skewed), while a negative skew indicates a long tail extending towards more negative values (left-skewed). The formula for sample skewness is:

      Skew=1ni=1n(xiμ)3σ3\text{Skew} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^3}{\sigma^3}

      where xix_i are the data points, μ\mu is the sample mean, σ\sigma is the sample standard deviation, and nn is the number of data points.

    • Kurtosis: A measure of the “tailedness” of the probability distribution. High kurtosis indicates that the distribution has heavier tails and a sharper peak around the mean compared to a normal distribution, suggesting more frequent extreme values. The formula for sample kurtosis (often adjusted to excess kurtosis by subtracting 3, where a normal distribution has a kurtosis of 3) is:

      Kurt=1ni=1n(xiμ)4σ4\text{Kurt} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^4}{\sigma^4}

      Excess kurtosis is then Kurtexcess=Kurt3\text{Kurt}_{excess} = \text{Kurt} - 3.

  • Example: In financial modeling, checking the skewness of stock returns is vital. A significant negative skew might indicate a higher probability of large negative returns (crashes) compared to large positive returns. Similarly, examining kurtosis helps assess the risk associated with extreme price movements.

Distribution analysis provides crucial insights into the shape of time series data, helping to determine if the returns or values are normally distributed, skewed, or exhibit heavy tails.

Tools for Assessing Distribution Shape:

  • Histogram: Visual inspection can reveal asymmetry and the relative frequency of values in the tails.

  • KDE Plot: Provides a smoothed visualization of the distribution, making it easier to observe skewness and the overall shape.

  • Skewness Coefficient: A numerical measure of asymmetry.

    Skewness>0    Right-tailed (positive skew)\text{Skewness} > 0 \implies \text{Right-tailed (positive skew)}

    Skewness<0    Left-tailed (negative skew)\text{Skewness} < 0 \implies \text{Left-tailed (negative skew)}

    Skewness0    Symmetric\text{Skewness} \approx 0 \implies \text{Symmetric}
  • Kurtosis Coefficient: A numerical measure of the tailedness of the distribution.

    Kurtosis>3    Leptokurtic (heavy tails, sharper peak)\text{Kurtosis} > 3 \implies \text{Leptokurtic (heavy tails, sharper peak)}

    Kurtosis<3    Platykurtic (thin tails, flatter peak)\text{Kurtosis} < 3 \implies \text{Platykurtic (thin tails, flatter peak)}

    Kurtosis3    Mesokurtic (similar to normal distribution)\text{Kurtosis} \approx 3 \implies \text{Mesokurtic (similar to normal distribution)}
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, skewnorm
from scipy.stats import skew, kurtosis

# Generate data points
np.random.seed(42)  # for reproducibility
n_points = 1000

# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=n_points)
skewness_normal = 0

# Right-skewed distribution (using a different distribution for better visual)
right_skewed_data = np.random.exponential(scale=1, size=n_points) - 1 # Shifted exponential
skewness_right = skew(right_skewed_data)

# Left-skewed distribution (mirroring the right-skewed)
left_skewed_data = - (np.random.exponential(scale=1, size=n_points) - 1)
skewness_left = skew(left_skewed_data)

# Plotting histograms
plt.figure(figsize=(14, 6))

plt.subplot(1, 3, 1)
plt.hist(normal_data, bins=30, density=True, alpha=0.7, color='blue', label='Normal')
xmin, xmax = plt.xlim()
x_norm = np.linspace(xmin, xmax, 100)
p_norm = norm.pdf(x_norm, np.mean(normal_data), np.std(normal_data))
plt.plot(x_norm, p_norm, 'k', linewidth=2)
plt.title(f'Normal (Skewness ≈ {skewness_normal:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()

plt.subplot(1, 3, 2)
plt.hist(right_skewed_data, bins=30, density=True, alpha=0.7, color='green', label='Right-Skewed')
xmin_right, xmax_right = plt.xlim()
x_right = np.linspace(xmin_right, xmax_right, 100)
p_right = skewnorm.pdf(x_right, a=skewness_right, loc=np.mean(right_skewed_data), scale=np.std(right_skewed_data)) # Using skewnorm to fit
# Note: Fitting skewnorm might not perfectly match the exponential, but shows the skewed shape
plt.plot(x_right, p_right, 'k', linewidth=2)
plt.title(f'Right-Skewed (Skewness ≈ {skewness_right:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()

plt.subplot(1, 3, 3)
plt.hist(left_skewed_data, bins=30, density=True, alpha=0.7, color='red', label='Left-Skewed')
xmin_left, xmax_left = plt.xlim()
x_left = np.linspace(xmin_left, xmax_left, 100)
p_left = skewnorm.pdf(x_left, a=skewness_left, loc=np.mean(left_skewed_data), scale=np.std(left_skewed_data)) # Using skewnorm to fit
# Note: Fitting skewnorm might not perfectly match the mirrored exponential, but shows the skewed shape
plt.plot(x_left, p_left, 'k', linewidth=2)
plt.title(f'Left-Skewed (Skewness ≈ {skewness_left:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()

plt.tight_layout()
plt.show()
<Figure size 1400x600 with 3 Axes>
# Your code here

🔍 Detecting and Handling Missing Data in Pandas

Reindexing

Reindexing allows you to change/add/delete the index on a specified axis:

import numpy as np
import pandas as pd

# Create sample data
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

# Reindex with additional column
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
print("DataFrame with missing data:")
df1

Dropping Missing Data

Use .dropna() to remove rows/columns with missing values:

# Drop rows with any missing values
print("After dropping rows with missing data:")
df1.dropna(how="any")

Filling Missing Data

Use .fillna() to fill missing values with a specified value:

print("After filling missing data with 5:")
df1.fillna(value=5)

Detecting Missing Values

Use isna() or isnull() to get a boolean mask:

print("Boolean mask for missing values:")
pd.isna(df1)