Statistical Visualization with Seaborn - Programming for Machine Learning and Business

Communicating Distributions, Relationships, and Group Differences More Efficiently¶

Notebook Guide¶

The detailed original Seaborn content stays in place. This added section clarifies what learners should watch for.

Focus questions¶

what distribution shape is the chart revealing
what relationship between variables is being suggested
how do grouping and confidence cues improve interpretation
when is Seaborn faster than hand-building plots with Matplotlib

sns.barplot() + sns.heatmap() = Data science pro level Automatic stats + beauty = $90K/month consulting

Academic papers + FAANG research = 100% Seaborn

🎯 Seaborn = Matplotlib on Steroids¶

Plot	Seaborn Code	Business Insight	Replaces
Barplot	`sns.barplot()`	Auto error bars	Manual std dev
Boxplot	`sns.boxplot()`	Outliers detected	Excel bins
Heatmap	`sns.heatmap()`	Correlation matrix	100 Excel cells
Pairplot	`sns.pairplot()`	All correlations	10 scatter plots
Violin	`sns.violinplot()`	Distribution shape	Complex histograms

🚀 Step 1: Barplot = Auto Statistics (Run this!)¶

## !pip install seaborn  # Run once!

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## REAL BUSINESS DATA
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'] * 20,
    'Sales': [45000, 32000, 18000, 25000, 8000],
    'Region': ['North', 'South', 'East', 'West'] * 25
}
df = pd.DataFrame(data)

## SEABORN MAGIC: Auto error bars + colors!
plt.figure(figsize=(12, 8))
sns.barplot(data=df, x='Product', y='Sales', hue='Region', palette='Set2')
plt.title('🏆 Product Sales by Region (Auto Error Bars)', fontsize=16, fontweight='bold')
plt.ylabel('Average Sales ($)', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Region')
plt.tight_layout()
plt.show()

🔥 Step 2: Boxplots = Outlier Detection¶

## 1000 SALES DATA POINTS
np.random.seed(42)
sales_data = {
    'Laptop': np.random.normal(45000, 8000, 200),
    'Phone': np.random.normal(32000, 6000, 200),
    'Tablet': np.random.normal(18000, 4000, 200),
    'Monitor': np.random.normal(25000, 5000, 200)
}
df_sales = pd.DataFrame(sales_data).melt(var_name='Product', value_name='Sales')

plt.figure(figsize=(10, 7))
sns.boxplot(data=df_sales, x='Product', y='Sales', palette='Set3')
plt.title('📦 Sales Distribution & Outliers', fontsize=16, fontweight='bold')
plt.ylabel('Sales ($)', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## OUTLIER STATS
for product in df_sales['Product'].unique():
    product_data = df_sales[df_sales['Product'] == product]['Sales']
    q1, q3 = product_data.quantile([0.25, 0.75])
    iqr = q3 - q1
    outliers = product_data[(product_data < q1 - 1.5*iqr) | (product_data > q3 + 1.5*iqr)]
    print(f"🚨 {product}: {len(outliers)} outliers")

⚡ Step 3: Heatmap = Performance Matrix¶

## REGIONAL PERFORMANCE MATRIX
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Monitor']
sales_matrix = np.array([
    [45000, 42000, 38000, 41000],
    [32000, 30000, 28000, 31000],
    [18000, 16000, 20000, 17000],
    [25000, 24000, 26000, 23000]
])

df_matrix = pd.DataFrame(sales_matrix, index=products, columns=regions)

plt.figure(figsize=(10, 7))
sns.heatmap(df_matrix, annot=True, fmt='$,.0f', cmap='YlOrRd',
            cbar_kws={'label': 'Sales ($)'}, linewidths=1)
plt.title('🔥 Regional Sales Heatmap', fontsize=16, fontweight='bold')
plt.ylabel('Product', fontsize=12)
plt.xlabel('Region', fontsize=12)
plt.tight_layout()
plt.show()

🧠 Step 4: Pairplot = ALL Correlations Instantly¶

## MULTI-VARIABLE ANALYSIS
data_multi = {
    'Marketing': np.random.normal(40000, 10000, 100),
    'Sales': np.random.normal(60000, 15000, 100),
    'Profit': np.random.normal(18000, 5000, 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100)
}
df_multi = pd.DataFrame(data_multi)

## SEABORN PAIRPLOT = 10 PLOTS IN 1 LINE!
sns.pairplot(df_multi[['Marketing', 'Sales', 'Profit']],
             diag_kind='kde', plot_kws={'alpha': 0.6},
             corner=True, height=2)
plt.suptitle('🔍 Multi-Variable Correlation Matrix', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## CORRELATION TABLE
corr_matrix = df_multi[['Marketing', 'Sales', 'Profit']].corr()
print("📊 Correlation Matrix:")
print(corr_matrix.round(3))

📊 Step 5: Violin Plot = Distribution Shape¶

## ADVANCED DISTRIBUTION ANALYSIS
np.random.seed(42)
dist_data = {
    'Laptop': np.random.normal(45000, 8000, 200),
    'Phone': np.random.normal(32000, 6000, 200),
    'Tablet': np.random.normal(18000, 4000, 200)
}
df_dist = pd.DataFrame(dist_data).melt(var_name='Product', value_name='Sales')

plt.figure(figsize=(10, 7))
sns.violinplot(data=df_dist, x='Product', y='Sales', palette='husl', inner='quartile')
plt.title('🎻 Sales Distribution Shape (Violin Plot)', fontsize=16, fontweight='bold')
plt.ylabel('Sales ($)', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

📋 Seaborn Cheat Sheet (Data Science Gold)¶

Plot	Code	Auto Magic	Business Use
Barplot	`sns.barplot(x,y,hue)`	Error bars	Regional analysis
Boxplot	`sns.boxplot(x,y)`	Outliers	Quality control
Heatmap	`sns.heatmap(df)`	Color scaling	Performance matrix
Pairplot	`sns.pairplot(df)`	All correlations	Feature selection
Violin	`sns.violinplot(x,y)`	Distribution shape	Advanced stats
FacetGrid	`sns.FacetGrid()`	Multi-panel	Executive dashboards

## PRO SETUP (Always!)
sns.set_style("whitegrid")
plt.figure(figsize=(12, 8))
sns.barplot(data=df, x='category', y='metric', palette='Set2')

🏆 YOUR EXERCISE: Build YOUR Seaborn Analysis¶

## MISSION: YOUR statistical visualization suite!

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## YOUR BUSINESS DATA
your_categories = ['???', '???', '???', '???']  # YOUR products/regions
your_base_values = [??? , ???, ???, ???]         # YOUR base values
your_hue_categories = ['A', 'B', 'C', 'D'] * 25  # YOUR subgroups

## GENERATE YOUR DATA
np.random.seed(42)
your_data = {
    'Category': np.repeat(your_categories, 25),
    'Subgroup': your_hue_categories,
    'Value': []
}
for i, base in enumerate(your_base_values):
    your_data['Value'].extend(np.random.normal(base, base*0.2, 25))

df_your = pd.DataFrame(your_data)

## 1. YOUR BARPLOT WITH ERROR BARS
plt.figure(figsize=(12, 8))
sns.barplot(data=df_your, x='Category', y='Value', hue='Subgroup', palette='Set2')
plt.title('🏆 YOUR Business Analysis (Auto Error Bars)', fontsize=16, fontweight='bold')
plt.ylabel('Your Metric', fontsize=12)
plt.xlabel('Your Categories', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Subgroups')
plt.tight_layout()
plt.show()

## 2. YOUR BOXPLOT FOR OUTLIERS
plt.figure(figsize=(10, 7))
sns.boxplot(data=df_your, x='Category', y='Value', palette='Set3')
plt.title('🚨 YOUR Outlier Detection', fontsize=16, fontweight='bold')
plt.ylabel('Your Metric', fontsize=12)
plt.xlabel('Your Categories', fontsize=12)
plt.tight_layout()
plt.show()

print("✅ YOUR SEABORN SUITE COMPLETE!")

Example to test:

your_categories = ['Product A', 'Product B', 'Product C', 'Product D']
your_base_values = [45000, 32000, 18000, 25000]

YOUR MISSION:

Add YOUR categories + base values
Run YOUR statistical analysis
Screenshot → “I do data science visualizations!”

🎉 What You Mastered¶

Seaborn Skill	Status	Data Science Power
Barplot	✅	Auto error bars
Boxplot	✅	Outlier detection
Heatmap	✅	Performance matrix
Pairplot	✅	All correlations
Violin	✅	Distribution analysis
$250K stats	✅	Research level

Next: Plotly Interactive (Clickable dashboards = Stakeholder demos!)

print("🎊" * 20)
print("SEABORN = $90K/MONTH DATA SCIENCE!")
print("💻 Auto stats + beauty = Publication quality!")
print("🚀 Academic papers + FAANG = THESE plots!")
print("🎊" * 20)

can we appreciate how sns.barplot(hue='Region') just added automatic error bars + statistical colors that took analysts 2 hours in Excel? Your students went from basic charts to sns.pairplot() correlation matrices that reveal hidden business insights in one line. While data scientists spend days building heatmaps manually, your class is generating sns.heatmap(annot=True) performance matrices and sns.violinplot() distribution shapes that power FAANG research papers. This isn’t visualization—it’s the $250K+ statistical toolkit that turns raw data into publication-quality insights instantly!

# Your code here

Exercises¶

Exercise 1¶

Exercise 2¶

Exercise 3¶

Summary¶

Seaborn helps move quickly from raw tables to interpretable statistical plots. Keep the original examples as patterns for comparing categories, spotting trends, and describing uncertainty.

8. Interactive Code¶

9. Guided Practice¶

Why is Seaborn popular for statistical visuals?¶

It avoids all plotting abstractionsSeaborn adds useful high-level plotting helpers.

It provides concise interfaces for informative statistical chartsCorrect. Seaborn makes many common plots easier.

It stores data in SQLite filesThat is unrelated.

It replaces Python dictionariesSeaborn is a visualization library.

Which region appears most often in the example?¶

NorthNorth appears once.

SouthCorrect. South appears twice.

WestWest appears once.

All are tiedThe counts are not equal.