Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Communicating Distributions, Relationships, and Group Differences More Efficiently

Notebook Guide

The detailed original Seaborn content stays in place. This added section clarifies what learners should watch for.

Focus questions

  • what distribution shape is the chart revealing

  • what relationship between variables is being suggested

  • how do grouping and confidence cues improve interpretation

  • when is Seaborn faster than hand-building plots with Matplotlib

sns.barplot() + sns.heatmap() = Data science pro level Automatic stats + beauty = $90K/month consulting

Academic papers + FAANG research = 100% Seaborn


🎯 Seaborn = Matplotlib on Steroids

PlotSeaborn CodeBusiness InsightReplaces
Barplotsns.barplot()Auto error barsManual std dev
Boxplotsns.boxplot()Outliers detectedExcel bins
Heatmapsns.heatmap()Correlation matrix100 Excel cells
Pairplotsns.pairplot()All correlations10 scatter plots
Violinsns.violinplot()Distribution shapeComplex histograms

🚀 Step 1: Barplot = Auto Statistics (Run this!)

## !pip install seaborn  # Run once!

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## REAL BUSINESS DATA
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'] * 20,
    'Sales': [45000, 32000, 18000, 25000, 8000],
    'Region': ['North', 'South', 'East', 'West'] * 25
}
df = pd.DataFrame(data)

## SEABORN MAGIC: Auto error bars + colors!
plt.figure(figsize=(12, 8))
sns.barplot(data=df, x='Product', y='Sales', hue='Region', palette='Set2')
plt.title('🏆 Product Sales by Region (Auto Error Bars)', fontsize=16, fontweight='bold')
plt.ylabel('Average Sales ($)', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Region')
plt.tight_layout()
plt.show()

🔥 Step 2: Boxplots = Outlier Detection

## 1000 SALES DATA POINTS
np.random.seed(42)
sales_data = {
    'Laptop': np.random.normal(45000, 8000, 200),
    'Phone': np.random.normal(32000, 6000, 200),
    'Tablet': np.random.normal(18000, 4000, 200),
    'Monitor': np.random.normal(25000, 5000, 200)
}
df_sales = pd.DataFrame(sales_data).melt(var_name='Product', value_name='Sales')

plt.figure(figsize=(10, 7))
sns.boxplot(data=df_sales, x='Product', y='Sales', palette='Set3')
plt.title('📦 Sales Distribution & Outliers', fontsize=16, fontweight='bold')
plt.ylabel('Sales ($)', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## OUTLIER STATS
for product in df_sales['Product'].unique():
    product_data = df_sales[df_sales['Product'] == product]['Sales']
    q1, q3 = product_data.quantile([0.25, 0.75])
    iqr = q3 - q1
    outliers = product_data[(product_data < q1 - 1.5*iqr) | (product_data > q3 + 1.5*iqr)]
    print(f"🚨 {product}: {len(outliers)} outliers")

Step 3: Heatmap = Performance Matrix

## REGIONAL PERFORMANCE MATRIX
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Monitor']
sales_matrix = np.array([
    [45000, 42000, 38000, 41000],
    [32000, 30000, 28000, 31000],
    [18000, 16000, 20000, 17000],
    [25000, 24000, 26000, 23000]
])

df_matrix = pd.DataFrame(sales_matrix, index=products, columns=regions)

plt.figure(figsize=(10, 7))
sns.heatmap(df_matrix, annot=True, fmt='$,.0f', cmap='YlOrRd',
            cbar_kws={'label': 'Sales ($)'}, linewidths=1)
plt.title('🔥 Regional Sales Heatmap', fontsize=16, fontweight='bold')
plt.ylabel('Product', fontsize=12)
plt.xlabel('Region', fontsize=12)
plt.tight_layout()
plt.show()

🧠 Step 4: Pairplot = ALL Correlations Instantly

## MULTI-VARIABLE ANALYSIS
data_multi = {
    'Marketing': np.random.normal(40000, 10000, 100),
    'Sales': np.random.normal(60000, 15000, 100),
    'Profit': np.random.normal(18000, 5000, 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100)
}
df_multi = pd.DataFrame(data_multi)

## SEABORN PAIRPLOT = 10 PLOTS IN 1 LINE!
sns.pairplot(df_multi[['Marketing', 'Sales', 'Profit']],
             diag_kind='kde', plot_kws={'alpha': 0.6},
             corner=True, height=2)
plt.suptitle('🔍 Multi-Variable Correlation Matrix', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

## CORRELATION TABLE
corr_matrix = df_multi[['Marketing', 'Sales', 'Profit']].corr()
print("📊 Correlation Matrix:")
print(corr_matrix.round(3))

📊 Step 5: Violin Plot = Distribution Shape

## ADVANCED DISTRIBUTION ANALYSIS
np.random.seed(42)
dist_data = {
    'Laptop': np.random.normal(45000, 8000, 200),
    'Phone': np.random.normal(32000, 6000, 200),
    'Tablet': np.random.normal(18000, 4000, 200)
}
df_dist = pd.DataFrame(dist_data).melt(var_name='Product', value_name='Sales')

plt.figure(figsize=(10, 7))
sns.violinplot(data=df_dist, x='Product', y='Sales', palette='husl', inner='quartile')
plt.title('🎻 Sales Distribution Shape (Violin Plot)', fontsize=16, fontweight='bold')
plt.ylabel('Sales ($)', fontsize=12)
plt.xlabel('Product', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

📋 Seaborn Cheat Sheet (Data Science Gold)

PlotCodeAuto MagicBusiness Use
Barplotsns.barplot(x,y,hue)Error barsRegional analysis
Boxplotsns.boxplot(x,y)OutliersQuality control
Heatmapsns.heatmap(df)Color scalingPerformance matrix
Pairplotsns.pairplot(df)All correlationsFeature selection
Violinsns.violinplot(x,y)Distribution shapeAdvanced stats
FacetGridsns.FacetGrid()Multi-panelExecutive dashboards
## PRO SETUP (Always!)
sns.set_style("whitegrid")
plt.figure(figsize=(12, 8))
sns.barplot(data=df, x='category', y='metric', palette='Set2')

🏆 YOUR EXERCISE: Build YOUR Seaborn Analysis

## MISSION: YOUR statistical visualization suite!

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## YOUR BUSINESS DATA
your_categories = ['???', '???', '???', '???']  # YOUR products/regions
your_base_values = [??? , ???, ???, ???]         # YOUR base values
your_hue_categories = ['A', 'B', 'C', 'D'] * 25  # YOUR subgroups

## GENERATE YOUR DATA
np.random.seed(42)
your_data = {
    'Category': np.repeat(your_categories, 25),
    'Subgroup': your_hue_categories,
    'Value': []
}
for i, base in enumerate(your_base_values):
    your_data['Value'].extend(np.random.normal(base, base*0.2, 25))

df_your = pd.DataFrame(your_data)

## 1. YOUR BARPLOT WITH ERROR BARS
plt.figure(figsize=(12, 8))
sns.barplot(data=df_your, x='Category', y='Value', hue='Subgroup', palette='Set2')
plt.title('🏆 YOUR Business Analysis (Auto Error Bars)', fontsize=16, fontweight='bold')
plt.ylabel('Your Metric', fontsize=12)
plt.xlabel('Your Categories', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Subgroups')
plt.tight_layout()
plt.show()

## 2. YOUR BOXPLOT FOR OUTLIERS
plt.figure(figsize=(10, 7))
sns.boxplot(data=df_your, x='Category', y='Value', palette='Set3')
plt.title('🚨 YOUR Outlier Detection', fontsize=16, fontweight='bold')
plt.ylabel('Your Metric', fontsize=12)
plt.xlabel('Your Categories', fontsize=12)
plt.tight_layout()
plt.show()

print("✅ YOUR SEABORN SUITE COMPLETE!")

Example to test:

your_categories = ['Product A', 'Product B', 'Product C', 'Product D']
your_base_values = [45000, 32000, 18000, 25000]

YOUR MISSION:

  1. Add YOUR categories + base values

  2. Run YOUR statistical analysis

  3. Screenshot“I do data science visualizations!”


🎉 What You Mastered

Seaborn SkillStatusData Science Power
BarplotAuto error bars
BoxplotOutlier detection
HeatmapPerformance matrix
PairplotAll correlations
ViolinDistribution analysis
$250K statsResearch level

Next: Plotly Interactive (Clickable dashboards = Stakeholder demos!)

print("🎊" * 20)
print("SEABORN = $90K/MONTH DATA SCIENCE!")
print("💻 Auto stats + beauty = Publication quality!")
print("🚀 Academic papers + FAANG = THESE plots!")
print("🎊" * 20)

can we appreciate how sns.barplot(hue='Region') just added automatic error bars + statistical colors that took analysts 2 hours in Excel? Your students went from basic charts to sns.pairplot() correlation matrices that reveal hidden business insights in one line. While data scientists spend days building heatmaps manually, your class is generating sns.heatmap(annot=True) performance matrices and sns.violinplot() distribution shapes that power FAANG research papers. This isn’t visualization—it’s the $250K+ statistical toolkit that turns raw data into publication-quality insights instantly!

# Your code here

Exercises

Exercise 1


Exercise 2


Exercise 3


Summary

Seaborn helps move quickly from raw tables to interpretable statistical plots. Keep the original examples as patterns for comparing categories, spotting trends, and describing uncertainty.

8. Interactive Code

Expected output
['North', 'South', 'West']
Expected output
{'North': 1, 'South': 2, 'West': 1}

9. Guided Practice

It avoids all plotting abstractionsSeaborn adds useful high-level plotting helpers.
It provides concise interfaces for informative statistical chartsCorrect. Seaborn makes many common plots easier.
It stores data in SQLite filesThat is unrelated.
It replaces Python dictionariesSeaborn is a visualization library.

Which region appears most often in the example?

NorthNorth appears once.
SouthCorrect. South appears twice.
WestWest appears once.
All are tiedThe counts are not equal.