Exploratory Data Analysis - Programming for Machine Learning and Business

Investigating Patterns, Anomalies, and Business Signals Before Modeling¶

Notebook Guide¶

This notebook already contains detailed exploratory analysis material. The added guide below helps position that content.

What EDA should answer¶

what does the dataset contain
where are the quality risks and anomalies
which relationships are worth deeper analysis
what hypotheses should be tested next

Turning Data Into Business Stories

Welcome to the stage where data becomes drama. 🎭

You’ve loaded it, cleaned it, encoded it — now it’s time to interrogate it.

💬 “EDA is like a first date with your dataset — ask questions, spot red flags, and decide if you want to take it to production.”

🧠 Why EDA Matters¶

EDA is where you move from data janitor to data storyteller.

Here’s what EDA helps you uncover:

What’s normal in your data
What’s weird (and possibly a bug or a legend)
What features actually matter
How variables interact

⚡ A good EDA session can save you 10 hours of model debugging later.

🧰 Prerequisite¶

If you’re new to pandas or visualization, check my other book: 👉 📘 Programming for Business

📦 Step 1. Get to Know Your Dataset¶

Let’s start with the basics — “Hi, dataset! Tell me about yourself.”

import pandas as pd

df = pd.read_csv("sales_data_clean.csv")

df.info()
df.describe()
df.head()

💬 “.info() is like your dataset’s LinkedIn profile — gives you a summary, but hides the weird stuff.”

🔍 Step 2. Basic Summary Statistics¶

Numbers can whisper business truths — if you know how to listen.

df.describe(include='all')

Metric	What It Means
`mean`, `median`	Typical value
`std`, `IQR`	How spread out things are
`min`, `max`	Range (and possible outliers)
`unique`	Categorical diversity

💡 Example: “Average order = $92, but max order =$ 10,000?” Either a whale customer… or a typo named Jeff.

🎨 Step 3. Univariate Visuals — “The Solo Acts”¶

Visualize one variable at a time to see its story.

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(df['sales_amount'], bins=30, kde=True)
plt.title("Sales Distribution")
plt.show()

What to Look For	Meaning
Bell curve	Healthy data
Right-skewed	A few big spenders
Flat or weird shape	Mixed data sources

💬 “If your histogram looks like a cliff, it’s not geology — it’s bad data entry.”

🍰 Categorical Insights¶

sns.countplot(x='region', data=df)
plt.title("Sales by Region")
plt.show()

💡 “If one region dominates, either it’s your best market — or others are asleep at work.”

🧩 Step 4. Bivariate Analysis — “The Dynamic Duos”¶

This is where you see relationships between features.

🧮 Numerical vs Numerical¶

sns.scatterplot(x='marketing_spend', y='sales_amount', data=df)
plt.title("Marketing Spend vs Sales")
plt.show()

💬 “If you see an upward trend, congrats — your marketing isn’t just burning money.”

🎯 Categorical vs Numerical¶

sns.boxplot(x='region', y='sales_amount', data=df)
plt.title("Sales Distribution by Region")
plt.show()

💡 Boxplots are like HR reports — they show who’s average, who’s overachieving, and who’s way out of line.

🧠 Step 5. Correlation — “The Data Gossip Network”¶

Find which variables are talking behind each other’s backs.

corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

Value	Meaning
0.9+	Practically twins
0.5	Moderate connection
0	No gossip
-0.9	Total opposites

💬 “If two columns are 0.99 correlated, drop one — they’re basically sending the same email twice.”

🪄 Step 6. Feature Relationships (Business Examples)¶

Example 1: Revenue vs Customer Segment¶

sns.barplot(x='customer_tier', y='sales_amount', data=df, estimator='mean')

“Gold customers really are gold — except when they ask for more discounts.”

Example 2: Seasonal Trends¶

df['month'] = pd.to_datetime(df['date']).dt.month
sns.lineplot(x='month', y='sales_amount', data=df)

“If sales spike every November — that’s not coincidence, that’s marketing season.”

📊 Step 7. Advanced EDA Helpers¶

When you’re ready to go pro:

!pip install pandas-profiling
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Business Data Profile", explorative=True)
profile.to_notebook_iframe()

💬 “This is like an MRI for your dataset — it finds hidden patterns before you even blink.”

Advanced DataFrame Indexing & Selection¶

After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.

Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.

🧱 1. Label-Based & Position-Based Selection¶

`.loc[]` — Access by Label (Name)¶

Use actual labels (names) of rows and columns.

import pandas as pd

data = {
    'Name': ['Asha', 'Raj', 'Kiran', 'Meera', 'Vikram'],
    'Department': ['Sales', 'Finance', 'Sales', 'HR', 'Finance'],
    'Salary': [45000, 55000, 60000, 40000, 58000],
    'Experience': [2, 5, 4, 3, 6]
}

df = pd.DataFrame(data, index=['E1', 'E2', 'E3', 'E4', 'E5'])

# Select a single row and specific columns using labels
print(df.loc['E3', ['Name', 'Salary']])

# Select multiple rows
print(df.loc[['E1', 'E5'], :])

`.iloc[]` — Access by Position (Integer)¶

Use integer positions instead of names.

# Select third row (index position 2)
print(df.iloc[2])

# Select first three rows and first two columns
print(df.iloc[0:3, 0:2])

💡 Key difference: .loc[] uses labels, .iloc[] uses numeric positions.

🧩 2. Boolean Indexing — Filter Like a Pro 🔍¶

Boolean indexing lets you apply conditions directly on DataFrame columns.

# Filter where Salary > 50,000
high_salary = df[df['Salary'] > 50000]
print(high_salary)

# Combine multiple conditions with & (AND), | (OR), ~ (NOT)
filtered = df[(df['Department'] == 'Finance') & (df['Experience'] > 5)]
print(filtered)

💡 Always enclose conditions within parentheses and use &, |, ~ operators.

⚙️ 3. Index Management — Set & Reset¶

You can change which column acts as the row identifier for faster lookups and clarity.

# Set 'Name' as index
df2 = df.set_index('Name')
print(df2)

# Access row by name
print(df2.loc['Asha'])

# Reset back to default integer index
df2_reset = df2.reset_index()
print(df2_reset)

💡 Why? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.

✂️ 4. Conditional Slicing¶

Extract specific subsets by combining slicing and conditions.

# Slice rows 1 to 3 (exclusive of 4)
print(df.iloc[1:4])

# Conditional slicing — employees with experience between 3 and 6 years
print(df[(df['Experience'] >= 3) & (df['Experience'] <= 6)])

# Select specific rows and columns together
df.loc[df['Salary'] > 50000, ['Name', 'Department']]

🧠 5. Fast Scalar Access¶

For accessing single values, use .at[] or .iat[] — much faster than .loc[] or .iloc[].

# Access specific cell
print(df.at['E2', 'Salary'])   # Using label
print(df.iat[2, 2])            # Using position

# Update value
df.at['E4', 'Salary'] = 42000
print(df)

Summary — Advanced Selection in Pandas¶

Method	Type	Example	Description
`.loc[]`	Label-based	`df.loc['A']`	Select by row/column name
`.iloc[]`	Position-based	`df.iloc[0:2, 1:3]`	Select by numeric position
Boolean Index	Condition-based	`df[df['Profit'] > 8000]`	Filter with logic
`.set_index()`	Index management	`df.set_index('City')`	Change lookup key
`.reset_index()`	Revert index	`df.reset_index()`	Restore default integers
`.at[]` / `.iat[]`	Fast scalar	`df.at['E2', 'Salary']`	Quick single value access

🧪 Practice Lab — Advanced Indexing¶

Try the following to solidify your understanding:

Select only the Name and Salary columns using both .loc and .iloc.
Retrieve all employees with experience less than 4 years.
Display the 2nd and 4th rows using .iloc.
Set Department as the new index and display the DataFrame.
Reset the index back to default integer.
Filter employees whose name starts with 'M' or 'V'.
Increase salary by 5% for all employees in the Sales department.
Access the salary of the employee in the 3rd row and update it to 62000.
Create a subset of only Finance department employees using .loc.
Display employees with experience between 3 and 6 years.

🧭 Advanced DataFrame Indexing & Selection¶

After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.

Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.

🎯 Learning Objectives¶

By the end of this section, you will be able to:

Select rows and columns using .loc[] and .iloc[]
Apply boolean indexing for flexible filtering
Set and reset indices dynamically
Slice data using conditions and custom logic

🧱 1. Selecting Rows and Columns Using `.loc[]` and `.iloc[]`¶

`.loc[]` → Label-based indexing¶

You use actual labels (names) of rows and columns.

import pandas as pd

data = {
    'Name': ['Asha', 'Raj', 'Kiran', 'Meera', 'Vikram'],
    'Department': ['Sales', 'Finance', 'Sales', 'HR', 'Finance'],
    'Salary': [45000, 55000, 60000, 40000, 58000],
    'Experience': [2, 5, 4, 3, 6]
}

df = pd.DataFrame(data, index=['E1', 'E2', 'E3', 'E4', 'E5'])

# Select a single row and specific columns using labels
print(df.loc['E3', ['Name', 'Salary']])

# Select multiple rows
print(df.loc[['E1', 'E5'], :])

`.iloc[]` → Integer position-based indexing¶

You use integer positions instead of names.

# Select third row (index position 2)
print(df.iloc[2])

# Select first three rows and first two columns
print(df.iloc[0:3, 0:2])

💡 Key difference:

.loc[] → Use labels
.iloc[] → Use numeric positions

🧩 2. Boolean Indexing¶

Boolean indexing allows filtering based on logical conditions.

# Employees with salary greater than 50,000
print(df[df['Salary'] > 50000])

# Employees from Finance department with > 5 years experience
print(df[(df['Department'] == 'Finance') & (df['Experience'] > 5)])

💡 Tip: Combine multiple conditions using & (AND), | (OR), and ~ (NOT). Always enclose conditions within parentheses.

⚙️ 3. Setting and Resetting Index¶

Indexes act like row identifiers. You can change them to columns that make sense for your data.

# Set 'Name' as index
df2 = df.set_index('Name')
print(df2)

# Reset back to default integer index
df2 = df2.reset_index()
print(df2)

💡 Why do this? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.

✂️ 4. Slicing Based on Conditions¶

You can extract specific portions of your DataFrame by slicing or conditional selection.

# Slice rows 1 to 3 (exclusive of 4)
print(df.iloc[1:4])

# Conditional slicing — employees with experience between 3 and 6 years
print(df[(df['Experience'] >= 3) & (df['Experience'] <= 6)])

🧠 5. Selecting Specific Cells and Values¶

You can directly access or modify specific cells using .at[] or .iat[] for faster scalar access.

# Access specific cell
print(df.at['E2', 'Salary'])   # Using label
print(df.iat[2, 2])            # Using position

# Update value
df.at['E4', 'Salary'] = 42000
print(df)

💡 .at[] and .iat[] are optimized for single value access — much faster than .loc[] or .iloc[].

🧪 Practice Exercises¶

Try the following to solidify your understanding:

Select only the Name and Salary columns using both .loc and .iloc.
Retrieve all employees with experience less than 4 years.
Display the 2nd and 4th rows using .iloc.
Set Department as the new index and display the DataFrame.
Reset the index back to default integer.
Filter employees whose name starts with 'M' or 'V'.
Increase salary by 5% for all employees in the Sales department.
Access the salary of the employee in the 3rd row and update it to 62000.
Create a subset of only Finance department employees using .loc.
Display employees with experience between 3 and 6 years.

⚡ Summary¶

Concept	Method	Description
`.loc[]`	Label-based	Access by names (rows/columns)
`.iloc[]`	Position-based	Access by numeric index
`.at[]`, `.iat[]`	Fast scalar access	Quick value retrieval
`.set_index()`, `.reset_index()`	Index management	Reassign or restore indices
Boolean indexing	Condition-based	Logical filtering of rows

🧩 Hierarchical (Multi-level) Indexing — Basics¶

In large datasets, data often has multiple dimensions — like regions, departments, or years — and we want to organize or summarize them efficiently. That’s where Hierarchical Indexing (also called Multi-level Indexing) in Pandas becomes powerful.

It allows you to store and access data across multiple levels of rows or columns, creating a structured and compact representation of complex datasets.

🎯 Learning Objectives¶

By the end of this section, you will be able to:

Create multi-level indexed DataFrames
Select data at different index levels
Perform stacking and unstacking
Apply aggregation based on specific index levels

🧱 1. Creating a Multi-level (Hierarchical) Index¶

Let’s start with a dataset representing sales performance across regions and years.

import pandas as pd

data = {
    'Sales': [25000, 27000, 30000, 28000, 26000, 31000],
    'Profit': [4000, 5000, 6000, 4500, 4700, 6200]
}

index = [
    ['North', 'North', 'South', 'South', 'East', 'East'],
    [2022, 2023, 2022, 2023, 2022, 2023]
]

df = pd.DataFrame(data, index=index, columns=['Sales', 'Profit'])
df.index.names = ['Region', 'Year']
print(df)

Output:

               Sales  Profit
Region Year
North  2022   25000    4000
       2023   27000    5000
South  2022   30000    6000
       2023   28000    4500
East   2022   26000    4700
       2023   31000    6200

💡 Notice: The index now has two levels — Region and Year.

🔍 2. Selecting Data at Different Index Levels¶

You can use .loc[] with tuples to access data at specific index combinations.

# Access all rows for North
print(df.loc['North'])

# Access only the row for South in 2022
print(df.loc[('South', 2022)])

Or use index slicing for partial selections.

# Select all years for East and South
print(df.loc[['East', 'South']])

⚙️ 3. Stacking and Unstacking¶

Stacking and unstacking reshape data between rows and columns — powerful for pivot-style transformations.

# Convert columns into inner-level rows
stacked = df.stack()
print(stacked.head())

# Convert inner-level rows back to columns
unstacked = stacked.unstack()
print(unstacked.head())

💡 Think of it as:

.stack() → makes the table taller (more rows, fewer columns)
.unstack() → makes the table wider (fewer rows, more columns)

📊 4. Aggregation by Index Levels¶

You can compute summary statistics for specific index levels using level parameter in aggregation.

# Total sales per Region
print(df.groupby(level='Region').sum())

# Average profit per Year
print(df.groupby(level='Year').mean())

💡 Use level name or number (level=0 for Region, level=1 for Year).

🧪 Practice Exercises¶

Try these small tasks:

Create a DataFrame with 2-level index: ['Country', 'Year'] showing GDP and Inflation.
Display all records for the country 'India'.
Select the data for 'USA' in 2023.
Compute the total GDP by country.
Compute the average inflation by year.
Convert the table to a stacked format and then back using .unstack().
Add a third index level (Quarter) and print the new structure.
Select only the GDP column for 'India' using .loc.
Convert multi-index to columns using .reset_index().
Sort the multi-index in ascending order of Year.

⚡ Summary¶

Concept	Function	Description
MultiIndex creation	`pd.MultiIndex.from_arrays()` / nested lists	Creates hierarchical indices
Selection	`.loc[]` with tuples	Access multi-level data
Reshaping	`.stack()` / `.unstack()`	Convert between wide and long formats
Aggregation	`.groupby(level=)`	Aggregate data at specific index levels
Reset	`.reset_index()`	Flatten the hierarchy

🧬 Advanced Hierarchical (Multi-Level) Indexing¶

In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack(). Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.

🎯 Learning Objectives¶

By the end of this section, you’ll be able to:

Create multi-level indices from multiple columns
Perform cross-section selection with .xs()
Use stacking/unstacking across levels
Apply aggregation at specific index levels
Understand index ordering and sorting

🧱 1. Creating Multi-Level Indices from Columns¶

You can convert existing columns into a multi-index using .set_index().

import pandas as pd

data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Year': [2022, 2023, 2022, 2023, 2022, 2023],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Sales': [25000, 27000, 30000, 28000, 26000, 31000],
    'Profit': [4000, 5000, 6000, 4500, 4700, 6200]
}

df = pd.DataFrame(data)
df = df.set_index(['Region', 'Year', 'Quarter'])
print(df)

Output:

                         Sales  Profit
Region Year Quarter
North  2022 Q1        25000    4000
       2023 Q2        27000    5000
South  2022 Q1        30000    6000
       2023 Q2        28000    4500
East   2022 Q1        26000    4700
       2023 Q2        31000    6200

Now you have a three-level index: Region → Year → Quarter.

🔍 2. Cross-Section Selection with `.xs()`¶

When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.

# Select all data for the year 2022
print(df.xs(2022, level='Year'))

# Select only Q1 records for all regions
print(df.xs('Q1', level='Quarter'))

# Select for one region and one year
print(df.xs(('South', 2023), level=('Region', 'Year')))

💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.

🧩 3. Stacking and Unstacking with Levels¶

You can reshape data across any index level using the level argument.

# Unstack by Quarter — move it to columns
print(df.unstack(level='Quarter'))

# Stack back Quarter level
print(df.unstack(level='Quarter').stack(level='Quarter'))

💡 Tip:

Use stack(level='Quarter') to move Quarter to rows.
Use unstack(level='Quarter') to move Quarter to columns.

⚙️ 4. Aggregation at Specific Index Levels¶

Use groupby(level=) for advanced summarization.

# Total Sales per Region
print(df.groupby(level='Region').sum())

# Average Profit per Year
print(df.groupby(level='Year')['Profit'].mean())

# Total Sales per Region and Year
print(df.groupby(level=['Region', 'Year'])['Sales'].sum())

These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.

🔄 5. Sorting and Reordering Multi-Level Indices¶

A well-sorted index helps when performing hierarchical operations.

# Sort by Region then Year
df = df.sort_index(level=['Region', 'Year'])
print(df.head())

# Swap index levels (Year before Region)
df_swapped = df.swaplevel('Region', 'Year')
print(df_swapped.head())

# Sort by index after swapping
df_swapped = df_swapped.sort_index()

💡 Swapping and sorting are common when data is structured differently across systems.

🧮 6. Aggregation Example — Profit Margin by Level¶

Let’s compute Profit Margin (%) by Region and Year.

df['Profit Margin (%)'] = (df['Profit'] / df['Sales']) * 100

# Average profit margin by Region and Year
result = df.groupby(level=['Region', 'Year'])['Profit Margin (%)'].mean()
print(result)

🧪 Practice Exercises¶

Try these to get comfortable:

Create a DataFrame with index levels: ['Department', 'Year', 'Quarter'] showing Revenue and Cost.
Convert columns into hierarchical index using .set_index().
Use .xs() to get all Q2 data across departments.
Compute total Revenue by Department.
Find average Cost by Year.
Unstack the Year level and display Revenue as columns.
Stack it back and verify it matches the original DataFrame.
Swap index levels (Department ↔ Year) and sort.
Add a new calculated column Profit = Revenue - Cost.
Compute and display average profit per department.

📚 Summary¶

Function	Description
`.set_index()`	Converts columns into a hierarchical index
`.xs()`	Cross-section selection across index levels
`.stack()` / `.unstack()`	Reshape DataFrames across levels
`.groupby(level=)`	Aggregate across multiple index levels
`.swaplevel()`	Reorder levels in a multi-index
`.sort_index()`	Sort hierarchical index for better readability

🧮 Complex DataFrame Manipulations¶

Data manipulation lies at the heart of every data analysis process. Once a dataset is loaded into a Pandas DataFrame, the next step is to explore, filter, summarize, and transform it to extract meaningful insights.

In this section, we’ll cover powerful techniques:

Filtering data using multiple logical conditions
Grouping with aggregation using groupby()
Creating pivot tables for multi-dimensional summaries
Adding calculated columns using existing data

🎯 Learning Objectives¶

By the end of this section, you will be able to:

Filter DataFrame rows using multiple logical conditions
Perform grouped aggregations with groupby()
Create pivot tables for multi-dimensional summaries
Add new calculated columns using existing data

🧮 1. Filtering Rows Based on Multiple Conditions¶

Filtering allows you to select only the data that meets certain criteria. Pandas uses boolean indexing to achieve this.

import pandas as pd

# Sample dataset
data = {
    'Employee': ['Asha', 'Raj', 'Kiran', 'Meera', 'Vikram'],
    'Department': ['Sales', 'Finance', 'Sales', 'HR', 'Finance'],
    'Salary': [45000, 55000, 60000, 40000, 58000],
    'Experience': [2, 5, 4, 3, 6]
}
df = pd.DataFrame(data)

# Filter employees in Sales with Salary above 45,000
filtered = df[(df['Department'] == 'Sales') & (df['Salary'] > 45000)]
print(filtered)

💡 Explanation:

Use & for AND and | for OR conditions.
Always wrap each condition in parentheses.

📊 2. Grouping Data with Aggregation¶

Grouping helps summarize large datasets into meaningful statistics. The groupby() function is one of the most powerful tools in Pandas.

# Average salary by department
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)

# Multiple aggregations
agg_result = df.groupby('Department').agg({'Salary': ['mean', 'max'], 'Experience': 'mean'})
print(agg_result)

💡 Explanation:

You can compute multiple aggregations for each group.
The result is often a MultiIndex DataFrame, which can be flattened later for easier access.

🪄 3. Creating Pivot Tables¶

Pivot tables allow you to summarize and reshape data interactively, similar to Excel.

pivot = pd.pivot_table(
    df,
    values='Salary',
    index='Department',
    columns='Experience',
    aggfunc='mean',
    fill_value=0
)
print(pivot)

💡 Explanation:

index defines rows, columns defines columns, and values defines what to aggregate.
You can specify aggregation functions like mean, sum, or even custom functions.

➕ 4. Adding Calculated Columns¶

You can create new columns derived from existing ones. This is essential for analysis and feature engineering.

# Add a 10% bonus to each salary
df['Bonus'] = df['Salary'] * 0.10
df['Total Pay'] = df['Salary'] + df['Bonus']
print(df)

💡 Explanation:

You can perform vectorized operations directly on DataFrame columns.
Avoid loops — Pandas operations are faster and cleaner.

🧪 Practice Exercises¶

Try the following tasks using your dataset:

Filter records where Experience > 3 and Salary < 60000.
Group data by Department and calculate both average and maximum salaries.
Create a pivot table showing average salary by Department and Experience.
Add a column Tax = 10% of Salary, and compute Net Salary = Salary - Tax.
Find the top 3 employees with the highest total pay (combine with nlargest()).
For each department, display the employee with maximum experience.
Filter all employees whose names start with 'A' or 'M'.
Add a calculated column for Years Until Retirement assuming retirement age is 60.
Identify the lowest paid employee in each department (combine with nsmallest()).
Create a report summarizing total salary and average experience by department.

Advanced Hierarchical (Multi-Level) Indexing¶

🎯 Learning Objectives¶

By the end of this section, you’ll be able to:

Create multi-level indices from multiple columns
Perform cross-section selection with .xs()
Use stacking/unstacking across levels
Apply aggregation at specific index levels
Understand index ordering and sorting

🧱 1. Creating Multi-Level Indices from Columns¶

You can convert existing columns into a multi-index using .set_index().

import pandas as pd

data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Year': [2022, 2023, 2022, 2023, 2022, 2023],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Sales': [25000, 27000, 30000, 28000, 26000, 31000],
    'Profit': [4000, 5000, 6000, 4500, 4700, 6200]
}

df = pd.DataFrame(data)
df = df.set_index(['Region', 'Year', 'Quarter'])
print(df)

Output:

                         Sales  Profit
Region Year Quarter
North  2022 Q1        25000    4000
       2023 Q2        27000    5000
South  2022 Q1        30000    6000
       2023 Q2        28000    4500
East   2022 Q1        26000    4700
       2023 Q2        31000    6200

Now you have a three-level index: Region → Year → Quarter.

2. Cross-Section Selection with `.xs()`¶

When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.

# Select all data for the year 2022
print(df.xs(2022, level='Year'))

# Select only Q1 records for all regions
print(df.xs('Q1', level='Quarter'))

# Select for one region and one year
print(df.xs(('South', 2023), level=('Region', 'Year')))

💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.

🧩 3. Stacking and Unstacking with Levels¶

You can reshape data across any index level using the level argument.

# Unstack by Quarter — move it to columns
print(df.unstack(level='Quarter'))

# Stack back Quarter level
print(df.unstack(level='Quarter').stack(level='Quarter'))

💡 Tip:

Use stack(level='Quarter') to move Quarter to rows.
Use unstack(level='Quarter') to move Quarter to columns.

⚙️ 4. Aggregation at Specific Index Levels¶

Use groupby(level=) for advanced summarization.

# Total Sales per Region
print(df.groupby(level='Region').sum())

# Average Profit per Year
print(df.groupby(level='Year')['Profit'].mean())

# Total Sales per Region and Year
print(df.groupby(level=['Region', 'Year'])['Sales'].sum())

These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.

🔄 5. Sorting and Reordering Multi-Level Indices¶

A well-sorted index helps when performing hierarchical operations.

# Sort by Region then Year
df = df.sort_index(level=['Region', 'Year'])
print(df.head())

# Swap index levels (Year before Region)
df_swapped = df.swaplevel('Region', 'Year')
print(df_swapped.head())

# Sort by index after swapping
df_swapped = df_swapped.sort_index()

💡 Swapping and sorting are common when data is structured differently across systems.

🧮 6. Aggregation Example — Profit Margin by Level¶

Let’s compute Profit Margin (%) by Region and Year.

df['Profit Margin (%)'] = (df['Profit'] / df['Sales']) * 100

# Average profit margin by Region and Year
result = df.groupby(level=['Region', 'Year'])['Profit Margin (%)'].mean()
print(result)

🧪 Practice Exercises¶

Try these to get comfortable:

Create a DataFrame with index levels: ['Department', 'Year', 'Quarter'] showing Revenue and Cost.
Convert columns into hierarchical index using .set_index().
Use .xs() to get all Q2 data across departments.
Compute total Revenue by Department.
Find average Cost by Year.
Unstack the Year level and display Revenue as columns.
Stack it back and verify it matches the original DataFrame.
Swap index levels (Department ↔ Year) and sort.
Add a new calculated column Profit = Revenue - Cost.
Compute and display average profit per department.

📚 Summary¶

Function	Description
`.set_index()`	Converts columns into a hierarchical index
`.xs()`	Cross-section selection across index levels
`.stack()` / `.unstack()`	Reshape DataFrames across levels
`.groupby(level=)`	Aggregate across multiple index levels
`.swaplevel()`	Reorder levels in a multi-index
`.sort_index()`	Sort hierarchical index for better readability

🏆 Identifying Top-Performing Records¶

Once you’ve cleaned, indexed, and organized your dataset, the next step is often to find the best (or worst) performers — the top customers, highest-grossing products, or most efficient employees.

Pandas provides multiple efficient methods:

sort_values() — Order DataFrame by column(s)
nlargest() / nsmallest() — Get top/bottom N records directly
Grouping + sorting — Find top performers within groups

1. Sorting Values¶

Use .sort_values() to arrange data based on one or more columns.

import pandas as pd

df = pd.DataFrame({
    'Product': ['Laptop', 'Tablet', 'Phone', 'Camera', 'Monitor'],
    'Revenue': [80000, 30000, 60000, 20000, 40000],
    'Profit': [20000, 8000, 15000, 4000, 7000]
})

# Sort by Revenue in descending order
df_sorted = df.sort_values(by='Revenue', ascending=False)
print(df_sorted)

Output:

  Product  Revenue  Profit
0  Laptop    80000   20000
2   Phone    60000   15000
4 Monitor    40000    7000
1  Tablet    30000    8000
3  Camera    20000    4000

You can also sort by multiple columns:

df.sort_values(by=['Revenue', 'Profit'], ascending=[False, False], inplace=True)

🎯 2. Getting Top N Records — `.nlargest()`¶

When you only care about the top few performers, .nlargest() is faster and simpler than sorting.

top3 = df.nlargest(3, 'Revenue')
print(top3)

Output:

  Product  Revenue  Profit
0  Laptop    80000   20000
2   Phone    60000   15000
4 Monitor    40000    7000

Similarly, .nsmallest() gives you the lowest-performing records:

bottom2 = df.nsmallest(2, 'Profit')
print(bottom2)

Output:

  Product  Revenue  Profit
3  Camera    20000    4000
4 Monitor    40000    7000

🧩 3. Grouping + Sorting — Find Top in Each Group¶

Find the top-performing record within each category.

data = pd.DataFrame({
    'Category': ['Electronics', 'Electronics', 'Furniture', 'Furniture', 'Clothing'],
    'Product': ['Laptop', 'Phone', 'Chair', 'Desk', 'Jacket'],
    'Revenue': [80000, 60000, 20000, 25000, 15000],
    'Profit': [20000, 15000, 4000, 5000, 3000]
})

# Top product in each category based on Revenue
top_per_category = data.sort_values('Revenue', ascending=False).groupby('Category').head(1)
print(top_per_category)

Output:

      Category Product  Revenue  Profit
0  Electronics  Laptop    80000   20000
3    Furniture    Desk    25000    5000
4     Clothing  Jacket    15000    3000

⚡ 4. Using `idxmax()` and `idxmin()`¶

Quickly get the index of the top or bottom record:

best_product = df.loc[df['Profit'].idxmax()]
print(best_product)

Output:

Product    Laptop
Revenue     80000
Profit      20000
Name: 0, dtype: object

🧠 Summary — Identifying Top Performers¶

Method	Purpose	Notes
`sort_values()`	Sort full DataFrame	Use for ordered reports
`nlargest()` / `nsmallest()`	Get top or bottom N records	Fast and simple
`groupby().head()`	Top N within each group	Great for categorical comparisons
`idxmax()` / `idxmin()`	Get best record index	Works on single columns

📚 Complete Practice Lab — Data Detective Challenge¶

Now that you’ve mastered EDA, indexing, grouping, and identifying top performers, it’s time to put it all together in a comprehensive analysis challenge.

🎯 Challenge: Company Sales Analysis¶

You have access to a dataset called company_sales.csv with the following structure:

Column	Description
`Date`	Transaction date
`Region`	Sales region
`Product_Category`	Category of product sold
`Quantity`	Units sold
`Price_per_Unit`	Unit price
`Sales_Amount`	Total sales (Quantity × Price_per_Unit)
`Marketing_Spend`	Marketing spend for that transaction
`Customer_Tier`	Customer segment (Gold/Silver/Bronze)

📋 Analysis Tasks¶

Use all the techniques you’ve learned to answer these questions:

Part 1: Data Exploration¶

Load the dataset and display basic info (shape, columns, data types)
Generate summary statistics for all numeric columns
Check for any missing values or outliers
Create a correlation matrix and visualize it

Part 2: Univariate & Bivariate Analysis¶

Create a histogram of Sales_Amount — is it normally distributed?
Create a boxplot of Sales_Amount by Region — which region performs best?
Create a scatter plot: Marketing_Spend vs Sales_Amount — is there a relationship?
Analyze which Product_Category has the highest average sales

Part 3: Indexing & Filtering¶

Filter for transactions with Sales_Amount > $50,000 — how many records?
Select only Gold-tier customers in the North region using .loc[]
Find all transactions from the top 3 months by sales using indexing

Part 4: Grouping & Aggregation¶

Group by Region and calculate:
- Average Sales Amount
- Total Marketing Spend
- Count of transactions
Create a pivot table showing average Sales Amount by Region and Customer_Tier
Group by Product_Category and find the best-performing category by total revenue

Part 5: Top Performers & Calculated Columns¶

Add a calculated column: ROI = (Sales_Amount - Marketing_Spend) / Marketing_Spend * 100
Find the top 5 transactions with the highest ROI
For each region, identify the top 3 performing days (by total sales)
Find the bottom 2 performing product categories by average profit margin

Part 6: Advanced Insights¶

Create a multi-level index: ['Region', 'Customer_Tier'] and calculate total sales
Use .xs() to extract all sales for a specific tier across regions

🏆 Bonus Challenges¶

Visualization: Create a dashboard with 4 plots showing key metrics
Time Series: If Date is included, plot sales trends over time
Segmentation: Identify high-value customers (top 10% by sales)
Forecasting: Predict next month’s total sales based on current trends

💡 Key Takeaways¶

By completing this lab, you’ll have practiced:

✅ Data Loading & Exploration — Understanding your dataset
✅ Univariate Analysis — Single variable patterns
✅ Bivariate Analysis — Relationships between variables
✅ Filtering & Indexing — Extracting relevant subsets
✅ Grouping & Aggregation — Summarizing by categories
✅ Pivot Tables — Multi-dimensional views
✅ Top Performers — Ranking and sorting
✅ Feature Engineering — Creating new insights

📖 Recap Table¶

Topic	Key Functions	Use Case
Summary Stats	`.describe()`, `.info()`	Understand data distribution
Visualization	`sns.histplot()`, `sns.scatterplot()`	Visual pattern recognition
Filtering	`df[df['col'] > value]`	Extract subsets
Indexing	`.loc[]`, `.iloc[]`, `.set_index()`	Efficient data access
Grouping	`.groupby().agg()`	Categorical summaries
Pivoting	`.pivot_table()`	Multi-dimensional views
Ranking	`.nlargest()`, `.sort_values()`	Top/bottom performers
Features	Calculated columns	New derived metrics

🎓 You’re now a data detective! Go forth and uncover business stories hidden in your data! 🔍

# Your code here

📊 Data Inspection Basics¶

Viewing Data¶

Use .head() and .tail() to view the top and bottom rows of the DataFrame:

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

print("First 5 rows:")
print(df.head())

print("\nLast 3 rows:")
df.tail(3)

Inspecting Index and Columns¶

print("Index:")
print(df.index)

print("\nColumns:")
df.columns

Converting to NumPy Array¶

Return a NumPy representation of the underlying data:

df.to_numpy()

Data Types¶

When working with mixed data types, pandas will find a common dtype:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

print("Data types:")
print(df2.dtypes)

print("\nConverted to numpy:")
df2.to_numpy()

📈 Summary Statistics¶

The .describe() method provides quick statistical summary of your data:

df.describe()

Transposing Data¶

df.T

🔄 Sorting Data¶

Sort by Index¶

df.sort_index(axis=1, ascending=False)

Sort by Values¶

df.sort_values(by="B")

🎯 Data Selection¶

Basic Column Selection¶

Selecting a single column returns a Series:

df["A"]

Multiple Column Selection¶

df[["B", "A"]]

Row Slicing¶

print("First 3 rows:")
print(df[0:3])

print("\nDate range slice:")
df["20130102":"20130104"]

Selection by Label (.loc)¶

Use .loc[] for label-based indexing:

# Select a single row
print("Single row:")
print(df.loc[dates[0]])

print("\nSelect rows and columns:")
print(df.loc[:, ["A", "B"]])

print("\nSlice with labels (both endpoints included):")
df.loc["20130102":"20130104", ["A", "B"]]

Selection by Position (.iloc)¶

Use .iloc[] for integer position-based indexing:

# Select by position
print("4th row:")
print(df.iloc[3])

print("\nSlice rows and columns:")
print(df.iloc[3:5, 0:2])

print("\nSpecific positions:")
df.iloc[[1, 2, 4], [0, 2]]

Boolean Indexing¶

Filter data based on conditions:

# Filter where column A is greater than 0
df[df["A"] > 0]

Using .isin() for Filtering¶

df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
print("DataFrame with new column:")
print(df2)

print("\nFiltered data:")
df2[df2["E"].isin(["two", "four"])]

📊 Statistical Operations¶

Computing Statistics¶

Operations generally exclude missing data:

# Mean for each column
print("Mean by column:")
print(df.mean())

print("\nMean by row:")
df.mean(axis=1)

Operations with Different Indices¶

When operating with objects that have different indices, pandas will align them:

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print("Shifted series:")
print(s)

print("\nSubtraction with alignment:")
df.sub(s, axis="index")

User-Defined Functions¶

Apply custom functions using .agg() and .transform():

# Aggregation
print("Aggregation with custom function:")
print(df.agg(lambda x: np.mean(x) * 5.6))

print("\nTransformation:")
df.transform(lambda x: x * 101.2)

🎲 Value Counts¶

Value counts provide frequency distribution of values:

s = pd.Series(np.random.randint(0, 7, size=10))
print("Sample series:")
print(s)

print("\nValue counts:")
s.value_counts()

👥 Grouping Data¶

Group-by operations allow you to split data, apply a function, and combine results:

df_group = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
print("Sample DataFrame:")
df_group

Grouping and Aggregating¶

# Group by column A and sum
df_group.groupby("A")[["C", "D"]].sum()

Grouping by Multiple Columns¶

df_group.groupby(["A", "B"]).sum()

🔄 Reshaping Data¶

Stack and Unstack¶

Stack “compresses” a level in the DataFrame’s columns:

arrays = [
   ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df_stack = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2_stack = df_stack[:4]
print("Original DataFrame:")
df2_stack

Stacking¶

stacked = df2_stack.stack()
stacked

Unstacking¶

print("Unstack (default - last level):")
print(stacked.unstack())

print("\nUnstack level 1:")
print(stacked.unstack(1))

print("\nUnstack level 0:")
stacked.unstack(0)

Pivot Tables¶

Pivot tables provide a way to create cross-tabulations:

df_pivot = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)
print("Sample DataFrame:")
df_pivot

Creating Pivot Table¶

pd.pivot_table(df_pivot, values="D", index=["A", "B"], columns=["C"])

EDA Reflection¶

Good EDA does not just produce charts. It reduces uncertainty, uncovers failure points, and sharpens the next analytical decision.

8. Interactive Code¶

9. Guided Practice¶

What is the purpose of exploratory data analysis?¶

To skip understanding the dataEDA is about understanding the data more deeply.

To inspect patterns, distributions, and anomaliesCorrect. EDA helps reveal how data behaves.

To replace all modeling automaticallyEDA is useful before modeling, not a total replacement for it.

To remove every row from a datasetThat would not be useful analysis.

What is the mean of the example values?¶

12.0That is the minimum.

14.6Correct. That is the average of the five values.

15.0That is not the computed average.

19.0That is the maximum.

Investigating Patterns, Anomalies, and Business Signals Before Modeling¶

Notebook Guide¶

What EDA should answer¶

🧠 Why EDA Matters¶

🧰 Prerequisite¶

📦 Step 1. Get to Know Your Dataset¶

🔍 Step 2. Basic Summary Statistics¶

🎨 Step 3. Univariate Visuals — “The Solo Acts”¶

🍰 Categorical Insights¶

🧩 Step 4. Bivariate Analysis — “The Dynamic Duos”¶

🧮 Numerical vs Numerical¶

🎯 Categorical vs Numerical¶

🧠 Step 5. Correlation — “The Data Gossip Network”¶

🪄 Step 6. Feature Relationships (Business Examples)¶

Example 1: Revenue vs Customer Segment¶

Example 2: Seasonal Trends¶

📊 Step 7. Advanced EDA Helpers¶

Advanced DataFrame Indexing & Selection¶

🧱 1. Label-Based & Position-Based Selection¶

.loc[] — Access by Label (Name)¶

.iloc[] — Access by Position (Integer)¶

🧩 2. Boolean Indexing — Filter Like a Pro 🔍¶

⚙️ 3. Index Management — Set & Reset¶

✂️ 4. Conditional Slicing¶

🧠 5. Fast Scalar Access¶

Summary — Advanced Selection in Pandas¶

🧪 Practice Lab — Advanced Indexing¶

🧭 Advanced DataFrame Indexing & Selection¶

🎯 Learning Objectives¶

🧱 1. Selecting Rows and Columns Using .loc[] and .iloc[]¶

.loc[] → Label-based indexing¶

.iloc[] → Integer position-based indexing¶

🧩 2. Boolean Indexing¶

⚙️ 3. Setting and Resetting Index¶

✂️ 4. Slicing Based on Conditions¶

🧠 5. Selecting Specific Cells and Values¶

🧪 Practice Exercises¶

⚡ Summary¶

🧩 Hierarchical (Multi-level) Indexing — Basics¶

🎯 Learning Objectives¶

🧱 1. Creating a Multi-level (Hierarchical) Index¶

🔍 2. Selecting Data at Different Index Levels¶

⚙️ 3. Stacking and Unstacking¶

📊 4. Aggregation by Index Levels¶

🧪 Practice Exercises¶

⚡ Summary¶

🧬 Advanced Hierarchical (Multi-Level) Indexing¶

🎯 Learning Objectives¶

🧱 1. Creating Multi-Level Indices from Columns¶

🔍 2. Cross-Section Selection with .xs()¶

🧩 3. Stacking and Unstacking with Levels¶

⚙️ 4. Aggregation at Specific Index Levels¶

🔄 5. Sorting and Reordering Multi-Level Indices¶

🧮 6. Aggregation Example — Profit Margin by Level¶

🧪 Practice Exercises¶

📚 Summary¶

🧮 Complex DataFrame Manipulations¶

🎯 Learning Objectives¶

🧮 1. Filtering Rows Based on Multiple Conditions¶

📊 2. Grouping Data with Aggregation¶

🪄 3. Creating Pivot Tables¶

➕ 4. Adding Calculated Columns¶

🧪 Practice Exercises¶

Advanced Hierarchical (Multi-Level) Indexing¶

🎯 Learning Objectives¶

🧱 1. Creating Multi-Level Indices from Columns¶

2. Cross-Section Selection with .xs()¶

🧩 3. Stacking and Unstacking with Levels¶

⚙️ 4. Aggregation at Specific Index Levels¶

🔄 5. Sorting and Reordering Multi-Level Indices¶

🧮 6. Aggregation Example — Profit Margin by Level¶

🧪 Practice Exercises¶

📚 Summary¶

🏆 Identifying Top-Performing Records¶

1. Sorting Values¶

🎯 2. Getting Top N Records — .nlargest()¶

🧩 3. Grouping + Sorting — Find Top in Each Group¶

⚡ 4. Using idxmax() and idxmin()¶

🧠 Summary — Identifying Top Performers¶

📚 Complete Practice Lab — Data Detective Challenge¶

`.loc[]` — Access by Label (Name)¶

`.iloc[]` — Access by Position (Integer)¶

🧱 1. Selecting Rows and Columns Using `.loc[]` and `.iloc[]`¶

`.loc[]` → Label-based indexing¶

`.iloc[]` → Integer position-based indexing¶

🔍 2. Cross-Section Selection with `.xs()`¶

2. Cross-Section Selection with `.xs()`¶

🎯 2. Getting Top N Records — `.nlargest()`¶

⚡ 4. Using `idxmax()` and `idxmin()`¶