Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Turning Data Into Business Stories

Welcome to the stage where data becomes drama. 🎭

You’ve loaded it, cleaned it, encoded it — now it’s time to interrogate it.

💬 “EDA is like a first date with your dataset — ask questions, spot red flags, and decide if you want to take it to production.”


🧠 Why EDA Matters

EDA is where you move from data janitor to data storyteller.

Here’s what EDA helps you uncover:

  • What’s normal in your data

  • What’s weird (and possibly a bug or a legend)

  • What features actually matter

  • How variables interact

⚡ A good EDA session can save you 10 hours of model debugging later.


🧰 Prerequisite

If you’re new to pandas or visualization, check my other book: 👉 📘 Programming for Business


📦 Step 1. Get to Know Your Dataset

Let’s start with the basics — “Hi, dataset! Tell me about yourself.”

import pandas as pd

df = pd.read_csv("sales_data_clean.csv")

df.info()
df.describe()
df.head()

💬 “.info() is like your dataset’s LinkedIn profile — gives you a summary, but hides the weird stuff.”


🔍 Step 2. Basic Summary Statistics

Numbers can whisper business truths — if you know how to listen.

df.describe(include='all')
MetricWhat It Means
mean, medianTypical value
std, IQRHow spread out things are
min, maxRange (and possible outliers)
uniqueCategorical diversity

💡 Example: “Average order = 92,butmaxorder=92, but max order = 10,000?” Either a whale customer… or a typo named Jeff.


🎨 Step 3. Univariate Visuals — “The Solo Acts”

Visualize one variable at a time to see its story.

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(df['sales_amount'], bins=30, kde=True)
plt.title("Sales Distribution")
plt.show()
What to Look ForMeaning
Bell curveHealthy data
Right-skewedA few big spenders
Flat or weird shapeMixed data sources

💬 “If your histogram looks like a cliff, it’s not geology — it’s bad data entry.”


🍰 Categorical Insights

sns.countplot(x='region', data=df)
plt.title("Sales by Region")
plt.show()

💡 “If one region dominates, either it’s your best market — or others are asleep at work.”


🧩 Step 4. Bivariate Analysis — “The Dynamic Duos”

This is where you see relationships between features.

🧮 Numerical vs Numerical

sns.scatterplot(x='marketing_spend', y='sales_amount', data=df)
plt.title("Marketing Spend vs Sales")
plt.show()

💬 “If you see an upward trend, congrats — your marketing isn’t just burning money.”


🎯 Categorical vs Numerical

sns.boxplot(x='region', y='sales_amount', data=df)
plt.title("Sales Distribution by Region")
plt.show()

💡 Boxplots are like HR reports — they show who’s average, who’s overachieving, and who’s way out of line.


🧠 Step 5. Correlation — “The Data Gossip Network”

Find which variables are talking behind each other’s backs.

corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
ValueMeaning
0.9+Practically twins
0.5Moderate connection
0No gossip
-0.9Total opposites

💬 “If two columns are 0.99 correlated, drop one — they’re basically sending the same email twice.”


🪄 Step 6. Feature Relationships (Business Examples)

Example 1: Revenue vs Customer Segment

sns.barplot(x='customer_tier', y='sales_amount', data=df, estimator='mean')

“Gold customers really are gold — except when they ask for more discounts.”

df['month'] = pd.to_datetime(df['date']).dt.month
sns.lineplot(x='month', y='sales_amount', data=df)

“If sales spike every November — that’s not coincidence, that’s marketing season.”


📊 Step 7. Advanced EDA Helpers

When you’re ready to go pro:

!pip install pandas-profiling
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Business Data Profile", explorative=True)
profile.to_notebook_iframe()

💬 “This is like an MRI for your dataset — it finds hidden patterns before you even blink.”


Advanced DataFrame Indexing & Selection

After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.

Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.

🧱 1. Label-Based & Position-Based Selection

.loc[] — Access by Label (Name)

Use actual labels (names) of rows and columns.

import pandas as pd

data = {
    'Name': ['Asha', 'Raj', 'Kiran', 'Meera', 'Vikram'],
    'Department': ['Sales', 'Finance', 'Sales', 'HR', 'Finance'],
    'Salary': [45000, 55000, 60000, 40000, 58000],
    'Experience': [2, 5, 4, 3, 6]
}

df = pd.DataFrame(data, index=['E1', 'E2', 'E3', 'E4', 'E5'])

# Select a single row and specific columns using labels
print(df.loc['E3', ['Name', 'Salary']])

# Select multiple rows
print(df.loc[['E1', 'E5'], :])

.iloc[] — Access by Position (Integer)

Use integer positions instead of names.

# Select third row (index position 2)
print(df.iloc[2])

# Select first three rows and first two columns
print(df.iloc[0:3, 0:2])

💡 Key difference: .loc[] uses labels, .iloc[] uses numeric positions.


🧩 2. Boolean Indexing — Filter Like a Pro 🔍

Boolean indexing lets you apply conditions directly on DataFrame columns.

# Filter where Salary > 50,000
high_salary = df[df['Salary'] > 50000]
print(high_salary)

# Combine multiple conditions with & (AND), | (OR), ~ (NOT)
filtered = df[(df['Department'] == 'Finance') & (df['Experience'] > 5)]
print(filtered)

💡 Always enclose conditions within parentheses and use &, |, ~ operators.


⚙️ 3. Index Management — Set & Reset

You can change which column acts as the row identifier for faster lookups and clarity.

# Set 'Name' as index
df2 = df.set_index('Name')
print(df2)

# Access row by name
print(df2.loc['Asha'])

# Reset back to default integer index
df2_reset = df2.reset_index()
print(df2_reset)

💡 Why? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.


✂️ 4. Conditional Slicing

Extract specific subsets by combining slicing and conditions.

# Slice rows 1 to 3 (exclusive of 4)
print(df.iloc[1:4])

# Conditional slicing — employees with experience between 3 and 6 years
print(df[(df['Experience'] >= 3) & (df['Experience'] <= 6)])

# Select specific rows and columns together
df.loc[df['Salary'] > 50000, ['Name', 'Department']]

🧠 5. Fast Scalar Access

For accessing single values, use .at[] or .iat[] — much faster than .loc[] or .iloc[].

# Access specific cell
print(df.at['E2', 'Salary'])   # Using label
print(df.iat[2, 2])            # Using position

# Update value
df.at['E4', 'Salary'] = 42000
print(df)

Summary — Advanced Selection in Pandas

MethodTypeExampleDescription
.loc[]Label-baseddf.loc['A']Select by row/column name
.iloc[]Position-baseddf.iloc[0:2, 1:3]Select by numeric position
Boolean IndexCondition-baseddf[df['Profit'] > 8000]Filter with logic
.set_index()Index managementdf.set_index('City')Change lookup key
.reset_index()Revert indexdf.reset_index()Restore default integers
.at[] / .iat[]Fast scalardf.at['E2', 'Salary']Quick single value access

🧪 Practice Lab — Advanced Indexing

Try the following to solidify your understanding:

  1. Select only the Name and Salary columns using both .loc and .iloc.

  2. Retrieve all employees with experience less than 4 years.

  3. Display the 2nd and 4th rows using .iloc.

  4. Set Department as the new index and display the DataFrame.

  5. Reset the index back to default integer.

  6. Filter employees whose name starts with 'M' or 'V'.

  7. Increase salary by 5% for all employees in the Sales department.

  8. Access the salary of the employee in the 3rd row and update it to 62000.

  9. Create a subset of only Finance department employees using .loc.

  10. Display employees with experience between 3 and 6 years.


🧭 Advanced DataFrame Indexing & Selection

After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.

Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.


🎯 Learning Objectives

By the end of this section, you will be able to:

  • Select rows and columns using .loc[] and .iloc[]

  • Apply boolean indexing for flexible filtering

  • Set and reset indices dynamically

  • Slice data using conditions and custom logic


🧱 1. Selecting Rows and Columns Using .loc[] and .iloc[]

.loc[]Label-based indexing

You use actual labels (names) of rows and columns.

import pandas as pd

data = {
    'Name': ['Asha', 'Raj', 'Kiran', 'Meera', 'Vikram'],
    'Department': ['Sales', 'Finance', 'Sales', 'HR', 'Finance'],
    'Salary': [45000, 55000, 60000, 40000, 58000],
    'Experience': [2, 5, 4, 3, 6]
}

df = pd.DataFrame(data, index=['E1', 'E2', 'E3', 'E4', 'E5'])

# Select a single row and specific columns using labels
print(df.loc['E3', ['Name', 'Salary']])

# Select multiple rows
print(df.loc[['E1', 'E5'], :])

.iloc[]Integer position-based indexing

You use integer positions instead of names.

# Select third row (index position 2)
print(df.iloc[2])

# Select first three rows and first two columns
print(df.iloc[0:3, 0:2])

💡 Key difference:

  • .loc[] → Use labels

  • .iloc[] → Use numeric positions


🧩 2. Boolean Indexing

Boolean indexing allows filtering based on logical conditions.

# Employees with salary greater than 50,000
print(df[df['Salary'] > 50000])

# Employees from Finance department with > 5 years experience
print(df[(df['Department'] == 'Finance') & (df['Experience'] > 5)])

💡 Tip: Combine multiple conditions using & (AND), | (OR), and ~ (NOT). Always enclose conditions within parentheses.


⚙️ 3. Setting and Resetting Index

Indexes act like row identifiers. You can change them to columns that make sense for your data.

# Set 'Name' as index
df2 = df.set_index('Name')
print(df2)

# Reset back to default integer index
df2 = df2.reset_index()
print(df2)

💡 Why do this? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.


✂️ 4. Slicing Based on Conditions

You can extract specific portions of your DataFrame by slicing or conditional selection.

# Slice rows 1 to 3 (exclusive of 4)
print(df.iloc[1:4])

# Conditional slicing — employees with experience between 3 and 6 years
print(df[(df['Experience'] >= 3) & (df['Experience'] <= 6)])

🧠 5. Selecting Specific Cells and Values

You can directly access or modify specific cells using .at[] or .iat[] for faster scalar access.

# Access specific cell
print(df.at['E2', 'Salary'])   # Using label
print(df.iat[2, 2])            # Using position

# Update value
df.at['E4', 'Salary'] = 42000
print(df)

💡 .at[] and .iat[] are optimized for single value access — much faster than .loc[] or .iloc[].


🧪 Practice Exercises

Try the following to solidify your understanding:

  1. Select only the Name and Salary columns using both .loc and .iloc.

  2. Retrieve all employees with experience less than 4 years.

  3. Display the 2nd and 4th rows using .iloc.

  4. Set Department as the new index and display the DataFrame.

  5. Reset the index back to default integer.

  6. Filter employees whose name starts with 'M' or 'V'.

  7. Increase salary by 5% for all employees in the Sales department.

  8. Access the salary of the employee in the 3rd row and update it to 62000.

  9. Create a subset of only Finance department employees using .loc.

  10. Display employees with experience between 3 and 6 years.


⚡ Summary

ConceptMethodDescription
.loc[]Label-basedAccess by names (rows/columns)
.iloc[]Position-basedAccess by numeric index
.at[], .iat[]Fast scalar accessQuick value retrieval
.set_index(), .reset_index()Index managementReassign or restore indices
Boolean indexingCondition-basedLogical filtering of rows


🧩 Hierarchical (Multi-level) Indexing — Basics

In large datasets, data often has multiple dimensions — like regions, departments, or years — and we want to organize or summarize them efficiently. That’s where Hierarchical Indexing (also called Multi-level Indexing) in Pandas becomes powerful.

It allows you to store and access data across multiple levels of rows or columns, creating a structured and compact representation of complex datasets.


🎯 Learning Objectives

By the end of this section, you will be able to:

  • Create multi-level indexed DataFrames

  • Select data at different index levels

  • Perform stacking and unstacking

  • Apply aggregation based on specific index levels


🧱 1. Creating a Multi-level (Hierarchical) Index

Let’s start with a dataset representing sales performance across regions and years.

import pandas as pd

data = {
    'Sales': [25000, 27000, 30000, 28000, 26000, 31000],
    'Profit': [4000, 5000, 6000, 4500, 4700, 6200]
}

index = [
    ['North', 'North', 'South', 'South', 'East', 'East'],
    [2022, 2023, 2022, 2023, 2022, 2023]
]

df = pd.DataFrame(data, index=index, columns=['Sales', 'Profit'])
df.index.names = ['Region', 'Year']
print(df)

Output:

               Sales  Profit
Region Year
North  2022   25000    4000
       2023   27000    5000
South  2022   30000    6000
       2023   28000    4500
East   2022   26000    4700
       2023   31000    6200

💡 Notice: The index now has two levels — Region and Year.


🔍 2. Selecting Data at Different Index Levels

You can use .loc[] with tuples to access data at specific index combinations.

# Access all rows for North
print(df.loc['North'])

# Access only the row for South in 2022
print(df.loc[('South', 2022)])

Or use index slicing for partial selections.

# Select all years for East and South
print(df.loc[['East', 'South']])

⚙️ 3. Stacking and Unstacking

Stacking and unstacking reshape data between rows and columns — powerful for pivot-style transformations.

# Convert columns into inner-level rows
stacked = df.stack()
print(stacked.head())

# Convert inner-level rows back to columns
unstacked = stacked.unstack()
print(unstacked.head())

💡 Think of it as:

  • .stack() → makes the table taller (more rows, fewer columns)

  • .unstack() → makes the table wider (fewer rows, more columns)


📊 4. Aggregation by Index Levels

You can compute summary statistics for specific index levels using level parameter in aggregation.

# Total sales per Region
print(df.groupby(level='Region').sum())

# Average profit per Year
print(df.groupby(level='Year').mean())

💡 Use level name or number (level=0 for Region, level=1 for Year).


🧪 Practice Exercises

Try these small tasks:

  1. Create a DataFrame with 2-level index: ['Country', 'Year'] showing GDP and Inflation.

  2. Display all records for the country 'India'.

  3. Select the data for 'USA' in 2023.

  4. Compute the total GDP by country.

  5. Compute the average inflation by year.

  6. Convert the table to a stacked format and then back using .unstack().

  7. Add a third index level (Quarter) and print the new structure.

  8. Select only the GDP column for 'India' using .loc.

  9. Convert multi-index to columns using .reset_index().

  10. Sort the multi-index in ascending order of Year.


⚡ Summary

ConceptFunctionDescription
MultiIndex creationpd.MultiIndex.from_arrays() / nested listsCreates hierarchical indices
Selection.loc[] with tuplesAccess multi-level data
Reshaping.stack() / .unstack()Convert between wide and long formats
Aggregation.groupby(level=)Aggregate data at specific index levels
Reset.reset_index()Flatten the hierarchy


🧬 Advanced Hierarchical (Multi-Level) Indexing

In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack(). Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.


🎯 Learning Objectives

By the end of this section, you’ll be able to:

  • Create multi-level indices from multiple columns

  • Perform cross-section selection with .xs()

  • Use stacking/unstacking across levels

  • Apply aggregation at specific index levels

  • Understand index ordering and sorting


🧱 1. Creating Multi-Level Indices from Columns

You can convert existing columns into a multi-index using .set_index().

import pandas as pd

data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Year': [2022, 2023, 2022, 2023, 2022, 2023],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Sales': [25000, 27000, 30000, 28000, 26000, 31000],
    'Profit': [4000, 5000, 6000, 4500, 4700, 6200]
}

df = pd.DataFrame(data)
df = df.set_index(['Region', 'Year', 'Quarter'])
print(df)

Output:

                         Sales  Profit
Region Year Quarter
North  2022 Q1        25000    4000
       2023 Q2        27000    5000
South  2022 Q1        30000    6000
       2023 Q2        28000    4500
East   2022 Q1        26000    4700
       2023 Q2        31000    6200

Now you have a three-level index: Region → Year → Quarter.


🔍 2. Cross-Section Selection with .xs()

When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.

# Select all data for the year 2022
print(df.xs(2022, level='Year'))

# Select only Q1 records for all regions
print(df.xs('Q1', level='Quarter'))

# Select for one region and one year
print(df.xs(('South', 2023), level=('Region', 'Year')))

💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.


🧩 3. Stacking and Unstacking with Levels

You can reshape data across any index level using the level argument.

# Unstack by Quarter — move it to columns
print(df.unstack(level='Quarter'))

# Stack back Quarter level
print(df.unstack(level='Quarter').stack(level='Quarter'))

💡 Tip:

  • Use stack(level='Quarter') to move Quarter to rows.

  • Use unstack(level='Quarter') to move Quarter to columns.


⚙️ 4. Aggregation at Specific Index Levels

Use groupby(level=) for advanced summarization.

# Total Sales per Region
print(df.groupby(level='Region').sum())

# Average Profit per Year
print(df.groupby(level='Year')['Profit'].mean())

# Total Sales per Region and Year
print(df.groupby(level=['Region', 'Year'])['Sales'].sum())

These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.


🔄 5. Sorting and Reordering Multi-Level Indices

A well-sorted index helps when performing hierarchical operations.

# Sort by Region then Year
df = df.sort_index(level=['Region', 'Year'])
print(df.head())

# Swap index levels (Year before Region)
df_swapped = df.swaplevel('Region', 'Year')
print(df_swapped.head())

# Sort by index after swapping
df_swapped = df_swapped.sort_index()

💡 Swapping and sorting are common when data is structured differently across systems.


🧮 6. Aggregation Example — Profit Margin by Level

Let’s compute Profit Margin (%) by Region and Year.

df['Profit Margin (%)'] = (df['Profit'] / df['Sales']) * 100

# Average profit margin by Region and Year
result = df.groupby(level=['Region', 'Year'])['Profit Margin (%)'].mean()
print(result)

🧪 Practice Exercises

Try these to get comfortable:

  1. Create a DataFrame with index levels: ['Department', 'Year', 'Quarter'] showing Revenue and Cost.

  2. Convert columns into hierarchical index using .set_index().

  3. Use .xs() to get all Q2 data across departments.

  4. Compute total Revenue by Department.

  5. Find average Cost by Year.

  6. Unstack the Year level and display Revenue as columns.

  7. Stack it back and verify it matches the original DataFrame.

  8. Swap index levels (DepartmentYear) and sort.

  9. Add a new calculated column Profit = Revenue - Cost.

  10. Compute and display average profit per department.


📚 Summary

FunctionDescription
.set_index()Converts columns into a hierarchical index
.xs()Cross-section selection across index levels
.stack() / .unstack()Reshape DataFrames across levels
.groupby(level=)Aggregate across multiple index levels
.swaplevel()Reorder levels in a multi-index
.sort_index()Sort hierarchical index for better readability

🧮 Complex DataFrame Manipulations

Data manipulation lies at the heart of every data analysis process. Once a dataset is loaded into a Pandas DataFrame, the next step is to explore, filter, summarize, and transform it to extract meaningful insights.

In this section, we’ll cover powerful techniques:

  1. Filtering data using multiple logical conditions

  2. Grouping with aggregation using groupby()

  3. Creating pivot tables for multi-dimensional summaries

  4. Adding calculated columns using existing data


🎯 Learning Objectives

By the end of this section, you will be able to:

  • Filter DataFrame rows using multiple logical conditions

  • Perform grouped aggregations with groupby()

  • Create pivot tables for multi-dimensional summaries

  • Add new calculated columns using existing data


🧮 1. Filtering Rows Based on Multiple Conditions

Filtering allows you to select only the data that meets certain criteria. Pandas uses boolean indexing to achieve this.

import pandas as pd

# Sample dataset
data = {
    'Employee': ['Asha', 'Raj', 'Kiran', 'Meera', 'Vikram'],
    'Department': ['Sales', 'Finance', 'Sales', 'HR', 'Finance'],
    'Salary': [45000, 55000, 60000, 40000, 58000],
    'Experience': [2, 5, 4, 3, 6]
}
df = pd.DataFrame(data)

# Filter employees in Sales with Salary above 45,000
filtered = df[(df['Department'] == 'Sales') & (df['Salary'] > 45000)]
print(filtered)

💡 Explanation:

  • Use & for AND and | for OR conditions.

  • Always wrap each condition in parentheses.


📊 2. Grouping Data with Aggregation

Grouping helps summarize large datasets into meaningful statistics. The groupby() function is one of the most powerful tools in Pandas.

# Average salary by department
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)

# Multiple aggregations
agg_result = df.groupby('Department').agg({'Salary': ['mean', 'max'], 'Experience': 'mean'})
print(agg_result)

💡 Explanation:

  • You can compute multiple aggregations for each group.

  • The result is often a MultiIndex DataFrame, which can be flattened later for easier access.


🪄 3. Creating Pivot Tables

Pivot tables allow you to summarize and reshape data interactively, similar to Excel.

pivot = pd.pivot_table(
    df,
    values='Salary',
    index='Department',
    columns='Experience',
    aggfunc='mean',
    fill_value=0
)
print(pivot)

💡 Explanation:

  • index defines rows, columns defines columns, and values defines what to aggregate.

  • You can specify aggregation functions like mean, sum, or even custom functions.


➕ 4. Adding Calculated Columns

You can create new columns derived from existing ones. This is essential for analysis and feature engineering.

# Add a 10% bonus to each salary
df['Bonus'] = df['Salary'] * 0.10
df['Total Pay'] = df['Salary'] + df['Bonus']
print(df)

💡 Explanation:

  • You can perform vectorized operations directly on DataFrame columns.

  • Avoid loops — Pandas operations are faster and cleaner.


🧪 Practice Exercises

Try the following tasks using your dataset:

  1. Filter records where Experience > 3 and Salary < 60000.

  2. Group data by Department and calculate both average and maximum salaries.

  3. Create a pivot table showing average salary by Department and Experience.

  4. Add a column Tax = 10% of Salary, and compute Net Salary = Salary - Tax.

  5. Find the top 3 employees with the highest total pay (combine with nlargest()).

  6. For each department, display the employee with maximum experience.

  7. Filter all employees whose names start with 'A' or 'M'.

  8. Add a calculated column for Years Until Retirement assuming retirement age is 60.

  9. Identify the lowest paid employee in each department (combine with nsmallest()).

  10. Create a report summarizing total salary and average experience by department.


Advanced Hierarchical (Multi-Level) Indexing

In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack(). Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.


🎯 Learning Objectives

By the end of this section, you’ll be able to:

  • Create multi-level indices from multiple columns

  • Perform cross-section selection with .xs()

  • Use stacking/unstacking across levels

  • Apply aggregation at specific index levels

  • Understand index ordering and sorting


🧱 1. Creating Multi-Level Indices from Columns

You can convert existing columns into a multi-index using .set_index().

import pandas as pd

data = {
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Year': [2022, 2023, 2022, 2023, 2022, 2023],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Sales': [25000, 27000, 30000, 28000, 26000, 31000],
    'Profit': [4000, 5000, 6000, 4500, 4700, 6200]
}

df = pd.DataFrame(data)
df = df.set_index(['Region', 'Year', 'Quarter'])
print(df)

Output:

                         Sales  Profit
Region Year Quarter
North  2022 Q1        25000    4000
       2023 Q2        27000    5000
South  2022 Q1        30000    6000
       2023 Q2        28000    4500
East   2022 Q1        26000    4700
       2023 Q2        31000    6200

Now you have a three-level index: Region → Year → Quarter.


2. Cross-Section Selection with .xs()

When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.

# Select all data for the year 2022
print(df.xs(2022, level='Year'))

# Select only Q1 records for all regions
print(df.xs('Q1', level='Quarter'))

# Select for one region and one year
print(df.xs(('South', 2023), level=('Region', 'Year')))

💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.


🧩 3. Stacking and Unstacking with Levels

You can reshape data across any index level using the level argument.

# Unstack by Quarter — move it to columns
print(df.unstack(level='Quarter'))

# Stack back Quarter level
print(df.unstack(level='Quarter').stack(level='Quarter'))

💡 Tip:

  • Use stack(level='Quarter') to move Quarter to rows.

  • Use unstack(level='Quarter') to move Quarter to columns.


⚙️ 4. Aggregation at Specific Index Levels

Use groupby(level=) for advanced summarization.

# Total Sales per Region
print(df.groupby(level='Region').sum())

# Average Profit per Year
print(df.groupby(level='Year')['Profit'].mean())

# Total Sales per Region and Year
print(df.groupby(level=['Region', 'Year'])['Sales'].sum())

These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.


🔄 5. Sorting and Reordering Multi-Level Indices

A well-sorted index helps when performing hierarchical operations.

# Sort by Region then Year
df = df.sort_index(level=['Region', 'Year'])
print(df.head())

# Swap index levels (Year before Region)
df_swapped = df.swaplevel('Region', 'Year')
print(df_swapped.head())

# Sort by index after swapping
df_swapped = df_swapped.sort_index()

💡 Swapping and sorting are common when data is structured differently across systems.


🧮 6. Aggregation Example — Profit Margin by Level

Let’s compute Profit Margin (%) by Region and Year.

df['Profit Margin (%)'] = (df['Profit'] / df['Sales']) * 100

# Average profit margin by Region and Year
result = df.groupby(level=['Region', 'Year'])['Profit Margin (%)'].mean()
print(result)

🧪 Practice Exercises

Try these to get comfortable:

  1. Create a DataFrame with index levels: ['Department', 'Year', 'Quarter'] showing Revenue and Cost.

  2. Convert columns into hierarchical index using .set_index().

  3. Use .xs() to get all Q2 data across departments.

  4. Compute total Revenue by Department.

  5. Find average Cost by Year.

  6. Unstack the Year level and display Revenue as columns.

  7. Stack it back and verify it matches the original DataFrame.

  8. Swap index levels (DepartmentYear) and sort.

  9. Add a new calculated column Profit = Revenue - Cost.

  10. Compute and display average profit per department.


📚 Summary

FunctionDescription
.set_index()Converts columns into a hierarchical index
.xs()Cross-section selection across index levels
.stack() / .unstack()Reshape DataFrames across levels
.groupby(level=)Aggregate across multiple index levels
.swaplevel()Reorder levels in a multi-index
.sort_index()Sort hierarchical index for better readability

🏆 Identifying Top-Performing Records

Once you’ve cleaned, indexed, and organized your dataset, the next step is often to find the best (or worst) performers — the top customers, highest-grossing products, or most efficient employees.

Pandas provides multiple efficient methods:

  1. sort_values() — Order DataFrame by column(s)

  2. nlargest() / nsmallest() — Get top/bottom N records directly

  3. Grouping + sorting — Find top performers within groups


1. Sorting Values

Use .sort_values() to arrange data based on one or more columns.

import pandas as pd

df = pd.DataFrame({
    'Product': ['Laptop', 'Tablet', 'Phone', 'Camera', 'Monitor'],
    'Revenue': [80000, 30000, 60000, 20000, 40000],
    'Profit': [20000, 8000, 15000, 4000, 7000]
})

# Sort by Revenue in descending order
df_sorted = df.sort_values(by='Revenue', ascending=False)
print(df_sorted)

Output:

  Product  Revenue  Profit
0  Laptop    80000   20000
2   Phone    60000   15000
4 Monitor    40000    7000
1  Tablet    30000    8000
3  Camera    20000    4000

You can also sort by multiple columns:

df.sort_values(by=['Revenue', 'Profit'], ascending=[False, False], inplace=True)

🎯 2. Getting Top N Records — .nlargest()

When you only care about the top few performers, .nlargest() is faster and simpler than sorting.

top3 = df.nlargest(3, 'Revenue')
print(top3)

Output:

  Product  Revenue  Profit
0  Laptop    80000   20000
2   Phone    60000   15000
4 Monitor    40000    7000

Similarly, .nsmallest() gives you the lowest-performing records:

bottom2 = df.nsmallest(2, 'Profit')
print(bottom2)

Output:

  Product  Revenue  Profit
3  Camera    20000    4000
4 Monitor    40000    7000

🧩 3. Grouping + Sorting — Find Top in Each Group

Find the top-performing record within each category.

data = pd.DataFrame({
    'Category': ['Electronics', 'Electronics', 'Furniture', 'Furniture', 'Clothing'],
    'Product': ['Laptop', 'Phone', 'Chair', 'Desk', 'Jacket'],
    'Revenue': [80000, 60000, 20000, 25000, 15000],
    'Profit': [20000, 15000, 4000, 5000, 3000]
})

# Top product in each category based on Revenue
top_per_category = data.sort_values('Revenue', ascending=False).groupby('Category').head(1)
print(top_per_category)

Output:

      Category Product  Revenue  Profit
0  Electronics  Laptop    80000   20000
3    Furniture    Desk    25000    5000
4     Clothing  Jacket    15000    3000

⚡ 4. Using idxmax() and idxmin()

Quickly get the index of the top or bottom record:

best_product = df.loc[df['Profit'].idxmax()]
print(best_product)

Output:

Product    Laptop
Revenue     80000
Profit      20000
Name: 0, dtype: object

🧠 Summary — Identifying Top Performers

MethodPurposeNotes
sort_values()Sort full DataFrameUse for ordered reports
nlargest() / nsmallest()Get top or bottom N recordsFast and simple
groupby().head()Top N within each groupGreat for categorical comparisons
idxmax() / idxmin()Get best record indexWorks on single columns

📚 Complete Practice Lab — Data Detective Challenge

Now that you’ve mastered EDA, indexing, grouping, and identifying top performers, it’s time to put it all together in a comprehensive analysis challenge.


🎯 Challenge: Company Sales Analysis

You have access to a dataset called company_sales.csv with the following structure:

ColumnDescription
DateTransaction date
RegionSales region
Product_CategoryCategory of product sold
QuantityUnits sold
Price_per_UnitUnit price
Sales_AmountTotal sales (Quantity × Price_per_Unit)
Marketing_SpendMarketing spend for that transaction
Customer_TierCustomer segment (Gold/Silver/Bronze)

📋 Analysis Tasks

Use all the techniques you’ve learned to answer these questions:

Part 1: Data Exploration

  1. Load the dataset and display basic info (shape, columns, data types)

  2. Generate summary statistics for all numeric columns

  3. Check for any missing values or outliers

  4. Create a correlation matrix and visualize it

Part 2: Univariate & Bivariate Analysis

  1. Create a histogram of Sales_Amount — is it normally distributed?

  2. Create a boxplot of Sales_Amount by Region — which region performs best?

  3. Create a scatter plot: Marketing_Spend vs Sales_Amount — is there a relationship?

  4. Analyze which Product_Category has the highest average sales

Part 3: Indexing & Filtering

  1. Filter for transactions with Sales_Amount > $50,000 — how many records?

  2. Select only Gold-tier customers in the North region using .loc[]

  3. Find all transactions from the top 3 months by sales using indexing

Part 4: Grouping & Aggregation

  1. Group by Region and calculate:

    • Average Sales Amount

    • Total Marketing Spend

    • Count of transactions

  2. Create a pivot table showing average Sales Amount by Region and Customer_Tier

  3. Group by Product_Category and find the best-performing category by total revenue

Part 5: Top Performers & Calculated Columns

  1. Add a calculated column: ROI = (Sales_Amount - Marketing_Spend) / Marketing_Spend * 100

  2. Find the top 5 transactions with the highest ROI

  3. For each region, identify the top 3 performing days (by total sales)

  4. Find the bottom 2 performing product categories by average profit margin

Part 6: Advanced Insights

  1. Create a multi-level index: ['Region', 'Customer_Tier'] and calculate total sales

  2. Use .xs() to extract all sales for a specific tier across regions


🏆 Bonus Challenges

  • Visualization: Create a dashboard with 4 plots showing key metrics

  • Time Series: If Date is included, plot sales trends over time

  • Segmentation: Identify high-value customers (top 10% by sales)

  • Forecasting: Predict next month’s total sales based on current trends


💡 Key Takeaways

By completing this lab, you’ll have practiced:

Data Loading & Exploration — Understanding your dataset
Univariate Analysis — Single variable patterns
Bivariate Analysis — Relationships between variables
Filtering & Indexing — Extracting relevant subsets
Grouping & Aggregation — Summarizing by categories
Pivot Tables — Multi-dimensional views
Top Performers — Ranking and sorting
Feature Engineering — Creating new insights


📖 Recap Table

TopicKey FunctionsUse Case
Summary Stats.describe(), .info()Understand data distribution
Visualizationsns.histplot(), sns.scatterplot()Visual pattern recognition
Filteringdf[df['col'] > value]Extract subsets
Indexing.loc[], .iloc[], .set_index()Efficient data access
Grouping.groupby().agg()Categorical summaries
Pivoting.pivot_table()Multi-dimensional views
Ranking.nlargest(), .sort_values()Top/bottom performers
FeaturesCalculated columnsNew derived metrics

🎓 You’re now a data detective! Go forth and uncover business stories hidden in your data! 🔍

# Your code here

📊 Data Inspection Basics

Viewing Data

Use .head() and .tail() to view the top and bottom rows of the DataFrame:

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

print("First 5 rows:")
print(df.head())

print("\nLast 3 rows:")
df.tail(3)

Inspecting Index and Columns

print("Index:")
print(df.index)

print("\nColumns:")
df.columns

Converting to NumPy Array

Return a NumPy representation of the underlying data:

df.to_numpy()

Data Types

When working with mixed data types, pandas will find a common dtype:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

print("Data types:")
print(df2.dtypes)

print("\nConverted to numpy:")
df2.to_numpy()

📈 Summary Statistics

The .describe() method provides quick statistical summary of your data:

df.describe()

Transposing Data

df.T

🔄 Sorting Data

Sort by Index

df.sort_index(axis=1, ascending=False)

Sort by Values

df.sort_values(by="B")

🎯 Data Selection

Basic Column Selection

Selecting a single column returns a Series:

df["A"]

Multiple Column Selection

df[["B", "A"]]

Row Slicing

print("First 3 rows:")
print(df[0:3])

print("\nDate range slice:")
df["20130102":"20130104"]

Selection by Label (.loc)

Use .loc[] for label-based indexing:

# Select a single row
print("Single row:")
print(df.loc[dates[0]])

print("\nSelect rows and columns:")
print(df.loc[:, ["A", "B"]])

print("\nSlice with labels (both endpoints included):")
df.loc["20130102":"20130104", ["A", "B"]]

Selection by Position (.iloc)

Use .iloc[] for integer position-based indexing:

# Select by position
print("4th row:")
print(df.iloc[3])

print("\nSlice rows and columns:")
print(df.iloc[3:5, 0:2])

print("\nSpecific positions:")
df.iloc[[1, 2, 4], [0, 2]]

Boolean Indexing

Filter data based on conditions:

# Filter where column A is greater than 0
df[df["A"] > 0]

Using .isin() for Filtering

df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
print("DataFrame with new column:")
print(df2)

print("\nFiltered data:")
df2[df2["E"].isin(["two", "four"])]

📊 Statistical Operations

Computing Statistics

Operations generally exclude missing data:

# Mean for each column
print("Mean by column:")
print(df.mean())

print("\nMean by row:")
df.mean(axis=1)

Operations with Different Indices

When operating with objects that have different indices, pandas will align them:

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print("Shifted series:")
print(s)

print("\nSubtraction with alignment:")
df.sub(s, axis="index")

User-Defined Functions

Apply custom functions using .agg() and .transform():

# Aggregation
print("Aggregation with custom function:")
print(df.agg(lambda x: np.mean(x) * 5.6))

print("\nTransformation:")
df.transform(lambda x: x * 101.2)

🎲 Value Counts

Value counts provide frequency distribution of values:

s = pd.Series(np.random.randint(0, 7, size=10))
print("Sample series:")
print(s)

print("\nValue counts:")
s.value_counts()

👥 Grouping Data

Group-by operations allow you to split data, apply a function, and combine results:

df_group = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
print("Sample DataFrame:")
df_group

Grouping and Aggregating

# Group by column A and sum
df_group.groupby("A")[["C", "D"]].sum()

Grouping by Multiple Columns

df_group.groupby(["A", "B"]).sum()

🔄 Reshaping Data

Stack and Unstack

Stack “compresses” a level in the DataFrame’s columns:

arrays = [
   ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df_stack = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2_stack = df_stack[:4]
print("Original DataFrame:")
df2_stack

Stacking

stacked = df2_stack.stack()
stacked

Unstacking

print("Unstack (default - last level):")
print(stacked.unstack())

print("\nUnstack level 1:")
print(stacked.unstack(1))

print("\nUnstack level 0:")
stacked.unstack(0)

Pivot Tables

Pivot tables provide a way to create cross-tabulations:

df_pivot = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)
print("Sample DataFrame:")
df_pivot

Creating Pivot Table

pd.pivot_table(df_pivot, values="D", index=["A", "B"], columns=["C"])