Exploratory Data Analysis#
Turning Data Into Business Stories
Welcome to the stage where data becomes drama. 🎭
You’ve loaded it, cleaned it, encoded it — now it’s time to interrogate it.
💬 “EDA is like a first date with your dataset — ask questions, spot red flags, and decide if you want to take it to production.”
🧠 Why EDA Matters#
EDA is where you move from data janitor to data storyteller.
Here’s what EDA helps you uncover:
What’s normal in your data
What’s weird (and possibly a bug or a legend)
What features actually matter
How variables interact
⚡ A good EDA session can save you 10 hours of model debugging later.
🧰 Prerequisite#
If you’re new to pandas or visualization, check my other book: 👉 📘 Programming for Business
📦 Step 1. Get to Know Your Dataset#
Let’s start with the basics — “Hi, dataset! Tell me about yourself.”
💬 “
.info()is like your dataset’s LinkedIn profile — gives you a summary, but hides the weird stuff.”
🔍 Step 2. Basic Summary Statistics#
Numbers can whisper business truths — if you know how to listen.
Metric |
What It Means |
|---|---|
|
Typical value |
|
How spread out things are |
|
Range (and possible outliers) |
|
Categorical diversity |
💡 Example: “Average order = \(92, but max order = \)10,000?” Either a whale customer… or a typo named Jeff.
🎨 Step 3. Univariate Visuals — “The Solo Acts”#
Visualize one variable at a time to see its story.
What to Look For |
Meaning |
|---|---|
Bell curve |
Healthy data |
Right-skewed |
A few big spenders |
Flat or weird shape |
Mixed data sources |
💬 “If your histogram looks like a cliff, it’s not geology — it’s bad data entry.”
🍰 Categorical Insights#
💡 “If one region dominates, either it’s your best market — or others are asleep at work.”
🧩 Step 4. Bivariate Analysis — “The Dynamic Duos”#
This is where you see relationships between features.
🧮 Numerical vs Numerical#
💬 “If you see an upward trend, congrats — your marketing isn’t just burning money.”
🎯 Categorical vs Numerical#
💡 Boxplots are like HR reports — they show who’s average, who’s overachieving, and who’s way out of line.
🧠 Step 5. Correlation — “The Data Gossip Network”#
Find which variables are talking behind each other’s backs.
Value |
Meaning |
|---|---|
0.9+ |
Practically twins |
0.5 |
Moderate connection |
0 |
No gossip |
-0.9 |
Total opposites |
💬 “If two columns are 0.99 correlated, drop one — they’re basically sending the same email twice.”
🪄 Step 6. Feature Relationships (Business Examples)#
Example 1: Revenue vs Customer Segment#
“Gold customers really are gold — except when they ask for more discounts.”
Example 2: Seasonal Trends#
“If sales spike every November — that’s not coincidence, that’s marketing season.”
📊 Step 7. Advanced EDA Helpers#
When you’re ready to go pro:
💬 “This is like an MRI for your dataset — it finds hidden patterns before you even blink.”
Advanced DataFrame Indexing & Selection#
After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.
Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.
🧱 1. Label-Based & Position-Based Selection#
.loc[] — Access by Label (Name)#
Use actual labels (names) of rows and columns.
.iloc[] — Access by Position (Integer)#
Use integer positions instead of names.
💡 Key difference: .loc[] uses labels, .iloc[] uses numeric positions.
🧩 2. Boolean Indexing — Filter Like a Pro 🔍#
Boolean indexing lets you apply conditions directly on DataFrame columns.
💡 Always enclose conditions within parentheses and use
&,|,~operators.
⚙️ 3. Index Management — Set & Reset#
You can change which column acts as the row identifier for faster lookups and clarity.
💡 Why? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.
✂️ 4. Conditional Slicing#
Extract specific subsets by combining slicing and conditions.
🧠 5. Fast Scalar Access#
For accessing single values, use .at[] or .iat[] — much faster than .loc[] or .iloc[].
Summary — Advanced Selection in Pandas#
Method |
Type |
Example |
Description |
|---|---|---|---|
|
Label-based |
|
Select by row/column name |
|
Position-based |
|
Select by numeric position |
Boolean Index |
Condition-based |
|
Filter with logic |
|
Index management |
|
Change lookup key |
|
Revert index |
|
Restore default integers |
|
Fast scalar |
|
Quick single value access |
🧪 Practice Lab — Advanced Indexing#
Try the following to solidify your understanding:
Select only the
NameandSalarycolumns using both.locand.iloc.Retrieve all employees with experience less than 4 years.
Display the 2nd and 4th rows using
.iloc.Set
Departmentas the new index and display the DataFrame.Reset the index back to default integer.
Filter employees whose name starts with
'M'or'V'.Increase salary by 5% for all employees in the Sales department.
Access the salary of the employee in the 3rd row and update it to
62000.Create a subset of only Finance department employees using
.loc.Display employees with experience between 3 and 6 years.
🧭 Advanced DataFrame Indexing & Selection#
After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.
Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.
🎯 Learning Objectives#
By the end of this section, you will be able to:
Select rows and columns using
.loc[]and.iloc[]Apply boolean indexing for flexible filtering
Set and reset indices dynamically
Slice data using conditions and custom logic
🧱 1. Selecting Rows and Columns Using .loc[] and .iloc[]#
.loc[] → Label-based indexing#
You use actual labels (names) of rows and columns.
.iloc[] → Integer position-based indexing#
You use integer positions instead of names.
💡 Key difference:
.loc[]→ Use labels.iloc[]→ Use numeric positions
🧩 2. Boolean Indexing#
Boolean indexing allows filtering based on logical conditions.
💡 Tip: Combine multiple conditions using & (AND), | (OR), and ~ (NOT).
Always enclose conditions within parentheses.
⚙️ 3. Setting and Resetting Index#
Indexes act like row identifiers. You can change them to columns that make sense for your data.
💡 Why do this? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.
✂️ 4. Slicing Based on Conditions#
You can extract specific portions of your DataFrame by slicing or conditional selection.
🧠 5. Selecting Specific Cells and Values#
You can directly access or modify specific cells using .at[] or .iat[] for faster scalar access.
💡 .at[] and .iat[] are optimized for single value access — much faster than .loc[] or .iloc[].
🧪 Practice Exercises#
Try the following to solidify your understanding:
Select only the
NameandSalarycolumns using both.locand.iloc.Retrieve all employees with experience less than 4 years.
Display the 2nd and 4th rows using
.iloc.Set
Departmentas the new index and display the DataFrame.Reset the index back to default integer.
Filter employees whose name starts with
'M'or'V'.Increase salary by 5% for all employees in the Sales department.
Access the salary of the employee in the 3rd row and update it to
62000.Create a subset of only Finance department employees using
.loc.Display employees with experience between 3 and 6 years.
⚡ Summary#
Concept |
Method |
Description |
|---|---|---|
|
Label-based |
Access by names (rows/columns) |
|
Position-based |
Access by numeric index |
|
Fast scalar access |
Quick value retrieval |
|
Index management |
Reassign or restore indices |
Boolean indexing |
Condition-based |
Logical filtering of rows |
🧩 Hierarchical (Multi-level) Indexing — Basics#
In large datasets, data often has multiple dimensions — like regions, departments, or years — and we want to organize or summarize them efficiently. That’s where Hierarchical Indexing (also called Multi-level Indexing) in Pandas becomes powerful.
It allows you to store and access data across multiple levels of rows or columns, creating a structured and compact representation of complex datasets.
🎯 Learning Objectives#
By the end of this section, you will be able to:
Create multi-level indexed DataFrames
Select data at different index levels
Perform stacking and unstacking
Apply aggregation based on specific index levels
🧱 1. Creating a Multi-level (Hierarchical) Index#
Let’s start with a dataset representing sales performance across regions and years.
Output:
Sales Profit
Region Year
North 2022 25000 4000
2023 27000 5000
South 2022 30000 6000
2023 28000 4500
East 2022 26000 4700
2023 31000 6200
💡 Notice: The index now has two levels — Region and Year.
🔍 2. Selecting Data at Different Index Levels#
You can use .loc[] with tuples to access data at specific index combinations.
Or use index slicing for partial selections.
⚙️ 3. Stacking and Unstacking#
Stacking and unstacking reshape data between rows and columns — powerful for pivot-style transformations.
💡 Think of it as:
.stack()→ makes the table taller (more rows, fewer columns).unstack()→ makes the table wider (fewer rows, more columns)
📊 4. Aggregation by Index Levels#
You can compute summary statistics for specific index levels using level parameter in aggregation.
💡 Use level name or number (level=0 for Region, level=1 for Year).
🧪 Practice Exercises#
Try these small tasks:
Create a DataFrame with 2-level index:
['Country', 'Year']showing GDP and Inflation.Display all records for the country
'India'.Select the data for
'USA'in2023.Compute the total GDP by country.
Compute the average inflation by year.
Convert the table to a stacked format and then back using
.unstack().Add a third index level (
Quarter) and print the new structure.Select only the GDP column for
'India'using.loc.Convert multi-index to columns using
.reset_index().Sort the multi-index in ascending order of
Year.
⚡ Summary#
Concept |
Function |
Description |
|---|---|---|
MultiIndex creation |
|
Creates hierarchical indices |
Selection |
|
Access multi-level data |
Reshaping |
|
Convert between wide and long formats |
Aggregation |
|
Aggregate data at specific index levels |
Reset |
|
Flatten the hierarchy |
🧬 Advanced Hierarchical (Multi-Level) Indexing#
In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack().
Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.
🎯 Learning Objectives#
By the end of this section, you’ll be able to:
Create multi-level indices from multiple columns
Perform cross-section selection with
.xs()Use stacking/unstacking across levels
Apply aggregation at specific index levels
Understand index ordering and sorting
🧱 1. Creating Multi-Level Indices from Columns#
You can convert existing columns into a multi-index using .set_index().
Output:
Sales Profit
Region Year Quarter
North 2022 Q1 25000 4000
2023 Q2 27000 5000
South 2022 Q1 30000 6000
2023 Q2 28000 4500
East 2022 Q1 26000 4700
2023 Q2 31000 6200
Now you have a three-level index: Region → Year → Quarter.
🔍 2. Cross-Section Selection with .xs()#
When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.
💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.
🧩 3. Stacking and Unstacking with Levels#
You can reshape data across any index level using the level argument.
💡 Tip:
Use
stack(level='Quarter')to move Quarter to rows.Use
unstack(level='Quarter')to move Quarter to columns.
⚙️ 4. Aggregation at Specific Index Levels#
Use groupby(level=) for advanced summarization.
These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.
🔄 5. Sorting and Reordering Multi-Level Indices#
A well-sorted index helps when performing hierarchical operations.
💡 Swapping and sorting are common when data is structured differently across systems.
🧮 6. Aggregation Example — Profit Margin by Level#
Let’s compute Profit Margin (%) by Region and Year.
🧪 Practice Exercises#
Try these to get comfortable:
Create a DataFrame with index levels:
['Department', 'Year', 'Quarter']showing Revenue and Cost.Convert columns into hierarchical index using
.set_index().Use
.xs()to get all Q2 data across departments.Compute total Revenue by Department.
Find average Cost by Year.
Unstack the Year level and display Revenue as columns.
Stack it back and verify it matches the original DataFrame.
Swap index levels (
Department↔Year) and sort.Add a new calculated column
Profit = Revenue - Cost.Compute and display average profit per department.
📚 Summary#
Function |
Description |
|---|---|
|
Converts columns into a hierarchical index |
|
Cross-section selection across index levels |
|
Reshape DataFrames across levels |
|
Aggregate across multiple index levels |
|
Reorder levels in a multi-index |
|
Sort hierarchical index for better readability |
🧮 Complex DataFrame Manipulations#
Data manipulation lies at the heart of every data analysis process. Once a dataset is loaded into a Pandas DataFrame, the next step is to explore, filter, summarize, and transform it to extract meaningful insights.
In this section, we’ll cover powerful techniques:
Filtering data using multiple logical conditions
Grouping with aggregation using
groupby()Creating pivot tables for multi-dimensional summaries
Adding calculated columns using existing data
🎯 Learning Objectives#
By the end of this section, you will be able to:
Filter DataFrame rows using multiple logical conditions
Perform grouped aggregations with
groupby()Create pivot tables for multi-dimensional summaries
Add new calculated columns using existing data
🧮 1. Filtering Rows Based on Multiple Conditions#
Filtering allows you to select only the data that meets certain criteria. Pandas uses boolean indexing to achieve this.
💡 Explanation:
Use
&for AND and|for OR conditions.Always wrap each condition in parentheses.
📊 2. Grouping Data with Aggregation#
Grouping helps summarize large datasets into meaningful statistics. The groupby() function is one of the most powerful tools in Pandas.
💡 Explanation:
You can compute multiple aggregations for each group.
The result is often a MultiIndex DataFrame, which can be flattened later for easier access.
🪄 3. Creating Pivot Tables#
Pivot tables allow you to summarize and reshape data interactively, similar to Excel.
💡 Explanation:
indexdefines rows,columnsdefines columns, andvaluesdefines what to aggregate.You can specify aggregation functions like
mean,sum, or even custom functions.
➕ 4. Adding Calculated Columns#
You can create new columns derived from existing ones. This is essential for analysis and feature engineering.
💡 Explanation:
You can perform vectorized operations directly on DataFrame columns.
Avoid loops — Pandas operations are faster and cleaner.
🧪 Practice Exercises#
Try the following tasks using your dataset:
Filter records where
Experience > 3andSalary < 60000.Group data by
Departmentand calculate both average and maximum salaries.Create a pivot table showing average salary by
DepartmentandExperience.Add a column
Tax= 10% ofSalary, and computeNet Salary=Salary - Tax.Find the top 3 employees with the highest total pay (combine with nlargest()).
For each department, display the employee with maximum experience.
Filter all employees whose names start with
'A'or'M'.Add a calculated column for Years Until Retirement assuming retirement age is 60.
Identify the lowest paid employee in each department (combine with nsmallest()).
Create a report summarizing total salary and average experience by department.
Advanced Hierarchical (Multi-Level) Indexing#
In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack().
Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.
🎯 Learning Objectives#
By the end of this section, you’ll be able to:
Create multi-level indices from multiple columns
Perform cross-section selection with
.xs()Use stacking/unstacking across levels
Apply aggregation at specific index levels
Understand index ordering and sorting
🧱 1. Creating Multi-Level Indices from Columns#
You can convert existing columns into a multi-index using .set_index().
Output:
Sales Profit
Region Year Quarter
North 2022 Q1 25000 4000
2023 Q2 27000 5000
South 2022 Q1 30000 6000
2023 Q2 28000 4500
East 2022 Q1 26000 4700
2023 Q2 31000 6200
Now you have a three-level index: Region → Year → Quarter.
2. Cross-Section Selection with .xs()#
When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.
💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.
🧩 3. Stacking and Unstacking with Levels#
You can reshape data across any index level using the level argument.
💡 Tip:
Use
stack(level='Quarter')to move Quarter to rows.Use
unstack(level='Quarter')to move Quarter to columns.
⚙️ 4. Aggregation at Specific Index Levels#
Use groupby(level=) for advanced summarization.
These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.
🔄 5. Sorting and Reordering Multi-Level Indices#
A well-sorted index helps when performing hierarchical operations.
💡 Swapping and sorting are common when data is structured differently across systems.
🧮 6. Aggregation Example — Profit Margin by Level#
Let’s compute Profit Margin (%) by Region and Year.
🧪 Practice Exercises#
Try these to get comfortable:
Create a DataFrame with index levels:
['Department', 'Year', 'Quarter']showing Revenue and Cost.Convert columns into hierarchical index using
.set_index().Use
.xs()to get all Q2 data across departments.Compute total Revenue by Department.
Find average Cost by Year.
Unstack the Year level and display Revenue as columns.
Stack it back and verify it matches the original DataFrame.
Swap index levels (
Department↔Year) and sort.Add a new calculated column
Profit = Revenue - Cost.Compute and display average profit per department.
📚 Summary#
Function |
Description |
|---|---|
|
Converts columns into a hierarchical index |
|
Cross-section selection across index levels |
|
Reshape DataFrames across levels |
|
Aggregate across multiple index levels |
|
Reorder levels in a multi-index |
|
Sort hierarchical index for better readability |
🏆 Identifying Top-Performing Records#
Once you’ve cleaned, indexed, and organized your dataset, the next step is often to find the best (or worst) performers — the top customers, highest-grossing products, or most efficient employees.
Pandas provides multiple efficient methods:
sort_values()— Order DataFrame by column(s)nlargest()/nsmallest()— Get top/bottom N records directlyGrouping + sorting — Find top performers within groups
1. Sorting Values#
Use .sort_values() to arrange data based on one or more columns.
Output:
Product Revenue Profit
0 Laptop 80000 20000
2 Phone 60000 15000
4 Monitor 40000 7000
1 Tablet 30000 8000
3 Camera 20000 4000
You can also sort by multiple columns:
🎯 2. Getting Top N Records — .nlargest()#
When you only care about the top few performers, .nlargest() is faster and simpler than sorting.
Output:
Product Revenue Profit
0 Laptop 80000 20000
2 Phone 60000 15000
4 Monitor 40000 7000
Similarly, .nsmallest() gives you the lowest-performing records:
Output:
Product Revenue Profit
3 Camera 20000 4000
4 Monitor 40000 7000
🧩 3. Grouping + Sorting — Find Top in Each Group#
Find the top-performing record within each category.
Output:
Category Product Revenue Profit
0 Electronics Laptop 80000 20000
3 Furniture Desk 25000 5000
4 Clothing Jacket 15000 3000
⚡ 4. Using idxmax() and idxmin()#
Quickly get the index of the top or bottom record:
Output:
Product Laptop
Revenue 80000
Profit 20000
Name: 0, dtype: object
🧠 Summary — Identifying Top Performers#
Method |
Purpose |
Notes |
|---|---|---|
|
Sort full DataFrame |
Use for ordered reports |
|
Get top or bottom N records |
Fast and simple |
|
Top N within each group |
Great for categorical comparisons |
|
Get best record index |
Works on single columns |
📚 Complete Practice Lab — Data Detective Challenge#
Now that you’ve mastered EDA, indexing, grouping, and identifying top performers, it’s time to put it all together in a comprehensive analysis challenge.
🎯 Challenge: Company Sales Analysis#
You have access to a dataset called company_sales.csv with the following structure:
Column |
Description |
|---|---|
|
Transaction date |
|
Sales region |
|
Category of product sold |
|
Units sold |
|
Unit price |
|
Total sales (Quantity × Price_per_Unit) |
|
Marketing spend for that transaction |
|
Customer segment (Gold/Silver/Bronze) |
📋 Analysis Tasks#
Use all the techniques you’ve learned to answer these questions:
Part 1: Data Exploration#
Load the dataset and display basic info (shape, columns, data types)
Generate summary statistics for all numeric columns
Check for any missing values or outliers
Create a correlation matrix and visualize it
Part 2: Univariate & Bivariate Analysis#
Create a histogram of
Sales_Amount— is it normally distributed?Create a boxplot of
Sales_AmountbyRegion— which region performs best?Create a scatter plot:
Marketing_SpendvsSales_Amount— is there a relationship?Analyze which
Product_Categoryhas the highest average sales
Part 3: Indexing & Filtering#
Filter for transactions with
Sales_Amount > $50,000— how many records?Select only Gold-tier customers in the North region using
.loc[]Find all transactions from the top 3 months by sales using indexing
Part 4: Grouping & Aggregation#
Group by
Regionand calculate:Average Sales Amount
Total Marketing Spend
Count of transactions
Create a pivot table showing average Sales Amount by
RegionandCustomer_TierGroup by
Product_Categoryand find the best-performing category by total revenue
Part 5: Top Performers & Calculated Columns#
Add a calculated column:
ROI = (Sales_Amount - Marketing_Spend) / Marketing_Spend * 100Find the top 5 transactions with the highest ROI
For each region, identify the top 3 performing days (by total sales)
Find the bottom 2 performing product categories by average profit margin
Part 6: Advanced Insights#
Create a multi-level index:
['Region', 'Customer_Tier']and calculate total salesUse
.xs()to extract all sales for a specific tier across regions
🏆 Bonus Challenges#
Visualization: Create a dashboard with 4 plots showing key metrics
Time Series: If Date is included, plot sales trends over time
Segmentation: Identify high-value customers (top 10% by sales)
Forecasting: Predict next month’s total sales based on current trends
💡 Key Takeaways#
By completing this lab, you’ll have practiced:
✅ Data Loading & Exploration — Understanding your dataset
✅ Univariate Analysis — Single variable patterns
✅ Bivariate Analysis — Relationships between variables
✅ Filtering & Indexing — Extracting relevant subsets
✅ Grouping & Aggregation — Summarizing by categories
✅ Pivot Tables — Multi-dimensional views
✅ Top Performers — Ranking and sorting
✅ Feature Engineering — Creating new insights
📖 Recap Table#
Topic |
Key Functions |
Use Case |
|---|---|---|
Summary Stats |
|
Understand data distribution |
Visualization |
|
Visual pattern recognition |
Filtering |
|
Extract subsets |
Indexing |
|
Efficient data access |
Grouping |
|
Categorical summaries |
Pivoting |
|
Multi-dimensional views |
Ranking |
|
Top/bottom performers |
Features |
Calculated columns |
New derived metrics |
🎓 You’re now a data detective! Go forth and uncover business stories hidden in your data! 🔍
# Your code here
📊 Data Inspection Basics#
Viewing Data#
Use .head() and .tail() to view the top and bottom rows of the DataFrame:
import numpy as np
import pandas as pd
# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
print("First 5 rows:")
print(df.head())
print("\nLast 3 rows:")
df.tail(3)
Inspecting Index and Columns#
print("Index:")
print(df.index)
print("\nColumns:")
df.columns
Converting to NumPy Array#
Return a NumPy representation of the underlying data:
df.to_numpy()
Data Types#
When working with mixed data types, pandas will find a common dtype:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
print("Data types:")
print(df2.dtypes)
print("\nConverted to numpy:")
df2.to_numpy()
📈 Summary Statistics#
The .describe() method provides quick statistical summary of your data:
df.describe()
Transposing Data#
df.T
🔄 Sorting Data#
Sort by Index#
df.sort_index(axis=1, ascending=False)
Sort by Values#
df.sort_values(by="B")
🎯 Data Selection#
Basic Column Selection#
Selecting a single column returns a Series:
df["A"]
Multiple Column Selection#
df[["B", "A"]]
Row Slicing#
print("First 3 rows:")
print(df[0:3])
print("\nDate range slice:")
df["20130102":"20130104"]
Selection by Label (.loc)#
Use .loc[] for label-based indexing:
# Select a single row
print("Single row:")
print(df.loc[dates[0]])
print("\nSelect rows and columns:")
print(df.loc[:, ["A", "B"]])
print("\nSlice with labels (both endpoints included):")
df.loc["20130102":"20130104", ["A", "B"]]
Selection by Position (.iloc)#
Use .iloc[] for integer position-based indexing:
# Select by position
print("4th row:")
print(df.iloc[3])
print("\nSlice rows and columns:")
print(df.iloc[3:5, 0:2])
print("\nSpecific positions:")
df.iloc[[1, 2, 4], [0, 2]]
Boolean Indexing#
Filter data based on conditions:
# Filter where column A is greater than 0
df[df["A"] > 0]
Using .isin() for Filtering#
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
print("DataFrame with new column:")
print(df2)
print("\nFiltered data:")
df2[df2["E"].isin(["two", "four"])]
📊 Statistical Operations#
Computing Statistics#
Operations generally exclude missing data:
# Mean for each column
print("Mean by column:")
print(df.mean())
print("\nMean by row:")
df.mean(axis=1)
Operations with Different Indices#
When operating with objects that have different indices, pandas will align them:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print("Shifted series:")
print(s)
print("\nSubtraction with alignment:")
df.sub(s, axis="index")
User-Defined Functions#
Apply custom functions using .agg() and .transform():
# Aggregation
print("Aggregation with custom function:")
print(df.agg(lambda x: np.mean(x) * 5.6))
print("\nTransformation:")
df.transform(lambda x: x * 101.2)
🎲 Value Counts#
Value counts provide frequency distribution of values:
s = pd.Series(np.random.randint(0, 7, size=10))
print("Sample series:")
print(s)
print("\nValue counts:")
s.value_counts()
👥 Grouping Data#
Group-by operations allow you to split data, apply a function, and combine results:
df_group = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
print("Sample DataFrame:")
df_group
Grouping and Aggregating#
# Group by column A and sum
df_group.groupby("A")[["C", "D"]].sum()
Grouping by Multiple Columns#
df_group.groupby(["A", "B"]).sum()
🔄 Reshaping Data#
Stack and Unstack#
Stack “compresses” a level in the DataFrame’s columns:
arrays = [
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df_stack = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2_stack = df_stack[:4]
print("Original DataFrame:")
df2_stack
Stacking#
stacked = df2_stack.stack()
stacked
Unstacking#
print("Unstack (default - last level):")
print(stacked.unstack())
print("\nUnstack level 1:")
print(stacked.unstack(1))
print("\nUnstack level 0:")
stacked.unstack(0)
Pivot Tables#
Pivot tables provide a way to create cross-tabulations:
df_pivot = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)
print("Sample DataFrame:")
df_pivot
Creating Pivot Table#
pd.pivot_table(df_pivot, values="D", index=["A", "B"], columns=["C"])