Exploratory Data Analysis

Contents

Exploratory Data Analysis#

⏳ Loading Pyodide…

Turning Data Into Business Stories

Welcome to the stage where data becomes drama. 🎭

You’ve loaded it, cleaned it, encoded it — now it’s time to interrogate it.

💬 “EDA is like a first date with your dataset — ask questions, spot red flags, and decide if you want to take it to production.”


🧠 Why EDA Matters#

EDA is where you move from data janitor to data storyteller.

Here’s what EDA helps you uncover:

  • What’s normal in your data

  • What’s weird (and possibly a bug or a legend)

  • What features actually matter

  • How variables interact

⚡ A good EDA session can save you 10 hours of model debugging later.


🧰 Prerequisite#

If you’re new to pandas or visualization, check my other book: 👉 📘 Programming for Business


📦 Step 1. Get to Know Your Dataset#

Let’s start with the basics — “Hi, dataset! Tell me about yourself.”

💬 “.info() is like your dataset’s LinkedIn profile — gives you a summary, but hides the weird stuff.”


🔍 Step 2. Basic Summary Statistics#

Numbers can whisper business truths — if you know how to listen.

Metric

What It Means

mean, median

Typical value

std, IQR

How spread out things are

min, max

Range (and possible outliers)

unique

Categorical diversity

💡 Example: “Average order = \(92, but max order = \)10,000?” Either a whale customer… or a typo named Jeff.


🎨 Step 3. Univariate Visuals — “The Solo Acts”#

Visualize one variable at a time to see its story.

What to Look For

Meaning

Bell curve

Healthy data

Right-skewed

A few big spenders

Flat or weird shape

Mixed data sources

💬 “If your histogram looks like a cliff, it’s not geology — it’s bad data entry.”


🍰 Categorical Insights#

💡 “If one region dominates, either it’s your best market — or others are asleep at work.”


🧩 Step 4. Bivariate Analysis — “The Dynamic Duos”#

This is where you see relationships between features.

🧮 Numerical vs Numerical#

💬 “If you see an upward trend, congrats — your marketing isn’t just burning money.”


🎯 Categorical vs Numerical#

💡 Boxplots are like HR reports — they show who’s average, who’s overachieving, and who’s way out of line.


🧠 Step 5. Correlation — “The Data Gossip Network”#

Find which variables are talking behind each other’s backs.

Value

Meaning

0.9+

Practically twins

0.5

Moderate connection

0

No gossip

-0.9

Total opposites

💬 “If two columns are 0.99 correlated, drop one — they’re basically sending the same email twice.”


🪄 Step 6. Feature Relationships (Business Examples)#

Example 1: Revenue vs Customer Segment#

“Gold customers really are gold — except when they ask for more discounts.”


📊 Step 7. Advanced EDA Helpers#

When you’re ready to go pro:

💬 “This is like an MRI for your dataset — it finds hidden patterns before you even blink.”


Advanced DataFrame Indexing & Selection#

After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.

Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.

🧱 1. Label-Based & Position-Based Selection#

.loc[] — Access by Label (Name)#

Use actual labels (names) of rows and columns.

.iloc[] — Access by Position (Integer)#

Use integer positions instead of names.

💡 Key difference: .loc[] uses labels, .iloc[] uses numeric positions.


🧩 2. Boolean Indexing — Filter Like a Pro 🔍#

Boolean indexing lets you apply conditions directly on DataFrame columns.

💡 Always enclose conditions within parentheses and use &, |, ~ operators.


⚙️ 3. Index Management — Set & Reset#

You can change which column acts as the row identifier for faster lookups and clarity.

💡 Why? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.


✂️ 4. Conditional Slicing#

Extract specific subsets by combining slicing and conditions.


🧠 5. Fast Scalar Access#

For accessing single values, use .at[] or .iat[] — much faster than .loc[] or .iloc[].


Summary — Advanced Selection in Pandas#

Method

Type

Example

Description

.loc[]

Label-based

df.loc['A']

Select by row/column name

.iloc[]

Position-based

df.iloc[0:2, 1:3]

Select by numeric position

Boolean Index

Condition-based

df[df['Profit'] > 8000]

Filter with logic

.set_index()

Index management

df.set_index('City')

Change lookup key

.reset_index()

Revert index

df.reset_index()

Restore default integers

.at[] / .iat[]

Fast scalar

df.at['E2', 'Salary']

Quick single value access


🧪 Practice Lab — Advanced Indexing#

Try the following to solidify your understanding:

  1. Select only the Name and Salary columns using both .loc and .iloc.

  2. Retrieve all employees with experience less than 4 years.

  3. Display the 2nd and 4th rows using .iloc.

  4. Set Department as the new index and display the DataFrame.

  5. Reset the index back to default integer.

  6. Filter employees whose name starts with 'M' or 'V'.

  7. Increase salary by 5% for all employees in the Sales department.

  8. Access the salary of the employee in the 3rd row and update it to 62000.

  9. Create a subset of only Finance department employees using .loc.

  10. Display employees with experience between 3 and 6 years.


🧭 Advanced DataFrame Indexing & Selection#

After learning how to manipulate and summarize data, the next step is mastering indexing and selection — the key to slicing, dicing, and extracting precise data subsets efficiently.

Pandas provides multiple ways to access rows, columns, and specific cells — each suited for different situations. Understanding them deeply makes your analysis cleaner and faster.


🎯 Learning Objectives#

By the end of this section, you will be able to:

  • Select rows and columns using .loc[] and .iloc[]

  • Apply boolean indexing for flexible filtering

  • Set and reset indices dynamically

  • Slice data using conditions and custom logic


🧱 1. Selecting Rows and Columns Using .loc[] and .iloc[]#

.loc[]Label-based indexing#

You use actual labels (names) of rows and columns.

.iloc[]Integer position-based indexing#

You use integer positions instead of names.

💡 Key difference:

  • .loc[] → Use labels

  • .iloc[] → Use numeric positions


🧩 2. Boolean Indexing#

Boolean indexing allows filtering based on logical conditions.

💡 Tip: Combine multiple conditions using & (AND), | (OR), and ~ (NOT). Always enclose conditions within parentheses.


⚙️ 3. Setting and Resetting Index#

Indexes act like row identifiers. You can change them to columns that make sense for your data.

💡 Why do this? Setting an index improves clarity and speeds up lookups when a particular column represents unique entities.


✂️ 4. Slicing Based on Conditions#

You can extract specific portions of your DataFrame by slicing or conditional selection.


🧠 5. Selecting Specific Cells and Values#

You can directly access or modify specific cells using .at[] or .iat[] for faster scalar access.

💡 .at[] and .iat[] are optimized for single value access — much faster than .loc[] or .iloc[].


🧪 Practice Exercises#

Try the following to solidify your understanding:

  1. Select only the Name and Salary columns using both .loc and .iloc.

  2. Retrieve all employees with experience less than 4 years.

  3. Display the 2nd and 4th rows using .iloc.

  4. Set Department as the new index and display the DataFrame.

  5. Reset the index back to default integer.

  6. Filter employees whose name starts with 'M' or 'V'.

  7. Increase salary by 5% for all employees in the Sales department.

  8. Access the salary of the employee in the 3rd row and update it to 62000.

  9. Create a subset of only Finance department employees using .loc.

  10. Display employees with experience between 3 and 6 years.


⚡ Summary#

Concept

Method

Description

.loc[]

Label-based

Access by names (rows/columns)

.iloc[]

Position-based

Access by numeric index

.at[], .iat[]

Fast scalar access

Quick value retrieval

.set_index(), .reset_index()

Index management

Reassign or restore indices

Boolean indexing

Condition-based

Logical filtering of rows



🧩 Hierarchical (Multi-level) Indexing — Basics#

In large datasets, data often has multiple dimensions — like regions, departments, or years — and we want to organize or summarize them efficiently. That’s where Hierarchical Indexing (also called Multi-level Indexing) in Pandas becomes powerful.

It allows you to store and access data across multiple levels of rows or columns, creating a structured and compact representation of complex datasets.


🎯 Learning Objectives#

By the end of this section, you will be able to:

  • Create multi-level indexed DataFrames

  • Select data at different index levels

  • Perform stacking and unstacking

  • Apply aggregation based on specific index levels


🧱 1. Creating a Multi-level (Hierarchical) Index#

Let’s start with a dataset representing sales performance across regions and years.

Output:

               Sales  Profit
Region Year
North  2022   25000    4000
       2023   27000    5000
South  2022   30000    6000
       2023   28000    4500
East   2022   26000    4700
       2023   31000    6200

💡 Notice: The index now has two levels — Region and Year.


🔍 2. Selecting Data at Different Index Levels#

You can use .loc[] with tuples to access data at specific index combinations.

Or use index slicing for partial selections.


⚙️ 3. Stacking and Unstacking#

Stacking and unstacking reshape data between rows and columns — powerful for pivot-style transformations.

💡 Think of it as:

  • .stack() → makes the table taller (more rows, fewer columns)

  • .unstack() → makes the table wider (fewer rows, more columns)


📊 4. Aggregation by Index Levels#

You can compute summary statistics for specific index levels using level parameter in aggregation.

💡 Use level name or number (level=0 for Region, level=1 for Year).


🧪 Practice Exercises#

Try these small tasks:

  1. Create a DataFrame with 2-level index: ['Country', 'Year'] showing GDP and Inflation.

  2. Display all records for the country 'India'.

  3. Select the data for 'USA' in 2023.

  4. Compute the total GDP by country.

  5. Compute the average inflation by year.

  6. Convert the table to a stacked format and then back using .unstack().

  7. Add a third index level (Quarter) and print the new structure.

  8. Select only the GDP column for 'India' using .loc.

  9. Convert multi-index to columns using .reset_index().

  10. Sort the multi-index in ascending order of Year.


⚡ Summary#

Concept

Function

Description

MultiIndex creation

pd.MultiIndex.from_arrays() / nested lists

Creates hierarchical indices

Selection

.loc[] with tuples

Access multi-level data

Reshaping

.stack() / .unstack()

Convert between wide and long formats

Aggregation

.groupby(level=)

Aggregate data at specific index levels

Reset

.reset_index()

Flatten the hierarchy



🧬 Advanced Hierarchical (Multi-Level) Indexing#

In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack(). Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.


🎯 Learning Objectives#

By the end of this section, you’ll be able to:

  • Create multi-level indices from multiple columns

  • Perform cross-section selection with .xs()

  • Use stacking/unstacking across levels

  • Apply aggregation at specific index levels

  • Understand index ordering and sorting


🧱 1. Creating Multi-Level Indices from Columns#

You can convert existing columns into a multi-index using .set_index().

Output:

                         Sales  Profit
Region Year Quarter
North  2022 Q1        25000    4000
       2023 Q2        27000    5000
South  2022 Q1        30000    6000
       2023 Q2        28000    4500
East   2022 Q1        26000    4700
       2023 Q2        31000    6200

Now you have a three-level index: Region → Year → Quarter.


🔍 2. Cross-Section Selection with .xs()#

When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.

💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.


🧩 3. Stacking and Unstacking with Levels#

You can reshape data across any index level using the level argument.

💡 Tip:

  • Use stack(level='Quarter') to move Quarter to rows.

  • Use unstack(level='Quarter') to move Quarter to columns.


⚙️ 4. Aggregation at Specific Index Levels#

Use groupby(level=) for advanced summarization.

These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.


🔄 5. Sorting and Reordering Multi-Level Indices#

A well-sorted index helps when performing hierarchical operations.

💡 Swapping and sorting are common when data is structured differently across systems.


🧮 6. Aggregation Example — Profit Margin by Level#

Let’s compute Profit Margin (%) by Region and Year.


🧪 Practice Exercises#

Try these to get comfortable:

  1. Create a DataFrame with index levels: ['Department', 'Year', 'Quarter'] showing Revenue and Cost.

  2. Convert columns into hierarchical index using .set_index().

  3. Use .xs() to get all Q2 data across departments.

  4. Compute total Revenue by Department.

  5. Find average Cost by Year.

  6. Unstack the Year level and display Revenue as columns.

  7. Stack it back and verify it matches the original DataFrame.

  8. Swap index levels (DepartmentYear) and sort.

  9. Add a new calculated column Profit = Revenue - Cost.

  10. Compute and display average profit per department.


📚 Summary#

Function

Description

.set_index()

Converts columns into a hierarchical index

.xs()

Cross-section selection across index levels

.stack() / .unstack()

Reshape DataFrames across levels

.groupby(level=)

Aggregate across multiple index levels

.swaplevel()

Reorder levels in a multi-index

.sort_index()

Sort hierarchical index for better readability


🧮 Complex DataFrame Manipulations#

Data manipulation lies at the heart of every data analysis process. Once a dataset is loaded into a Pandas DataFrame, the next step is to explore, filter, summarize, and transform it to extract meaningful insights.

In this section, we’ll cover powerful techniques:

  1. Filtering data using multiple logical conditions

  2. Grouping with aggregation using groupby()

  3. Creating pivot tables for multi-dimensional summaries

  4. Adding calculated columns using existing data


🎯 Learning Objectives#

By the end of this section, you will be able to:

  • Filter DataFrame rows using multiple logical conditions

  • Perform grouped aggregations with groupby()

  • Create pivot tables for multi-dimensional summaries

  • Add new calculated columns using existing data


🧮 1. Filtering Rows Based on Multiple Conditions#

Filtering allows you to select only the data that meets certain criteria. Pandas uses boolean indexing to achieve this.

💡 Explanation:

  • Use & for AND and | for OR conditions.

  • Always wrap each condition in parentheses.


📊 2. Grouping Data with Aggregation#

Grouping helps summarize large datasets into meaningful statistics. The groupby() function is one of the most powerful tools in Pandas.

💡 Explanation:

  • You can compute multiple aggregations for each group.

  • The result is often a MultiIndex DataFrame, which can be flattened later for easier access.


🪄 3. Creating Pivot Tables#

Pivot tables allow you to summarize and reshape data interactively, similar to Excel.

💡 Explanation:

  • index defines rows, columns defines columns, and values defines what to aggregate.

  • You can specify aggregation functions like mean, sum, or even custom functions.


➕ 4. Adding Calculated Columns#

You can create new columns derived from existing ones. This is essential for analysis and feature engineering.

💡 Explanation:

  • You can perform vectorized operations directly on DataFrame columns.

  • Avoid loops — Pandas operations are faster and cleaner.


🧪 Practice Exercises#

Try the following tasks using your dataset:

  1. Filter records where Experience > 3 and Salary < 60000.

  2. Group data by Department and calculate both average and maximum salaries.

  3. Create a pivot table showing average salary by Department and Experience.

  4. Add a column Tax = 10% of Salary, and compute Net Salary = Salary - Tax.

  5. Find the top 3 employees with the highest total pay (combine with nlargest()).

  6. For each department, display the employee with maximum experience.

  7. Filter all employees whose names start with 'A' or 'M'.

  8. Add a calculated column for Years Until Retirement assuming retirement age is 60.

  9. Identify the lowest paid employee in each department (combine with nsmallest()).

  10. Create a report summarizing total salary and average experience by department.


Advanced Hierarchical (Multi-Level) Indexing#

In the previous topic, you learned the basics of hierarchical indexing — creating multi-level DataFrames, selecting using .loc, and reshaping with .stack() / .unstack(). Now, let’s dive into advanced operations: selecting across multiple index levels, performing cross-sections, and aggregating data efficiently.


🎯 Learning Objectives#

By the end of this section, you’ll be able to:

  • Create multi-level indices from multiple columns

  • Perform cross-section selection with .xs()

  • Use stacking/unstacking across levels

  • Apply aggregation at specific index levels

  • Understand index ordering and sorting


🧱 1. Creating Multi-Level Indices from Columns#

You can convert existing columns into a multi-index using .set_index().

Output:

                         Sales  Profit
Region Year Quarter
North  2022 Q1        25000    4000
       2023 Q2        27000    5000
South  2022 Q1        30000    6000
       2023 Q2        28000    4500
East   2022 Q1        26000    4700
       2023 Q2        31000    6200

Now you have a three-level index: Region → Year → Quarter.


2. Cross-Section Selection with .xs()#

When dealing with multiple levels, .xs() is your best friend for selecting data at a specific index level.

💡 .xs() (short for cross-section) lets you slice through the multi-index without complex tuple-based indexing.


🧩 3. Stacking and Unstacking with Levels#

You can reshape data across any index level using the level argument.

💡 Tip:

  • Use stack(level='Quarter') to move Quarter to rows.

  • Use unstack(level='Quarter') to move Quarter to columns.


⚙️ 4. Aggregation at Specific Index Levels#

Use groupby(level=) for advanced summarization.

These aggregations respect the index hierarchy — great for summarizing business performance across regions, time, or departments.


🔄 5. Sorting and Reordering Multi-Level Indices#

A well-sorted index helps when performing hierarchical operations.

💡 Swapping and sorting are common when data is structured differently across systems.


🧮 6. Aggregation Example — Profit Margin by Level#

Let’s compute Profit Margin (%) by Region and Year.


🧪 Practice Exercises#

Try these to get comfortable:

  1. Create a DataFrame with index levels: ['Department', 'Year', 'Quarter'] showing Revenue and Cost.

  2. Convert columns into hierarchical index using .set_index().

  3. Use .xs() to get all Q2 data across departments.

  4. Compute total Revenue by Department.

  5. Find average Cost by Year.

  6. Unstack the Year level and display Revenue as columns.

  7. Stack it back and verify it matches the original DataFrame.

  8. Swap index levels (DepartmentYear) and sort.

  9. Add a new calculated column Profit = Revenue - Cost.

  10. Compute and display average profit per department.


📚 Summary#

Function

Description

.set_index()

Converts columns into a hierarchical index

.xs()

Cross-section selection across index levels

.stack() / .unstack()

Reshape DataFrames across levels

.groupby(level=)

Aggregate across multiple index levels

.swaplevel()

Reorder levels in a multi-index

.sort_index()

Sort hierarchical index for better readability


🏆 Identifying Top-Performing Records#

Once you’ve cleaned, indexed, and organized your dataset, the next step is often to find the best (or worst) performers — the top customers, highest-grossing products, or most efficient employees.

Pandas provides multiple efficient methods:

  1. sort_values() — Order DataFrame by column(s)

  2. nlargest() / nsmallest() — Get top/bottom N records directly

  3. Grouping + sorting — Find top performers within groups


1. Sorting Values#

Use .sort_values() to arrange data based on one or more columns.

Output:

  Product  Revenue  Profit
0  Laptop    80000   20000
2   Phone    60000   15000
4 Monitor    40000    7000
1  Tablet    30000    8000
3  Camera    20000    4000

You can also sort by multiple columns:


🎯 2. Getting Top N Records — .nlargest()#

When you only care about the top few performers, .nlargest() is faster and simpler than sorting.

Output:

  Product  Revenue  Profit
0  Laptop    80000   20000
2   Phone    60000   15000
4 Monitor    40000    7000

Similarly, .nsmallest() gives you the lowest-performing records:

Output:

  Product  Revenue  Profit
3  Camera    20000    4000
4 Monitor    40000    7000

🧩 3. Grouping + Sorting — Find Top in Each Group#

Find the top-performing record within each category.

Output:

      Category Product  Revenue  Profit
0  Electronics  Laptop    80000   20000
3    Furniture    Desk    25000    5000
4     Clothing  Jacket    15000    3000

⚡ 4. Using idxmax() and idxmin()#

Quickly get the index of the top or bottom record:

Output:

Product    Laptop
Revenue     80000
Profit      20000
Name: 0, dtype: object

🧠 Summary — Identifying Top Performers#

Method

Purpose

Notes

sort_values()

Sort full DataFrame

Use for ordered reports

nlargest() / nsmallest()

Get top or bottom N records

Fast and simple

groupby().head()

Top N within each group

Great for categorical comparisons

idxmax() / idxmin()

Get best record index

Works on single columns


📚 Complete Practice Lab — Data Detective Challenge#

Now that you’ve mastered EDA, indexing, grouping, and identifying top performers, it’s time to put it all together in a comprehensive analysis challenge.


🎯 Challenge: Company Sales Analysis#

You have access to a dataset called company_sales.csv with the following structure:

Column

Description

Date

Transaction date

Region

Sales region

Product_Category

Category of product sold

Quantity

Units sold

Price_per_Unit

Unit price

Sales_Amount

Total sales (Quantity × Price_per_Unit)

Marketing_Spend

Marketing spend for that transaction

Customer_Tier

Customer segment (Gold/Silver/Bronze)


📋 Analysis Tasks#

Use all the techniques you’ve learned to answer these questions:

Part 1: Data Exploration#

  1. Load the dataset and display basic info (shape, columns, data types)

  2. Generate summary statistics for all numeric columns

  3. Check for any missing values or outliers

  4. Create a correlation matrix and visualize it

Part 2: Univariate & Bivariate Analysis#

  1. Create a histogram of Sales_Amount — is it normally distributed?

  2. Create a boxplot of Sales_Amount by Region — which region performs best?

  3. Create a scatter plot: Marketing_Spend vs Sales_Amount — is there a relationship?

  4. Analyze which Product_Category has the highest average sales

Part 3: Indexing & Filtering#

  1. Filter for transactions with Sales_Amount > $50,000 — how many records?

  2. Select only Gold-tier customers in the North region using .loc[]

  3. Find all transactions from the top 3 months by sales using indexing

Part 4: Grouping & Aggregation#

  1. Group by Region and calculate:

    • Average Sales Amount

    • Total Marketing Spend

    • Count of transactions

  2. Create a pivot table showing average Sales Amount by Region and Customer_Tier

  3. Group by Product_Category and find the best-performing category by total revenue

Part 5: Top Performers & Calculated Columns#

  1. Add a calculated column: ROI = (Sales_Amount - Marketing_Spend) / Marketing_Spend * 100

  2. Find the top 5 transactions with the highest ROI

  3. For each region, identify the top 3 performing days (by total sales)

  4. Find the bottom 2 performing product categories by average profit margin

Part 6: Advanced Insights#

  1. Create a multi-level index: ['Region', 'Customer_Tier'] and calculate total sales

  2. Use .xs() to extract all sales for a specific tier across regions


🏆 Bonus Challenges#

  • Visualization: Create a dashboard with 4 plots showing key metrics

  • Time Series: If Date is included, plot sales trends over time

  • Segmentation: Identify high-value customers (top 10% by sales)

  • Forecasting: Predict next month’s total sales based on current trends


💡 Key Takeaways#

By completing this lab, you’ll have practiced:

Data Loading & Exploration — Understanding your dataset
Univariate Analysis — Single variable patterns
Bivariate Analysis — Relationships between variables
Filtering & Indexing — Extracting relevant subsets
Grouping & Aggregation — Summarizing by categories
Pivot Tables — Multi-dimensional views
Top Performers — Ranking and sorting
Feature Engineering — Creating new insights


📖 Recap Table#

Topic

Key Functions

Use Case

Summary Stats

.describe(), .info()

Understand data distribution

Visualization

sns.histplot(), sns.scatterplot()

Visual pattern recognition

Filtering

df[df['col'] > value]

Extract subsets

Indexing

.loc[], .iloc[], .set_index()

Efficient data access

Grouping

.groupby().agg()

Categorical summaries

Pivoting

.pivot_table()

Multi-dimensional views

Ranking

.nlargest(), .sort_values()

Top/bottom performers

Features

Calculated columns

New derived metrics


🎓 You’re now a data detective! Go forth and uncover business stories hidden in your data! 🔍

# Your code here

📊 Data Inspection Basics#

Viewing Data#

Use .head() and .tail() to view the top and bottom rows of the DataFrame:

import numpy as np
import pandas as pd

# Create sample dataframe
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

print("First 5 rows:")
print(df.head())

print("\nLast 3 rows:")
df.tail(3)

Inspecting Index and Columns#

print("Index:")
print(df.index)

print("\nColumns:")
df.columns

Converting to NumPy Array#

Return a NumPy representation of the underlying data:

df.to_numpy()

Data Types#

When working with mixed data types, pandas will find a common dtype:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

print("Data types:")
print(df2.dtypes)

print("\nConverted to numpy:")
df2.to_numpy()

📈 Summary Statistics#

The .describe() method provides quick statistical summary of your data:

df.describe()

Transposing Data#

df.T

🔄 Sorting Data#

Sort by Index#

df.sort_index(axis=1, ascending=False)

Sort by Values#

df.sort_values(by="B")

🎯 Data Selection#

Basic Column Selection#

Selecting a single column returns a Series:

df["A"]

Multiple Column Selection#

df[["B", "A"]]

Row Slicing#

print("First 3 rows:")
print(df[0:3])

print("\nDate range slice:")
df["20130102":"20130104"]

Selection by Label (.loc)#

Use .loc[] for label-based indexing:

# Select a single row
print("Single row:")
print(df.loc[dates[0]])

print("\nSelect rows and columns:")
print(df.loc[:, ["A", "B"]])

print("\nSlice with labels (both endpoints included):")
df.loc["20130102":"20130104", ["A", "B"]]

Selection by Position (.iloc)#

Use .iloc[] for integer position-based indexing:

# Select by position
print("4th row:")
print(df.iloc[3])

print("\nSlice rows and columns:")
print(df.iloc[3:5, 0:2])

print("\nSpecific positions:")
df.iloc[[1, 2, 4], [0, 2]]

Boolean Indexing#

Filter data based on conditions:

# Filter where column A is greater than 0
df[df["A"] > 0]

Using .isin() for Filtering#

df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
print("DataFrame with new column:")
print(df2)

print("\nFiltered data:")
df2[df2["E"].isin(["two", "four"])]

📊 Statistical Operations#

Computing Statistics#

Operations generally exclude missing data:

# Mean for each column
print("Mean by column:")
print(df.mean())

print("\nMean by row:")
df.mean(axis=1)

Operations with Different Indices#

When operating with objects that have different indices, pandas will align them:

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
print("Shifted series:")
print(s)

print("\nSubtraction with alignment:")
df.sub(s, axis="index")

User-Defined Functions#

Apply custom functions using .agg() and .transform():

# Aggregation
print("Aggregation with custom function:")
print(df.agg(lambda x: np.mean(x) * 5.6))

print("\nTransformation:")
df.transform(lambda x: x * 101.2)

🎲 Value Counts#

Value counts provide frequency distribution of values:

s = pd.Series(np.random.randint(0, 7, size=10))
print("Sample series:")
print(s)

print("\nValue counts:")
s.value_counts()

👥 Grouping Data#

Group-by operations allow you to split data, apply a function, and combine results:

df_group = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
print("Sample DataFrame:")
df_group

Grouping and Aggregating#

# Group by column A and sum
df_group.groupby("A")[["C", "D"]].sum()

Grouping by Multiple Columns#

df_group.groupby(["A", "B"]).sum()

🔄 Reshaping Data#

Stack and Unstack#

Stack “compresses” a level in the DataFrame’s columns:

arrays = [
   ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ["one", "two", "one", "two", "one", "two", "one", "two"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
df_stack = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2_stack = df_stack[:4]
print("Original DataFrame:")
df2_stack

Stacking#

stacked = df2_stack.stack()
stacked

Unstacking#

print("Unstack (default - last level):")
print(stacked.unstack())

print("\nUnstack level 1:")
print(stacked.unstack(1))

print("\nUnstack level 0:")
stacked.unstack(0)

Pivot Tables#

Pivot tables provide a way to create cross-tabulations:

df_pivot = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)
print("Sample DataFrame:")
df_pivot

Creating Pivot Table#

pd.pivot_table(df_pivot, values="D", index=["A", "B"], columns=["C"])