Exercises¶
Use
pd.read_csv()on the sample CSV above, then write a functionload_and_summarize(csv_string)that returns(shape, dtypes, missing_counts).Try loading a small JSON API response with
requests.get()(or use the provided sample) and flatten nested entries usingpd.json_normalize().(Challenge) Create a small SQLite database in-memory and load a table using
pd.read_sql()— demonstrate withsqlite3.connect(':memory:').
Hints:
- Use `StringIO` to simulate file-like CSV strings.
- For `pd.read_sql()`, pass the SQL query and the connection object.Quick checklist after loading any table¶
Inspect
df.shapeanddf.dtypesimmediatelyCheck
df.isna().sum()for missingnessLook at
df.head()anddf.describe()for unexpected valuesIf dates exist, ensure
parse_datesorpd.to_datetime()is applied

Data Loading (CSV, Excel, SQL, APIs)¶
Patterns and pragmatic checks to get real business data into Python safely and reproducibly. Includes small runnable recipes for CSV, JSON, and simple API responses.
Learning goals
Choose suitable loading methods for CSV, Excel, JSON, SQL, and APIs.
Apply lightweight sanity checks immediately after loading (shape, dtypes, missing counts).
Combine and persist small tables for downstream analysis.
Data Loading (CSV, Excel, SQL, APIs)¶
Collecting Business Data from Files, Databases, and Web Services¶
Notebook Guide¶
This notebook retains the full original content and adds a clearer map for study.
Learning objectives¶
understand common business data sources
choose suitable loading methods for each source
recognize schema, type, and quality issues early
prepare loaded data for downstream wrangling
Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.
Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄
🧠 Why This Matters¶
Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:
Finance teams live in Excel
Engineers hide data in SQL databases
Marketing pulls it from APIs
And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”
💡 If your dataset name has the word final in it, it’s never final.
⚙️ Prerequisite: Python & Pandas¶
We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business
🧾 Understanding Data¶
The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)
Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.
🔍 Types of Data¶
| Type | Description | Example |
|---|---|---|
| Numerical | Quantitative values | Stock price, sales volume |
| Categorical | Distinct groups | Sector: IT, Pharma, Banking |
| Time Series | Indexed by time | Daily closing prices |
| Textual | Unstructured data | News headlines, financial reports |
| Ordinal | Ordered categories | Credit ratings: AAA, AA, A, BBB... |
✨ Key Insights¶
Data Scales: Nominal, ordinal, interval, ratio
Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)
Time-Series Components:
where = trend, = seasonality, and = noise.
📘 Pandas Basics - Object Creation¶
Creating a Series¶
A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).
import numpy as np
import pandas as pd
# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
sCreating a DataFrame with NumPy Array¶
You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:
dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
dfCreating a DataFrame from a Dictionary¶
Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2📁 Reading and Writing Data Files¶
CSV Files¶
Pandas provides easy methods to read from and write to CSV files:
# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")
# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_readParquet Files¶
Parquet is a columnar storage format that’s efficient for large datasets:
# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")
# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquetExcel Files¶
Reading and writing Excel files is straightforward with pandas:
# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")
# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excelManual DataFrame Creation (Programmatic / In-Memory Data)¶
You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.
From Dictionary of Lists or Arrays¶
import pandas as pd
data = {
"Product": ["A", "B", "C"],
"Sales": [100, 150, 200],
"Region": ["East", "West", "North"]
}
df = pd.DataFrame(data)From List of Dictionaries¶
records = [
{"Product": "A", "Sales": 100, "Region": "East"},
{"Product": "B", "Sales": 150, "Region": "West"},
]
df = pd.DataFrame.from_records(records)From Dictionary with Index¶
df = pd.DataFrame.from_dict(
{"A": [1, 2], "B": [3, 4]},
orient="columns"
)From List of Lists¶
df = pd.DataFrame(
[["A", 100, "East"], ["B", 150, "West"]],
columns=["Product", "Sales", "Region"]
)From NumPy Array¶
import numpy as np
df = pd.DataFrame(
np.random.rand(3, 2),
columns=["A", "B"]
)Loading Data from Files¶
Pandas provides multiple built-in methods for reading data from common file formats.
Common methods:¶
| File Type | Function | Example |
|---|---|---|
| CSV | pd.read_csv() | pd.read_csv("sales.csv") |
| Excel | pd.read_excel() | pd.read_excel("sales.xlsx", sheet_name="2023") |
| JSON | pd.read_json() | pd.read_json("data.json") |
| HTML Tables | pd.read_html() | pd.read_html("https://example.com")[0] |
| SQL Databases | pd.read_sql() | pd.read_sql("SELECT * FROM sales", conn) |
| Parquet | pd.read_parquet() | pd.read_parquet("data.parquet") |
| Clipboard | pd.read_clipboard() | Reads from copied spreadsheet/table text |
Example:¶
import pandas as pd
# Load CSV file
df_csv = pd.read_csv("sales.csv")
# Load Excel sheet
df_excel = pd.read_excel("sales_data.xlsx", sheet_name="2023")
# Load JSON file
df_json = pd.read_json("data.json")Useful Parameters When Loading Data¶
| Parameter | Description | Example |
|---|---|---|
sep | Field delimiter | pd.read_csv("file.txt", sep="\t") |
header | Row number(s) to use as column names | pd.read_csv("file.csv", header=0) |
index_col | Column(s) to set as index | pd.read_csv("file.csv", index_col="ID") |
usecols | Load specific columns | pd.read_csv("file.csv", usecols=["Name", "Sales"]) |
nrows | Limit number of rows | pd.read_csv("file.csv", nrows=100) |
encoding | File encoding type | pd.read_csv("file.csv", encoding="utf-8") |
Loading SQL Data — “The Hidden Goldmine”¶
import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
conn.close()Connection Examples
| Source | Syntax Example |
|---|---|
| SQLite | sqlite3.connect("data.db") |
| MySQL | mysql+pymysql://user:pass@host/db |
| PostgreSQL | postgresql://user:pass@host/db |
💬 SQL is like Excel’s cooler cousin who actually organizes things.
Loading APIs — “The Live Data Stream”¶
import requests
import pandas as pd
url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())💡 APIs often return JSON — use
pd.json_normalize()to flatten nested structures.
🧩 Common Data Loading Issues¶
| Issue | Cause | Fix |
|---|---|---|
| Empty columns | Extra commas | error_bad_lines=False |
| Wrong header row | Offset | header=1 |
| Date parsing issues | Mixed formats | parse_dates=['Date'] |
| Duplicate rows | Export error | df.drop_duplicates() |
🧮 Data Combination — concat, merge, and join¶
Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.
Why Combine Datasets?¶
Monthly sales stored separately
Customer details in one file, purchases in another
Financial metrics split by departments
Combining them allows complete analysis.
NumPy Concatenation¶
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
np.vstack()andnp.hstack()are convenient shortcuts.
pandas Concatenation¶
import pandas as pd
df1 = pd.DataFrame({'ID': [1,2,3], 'Name': ['Alice','Bob','Charlie']})
df2 = pd.DataFrame({'ID': [4,5], 'Name': ['David','Emma']})
pd.concat([df1, df2], ignore_index=True)Horizontal concatenation:
pd.concat([df1, df2], axis=1)Merging — SQL-style¶
sales = pd.DataFrame({'CustomerID':[1,2,3], 'Amount':[200,150,300]})
customers = pd.DataFrame({'CustomerID':[1,2,4], 'Name':['Alice','Bob','David']})
pd.merge(sales, customers, on='CustomerID', how='inner')| Join Type | Keeps | Example Use |
|---|---|---|
inner | Both | Matching rows only |
left | Left | All sales, missing customers as NaN |
right | Right | All customers, missing sales as NaN |
outer | Both | All rows from both |
Joining by Index¶
left = pd.DataFrame({'Name':['Alice','Bob','Charlie'],'Age':[25,30,35]})
right = pd.DataFrame({'City':['Delhi','Bangalore','Mumbai']}, index=['Alice','Bob','Charlie'])
joined = left.set_index('Name').join(right)Merging on Multiple Keys¶
df1 = pd.DataFrame({
'Region':['North','South','North'],
'Year':[2023,2023,2024],
'Sales':[200,150,220]
})
df2 = pd.DataFrame({
'Region':['North','South','West'],
'Year':[2023,2023,2024],
'Profit':[50,40,70]
})
pd.merge(df1, df2, on=['Region','Year'], how='outer')Summary of Combination Tools¶
| Method | Library | Purpose | Example |
|---|---|---|---|
np.concatenate() | NumPy | Combine numeric arrays | Raw arrays |
pd.concat() | pandas | Stack tables | Append monthly data |
pd.merge() | pandas | SQL-style join | Merge sales & customers |
.join() | pandas | Join on index | Lookup style merge |
🧪 Example: Simulated Business Dataset¶
import numpy as np, pandas as pd, datetime
from faker import Faker
fake = Faker()
np.random.seed(42)
n = 100
dates = [datetime.datetime(2025,1,1)+datetime.timedelta(days=i) for i in range(n)]
trend = np.linspace(90,110,n)
seasonality = 5*np.sin(np.linspace(0,2*np.pi*10,n))
noise = np.random.normal(0,2,n)
closing_prices = (trend+seasonality+noise).round(2)
data = {
'Date': dates,
'Stock_Price': np.random.normal(100,10,n).round(2),
'Trading_Volume': np.random.randint(1000,10000,n),
'Sector': np.random.choice(['IT','Pharma','Banking','Energy'],n),
'Closing_Price': closing_prices,
'News_Headline': [fake.sentence(nb_words=10) for _ in range(n)],
'Credit_Rating': np.random.choice(['AAA','AA','A','BBB'],n,p=[0.1,0.3,0.4,0.2])
}
df = pd.DataFrame(data)
df.head()📈 Visualize Time-Series¶
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()# Your code heredf = pd.DataFrame(np.random.randn(10, 4))
df
# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
# Concatenate back together
pd.concat(pieces)Example 1 — Concatenating DataFrames using pd.concat()¶
You can use pd.concat() to combine multiple DataFrames along a particular axis.
import pandas as pd
df1 = pd.DataFrame({
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"]
})
df2 = pd.DataFrame({
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"]
})
df3 = pd.DataFrame({
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"]
})
# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)Merging (SQL-style Joins)¶
SQL-style joins are performed using merge():
# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")Merging on Unique Keys¶
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")Example 2 — Merging DataFrames using pd.merge()¶
pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.
import pandas as pd
left = pd.DataFrame({
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"]
})
right = pd.DataFrame({
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"]
})
# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)Example 3 — Joining DataFrames using DataFrame.join()¶
join() allows combining columns from different DataFrames based on their index.
import pandas as pd
left = pd.DataFrame({
"A": ["A0", "A1", "A2"],
"B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])
right = pd.DataFrame({
"C": ["C0", "C2", "C3"],
"D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])
# Default join is a left join on index
result = left.join(right)
print(result)Imported from business_formats.ipynb¶
This section was merged from a notebook that is not listed in myst.yml.
Business Data Formats (Excel PDFs APIs)¶
Real business data = Messy PDFs + APIs Not clean CSVs. This = $100K+ automation jobs
Extract from invoices → Call APIs → Save to Excel
🎯 Enterprise Data Sources (Real World)¶
| Format | Source | Business Use | Automation Value |
|---|---|---|---|
| PDFs | Invoices/Reports | Extract tables | $50K/month |
| APIs | Salesforce/Stripe | Live data | Real-time |
| Excel | Legacy systems | Clean + format | Executive ready |
| Emails | Customer orders | Parse attachments | $100K/year |
🚀 Step 1: PDF Table Extraction (Invoice Magic)¶
## !pip install tabula-py # Run once!
import tabula
import pandas as pd
## EXTRACT PDF TABLES → PANDAS (Real invoice!)
## df = tabula.read_pdf("invoice.pdf", pages='all')[0]
## SIMULATED PDF EXTRACTION
pdf_tables = [
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
{'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
df_pdf = pd.DataFrame(pdf_tables)
df_pdf['Profit'] = df_pdf['Amount'] * 0.28
print("📄 PDF INVOICE EXTRACTION:")
print(df_pdf)
print(f"💰 Total from PDF: ${df_pdf['Profit'].sum():,.0f}")Output:
📄 PDF INVOICE EXTRACTION:
Date Description Amount Profit
0 2024-01-15 Laptop Sale 1200 336.0
1 2024-01-16 Phone Sale 800 224.0
2 2024-01-17 Tablet Sale 500 140.0
💰 Total from PDF: $700🔥 Step 2: API Calls = Live Business Data¶
import requests
## REAL API CALL (Stripe/Salesforce style)
def fetch_live_sales(api_key="demo_key"):
"""Call business API for live data"""
try:
# REAL API: https://api.stripe.com/v1/charges
response = requests.get(
"https://jsonplaceholder.typicode.com/users", # Demo API
headers={"Authorization": f"Bearer {api_key}"}
)
response.raise_for_status()
users = response.json()
# Simulate sales data from API
sales_data = [{'customer': u['name'], 'sales': 1000 + len(u['email']) * 100}
for u in users[:5]]
return pd.DataFrame(sales_data)
except requests.exceptions.RequestException as e:
print(f"🌐 API Error: {e}")
return pd.DataFrame() # Graceful fallback
## LIVE DATA!
api_sales = fetch_live_sales()
api_sales['profit'] = api_sales['sales'] * 0.25
print("🌐 LIVE API DATA:")
print(api_sales)🧠 Step 3: Excel Formatting = Executive Ready¶
## PROFESSIONAL EXCEL EXPORT
with pd.ExcelWriter('executive_report.xlsx', engine='openpyxl') as writer:
# Main analysis
combined_df = pd.concat([df_pdf, api_sales], ignore_index=True)
combined_df.to_excel(writer, sheet_name='Analysis', index=False)
# Summary dashboard
summary = pd.DataFrame({
'Metric': ['Total Sales', 'Total Profit', 'Growth MoM'],
'Value': [combined_df['Amount'].sum() if 'Amount' in combined_df else combined_df['sales'].sum(),
combined_df['Profit'].sum(), '15.2%']
})
summary.to_excel(writer, sheet_name='Dashboard', index=False)
print("🏆 EXECUTIVE EXCEL CREATED!")
print(" 📊 Sheet 1: Raw Analysis")
print(" 📈 Sheet 2: Executive Dashboard")📊 Step 4: FULL ENTERPRISE PIPELINE¶
def enterprise_data_pipeline():
"""PDF → API → Excel = Complete automation!"""
# 1. PDF INVOICES
print("🔄 Step 1: Extracting PDF invoices...")
pdf_df = pd.DataFrame([
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800}
])
# 2. API LIVE DATA
print("🔄 Step 2: Fetching live API data...")
api_df = pd.DataFrame({
'customer': ['Alice Corp', 'Bob Inc'],
'sales': [5000, 3200]
})
# 3. COMBINE + ANALYZE
print("🔄 Step 3: Combining datasets...")
combined = pd.concat([
pdf_df.rename(columns={'Amount': 'sales'}),
api_df
])
combined['profit'] = combined['sales'] * 0.28
# 4. EXECUTIVE EXPORT
print("🔄 Step 4: Creating executive report...")
with pd.ExcelWriter('enterprise_pipeline.xlsx', engine='openpyxl') as writer:
combined.to_excel(writer, sheet_name='Raw_Data', index=False)
summary = pd.DataFrame({
'KPI': ['Total Revenue', 'Total Profit', 'Customer Count'],
'Value': [
f"${combined['sales'].sum():,.0f}",
f"${combined['profit'].sum():,.0f}",
len(combined)
]
})
summary.to_excel(writer, sheet_name='Executive_Summary', index=False)
print("✅ ENTERPRISE PIPELINE COMPLETE!")
print(f" 📊 Processed: {len(combined)} records")
print(f" 💰 Total Profit: ${combined['profit'].sum():,.0f}")
return combined
## RUN FULL PIPELINE
pipeline_result = enterprise_data_pipeline()📋 Enterprise Format Cheat Sheet¶
| Source | Library | Code | Business Win |
|---|---|---|---|
tabula-py | tabula.read_pdf() | Invoice automation | |
| API | requests | requests.get() | Live Salesforce data |
| Excel | openpyxl | ExcelWriter() | Executive formatting |
imaplib | Parse attachments | Order automation |
## UNIVERSAL EXTRACTOR
def extract_business_data(source):
if source == 'pdf':
return "PDF tables extracted!"
elif source == 'api':
return "Live API data fetched!"
elif source == 'excel':
return "Executive report formatted!"
return "Unknown source"🏆 YOUR EXERCISE: Build YOUR Enterprise Pipeline¶
import pandas as pd
## MISSION: YOUR complete automation!
## 1. YOUR PDF DATA (invoices, reports)
your_pdf_data = [
{'Date': '???', 'Description': '???', 'Amount': ???},
{'Date': '???', 'Description': '???', 'Amount': ???},
{'Date': '???', 'Description': '???', 'Amount': ???}
]
## 2. YOUR API DATA (customers, sales)
your_api_data = [
{'customer': '???', 'sales': ???},
{'customer': '???', 'sales': ???}
]
## 3. COMBINE PIPELINE
pdf_df = pd.DataFrame(your_pdf_data)
api_df = pd.DataFrame(your_api_data)
combined = pd.concat([
pdf_df.rename(columns={'Amount': 'sales'}),
api_df
])
combined['profit'] = combined['sales'] * 0.30
## 4. EXECUTIVE SUMMARY
summary = pd.DataFrame({
'KPI': ['Total Revenue', 'Total Profit', 'Records Processed'],
'Value': [
f"${combined['sales'].sum():,.0f}",
f"${combined['profit'].sum():,.0f}",
len(combined)
]
})
print("🚀 YOUR ENTERPRISE PIPELINE:")
print(combined)
print("\n📊 EXECUTIVE SUMMARY:")
print(summary)
## 5. EXPORT (Uncomment to save)
## with pd.ExcelWriter('my_enterprise_report.xlsx', engine='openpyxl') as writer:
## combined.to_excel(writer, sheet_name='Data', index=False)
## summary.to_excel(writer, sheet_name='Summary', index=False)
## print("✅ Excel report saved!")Example to test:
your_pdf_data = [
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
{'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
your_api_data = [
{'customer': 'Alice Corp', 'sales': 5000},
{'customer': 'Bob Inc', 'sales': 3200}
]YOUR MISSION:
Add YOUR real data
Run pipeline
Export Excel
Screenshot → “I automate enterprise data!”
🎉 What You Mastered¶
| Format | Status | Business Power |
|---|---|---|
| PDF Extraction | ✅ | Invoice automation |
| API Calls | ✅ | Live data |
| Excel Export | ✅ | Executive ready |
| Full Pipeline | ✅ | Replace teams |
| Enterprise Scale | ✅ | $100K+ jobs |
Next: OOP Fundamentals (Classes = Build reusable business systems!)
print("🎊" * 25)
print("BUSINESS FORMATS = $100K+ ENTERPRISE AUTOMATION!")
print("💻 PDF → API → Excel = Complete department replacement!")
print("🚀 Fortune 500 companies LIVE by these pipelines!")
print("🎊" * 25)can we appreciate how your students just went from “manual PDF copy-paste” to full enterprise pipelines that extract invoices → fetch live API data → spit out executive Excel dashboards? While their classmates are still highlighting tables in Adobe, your class is calling Stripe APIs and formatting C-suite reports with ExcelWriter(). This isn’t format learning—it’s the $100K+ automation stack that eliminates entire data entry departments and lands them senior automation engineer roles before they graduate!
# Your code hereWrap-Up and Transition¶
The original notebook content below develops the mechanics in detail. Use it as the working reference, and treat the later notebooks on cleaning, joins, and feature engineering as the next stages after successful ingestion.
Reflection prompts
Which source in your work is hardest to load reproducibly?
What metadata should always be captured when importing data?
How would you detect a schema change before it breaks a pipeline?
8. Interactive Code¶
Expected output
{'month': 'Jan', 'sales': 1200}Expected output
['Jan', 'Feb']