Business Data Formats (Excel PDFs APIs)#
Real business data = Messy PDFs + APIs Not clean CSVs. This = $100K+ automation jobs
Extract from invoices β Call APIs β Save to Excel
π― Enterprise Data Sources (Real World)#
Format |
Source |
Business Use |
Automation Value |
|---|---|---|---|
PDFs |
Invoices/Reports |
Extract tables |
$50K/month |
APIs |
Salesforce/Stripe |
Live data |
Real-time |
Excel |
Legacy systems |
Clean + format |
Executive ready |
Emails |
Customer orders |
Parse attachments |
$100K/year |
π Step 1: PDF Table Extraction (Invoice Magic)#
# !pip install tabula-py # Run once!
import tabula
import pandas as pd
# EXTRACT PDF TABLES β PANDAS (Real invoice!)
# df = tabula.read_pdf("invoice.pdf", pages='all')[0]
# SIMULATED PDF EXTRACTION
pdf_tables = [
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
{'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
df_pdf = pd.DataFrame(pdf_tables)
df_pdf['Profit'] = df_pdf['Amount'] * 0.28
print("π PDF INVOICE EXTRACTION:")
print(df_pdf)
print(f"π° Total from PDF: ${df_pdf['Profit'].sum():,.0f}")
Output:
π PDF INVOICE EXTRACTION:
Date Description Amount Profit
0 2024-01-15 Laptop Sale 1200 336.0
1 2024-01-16 Phone Sale 800 224.0
2 2024-01-17 Tablet Sale 500 140.0
π° Total from PDF: $700
π₯ Step 2: API Calls = Live Business Data#
import requests
# REAL API CALL (Stripe/Salesforce style)
def fetch_live_sales(api_key="demo_key"):
"""Call business API for live data"""
try:
# REAL API: https://api.stripe.com/v1/charges
response = requests.get(
"https://jsonplaceholder.typicode.com/users", # Demo API
headers={"Authorization": f"Bearer {api_key}"}
)
response.raise_for_status()
users = response.json()
# Simulate sales data from API
sales_data = [{'customer': u['name'], 'sales': 1000 + len(u['email']) * 100}
for u in users[:5]]
return pd.DataFrame(sales_data)
except requests.exceptions.RequestException as e:
print(f"π API Error: {e}")
return pd.DataFrame() # Graceful fallback
# LIVE DATA!
api_sales = fetch_live_sales()
api_sales['profit'] = api_sales['sales'] * 0.25
print("π LIVE API DATA:")
print(api_sales)
π§ Step 3: Excel Formatting = Executive Ready#
# PROFESSIONAL EXCEL EXPORT
with pd.ExcelWriter('executive_report.xlsx', engine='openpyxl') as writer:
# Main analysis
combined_df = pd.concat([df_pdf, api_sales], ignore_index=True)
combined_df.to_excel(writer, sheet_name='Analysis', index=False)
# Summary dashboard
summary = pd.DataFrame({
'Metric': ['Total Sales', 'Total Profit', 'Growth MoM'],
'Value': [combined_df['Amount'].sum() if 'Amount' in combined_df else combined_df['sales'].sum(),
combined_df['Profit'].sum(), '15.2%']
})
summary.to_excel(writer, sheet_name='Dashboard', index=False)
print("π EXECUTIVE EXCEL CREATED!")
print(" π Sheet 1: Raw Analysis")
print(" π Sheet 2: Executive Dashboard")
π Step 4: FULL ENTERPRISE PIPELINE#
def enterprise_data_pipeline():
"""PDF β API β Excel = Complete automation!"""
# 1. PDF INVOICES
print("π Step 1: Extracting PDF invoices...")
pdf_df = pd.DataFrame([
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800}
])
# 2. API LIVE DATA
print("π Step 2: Fetching live API data...")
api_df = pd.DataFrame({
'customer': ['Alice Corp', 'Bob Inc'],
'sales': [5000, 3200]
})
# 3. COMBINE + ANALYZE
print("π Step 3: Combining datasets...")
combined = pd.concat([
pdf_df.rename(columns={'Amount': 'sales'}),
api_df
])
combined['profit'] = combined['sales'] * 0.28
# 4. EXECUTIVE EXPORT
print("π Step 4: Creating executive report...")
with pd.ExcelWriter('enterprise_pipeline.xlsx', engine='openpyxl') as writer:
combined.to_excel(writer, sheet_name='Raw_Data', index=False)
summary = pd.DataFrame({
'KPI': ['Total Revenue', 'Total Profit', 'Customer Count'],
'Value': [
f"${combined['sales'].sum():,.0f}",
f"${combined['profit'].sum():,.0f}",
len(combined)
]
})
summary.to_excel(writer, sheet_name='Executive_Summary', index=False)
print("β
ENTERPRISE PIPELINE COMPLETE!")
print(f" π Processed: {len(combined)} records")
print(f" π° Total Profit: ${combined['profit'].sum():,.0f}")
return combined
# RUN FULL PIPELINE
pipeline_result = enterprise_data_pipeline()
π Enterprise Format Cheat Sheet#
Source |
Library |
Code |
Business Win |
|---|---|---|---|
|
|
Invoice automation |
|
API |
|
|
Live Salesforce data |
Excel |
|
|
Executive formatting |
|
Parse attachments |
Order automation |
# UNIVERSAL EXTRACTOR
def extract_business_data(source):
if source == 'pdf':
return "PDF tables extracted!"
elif source == 'api':
return "Live API data fetched!"
elif source == 'excel':
return "Executive report formatted!"
return "Unknown source"
π YOUR EXERCISE: Build YOUR Enterprise Pipeline#
import pandas as pd
# MISSION: YOUR complete automation!
# 1. YOUR PDF DATA (invoices, reports)
your_pdf_data = [
{'Date': '???', 'Description': '???', 'Amount': ???},
{'Date': '???', 'Description': '???', 'Amount': ???},
{'Date': '???', 'Description': '???', 'Amount': ???}
]
# 2. YOUR API DATA (customers, sales)
your_api_data = [
{'customer': '???', 'sales': ???},
{'customer': '???', 'sales': ???}
]
# 3. COMBINE PIPELINE
pdf_df = pd.DataFrame(your_pdf_data)
api_df = pd.DataFrame(your_api_data)
combined = pd.concat([
pdf_df.rename(columns={'Amount': 'sales'}),
api_df
])
combined['profit'] = combined['sales'] * 0.30
# 4. EXECUTIVE SUMMARY
summary = pd.DataFrame({
'KPI': ['Total Revenue', 'Total Profit', 'Records Processed'],
'Value': [
f"${combined['sales'].sum():,.0f}",
f"${combined['profit'].sum():,.0f}",
len(combined)
]
})
print("π YOUR ENTERPRISE PIPELINE:")
print(combined)
print("\nπ EXECUTIVE SUMMARY:")
print(summary)
# 5. EXPORT (Uncomment to save)
# with pd.ExcelWriter('my_enterprise_report.xlsx', engine='openpyxl') as writer:
# combined.to_excel(writer, sheet_name='Data', index=False)
# summary.to_excel(writer, sheet_name='Summary', index=False)
# print("β
Excel report saved!")
Example to test:
your_pdf_data = [
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
{'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
your_api_data = [
{'customer': 'Alice Corp', 'sales': 5000},
{'customer': 'Bob Inc', 'sales': 3200}
]
YOUR MISSION:
Add YOUR real data
Run pipeline
Export Excel
Screenshot β βI automate enterprise data!β
π What You Mastered#
Format |
Status |
Business Power |
|---|---|---|
PDF Extraction |
β |
Invoice automation |
API Calls |
β |
Live data |
Excel Export |
β |
Executive ready |
Full Pipeline |
β |
Replace teams |
Enterprise Scale |
β |
$100K+ jobs |
Next: OOP Fundamentals (Classes = Build reusable business systems!)
print("π" * 25)
print("BUSINESS FORMATS = $100K+ ENTERPRISE AUTOMATION!")
print("π» PDF β API β Excel = Complete department replacement!")
print("π Fortune 500 companies LIVE by these pipelines!")
print("π" * 25)
can we appreciate how your students just went from βmanual PDF copy-pasteβ to full enterprise pipelines that extract invoices β fetch live API data β spit out executive Excel dashboards? While their classmates are still highlighting tables in Adobe, your class is calling Stripe APIs and formatting C-suite reports with ExcelWriter(). This isnβt format learningβitβs the $100K+ automation stack that eliminates entire data entry departments and lands them senior automation engineer roles before they graduate!
# Your code here