Business Data Formats (Excel PDFs APIs)¶
Real business data = Messy PDFs + APIs Not clean CSVs. This = $100K+ automation jobs
Extract from invoices → Call APIs → Save to Excel
🎯 Enterprise Data Sources (Real World)¶
| Format | Source | Business Use | Automation Value |
|---|---|---|---|
| PDFs | Invoices/Reports | Extract tables | $50K/month |
| APIs | Salesforce/Stripe | Live data | Real-time |
| Excel | Legacy systems | Clean + format | Executive ready |
| Emails | Customer orders | Parse attachments | $100K/year |
🚀 Step 1: PDF Table Extraction (Invoice Magic)¶
## !pip install tabula-py # Run once!
import tabula
import pandas as pd
## EXTRACT PDF TABLES → PANDAS (Real invoice!)
## df = tabula.read_pdf("invoice.pdf", pages='all')[0]
## SIMULATED PDF EXTRACTION
pdf_tables = [
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
{'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
df_pdf = pd.DataFrame(pdf_tables)
df_pdf['Profit'] = df_pdf['Amount'] * 0.28
print("📄 PDF INVOICE EXTRACTION:")
print(df_pdf)
print(f"💰 Total from PDF: ${df_pdf['Profit'].sum():,.0f}")Output:
📄 PDF INVOICE EXTRACTION:
Date Description Amount Profit
0 2024-01-15 Laptop Sale 1200 336.0
1 2024-01-16 Phone Sale 800 224.0
2 2024-01-17 Tablet Sale 500 140.0
💰 Total from PDF: $700🔥 Step 2: API Calls = Live Business Data¶
import requests
## REAL API CALL (Stripe/Salesforce style)
def fetch_live_sales(api_key="demo_key"):
"""Call business API for live data"""
try:
# REAL API: https://api.stripe.com/v1/charges
response = requests.get(
"https://jsonplaceholder.typicode.com/users", # Demo API
headers={"Authorization": f"Bearer {api_key}"}
)
response.raise_for_status()
users = response.json()
# Simulate sales data from API
sales_data = [{'customer': u['name'], 'sales': 1000 + len(u['email']) * 100}
for u in users[:5]]
return pd.DataFrame(sales_data)
except requests.exceptions.RequestException as e:
print(f"🌐 API Error: {e}")
return pd.DataFrame() # Graceful fallback
## LIVE DATA!
api_sales = fetch_live_sales()
api_sales['profit'] = api_sales['sales'] * 0.25
print("🌐 LIVE API DATA:")
print(api_sales)🧠 Step 3: Excel Formatting = Executive Ready¶
## PROFESSIONAL EXCEL EXPORT
with pd.ExcelWriter('executive_report.xlsx', engine='openpyxl') as writer:
# Main analysis
combined_df = pd.concat([df_pdf, api_sales], ignore_index=True)
combined_df.to_excel(writer, sheet_name='Analysis', index=False)
# Summary dashboard
summary = pd.DataFrame({
'Metric': ['Total Sales', 'Total Profit', 'Growth MoM'],
'Value': [combined_df['Amount'].sum() if 'Amount' in combined_df else combined_df['sales'].sum(),
combined_df['Profit'].sum(), '15.2%']
})
summary.to_excel(writer, sheet_name='Dashboard', index=False)
print("🏆 EXECUTIVE EXCEL CREATED!")
print(" 📊 Sheet 1: Raw Analysis")
print(" 📈 Sheet 2: Executive Dashboard")📊 Step 4: FULL ENTERPRISE PIPELINE¶
def enterprise_data_pipeline():
"""PDF → API → Excel = Complete automation!"""
# 1. PDF INVOICES
print("🔄 Step 1: Extracting PDF invoices...")
pdf_df = pd.DataFrame([
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800}
])
# 2. API LIVE DATA
print("🔄 Step 2: Fetching live API data...")
api_df = pd.DataFrame({
'customer': ['Alice Corp', 'Bob Inc'],
'sales': [5000, 3200]
})
# 3. COMBINE + ANALYZE
print("🔄 Step 3: Combining datasets...")
combined = pd.concat([
pdf_df.rename(columns={'Amount': 'sales'}),
api_df
])
combined['profit'] = combined['sales'] * 0.28
# 4. EXECUTIVE EXPORT
print("🔄 Step 4: Creating executive report...")
with pd.ExcelWriter('enterprise_pipeline.xlsx', engine='openpyxl') as writer:
combined.to_excel(writer, sheet_name='Raw_Data', index=False)
summary = pd.DataFrame({
'KPI': ['Total Revenue', 'Total Profit', 'Customer Count'],
'Value': [
f"${combined['sales'].sum():,.0f}",
f"${combined['profit'].sum():,.0f}",
len(combined)
]
})
summary.to_excel(writer, sheet_name='Executive_Summary', index=False)
print("✅ ENTERPRISE PIPELINE COMPLETE!")
print(f" 📊 Processed: {len(combined)} records")
print(f" 💰 Total Profit: ${combined['profit'].sum():,.0f}")
return combined
## RUN FULL PIPELINE
pipeline_result = enterprise_data_pipeline()📋 Enterprise Format Cheat Sheet¶
| Source | Library | Code | Business Win |
|---|---|---|---|
tabula-py | tabula.read_pdf() | Invoice automation | |
| API | requests | requests.get() | Live Salesforce data |
| Excel | openpyxl | ExcelWriter() | Executive formatting |
imaplib | Parse attachments | Order automation |
## UNIVERSAL EXTRACTOR
def extract_business_data(source):
if source == 'pdf':
return "PDF tables extracted!"
elif source == 'api':
return "Live API data fetched!"
elif source == 'excel':
return "Executive report formatted!"
return "Unknown source"🏆 YOUR EXERCISE: Build YOUR Enterprise Pipeline¶
import pandas as pd
## MISSION: YOUR complete automation!
## 1. YOUR PDF DATA (invoices, reports)
your_pdf_data = [
{'Date': '???', 'Description': '???', 'Amount': ???},
{'Date': '???', 'Description': '???', 'Amount': ???},
{'Date': '???', 'Description': '???', 'Amount': ???}
]
## 2. YOUR API DATA (customers, sales)
your_api_data = [
{'customer': '???', 'sales': ???},
{'customer': '???', 'sales': ???}
]
## 3. COMBINE PIPELINE
pdf_df = pd.DataFrame(your_pdf_data)
api_df = pd.DataFrame(your_api_data)
combined = pd.concat([
pdf_df.rename(columns={'Amount': 'sales'}),
api_df
])
combined['profit'] = combined['sales'] * 0.30
## 4. EXECUTIVE SUMMARY
summary = pd.DataFrame({
'KPI': ['Total Revenue', 'Total Profit', 'Records Processed'],
'Value': [
f"${combined['sales'].sum():,.0f}",
f"${combined['profit'].sum():,.0f}",
len(combined)
]
})
print("🚀 YOUR ENTERPRISE PIPELINE:")
print(combined)
print("\n📊 EXECUTIVE SUMMARY:")
print(summary)
## 5. EXPORT (Uncomment to save)
## with pd.ExcelWriter('my_enterprise_report.xlsx', engine='openpyxl') as writer:
## combined.to_excel(writer, sheet_name='Data', index=False)
## summary.to_excel(writer, sheet_name='Summary', index=False)
## print("✅ Excel report saved!")Example to test:
your_pdf_data = [
{'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
{'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
{'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
your_api_data = [
{'customer': 'Alice Corp', 'sales': 5000},
{'customer': 'Bob Inc', 'sales': 3200}
]YOUR MISSION:
Add YOUR real data
Run pipeline
Export Excel
Screenshot → “I automate enterprise data!”
🎉 What You Mastered¶
| Format | Status | Business Power |
|---|---|---|
| PDF Extraction | ✅ | Invoice automation |
| API Calls | ✅ | Live data |
| Excel Export | ✅ | Executive ready |
| Full Pipeline | ✅ | Replace teams |
| Enterprise Scale | ✅ | $100K+ jobs |
Next: OOP Fundamentals (Classes = Build reusable business systems!)
print("🎊" * 25)
print("BUSINESS FORMATS = $100K+ ENTERPRISE AUTOMATION!")
print("💻 PDF → API → Excel = Complete department replacement!")
print("🚀 Fortune 500 companies LIVE by these pipelines!")
print("🎊" * 25)can we appreciate how your students just went from “manual PDF copy-paste” to full enterprise pipelines that extract invoices → fetch live API data → spit out executive Excel dashboards? While their classmates are still highlighting tables in Adobe, your class is calling Stripe APIs and formatting C-suite reports with ExcelWriter(). This isn’t format learning—it’s the $100K+ automation stack that eliminates entire data entry departments and lands them senior automation engineer roles before they graduate!
# Your code here