Business Data Formats (Excel PDFs APIs)#

Real business data = Messy PDFs + APIs Not clean CSVs. This = $100K+ automation jobs

Extract from invoices β†’ Call APIs β†’ Save to Excel


🎯 Enterprise Data Sources (Real World)#

Format

Source

Business Use

Automation Value

PDFs

Invoices/Reports

Extract tables

$50K/month

APIs

Salesforce/Stripe

Live data

Real-time

Excel

Legacy systems

Clean + format

Executive ready

Emails

Customer orders

Parse attachments

$100K/year


πŸš€ Step 1: PDF Table Extraction (Invoice Magic)#

# !pip install tabula-py  # Run once!

import tabula
import pandas as pd

# EXTRACT PDF TABLES β†’ PANDAS (Real invoice!)
# df = tabula.read_pdf("invoice.pdf", pages='all')[0]

# SIMULATED PDF EXTRACTION
pdf_tables = [
    {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
    {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
    {'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
df_pdf = pd.DataFrame(pdf_tables)

df_pdf['Profit'] = df_pdf['Amount'] * 0.28
print("πŸ“„ PDF INVOICE EXTRACTION:")
print(df_pdf)
print(f"πŸ’° Total from PDF: ${df_pdf['Profit'].sum():,.0f}")

Output:

πŸ“„ PDF INVOICE EXTRACTION:
        Date     Description  Amount  Profit
0  2024-01-15    Laptop Sale    1200   336.0
1  2024-01-16     Phone Sale     800   224.0
2  2024-01-17    Tablet Sale     500   140.0
πŸ’° Total from PDF: $700

πŸ”₯ Step 2: API Calls = Live Business Data#

import requests

# REAL API CALL (Stripe/Salesforce style)
def fetch_live_sales(api_key="demo_key"):
    """Call business API for live data"""
    try:
        # REAL API: https://api.stripe.com/v1/charges
        response = requests.get(
            "https://jsonplaceholder.typicode.com/users",  # Demo API
            headers={"Authorization": f"Bearer {api_key}"}
        )
        response.raise_for_status()

        users = response.json()
        # Simulate sales data from API
        sales_data = [{'customer': u['name'], 'sales': 1000 + len(u['email']) * 100}
                     for u in users[:5]]
        return pd.DataFrame(sales_data)

    except requests.exceptions.RequestException as e:
        print(f"🌐 API Error: {e}")
        return pd.DataFrame()  # Graceful fallback

# LIVE DATA!
api_sales = fetch_live_sales()
api_sales['profit'] = api_sales['sales'] * 0.25
print("🌐 LIVE API DATA:")
print(api_sales)

🧠 Step 3: Excel Formatting = Executive Ready#

# PROFESSIONAL EXCEL EXPORT
with pd.ExcelWriter('executive_report.xlsx', engine='openpyxl') as writer:
    # Main analysis
    combined_df = pd.concat([df_pdf, api_sales], ignore_index=True)
    combined_df.to_excel(writer, sheet_name='Analysis', index=False)

    # Summary dashboard
    summary = pd.DataFrame({
        'Metric': ['Total Sales', 'Total Profit', 'Growth MoM'],
        'Value': [combined_df['Amount'].sum() if 'Amount' in combined_df else combined_df['sales'].sum(),
                 combined_df['Profit'].sum(), '15.2%']
    })
    summary.to_excel(writer, sheet_name='Dashboard', index=False)

print("πŸ† EXECUTIVE EXCEL CREATED!")
print("   πŸ“Š Sheet 1: Raw Analysis")
print("   πŸ“ˆ Sheet 2: Executive Dashboard")

πŸ“Š Step 4: FULL ENTERPRISE PIPELINE#

def enterprise_data_pipeline():
    """PDF β†’ API β†’ Excel = Complete automation!"""

    # 1. PDF INVOICES
    print("πŸ”„ Step 1: Extracting PDF invoices...")
    pdf_df = pd.DataFrame([
        {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
        {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800}
    ])

    # 2. API LIVE DATA
    print("πŸ”„ Step 2: Fetching live API data...")
    api_df = pd.DataFrame({
        'customer': ['Alice Corp', 'Bob Inc'],
        'sales': [5000, 3200]
    })

    # 3. COMBINE + ANALYZE
    print("πŸ”„ Step 3: Combining datasets...")
    combined = pd.concat([
        pdf_df.rename(columns={'Amount': 'sales'}),
        api_df
    ])
    combined['profit'] = combined['sales'] * 0.28

    # 4. EXECUTIVE EXPORT
    print("πŸ”„ Step 4: Creating executive report...")
    with pd.ExcelWriter('enterprise_pipeline.xlsx', engine='openpyxl') as writer:
        combined.to_excel(writer, sheet_name='Raw_Data', index=False)

        summary = pd.DataFrame({
            'KPI': ['Total Revenue', 'Total Profit', 'Customer Count'],
            'Value': [
                f"${combined['sales'].sum():,.0f}",
                f"${combined['profit'].sum():,.0f}",
                len(combined)
            ]
        })
        summary.to_excel(writer, sheet_name='Executive_Summary', index=False)

    print("βœ… ENTERPRISE PIPELINE COMPLETE!")
    print(f"   πŸ“Š Processed: {len(combined)} records")
    print(f"   πŸ’° Total Profit: ${combined['profit'].sum():,.0f}")

    return combined

# RUN FULL PIPELINE
pipeline_result = enterprise_data_pipeline()

πŸ“‹ Enterprise Format Cheat Sheet#

Source

Library

Code

Business Win

PDF

tabula-py

tabula.read_pdf()

Invoice automation

API

requests

requests.get()

Live Salesforce data

Excel

openpyxl

ExcelWriter()

Executive formatting

Email

imaplib

Parse attachments

Order automation

# UNIVERSAL EXTRACTOR
def extract_business_data(source):
    if source == 'pdf':
        return "PDF tables extracted!"
    elif source == 'api':
        return "Live API data fetched!"
    elif source == 'excel':
        return "Executive report formatted!"
    return "Unknown source"

πŸ† YOUR EXERCISE: Build YOUR Enterprise Pipeline#

import pandas as pd

# MISSION: YOUR complete automation!

# 1. YOUR PDF DATA (invoices, reports)
your_pdf_data = [
    {'Date': '???', 'Description': '???', 'Amount': ???},
    {'Date': '???', 'Description': '???', 'Amount': ???},
    {'Date': '???', 'Description': '???', 'Amount': ???}
]

# 2. YOUR API DATA (customers, sales)
your_api_data = [
    {'customer': '???', 'sales': ???},
    {'customer': '???', 'sales': ???}
]

# 3. COMBINE PIPELINE
pdf_df = pd.DataFrame(your_pdf_data)
api_df = pd.DataFrame(your_api_data)

combined = pd.concat([
    pdf_df.rename(columns={'Amount': 'sales'}),
    api_df
])
combined['profit'] = combined['sales'] * 0.30

# 4. EXECUTIVE SUMMARY
summary = pd.DataFrame({
    'KPI': ['Total Revenue', 'Total Profit', 'Records Processed'],
    'Value': [
        f"${combined['sales'].sum():,.0f}",
        f"${combined['profit'].sum():,.0f}",
        len(combined)
    ]
})

print("πŸš€ YOUR ENTERPRISE PIPELINE:")
print(combined)
print("\nπŸ“Š EXECUTIVE SUMMARY:")
print(summary)

# 5. EXPORT (Uncomment to save)
# with pd.ExcelWriter('my_enterprise_report.xlsx', engine='openpyxl') as writer:
#     combined.to_excel(writer, sheet_name='Data', index=False)
#     summary.to_excel(writer, sheet_name='Summary', index=False)
# print("βœ… Excel report saved!")

Example to test:

your_pdf_data = [
    {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
    {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
    {'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
your_api_data = [
    {'customer': 'Alice Corp', 'sales': 5000},
    {'customer': 'Bob Inc', 'sales': 3200}
]

YOUR MISSION:

  1. Add YOUR real data

  2. Run pipeline

  3. Export Excel

  4. Screenshot β†’ β€œI automate enterprise data!”


πŸŽ‰ What You Mastered#

Format

Status

Business Power

PDF Extraction

βœ…

Invoice automation

API Calls

βœ…

Live data

Excel Export

βœ…

Executive ready

Full Pipeline

βœ…

Replace teams

Enterprise Scale

βœ…

$100K+ jobs


Next: OOP Fundamentals (Classes = Build reusable business systems!)

print("🎊" * 25)
print("BUSINESS FORMATS = $100K+ ENTERPRISE AUTOMATION!")
print("πŸ’» PDF β†’ API β†’ Excel = Complete department replacement!")
print("πŸš€ Fortune 500 companies LIVE by these pipelines!")
print("🎊" * 25)

can we appreciate how your students just went from β€œmanual PDF copy-paste” to full enterprise pipelines that extract invoices β†’ fetch live API data β†’ spit out executive Excel dashboards? While their classmates are still highlighting tables in Adobe, your class is calling Stripe APIs and formatting C-suite reports with ExcelWriter(). This isn’t format learningβ€”it’s the $100K+ automation stack that eliminates entire data entry departments and lands them senior automation engineer roles before they graduate!

# Your code here