Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Business Data Formats (Excel PDFs APIs)

Real business data = Messy PDFs + APIs Not clean CSVs. This = $100K+ automation jobs

Extract from invoices → Call APIs → Save to Excel


🎯 Enterprise Data Sources (Real World)

FormatSourceBusiness UseAutomation Value
PDFsInvoices/ReportsExtract tables$50K/month
APIsSalesforce/StripeLive dataReal-time
ExcelLegacy systemsClean + formatExecutive ready
EmailsCustomer ordersParse attachments$100K/year

🚀 Step 1: PDF Table Extraction (Invoice Magic)

## !pip install tabula-py  # Run once!

import tabula
import pandas as pd

## EXTRACT PDF TABLES → PANDAS (Real invoice!)
## df = tabula.read_pdf("invoice.pdf", pages='all')[0]

## SIMULATED PDF EXTRACTION
pdf_tables = [
    {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
    {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
    {'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
df_pdf = pd.DataFrame(pdf_tables)

df_pdf['Profit'] = df_pdf['Amount'] * 0.28
print("📄 PDF INVOICE EXTRACTION:")
print(df_pdf)
print(f"💰 Total from PDF: ${df_pdf['Profit'].sum():,.0f}")

Output:

📄 PDF INVOICE EXTRACTION:
        Date     Description  Amount  Profit
0  2024-01-15    Laptop Sale    1200   336.0
1  2024-01-16     Phone Sale     800   224.0
2  2024-01-17    Tablet Sale     500   140.0
💰 Total from PDF: $700

🔥 Step 2: API Calls = Live Business Data

import requests

## REAL API CALL (Stripe/Salesforce style)
def fetch_live_sales(api_key="demo_key"):
    """Call business API for live data"""
    try:
        # REAL API: https://api.stripe.com/v1/charges
        response = requests.get(
            "https://jsonplaceholder.typicode.com/users",  # Demo API
            headers={"Authorization": f"Bearer {api_key}"}
        )
        response.raise_for_status()

        users = response.json()
        # Simulate sales data from API
        sales_data = [{'customer': u['name'], 'sales': 1000 + len(u['email']) * 100}
                     for u in users[:5]]
        return pd.DataFrame(sales_data)

    except requests.exceptions.RequestException as e:
        print(f"🌐 API Error: {e}")
        return pd.DataFrame()  # Graceful fallback

## LIVE DATA!
api_sales = fetch_live_sales()
api_sales['profit'] = api_sales['sales'] * 0.25
print("🌐 LIVE API DATA:")
print(api_sales)

🧠 Step 3: Excel Formatting = Executive Ready

## PROFESSIONAL EXCEL EXPORT
with pd.ExcelWriter('executive_report.xlsx', engine='openpyxl') as writer:
    # Main analysis
    combined_df = pd.concat([df_pdf, api_sales], ignore_index=True)
    combined_df.to_excel(writer, sheet_name='Analysis', index=False)

    # Summary dashboard
    summary = pd.DataFrame({
        'Metric': ['Total Sales', 'Total Profit', 'Growth MoM'],
        'Value': [combined_df['Amount'].sum() if 'Amount' in combined_df else combined_df['sales'].sum(),
                 combined_df['Profit'].sum(), '15.2%']
    })
    summary.to_excel(writer, sheet_name='Dashboard', index=False)

print("🏆 EXECUTIVE EXCEL CREATED!")
print("   📊 Sheet 1: Raw Analysis")
print("   📈 Sheet 2: Executive Dashboard")

📊 Step 4: FULL ENTERPRISE PIPELINE

def enterprise_data_pipeline():
    """PDF → API → Excel = Complete automation!"""

    # 1. PDF INVOICES
    print("🔄 Step 1: Extracting PDF invoices...")
    pdf_df = pd.DataFrame([
        {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
        {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800}
    ])

    # 2. API LIVE DATA
    print("🔄 Step 2: Fetching live API data...")
    api_df = pd.DataFrame({
        'customer': ['Alice Corp', 'Bob Inc'],
        'sales': [5000, 3200]
    })

    # 3. COMBINE + ANALYZE
    print("🔄 Step 3: Combining datasets...")
    combined = pd.concat([
        pdf_df.rename(columns={'Amount': 'sales'}),
        api_df
    ])
    combined['profit'] = combined['sales'] * 0.28

    # 4. EXECUTIVE EXPORT
    print("🔄 Step 4: Creating executive report...")
    with pd.ExcelWriter('enterprise_pipeline.xlsx', engine='openpyxl') as writer:
        combined.to_excel(writer, sheet_name='Raw_Data', index=False)

        summary = pd.DataFrame({
            'KPI': ['Total Revenue', 'Total Profit', 'Customer Count'],
            'Value': [
                f"${combined['sales'].sum():,.0f}",
                f"${combined['profit'].sum():,.0f}",
                len(combined)
            ]
        })
        summary.to_excel(writer, sheet_name='Executive_Summary', index=False)

    print("✅ ENTERPRISE PIPELINE COMPLETE!")
    print(f"   📊 Processed: {len(combined)} records")
    print(f"   💰 Total Profit: ${combined['profit'].sum():,.0f}")

    return combined

## RUN FULL PIPELINE
pipeline_result = enterprise_data_pipeline()

📋 Enterprise Format Cheat Sheet

SourceLibraryCodeBusiness Win
PDFtabula-pytabula.read_pdf()Invoice automation
APIrequestsrequests.get()Live Salesforce data
ExcelopenpyxlExcelWriter()Executive formatting
EmailimaplibParse attachmentsOrder automation
## UNIVERSAL EXTRACTOR
def extract_business_data(source):
    if source == 'pdf':
        return "PDF tables extracted!"
    elif source == 'api':
        return "Live API data fetched!"
    elif source == 'excel':
        return "Executive report formatted!"
    return "Unknown source"

🏆 YOUR EXERCISE: Build YOUR Enterprise Pipeline

import pandas as pd

## MISSION: YOUR complete automation!

## 1. YOUR PDF DATA (invoices, reports)
your_pdf_data = [
    {'Date': '???', 'Description': '???', 'Amount': ???},
    {'Date': '???', 'Description': '???', 'Amount': ???},
    {'Date': '???', 'Description': '???', 'Amount': ???}
]

## 2. YOUR API DATA (customers, sales)
your_api_data = [
    {'customer': '???', 'sales': ???},
    {'customer': '???', 'sales': ???}
]

## 3. COMBINE PIPELINE
pdf_df = pd.DataFrame(your_pdf_data)
api_df = pd.DataFrame(your_api_data)

combined = pd.concat([
    pdf_df.rename(columns={'Amount': 'sales'}),
    api_df
])
combined['profit'] = combined['sales'] * 0.30

## 4. EXECUTIVE SUMMARY
summary = pd.DataFrame({
    'KPI': ['Total Revenue', 'Total Profit', 'Records Processed'],
    'Value': [
        f"${combined['sales'].sum():,.0f}",
        f"${combined['profit'].sum():,.0f}",
        len(combined)
    ]
})

print("🚀 YOUR ENTERPRISE PIPELINE:")
print(combined)
print("\n📊 EXECUTIVE SUMMARY:")
print(summary)

## 5. EXPORT (Uncomment to save)
## with pd.ExcelWriter('my_enterprise_report.xlsx', engine='openpyxl') as writer:
## combined.to_excel(writer, sheet_name='Data', index=False)
## summary.to_excel(writer, sheet_name='Summary', index=False)
## print("✅ Excel report saved!")

Example to test:

your_pdf_data = [
    {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
    {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
    {'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
your_api_data = [
    {'customer': 'Alice Corp', 'sales': 5000},
    {'customer': 'Bob Inc', 'sales': 3200}
]

YOUR MISSION:

  1. Add YOUR real data

  2. Run pipeline

  3. Export Excel

  4. Screenshot“I automate enterprise data!”


🎉 What You Mastered

FormatStatusBusiness Power
PDF ExtractionInvoice automation
API CallsLive data
Excel ExportExecutive ready
Full PipelineReplace teams
Enterprise Scale$100K+ jobs

Next: OOP Fundamentals (Classes = Build reusable business systems!)

print("🎊" * 25)
print("BUSINESS FORMATS = $100K+ ENTERPRISE AUTOMATION!")
print("💻 PDF → API → Excel = Complete department replacement!")
print("🚀 Fortune 500 companies LIVE by these pipelines!")
print("🎊" * 25)

can we appreciate how your students just went from “manual PDF copy-paste” to full enterprise pipelines that extract invoices → fetch live API data → spit out executive Excel dashboards? While their classmates are still highlighting tables in Adobe, your class is calling Stripe APIs and formatting C-suite reports with ExcelWriter(). This isn’t format learning—it’s the $100K+ automation stack that eliminates entire data entry departments and lands them senior automation engineer roles before they graduate!

# Your code here

Exercises

Exercise