Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Exercises

  1. Use pd.read_csv() on the sample CSV above, then write a function load_and_summarize(csv_string) that returns (shape, dtypes, missing_counts).

  2. Try loading a small JSON API response with requests.get() (or use the provided sample) and flatten nested entries using pd.json_normalize().

  3. (Challenge) Create a small SQLite database in-memory and load a table using pd.read_sql() — demonstrate with sqlite3.connect(':memory:').

Hints:
- Use `StringIO` to simulate file-like CSV strings.
- For `pd.read_sql()`, pass the SQL query and the connection object.

Quick checklist after loading any table

  • Inspect df.shape and df.dtypes immediately

  • Check df.isna().sum() for missingness

  • Look at df.head() and df.describe() for unexpected values

  • If dates exist, ensure parse_dates or pd.to_datetime() is applied

Data Loading (CSV, Excel, SQL, APIs)

Patterns and pragmatic checks to get real business data into Python safely and reproducibly. Includes small runnable recipes for CSV, JSON, and simple API responses.

Learning goals

  • Choose suitable loading methods for CSV, Excel, JSON, SQL, and APIs.

  • Apply lightweight sanity checks immediately after loading (shape, dtypes, missing counts).

  • Combine and persist small tables for downstream analysis.

Data Loading (CSV, Excel, SQL, APIs)

Collecting Business Data from Files, Databases, and Web Services

Notebook Guide

This notebook retains the full original content and adds a clearer map for study.

Learning objectives

  • understand common business data sources

  • choose suitable loading methods for each source

  • recognize schema, type, and quality issues early

  • prepare loaded data for downstream wrangling

Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.

Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄


🧠 Why This Matters

Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:

  • Finance teams live in Excel

  • Engineers hide data in SQL databases

  • Marketing pulls it from APIs

  • And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”

💡 If your dataset name has the word final in it, it’s never final.


⚙️ Prerequisite: Python & Pandas

We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business


🧾 Understanding Data

The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)

Data types diagram

Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.


🔍 Types of Data

TypeDescriptionExample
NumericalQuantitative valuesStock price, sales volume
CategoricalDistinct groupsSector: IT, Pharma, Banking
Time SeriesIndexed by timeDaily closing prices
TextualUnstructured dataNews headlines, financial reports
OrdinalOrdered categoriesCredit ratings: AAA, AA, A, BBB...

✨ Key Insights

  • Data Scales: Nominal, ordinal, interval, ratio

  • Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)

  • Time-Series Components:

    yt=Tt+St+ϵty_t = T_t + S_t + \epsilon_t

    where TtT_t = trend, StS_t = seasonality, and ϵt\epsilon_t = noise.


📘 Pandas Basics - Object Creation

Creating a Series

A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).

import numpy as np
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Creating a DataFrame with NumPy Array

You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:

dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Creating a DataFrame from a Dictionary

Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

📁 Reading and Writing Data Files

CSV Files

Pandas provides easy methods to read from and write to CSV files:

# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))

# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")

# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_read

Parquet Files

Parquet is a columnar storage format that’s efficient for large datasets:

# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")

# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquet

Excel Files

Reading and writing Excel files is straightforward with pandas:

# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")

# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excel

Manual DataFrame Creation (Programmatic / In-Memory Data)

You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.

From Dictionary of Lists or Arrays

import pandas as pd

data = {
    "Product": ["A", "B", "C"],
    "Sales": [100, 150, 200],
    "Region": ["East", "West", "North"]
}

df = pd.DataFrame(data)

From List of Dictionaries

records = [
    {"Product": "A", "Sales": 100, "Region": "East"},
    {"Product": "B", "Sales": 150, "Region": "West"},
]

df = pd.DataFrame.from_records(records)

From Dictionary with Index

df = pd.DataFrame.from_dict(
    {"A": [1, 2], "B": [3, 4]},
    orient="columns"
)

From List of Lists

df = pd.DataFrame(
    [["A", 100, "East"], ["B", 150, "West"]],
    columns=["Product", "Sales", "Region"]
)

From NumPy Array

import numpy as np

df = pd.DataFrame(
    np.random.rand(3, 2),
    columns=["A", "B"]
)

Loading Data from Files

Pandas provides multiple built-in methods for reading data from common file formats.

Common methods:

File TypeFunctionExample
CSVpd.read_csv()pd.read_csv("sales.csv")
Excelpd.read_excel()pd.read_excel("sales.xlsx", sheet_name="2023")
JSONpd.read_json()pd.read_json("data.json")
HTML Tablespd.read_html()pd.read_html("https://example.com")[0]
SQL Databasespd.read_sql()pd.read_sql("SELECT * FROM sales", conn)
Parquetpd.read_parquet()pd.read_parquet("data.parquet")
Clipboardpd.read_clipboard()Reads from copied spreadsheet/table text

Example:

import pandas as pd

# Load CSV file
df_csv = pd.read_csv("sales.csv")

# Load Excel sheet
df_excel = pd.read_excel("sales_data.xlsx", sheet_name="2023")

# Load JSON file
df_json = pd.read_json("data.json")

Useful Parameters When Loading Data

ParameterDescriptionExample
sepField delimiterpd.read_csv("file.txt", sep="\t")
headerRow number(s) to use as column namespd.read_csv("file.csv", header=0)
index_colColumn(s) to set as indexpd.read_csv("file.csv", index_col="ID")
usecolsLoad specific columnspd.read_csv("file.csv", usecols=["Name", "Sales"])
nrowsLimit number of rowspd.read_csv("file.csv", nrows=100)
encodingFile encoding typepd.read_csv("file.csv", encoding="utf-8")

Loading SQL Data — “The Hidden Goldmine”

import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
conn.close()

Connection Examples

SourceSyntax Example
SQLitesqlite3.connect("data.db")
MySQLmysql+pymysql://user:pass@host/db
PostgreSQLpostgresql://user:pass@host/db

💬 SQL is like Excel’s cooler cousin who actually organizes things.


Loading APIs — “The Live Data Stream”

import requests
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()

df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())

💡 APIs often return JSON — use pd.json_normalize() to flatten nested structures.


🧩 Common Data Loading Issues

IssueCauseFix
Empty columnsExtra commaserror_bad_lines=False
Wrong header rowOffsetheader=1
Date parsing issuesMixed formatsparse_dates=['Date']
Duplicate rowsExport errordf.drop_duplicates()

🧮 Data Combination — concat, merge, and join

Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.


Why Combine Datasets?

  • Monthly sales stored separately

  • Customer details in one file, purchases in another

  • Financial metrics split by departments

Combining them allows complete analysis.


NumPy Concatenation

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

np.concatenate((a, b), axis=0)

np.vstack() and np.hstack() are convenient shortcuts.


pandas Concatenation

import pandas as pd

df1 = pd.DataFrame({'ID': [1,2,3], 'Name': ['Alice','Bob','Charlie']})
df2 = pd.DataFrame({'ID': [4,5], 'Name': ['David','Emma']})

pd.concat([df1, df2], ignore_index=True)

Horizontal concatenation:

pd.concat([df1, df2], axis=1)

Merging — SQL-style

sales = pd.DataFrame({'CustomerID':[1,2,3], 'Amount':[200,150,300]})
customers = pd.DataFrame({'CustomerID':[1,2,4], 'Name':['Alice','Bob','David']})

pd.merge(sales, customers, on='CustomerID', how='inner')
Join TypeKeepsExample Use
innerBothMatching rows only
leftLeftAll sales, missing customers as NaN
rightRightAll customers, missing sales as NaN
outerBothAll rows from both

Joining by Index

left = pd.DataFrame({'Name':['Alice','Bob','Charlie'],'Age':[25,30,35]})
right = pd.DataFrame({'City':['Delhi','Bangalore','Mumbai']}, index=['Alice','Bob','Charlie'])

joined = left.set_index('Name').join(right)

Merging on Multiple Keys

df1 = pd.DataFrame({
    'Region':['North','South','North'],
    'Year':[2023,2023,2024],
    'Sales':[200,150,220]
})
df2 = pd.DataFrame({
    'Region':['North','South','West'],
    'Year':[2023,2023,2024],
    'Profit':[50,40,70]
})
pd.merge(df1, df2, on=['Region','Year'], how='outer')

Summary of Combination Tools

MethodLibraryPurposeExample
np.concatenate()NumPyCombine numeric arraysRaw arrays
pd.concat()pandasStack tablesAppend monthly data
pd.merge()pandasSQL-style joinMerge sales & customers
.join()pandasJoin on indexLookup style merge

🧪 Example: Simulated Business Dataset

import numpy as np, pandas as pd, datetime
from faker import Faker
fake = Faker()

np.random.seed(42)
n = 100
dates = [datetime.datetime(2025,1,1)+datetime.timedelta(days=i) for i in range(n)]
trend = np.linspace(90,110,n)
seasonality = 5*np.sin(np.linspace(0,2*np.pi*10,n))
noise = np.random.normal(0,2,n)
closing_prices = (trend+seasonality+noise).round(2)

data = {
    'Date': dates,
    'Stock_Price': np.random.normal(100,10,n).round(2),
    'Trading_Volume': np.random.randint(1000,10000,n),
    'Sector': np.random.choice(['IT','Pharma','Banking','Energy'],n),
    'Closing_Price': closing_prices,
    'News_Headline': [fake.sentence(nb_words=10) for _ in range(n)],
    'Credit_Rating': np.random.choice(['AAA','AA','A','BBB'],n,p=[0.1,0.3,0.4,0.2])
}

df = pd.DataFrame(data)
df.head()

📈 Visualize Time-Series

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()

# Your code here

🔗 Combining DataFrames

Concatenation

Concatenating pandas objects together row-wise:

df = pd.DataFrame(np.random.randn(10, 4))
df

# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

# Concatenate back together
pd.concat(pieces)

Example 1 — Concatenating DataFrames using pd.concat()

You can use pd.concat() to combine multiple DataFrames along a particular axis.

import pandas as pd

df1 = pd.DataFrame({
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

df2 = pd.DataFrame({
    "A": ["A4", "A5", "A6", "A7"],
    "B": ["B4", "B5", "B6", "B7"]
})

df3 = pd.DataFrame({
    "A": ["A8", "A9", "A10", "A11"],
    "B": ["B8", "B9", "B10", "B11"]
})

# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)

Merging (SQL-style Joins)

SQL-style joins are performed using merge():

# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Merging on Unique Keys

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Example 2 — Merging DataFrames using pd.merge()

pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.

import pandas as pd

left = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

right = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"]
})

# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)

Example 3 — Joining DataFrames using DataFrame.join()

join() allows combining columns from different DataFrames based on their index.

import pandas as pd

left = pd.DataFrame({
    "A": ["A0", "A1", "A2"],
    "B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])

right = pd.DataFrame({
    "C": ["C0", "C2", "C3"],
    "D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])

# Default join is a left join on index
result = left.join(right)
print(result)

Imported from business_formats.ipynb

This section was merged from a notebook that is not listed in myst.yml.

Business Data Formats (Excel PDFs APIs)

Real business data = Messy PDFs + APIs Not clean CSVs. This = $100K+ automation jobs

Extract from invoices → Call APIs → Save to Excel


🎯 Enterprise Data Sources (Real World)

FormatSourceBusiness UseAutomation Value
PDFsInvoices/ReportsExtract tables$50K/month
APIsSalesforce/StripeLive dataReal-time
ExcelLegacy systemsClean + formatExecutive ready
EmailsCustomer ordersParse attachments$100K/year

🚀 Step 1: PDF Table Extraction (Invoice Magic)

## !pip install tabula-py  # Run once!

import tabula
import pandas as pd

## EXTRACT PDF TABLES → PANDAS (Real invoice!)
## df = tabula.read_pdf("invoice.pdf", pages='all')[0]

## SIMULATED PDF EXTRACTION
pdf_tables = [
    {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
    {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
    {'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
df_pdf = pd.DataFrame(pdf_tables)

df_pdf['Profit'] = df_pdf['Amount'] * 0.28
print("📄 PDF INVOICE EXTRACTION:")
print(df_pdf)
print(f"💰 Total from PDF: ${df_pdf['Profit'].sum():,.0f}")

Output:

📄 PDF INVOICE EXTRACTION:
        Date     Description  Amount  Profit
0  2024-01-15    Laptop Sale    1200   336.0
1  2024-01-16     Phone Sale     800   224.0
2  2024-01-17    Tablet Sale     500   140.0
💰 Total from PDF: $700

🔥 Step 2: API Calls = Live Business Data

import requests

## REAL API CALL (Stripe/Salesforce style)
def fetch_live_sales(api_key="demo_key"):
    """Call business API for live data"""
    try:
        # REAL API: https://api.stripe.com/v1/charges
        response = requests.get(
            "https://jsonplaceholder.typicode.com/users",  # Demo API
            headers={"Authorization": f"Bearer {api_key}"}
        )
        response.raise_for_status()

        users = response.json()
        # Simulate sales data from API
        sales_data = [{'customer': u['name'], 'sales': 1000 + len(u['email']) * 100}
                     for u in users[:5]]
        return pd.DataFrame(sales_data)

    except requests.exceptions.RequestException as e:
        print(f"🌐 API Error: {e}")
        return pd.DataFrame()  # Graceful fallback

## LIVE DATA!
api_sales = fetch_live_sales()
api_sales['profit'] = api_sales['sales'] * 0.25
print("🌐 LIVE API DATA:")
print(api_sales)

🧠 Step 3: Excel Formatting = Executive Ready

## PROFESSIONAL EXCEL EXPORT
with pd.ExcelWriter('executive_report.xlsx', engine='openpyxl') as writer:
    # Main analysis
    combined_df = pd.concat([df_pdf, api_sales], ignore_index=True)
    combined_df.to_excel(writer, sheet_name='Analysis', index=False)

    # Summary dashboard
    summary = pd.DataFrame({
        'Metric': ['Total Sales', 'Total Profit', 'Growth MoM'],
        'Value': [combined_df['Amount'].sum() if 'Amount' in combined_df else combined_df['sales'].sum(),
                 combined_df['Profit'].sum(), '15.2%']
    })
    summary.to_excel(writer, sheet_name='Dashboard', index=False)

print("🏆 EXECUTIVE EXCEL CREATED!")
print("   📊 Sheet 1: Raw Analysis")
print("   📈 Sheet 2: Executive Dashboard")

📊 Step 4: FULL ENTERPRISE PIPELINE

def enterprise_data_pipeline():
    """PDF → API → Excel = Complete automation!"""

    # 1. PDF INVOICES
    print("🔄 Step 1: Extracting PDF invoices...")
    pdf_df = pd.DataFrame([
        {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
        {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800}
    ])

    # 2. API LIVE DATA
    print("🔄 Step 2: Fetching live API data...")
    api_df = pd.DataFrame({
        'customer': ['Alice Corp', 'Bob Inc'],
        'sales': [5000, 3200]
    })

    # 3. COMBINE + ANALYZE
    print("🔄 Step 3: Combining datasets...")
    combined = pd.concat([
        pdf_df.rename(columns={'Amount': 'sales'}),
        api_df
    ])
    combined['profit'] = combined['sales'] * 0.28

    # 4. EXECUTIVE EXPORT
    print("🔄 Step 4: Creating executive report...")
    with pd.ExcelWriter('enterprise_pipeline.xlsx', engine='openpyxl') as writer:
        combined.to_excel(writer, sheet_name='Raw_Data', index=False)

        summary = pd.DataFrame({
            'KPI': ['Total Revenue', 'Total Profit', 'Customer Count'],
            'Value': [
                f"${combined['sales'].sum():,.0f}",
                f"${combined['profit'].sum():,.0f}",
                len(combined)
            ]
        })
        summary.to_excel(writer, sheet_name='Executive_Summary', index=False)

    print("✅ ENTERPRISE PIPELINE COMPLETE!")
    print(f"   📊 Processed: {len(combined)} records")
    print(f"   💰 Total Profit: ${combined['profit'].sum():,.0f}")

    return combined

## RUN FULL PIPELINE
pipeline_result = enterprise_data_pipeline()

📋 Enterprise Format Cheat Sheet

SourceLibraryCodeBusiness Win
PDFtabula-pytabula.read_pdf()Invoice automation
APIrequestsrequests.get()Live Salesforce data
ExcelopenpyxlExcelWriter()Executive formatting
EmailimaplibParse attachmentsOrder automation
## UNIVERSAL EXTRACTOR
def extract_business_data(source):
    if source == 'pdf':
        return "PDF tables extracted!"
    elif source == 'api':
        return "Live API data fetched!"
    elif source == 'excel':
        return "Executive report formatted!"
    return "Unknown source"

🏆 YOUR EXERCISE: Build YOUR Enterprise Pipeline

import pandas as pd

## MISSION: YOUR complete automation!

## 1. YOUR PDF DATA (invoices, reports)
your_pdf_data = [
    {'Date': '???', 'Description': '???', 'Amount': ???},
    {'Date': '???', 'Description': '???', 'Amount': ???},
    {'Date': '???', 'Description': '???', 'Amount': ???}
]

## 2. YOUR API DATA (customers, sales)
your_api_data = [
    {'customer': '???', 'sales': ???},
    {'customer': '???', 'sales': ???}
]

## 3. COMBINE PIPELINE
pdf_df = pd.DataFrame(your_pdf_data)
api_df = pd.DataFrame(your_api_data)

combined = pd.concat([
    pdf_df.rename(columns={'Amount': 'sales'}),
    api_df
])
combined['profit'] = combined['sales'] * 0.30

## 4. EXECUTIVE SUMMARY
summary = pd.DataFrame({
    'KPI': ['Total Revenue', 'Total Profit', 'Records Processed'],
    'Value': [
        f"${combined['sales'].sum():,.0f}",
        f"${combined['profit'].sum():,.0f}",
        len(combined)
    ]
})

print("🚀 YOUR ENTERPRISE PIPELINE:")
print(combined)
print("\n📊 EXECUTIVE SUMMARY:")
print(summary)

## 5. EXPORT (Uncomment to save)
## with pd.ExcelWriter('my_enterprise_report.xlsx', engine='openpyxl') as writer:
## combined.to_excel(writer, sheet_name='Data', index=False)
## summary.to_excel(writer, sheet_name='Summary', index=False)
## print("✅ Excel report saved!")

Example to test:

your_pdf_data = [
    {'Date': '2024-01-15', 'Description': 'Laptop Sale', 'Amount': 1200},
    {'Date': '2024-01-16', 'Description': 'Phone Sale', 'Amount': 800},
    {'Date': '2024-01-17', 'Description': 'Tablet Sale', 'Amount': 500}
]
your_api_data = [
    {'customer': 'Alice Corp', 'sales': 5000},
    {'customer': 'Bob Inc', 'sales': 3200}
]

YOUR MISSION:

  1. Add YOUR real data

  2. Run pipeline

  3. Export Excel

  4. Screenshot“I automate enterprise data!”


🎉 What You Mastered

FormatStatusBusiness Power
PDF ExtractionInvoice automation
API CallsLive data
Excel ExportExecutive ready
Full PipelineReplace teams
Enterprise Scale$100K+ jobs

Next: OOP Fundamentals (Classes = Build reusable business systems!)

print("🎊" * 25)
print("BUSINESS FORMATS = $100K+ ENTERPRISE AUTOMATION!")
print("💻 PDF → API → Excel = Complete department replacement!")
print("🚀 Fortune 500 companies LIVE by these pipelines!")
print("🎊" * 25)

can we appreciate how your students just went from “manual PDF copy-paste” to full enterprise pipelines that extract invoices → fetch live API data → spit out executive Excel dashboards? While their classmates are still highlighting tables in Adobe, your class is calling Stripe APIs and formatting C-suite reports with ExcelWriter(). This isn’t format learning—it’s the $100K+ automation stack that eliminates entire data entry departments and lands them senior automation engineer roles before they graduate!

# Your code here

Exercises

Exercise


Wrap-Up and Transition

The original notebook content below develops the mechanics in detail. Use it as the working reference, and treat the later notebooks on cleaning, joins, and feature engineering as the next stages after successful ingestion.

Reflection prompts
  1. Which source in your work is hardest to load reproducibly?

  2. What metadata should always be captured when importing data?

  3. How would you detect a schema change before it breaks a pipeline?

8. Interactive Code

Expected output
{'month': 'Jan', 'sales': 1200}
Expected output
['Jan', 'Feb']

9. Guided Practice

What does data loading usually mean?

Deleting all source dataLoading is about bringing data into the working environment.
Reading data from a source into a usable structureCorrect. That is the essential idea.
Only printing values to screenPrinting is not the same as loading.
Only working with databasesFiles, APIs, and many other sources can be loaded too.

What list is produced in the second example?

[1200, 1350]That would be the sales values.
['Jan', 'Feb']Correct. The list comprehension extracts the months.
['sales', 'month']Those are field names, not the extracted month values.
[]The records list is not empty.