Data Loading (CSV, Excel, SQL, APIs)#

Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, and even APIs.

Because before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄


🧠 Why This Matters#

Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:

  • Finance teams live in Excel

  • Engineers hide data in SQL databases

  • Marketing pulls it from APIs

  • And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”

💡 If your dataset name has the word final in it, it’s never final.


⚙️ Prerequisite: Python & Pandas#

We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, grab a quick refresher from my other book: 👉 📘 Programming for Business


📂 1. Loading CSV Files — “The Business Standard”#

CSV (Comma-Separated Values) files are the backbone of corporate analytics. They’re lightweight, universal, and occasionally chaotic.

Example:#

import pandas as pd

df = pd.read_csv("sales_data.csv")
print(df.head())

🧩 Pro Tip: Use df.info() to check data types and see if pandas guessed correctly.

Common Issues:#

Problem

Fix

Wrong delimiter

pd.read_csv('file.txt', sep='\t')

Encoding error

encoding='utf-8' or 'latin1'

Missing headers

header=None

Big file

chunksize=10000 to process in parts

💬 “If Excel can open it, pandas can too — faster and with fewer crashes.”


📊 2. Loading Excel Files — “The Corporate Comfort Zone”#

Business runs on Excel — from invoices to entire company strategies.

Example:#

df = pd.read_excel("revenue_report.xlsx", sheet_name="Q1")
print(df.head())

🧠 Tip: Always check sheet_name=None to see all available sheets.

Multi-Sheet Example:#

all_sheets = pd.read_excel("revenue_report.xlsx", sheet_name=None)
for name, sheet in all_sheets.items():
    print(f"Sheet {name} has {len(sheet)} rows.")

💬 Excel: where “data model” means 47 columns, 6 merged cells, and one hidden formula from 2013.


🧰 3. Loading SQL Data — “The Hidden Goldmine”#

Many companies store mission-critical data in SQL databases — clean, structured, and surprisingly friendly once you meet them.

Example (SQLite demo):#

import sqlite3
import pandas as pd

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
print(df.head())
conn.close()

🧩 Use sqlalchemy for production connections (MySQL, PostgreSQL, etc.).

Source

Connection Example

SQLite

sqlite3.connect("data.db")

MySQL

mysql+pymysql://user:password@host/db

PostgreSQL

postgresql://user:password@host/db

💬 “SQL is like Excel’s cooler cousin who actually organizes things.”


🌐 4. Loading Data from APIs — “The Live Data Stream”#

APIs (Application Programming Interfaces) let you pull fresh data straight from online services — no more static CSVs!

Example:#

import requests
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()

df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())

💡 APIs often return JSON — pandas + pd.json_normalize() is your friend.


🧩 Practice Corner: “Find the Source!”#

Match each situation with the right loading tool:

Scenario

Tool

You got a file from accounting in .xlsx

pd.read_excel()

The sales system has an SQL backend

pd.read_sql()

Marketing dashboard exports .csv

pd.read_csv()

Finance API gives daily data

requests.get() + pd.DataFrame()

Pro tip: The goal isn’t just to load data — it’s to load it reproducibly.


🧾 5. Handling Common Data Loading Drama#

Issue

Cause

Fix

Empty columns

Extra commas in CSV

Use error_bad_lines=False

Header mismatch

Wrong row as header

header=1

Strange dates

Format confusion

parse_dates=['Date']

Duplicate rows

Export error

df.drop_duplicates()

💬 Pandas doesn’t judge bad data — it just cleans it silently (unlike your boss).


🧩 Quick Practice — “Business Data Sampler”#

# Load three mini datasets
sales = pd.read_csv("sales.csv")
customers = pd.read_excel("customers.xlsx")
rates = pd.read_json("exchange_rates.json")

# Peek at each
print(sales.shape, customers.shape, rates.shape)

🧠 Try joining them later when we reach the “Data Cleaning & Preprocessing” section.


🧭 6. Recap#

✅ CSV → your everyday data sandwich ✅ Excel → for legacy systems and CFOs ✅ SQL → for scalable, reliable data ✅ APIs → for real-time freshness

🎯 You can now load any data type that the business throws at you — even that mystery file from “Final_Version_Really_Final_3.xlsx”.


🔜 Next Stop#

👉 Head to Data Cleaning & Preprocessing where we roll up our sleeves for Digital Laundry Day 🧺 — removing typos, missing values, and other spreadsheet sins.

🧩 Combining Datasets in Python — concat, merge, and join#

🧠 1. Why Combine Datasets?#

In real-world business or analytics scenarios, data rarely comes from one source. You may have:

  • Monthly sales data stored separately for each month.

  • Customer details in one file and purchase records in another.

  • Financial metrics split by department.

Combining these into a single table makes analysis much easier.


🧮 Using NumPy#

NumPy is useful when dealing with raw numeric data (not labeled like pandas DataFrames).


2. Concatenation in NumPy#

np.concatenate() joins arrays along a specified axis.

import numpy as np

# Create sample arrays
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

# Stack vertically (add rows)
vertical = np.concatenate((a, b), axis=0)
print("Vertical Concatenation:\n", vertical)

# Stack horizontally (add columns)
c = np.array([[7], [8], [9]])
horizontal = np.concatenate((np.vstack((a, b)), c), axis=1)
print("\nHorizontal Concatenation:\n", horizontal)

Output:

Vertical Concatenation:
[[1 2]
 [3 4]
 [5 6]]

Horizontal Concatenation:
[[1 2 7]
 [3 4 8]
 [5 6 9]]

3. Stacking Helpers in NumPy#

NumPy has shortcuts for cleaner syntax:

np.vstack((a, b))   # Vertical stack
np.hstack((a, b.T)) # Horizontal stack
np.column_stack((a.flatten(), b.flatten()))

🧾 Using pandas#

pandas offers concat, merge, and join — each useful for different tasks.


4. Concatenation with pd.concat()#

Used to stack DataFrames vertically or horizontally.

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5], 'Name': ['David', 'Emma']})

# Vertical concatenation (same columns)
result = pd.concat([df1, df2])
print(result)

Output:

   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie
0   4    David
1   5     Emma

You can reset the index:

result = pd.concat([df1, df2], ignore_index=True)

🔄 Horizontal concatenation#

If you want to combine side by side:

pd.concat([df1, df2], axis=1)

5. Merging with pd.merge()#

merge() is similar to SQL joins — you can merge based on common columns or indices.

Example 1: Simple Inner Join#

sales = pd.DataFrame({
    'CustomerID': [1, 2, 3],
    'Amount': [200, 150, 300]
})

customers = pd.DataFrame({
    'CustomerID': [1, 2, 4],
    'Name': ['Alice', 'Bob', 'David']
})

merged = pd.merge(sales, customers, on='CustomerID', how='inner')
print(merged)

Output:

   CustomerID  Amount   Name
0           1     200  Alice
1           2     150    Bob

Example 2: Left, Right, Outer Joins#

pd.merge(sales, customers, on='CustomerID', how='left')
pd.merge(sales, customers, on='CustomerID', how='right')
pd.merge(sales, customers, on='CustomerID', how='outer')

Join Type

Keeps Rows From

Example Use

inner

Both tables

Only matching customers

left

Left table

All sales, even missing customers

right

Right table

All customers, even missing sales

outer

Both

All records, fills missing with NaN


6. Joining with .join()#

.join() is a simpler method for joining on index (like Excel’s VLOOKUP).

left = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
right = pd.DataFrame({'City': ['Delhi', 'Bangalore', 'Mumbai']}, index=['Alice', 'Bob', 'Charlie'])

joined = left.join(right.set_index('City'), how='left')

But usually, we join using:

joined = left.set_index('Name').join(right)
print(joined)

Output:

          Age      City
Name
Alice      25     Delhi
Bob        30  Bangalore
Charlie    35    Mumbai

7. Combine by Column Keys (Multiple Keys)#

You can merge using multiple columns:

df1 = pd.DataFrame({
    'Region': ['North', 'South', 'North'],
    'Year': [2023, 2023, 2024],
    'Sales': [200, 150, 220]
})

df2 = pd.DataFrame({
    'Region': ['North', 'South', 'West'],
    'Year': [2023, 2023, 2024],
    'Profit': [50, 40, 70]
})

merged = pd.merge(df1, df2, on=['Region', 'Year'], how='outer')
print(merged)

Output:

  Region  Year  Sales  Profit
0  North  2023  200.0    50.0
1  South  2023  150.0    40.0
2  North  2024  220.0     NaN
3   West  2024    NaN    70.0

8. Summary Table#

Method

Library

Purpose

Example

np.concatenate()

NumPy

Stack arrays vertically/horizontally

Combine numeric arrays

pd.concat()

pandas

Combine DataFrames by rows/columns

Append monthly data

pd.merge()

pandas

SQL-style join on keys

Combine customer & sales info

.join()

pandas

Join by index

Combine side tables easily


9. Practice Exercises#

  1. Concatenate two arrays using NumPy vertically and horizontally.

  2. Merge two DataFrames — orders and customers — using an inner join.

  3. Perform an outer join to include all rows even if missing in one table.

  4. Join two DataFrames on index (try .set_index() + .join()).

  5. Create a dataset of quarterly sales for each region and combine them into one final table using pd.concat().


Data#


The Latin word data is the plural of datum, “(thing) given,” and the neuter past participle of dare, “to give”. The first English use of the word “data” is from the 1640s. The word “data” was first used to mean “transmissible and storable computer information” in 1946. The expression “data processing” was first used in 1954 (https://en.wikipedia.org/wiki/Data).

Data

These are some of the different types of data: Geographical, Cultural, Scientific, Financial, Statistical, Meteorological, Natural, Transport


1. Types of Data#


  • Description: Understanding the different forms of data: numerical (continuous, discrete), categorical (nominal, ordinal), and time-series.

  • Relevance: Different data types require specific handling in forecasting models (e.g., time-series models for stock prices, classification models for credit risk).

  • Related Concepts:

    • Data Scales: Nominal, ordinal, interval, and ratio scales.

    • Data Sources: Primary (e.g., company sales records) vs. secondary (e.g., market indices).

    • Time-Series Components: Trend, seasonality, and noise in financial data.

  • Example: Numerical: Stock prices (continuous). Categorical: Industry sectors (nominal). Time-series: Daily sales.

  • Mathematical Context: Time-series data often follows models like: $\( y_t = T_t + S_t + \epsilon_t \)\( where \) T_t \( is the trend, \) S_t \( is the seasonal component, and \) \epsilon_t $ is random noise.

  • Teaching Tip: Use real-world examples (e.g., stock prices vs. customer ratings) and ask students to classify data types from a sample dataset.

Type

Description

Example

Numerical

Quantitative values

Stock price, volume

Categorical

Distinct groups

Sector: IT, Pharma, Banking

Time Series

Indexed by time

Daily closing prices

Textual

Unstructured data

News headlines, financial reports

Ordinal

Ordered categories

Credit ratings: AAA, AA, A, BBB…


pip install faker
Requirement already satisfied: faker in /Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages (37.3.0)
Requirement already satisfied: tzdata in /Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages (from faker) (2023.3)
Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
from faker import Faker
import datetime

# Set random seed for reproducibility
np.random.seed(42)

# Initialize Faker for generating realistic textual data
fake = Faker()

# Number of samples
n_samples = 100

# 1. Numerical Data (Continuous and Discrete)
# Continuous: Stock prices (e.g., normally distributed around $100 with some variation)
stock_prices = np.random.normal(loc=100, scale=10, size=n_samples).round(2)
# Discrete: Trading volume (e.g., random integers between 1000 and 10000)
trading_volumes = np.random.randint(1000, 10000, size=n_samples)

# 2. Categorical Data (Nominal)
# Sectors: Randomly choose from IT, Pharma, Banking, Energy
sectors = np.random.choice(['IT', 'Pharma', 'Banking', 'Energy'], size=n_samples)

# 3. Time Series Data
# Generate daily dates starting from Jan 1, 2025
start_date = datetime.datetime(2025, 1, 1)
dates = [start_date + datetime.timedelta(days=i) for i in range(n_samples)]
# Time-series: Daily closing prices with trend, seasonality, and noise
trend = np.linspace(90, 110, n_samples)  # Linear trend from 90 to 110
seasonality = 5 * np.sin(np.linspace(0, 2 * np.pi * 10, n_samples))  # Seasonal component
noise = np.random.normal(0, 2, n_samples)  # Random noise
closing_prices = (trend + seasonality + noise).round(2)

# 4. Textual Data
# Generate fake financial news headlines
headlines = [fake.sentence(nb_words=10) for _ in range(n_samples)]

# 5. Ordinal Data
# Credit ratings: Ordered categories AAA, AA, A, BBB
credit_ratings = np.random.choice(['AAA', 'AA', 'A', 'BBB'], size=n_samples, p=[0.1, 0.3, 0.4, 0.2])

# Create a DataFrame to hold all sample data
data = {
    'Date': dates,
    'Stock_Price': stock_prices,
    'Trading_Volume': trading_volumes,
    'Sector': sectors,
    'Closing_Price': closing_prices,
    'News_Headline': headlines,
    'Credit_Rating': credit_ratings
}
df = pd.DataFrame(data)
df.head()
Date Stock_Price Trading_Volume Sector Closing_Price News_Headline Credit_Rating
0 2025-01-01 104.97 5413 IT 92.17 Focus a place building me you way security som... A
1 2025-01-02 98.62 4748 Energy 95.27 Really lay future imagine perform such trial s... BBB
2 2025-01-03 106.48 1663 Energy 92.42 Mother seat financial attorney left meeting fo... BBB
3 2025-01-04 115.23 2998 Energy 93.46 Down maintain standard recognize TV relate off... A
4 2025-01-05 97.66 8994 Energy 94.67 Board market morning western feel whether pape... AA
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()
_images/4a86db6cefc7e9c792347621de68222c798e5170493cd883ca8d9cc1efd2cd85.png
# Your code here