Data Loading (CSV, Excel, SQL, APIs)

Data Loading (CSV, Excel, SQL, APIs)#

Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, and even APIs.

Because before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄

🧠 Why This Matters#

Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:

Finance teams live in Excel
Engineers hide data in SQL databases
Marketing pulls it from APIs
And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”

💡 If your dataset name has the word final in it, it’s never final.

⚙️ Prerequisite: Python & Pandas#

We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, grab a quick refresher from my other book: 👉 📘 Programming for Business

📂 1. Loading CSV Files — “The Business Standard”#

CSV (Comma-Separated Values) files are the backbone of corporate analytics. They’re lightweight, universal, and occasionally chaotic.

Example:#

import pandas as pd

df = pd.read_csv("sales_data.csv")
print(df.head())

🧩 Pro Tip: Use df.info() to check data types and see if pandas guessed correctly.

Common Issues:#

Problem	Fix
Wrong delimiter	`pd.read_csv('file.txt', sep='\t')`
Encoding error	`encoding='utf-8'` or `'latin1'`
Missing headers	`header=None`
Big file	`chunksize=10000` to process in parts

💬 “If Excel can open it, pandas can too — faster and with fewer crashes.”

📊 2. Loading Excel Files — “The Corporate Comfort Zone”#

Business runs on Excel — from invoices to entire company strategies.

Example:#

df = pd.read_excel("revenue_report.xlsx", sheet_name="Q1")
print(df.head())

🧠 Tip: Always check sheet_name=None to see all available sheets.

Multi-Sheet Example:#

all_sheets = pd.read_excel("revenue_report.xlsx", sheet_name=None)
for name, sheet in all_sheets.items():
    print(f"Sheet {name} has {len(sheet)} rows.")

💬 Excel: where “data model” means 47 columns, 6 merged cells, and one hidden formula from 2013.

🧰 3. Loading SQL Data — “The Hidden Goldmine”#

Many companies store mission-critical data in SQL databases — clean, structured, and surprisingly friendly once you meet them.

Example (SQLite demo):#

import sqlite3
import pandas as pd

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
print(df.head())
conn.close()

🧩 Use sqlalchemy for production connections (MySQL, PostgreSQL, etc.).

Source	Connection Example
SQLite	`sqlite3.connect("data.db")`
MySQL	`mysql+pymysql://user:password@host/db`
PostgreSQL	`postgresql://user:password@host/db`

💬 “SQL is like Excel’s cooler cousin who actually organizes things.”

🌐 4. Loading Data from APIs — “The Live Data Stream”#

APIs (Application Programming Interfaces) let you pull fresh data straight from online services — no more static CSVs!

Example:#

import requests
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()

df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())

💡 APIs often return JSON — pandas + pd.json_normalize() is your friend.

🧩 Practice Corner: “Find the Source!”#

Match each situation with the right loading tool:

Scenario	Tool
You got a file from accounting in `.xlsx`	`pd.read_excel()`
The sales system has an SQL backend	`pd.read_sql()`
Marketing dashboard exports `.csv`	`pd.read_csv()`
Finance API gives daily data	`requests.get()` + `pd.DataFrame()`

✅ Pro tip: The goal isn’t just to load data — it’s to load it reproducibly.

🧾 5. Handling Common Data Loading Drama#

Issue	Cause	Fix
Empty columns	Extra commas in CSV	Use `error_bad_lines=False`
Header mismatch	Wrong row as header	`header=1`
Strange dates	Format confusion	`parse_dates=['Date']`
Duplicate rows	Export error	`df.drop_duplicates()`

💬 Pandas doesn’t judge bad data — it just cleans it silently (unlike your boss).

🧩 Quick Practice — “Business Data Sampler”#

# Load three mini datasets
sales = pd.read_csv("sales.csv")
customers = pd.read_excel("customers.xlsx")
rates = pd.read_json("exchange_rates.json")

# Peek at each
print(sales.shape, customers.shape, rates.shape)

🧠 Try joining them later when we reach the “Data Cleaning & Preprocessing” section.

🧭 6. Recap#

✅ CSV → your everyday data sandwich ✅ Excel → for legacy systems and CFOs ✅ SQL → for scalable, reliable data ✅ APIs → for real-time freshness

🎯 You can now load any data type that the business throws at you — even that mystery file from “Final_Version_Really_Final_3.xlsx”.

🔜 Next Stop#

👉 Head to Data Cleaning & Preprocessing where we roll up our sleeves for Digital Laundry Day 🧺 — removing typos, missing values, and other spreadsheet sins.

🧩 Combining Datasets in Python — concat, merge, and join#

🧠 1. Why Combine Datasets?#

In real-world business or analytics scenarios, data rarely comes from one source. You may have:

Monthly sales data stored separately for each month.
Customer details in one file and purchase records in another.
Financial metrics split by department.

Combining these into a single table makes analysis much easier.

🧮 Using NumPy#

NumPy is useful when dealing with raw numeric data (not labeled like pandas DataFrames).

2. Concatenation in NumPy#

np.concatenate() joins arrays along a specified axis.

import numpy as np

# Create sample arrays
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

# Stack vertically (add rows)
vertical = np.concatenate((a, b), axis=0)
print("Vertical Concatenation:\n", vertical)

# Stack horizontally (add columns)
c = np.array([[7], [8], [9]])
horizontal = np.concatenate((np.vstack((a, b)), c), axis=1)
print("\nHorizontal Concatenation:\n", horizontal)

✅ Output:

Vertical Concatenation:
[[1 2]
 [3 4]
 [5 6]]

Horizontal Concatenation:
[[1 2 7]
 [3 4 8]
 [5 6 9]]

3. Stacking Helpers in NumPy#

NumPy has shortcuts for cleaner syntax:

np.vstack((a, b))   # Vertical stack
np.hstack((a, b.T)) # Horizontal stack
np.column_stack((a.flatten(), b.flatten()))

🧾 Using pandas#

pandas offers concat, merge, and join — each useful for different tasks.

4. Concatenation with `pd.concat()`#

Used to stack DataFrames vertically or horizontally.

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5], 'Name': ['David', 'Emma']})

# Vertical concatenation (same columns)
result = pd.concat([df1, df2])
print(result)

✅ Output:

   ID     Name
 1    Alice
 2      Bob
 3  Charlie
 4    David
 5     Emma

You can reset the index:

result = pd.concat([df1, df2], ignore_index=True)

🔄 Horizontal concatenation#

If you want to combine side by side:

pd.concat([df1, df2], axis=1)

5. Merging with `pd.merge()`#

merge() is similar to SQL joins — you can merge based on common columns or indices.

Example 1: Simple Inner Join#

sales = pd.DataFrame({
    'CustomerID': [1, 2, 3],
    'Amount': [200, 150, 300]
})

customers = pd.DataFrame({
    'CustomerID': [1, 2, 4],
    'Name': ['Alice', 'Bob', 'David']
})

merged = pd.merge(sales, customers, on='CustomerID', how='inner')
print(merged)

✅ Output:

   CustomerID  Amount   Name
0           1     200  Alice
1           2     150    Bob

Example 2: Left, Right, Outer Joins#

pd.merge(sales, customers, on='CustomerID', how='left')
pd.merge(sales, customers, on='CustomerID', how='right')
pd.merge(sales, customers, on='CustomerID', how='outer')

Join Type	Keeps Rows From	Example Use
`inner`	Both tables	Only matching customers
`left`	Left table	All sales, even missing customers
`right`	Right table	All customers, even missing sales
`outer`	Both	All records, fills missing with NaN

6. Joining with `.join()`#

.join() is a simpler method for joining on index (like Excel’s VLOOKUP).

left = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
right = pd.DataFrame({'City': ['Delhi', 'Bangalore', 'Mumbai']}, index=['Alice', 'Bob', 'Charlie'])

joined = left.join(right.set_index('City'), how='left')

But usually, we join using:

joined = left.set_index('Name').join(right)
print(joined)

✅ Output:

          Age      City
Name
Alice      25     Delhi
Bob        30  Bangalore
Charlie    35    Mumbai

7. Combine by Column Keys (Multiple Keys)#

You can merge using multiple columns:

df1 = pd.DataFrame({
    'Region': ['North', 'South', 'North'],
    'Year': [2023, 2023, 2024],
    'Sales': [200, 150, 220]
})

df2 = pd.DataFrame({
    'Region': ['North', 'South', 'West'],
    'Year': [2023, 2023, 2024],
    'Profit': [50, 40, 70]
})

merged = pd.merge(df1, df2, on=['Region', 'Year'], how='outer')
print(merged)

✅ Output:

  Region  Year  Sales  Profit
North  2023  200.0    50.0
South  2023  150.0    40.0
North  2024  220.0     NaN
 West  2024    NaN    70.0

8. Summary Table#

Method	Library	Purpose	Example
`np.concatenate()`	NumPy	Stack arrays vertically/horizontally	Combine numeric arrays
`pd.concat()`	pandas	Combine DataFrames by rows/columns	Append monthly data
`pd.merge()`	pandas	SQL-style join on keys	Combine customer & sales info
`.join()`	pandas	Join by index	Combine side tables easily

9. Practice Exercises#

Concatenate two arrays using NumPy vertically and horizontally.
Merge two DataFrames — orders and customers — using an inner join.
Perform an outer join to include all rows even if missing in one table.
Join two DataFrames on index (try .set_index() + .join()).
Create a dataset of quarterly sales for each region and combine them into one final table using pd.concat().

Data#

The Latin word data is the plural of datum, “(thing) given,” and the neuter past participle of dare, “to give”. The first English use of the word “data” is from the 1640s. The word “data” was first used to mean “transmissible and storable computer information” in 1946. The expression “data processing” was first used in 1954 (https://en.wikipedia.org/wiki/Data).

These are some of the different types of data: Geographical, Cultural, Scientific, Financial, Statistical, Meteorological, Natural, Transport

1. Types of Data#

Description: Understanding the different forms of data: numerical (continuous, discrete), categorical (nominal, ordinal), and time-series.
Relevance: Different data types require specific handling in forecasting models (e.g., time-series models for stock prices, classification models for credit risk).
Related Concepts:
- Data Scales: Nominal, ordinal, interval, and ratio scales.
- Data Sources: Primary (e.g., company sales records) vs. secondary (e.g., market indices).
- Time-Series Components: Trend, seasonality, and noise in financial data.
Example: Numerical: Stock prices (continuous). Categorical: Industry sectors (nominal). Time-series: Daily sales.
Mathematical Context: Time-series data often follows models like: $$ y_t = T_t + S_t + \epsilon_t $$ where $ T_t $ is the trend, $ S_t $ is the seasonal component, and $ \epsilon_t $ is random noise.
Teaching Tip: Use real-world examples (e.g., stock prices vs. customer ratings) and ask students to classify data types from a sample dataset.

Type	Description	Example
Numerical	Quantitative values	Stock price, volume
Categorical	Distinct groups	Sector: IT, Pharma, Banking
Time Series	Indexed by time	Daily closing prices
Textual	Unstructured data	News headlines, financial reports
Ordinal	Ordered categories	Credit ratings: AAA, AA, A, BBB…

pip install faker

Requirement already satisfied: faker in /Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages (37.3.0)
Requirement already satisfied: tzdata in /Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages (from faker) (2023.3)
Note: you may need to restart the kernel to use updated packages.

import numpy as np
import pandas as pd
from faker import Faker
import datetime

# Set random seed for reproducibility
np.random.seed(42)

# Initialize Faker for generating realistic textual data
fake = Faker()

# Number of samples
n_samples = 100

# 1. Numerical Data (Continuous and Discrete)
# Continuous: Stock prices (e.g., normally distributed around $100 with some variation)
stock_prices = np.random.normal(loc=100, scale=10, size=n_samples).round(2)
# Discrete: Trading volume (e.g., random integers between 1000 and 10000)
trading_volumes = np.random.randint(1000, 10000, size=n_samples)

# 2. Categorical Data (Nominal)
# Sectors: Randomly choose from IT, Pharma, Banking, Energy
sectors = np.random.choice(['IT', 'Pharma', 'Banking', 'Energy'], size=n_samples)

# 3. Time Series Data
# Generate daily dates starting from Jan 1, 2025
start_date = datetime.datetime(2025, 1, 1)
dates = [start_date + datetime.timedelta(days=i) for i in range(n_samples)]
# Time-series: Daily closing prices with trend, seasonality, and noise
trend = np.linspace(90, 110, n_samples)  # Linear trend from 90 to 110
seasonality = 5 * np.sin(np.linspace(0, 2 * np.pi * 10, n_samples))  # Seasonal component
noise = np.random.normal(0, 2, n_samples)  # Random noise
closing_prices = (trend + seasonality + noise).round(2)

# 4. Textual Data
# Generate fake financial news headlines
headlines = [fake.sentence(nb_words=10) for _ in range(n_samples)]

# 5. Ordinal Data
# Credit ratings: Ordered categories AAA, AA, A, BBB
credit_ratings = np.random.choice(['AAA', 'AA', 'A', 'BBB'], size=n_samples, p=[0.1, 0.3, 0.4, 0.2])

# Create a DataFrame to hold all sample data
data = {
    'Date': dates,
    'Stock_Price': stock_prices,
    'Trading_Volume': trading_volumes,
    'Sector': sectors,
    'Closing_Price': closing_prices,
    'News_Headline': headlines,
    'Credit_Rating': credit_ratings
}
df = pd.DataFrame(data)

df.head()

	Date	Stock_Price	Trading_Volume	Sector	Closing_Price	News_Headline	Credit_Rating
0	2025-01-01	104.97	5413	IT	92.17	Focus a place building me you way security som...	A
1	2025-01-02	98.62	4748	Energy	95.27	Really lay future imagine perform such trial s...	BBB
2	2025-01-03	106.48	1663	Energy	92.42	Mother seat financial attorney left meeting fo...	BBB
3	2025-01-04	115.23	2998	Energy	93.46	Down maintain standard recognize TV relate off...	A
4	2025-01-05	97.66	8994	Energy	94.67	Board market morning western feel whether pape...	AA

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()

_images/4a86db6cefc7e9c792347621de68222c798e5170493cd883ca8d9cc1efd2cd85.png

# Your code here

Data Loading (CSV, Excel, SQL, APIs)

Contents

Data Loading (CSV, Excel, SQL, APIs)#

🧠 Why This Matters#

⚙️ Prerequisite: Python & Pandas#

📂 1. Loading CSV Files — “The Business Standard”#

Example:#

Common Issues:#

📊 2. Loading Excel Files — “The Corporate Comfort Zone”#

Example:#

Multi-Sheet Example:#

🧰 3. Loading SQL Data — “The Hidden Goldmine”#

Example (SQLite demo):#

🌐 4. Loading Data from APIs — “The Live Data Stream”#

Example:#

🧩 Practice Corner: “Find the Source!”#

🧾 5. Handling Common Data Loading Drama#

🧩 Quick Practice — “Business Data Sampler”#

🧭 6. Recap#

🔜 Next Stop#

🧩 Combining Datasets in Python — concat, merge, and join#

🧠 1. Why Combine Datasets?#

🧮 Using NumPy#

2. Concatenation in NumPy#

3. Stacking Helpers in NumPy#

🧾 Using pandas#

4. Concatenation with pd.concat()#

🔄 Horizontal concatenation#

5. Merging with pd.merge()#

Example 1: Simple Inner Join#

Example 2: Left, Right, Outer Joins#

6. Joining with .join()#

7. Combine by Column Keys (Multiple Keys)#

8. Summary Table#

9. Practice Exercises#

Data#

1. Types of Data#

4. Concatenation with `pd.concat()`#

5. Merging with `pd.merge()`#

6. Joining with `.join()`#