Data Loading (CSV, Excel, SQL, APIs)#
Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, and even APIs.
Because before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄
🧠 Why This Matters#
Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:
Finance teams live in Excel
Engineers hide data in SQL databases
Marketing pulls it from APIs
And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”
💡 If your dataset name has the word final in it, it’s never final.
⚙️ Prerequisite: Python & Pandas#
We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, grab a quick refresher from my other book: 👉 📘 Programming for Business
📂 1. Loading CSV Files — “The Business Standard”#
CSV (Comma-Separated Values) files are the backbone of corporate analytics. They’re lightweight, universal, and occasionally chaotic.
Example:#
import pandas as pd
df = pd.read_csv("sales_data.csv")
print(df.head())
🧩 Pro Tip: Use
df.info()to check data types and see if pandas guessed correctly.
Common Issues:#
Problem |
Fix |
|---|---|
Wrong delimiter |
|
Encoding error |
|
Missing headers |
|
Big file |
|
💬 “If Excel can open it, pandas can too — faster and with fewer crashes.”
📊 2. Loading Excel Files — “The Corporate Comfort Zone”#
Business runs on Excel — from invoices to entire company strategies.
Example:#
df = pd.read_excel("revenue_report.xlsx", sheet_name="Q1")
print(df.head())
🧠 Tip: Always check
sheet_name=Noneto see all available sheets.
Multi-Sheet Example:#
all_sheets = pd.read_excel("revenue_report.xlsx", sheet_name=None)
for name, sheet in all_sheets.items():
print(f"Sheet {name} has {len(sheet)} rows.")
💬 Excel: where “data model” means 47 columns, 6 merged cells, and one hidden formula from 2013.
🌐 4. Loading Data from APIs — “The Live Data Stream”#
APIs (Application Programming Interfaces) let you pull fresh data straight from online services — no more static CSVs!
Example:#
import requests
import pandas as pd
url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())
💡 APIs often return JSON — pandas +
pd.json_normalize()is your friend.
🧩 Practice Corner: “Find the Source!”#
Match each situation with the right loading tool:
Scenario |
Tool |
|---|---|
You got a file from accounting in |
|
The sales system has an SQL backend |
|
Marketing dashboard exports |
|
Finance API gives daily data |
|
✅ Pro tip: The goal isn’t just to load data — it’s to load it reproducibly.
🧾 5. Handling Common Data Loading Drama#
Issue |
Cause |
Fix |
|---|---|---|
Empty columns |
Extra commas in CSV |
Use |
Header mismatch |
Wrong row as header |
|
Strange dates |
Format confusion |
|
Duplicate rows |
Export error |
|
💬 Pandas doesn’t judge bad data — it just cleans it silently (unlike your boss).
🧩 Quick Practice — “Business Data Sampler”#
# Load three mini datasets
sales = pd.read_csv("sales.csv")
customers = pd.read_excel("customers.xlsx")
rates = pd.read_json("exchange_rates.json")
# Peek at each
print(sales.shape, customers.shape, rates.shape)
🧠 Try joining them later when we reach the “Data Cleaning & Preprocessing” section.
🧭 6. Recap#
✅ CSV → your everyday data sandwich ✅ Excel → for legacy systems and CFOs ✅ SQL → for scalable, reliable data ✅ APIs → for real-time freshness
🎯 You can now load any data type that the business throws at you — even that mystery file from “Final_Version_Really_Final_3.xlsx”.
🔜 Next Stop#
👉 Head to Data Cleaning & Preprocessing where we roll up our sleeves for Digital Laundry Day 🧺 — removing typos, missing values, and other spreadsheet sins.
🧩 Combining Datasets in Python — concat, merge, and join#
🧠 1. Why Combine Datasets?#
In real-world business or analytics scenarios, data rarely comes from one source. You may have:
Monthly sales data stored separately for each month.
Customer details in one file and purchase records in another.
Financial metrics split by department.
Combining these into a single table makes analysis much easier.
🧮 Using NumPy#
NumPy is useful when dealing with raw numeric data (not labeled like pandas DataFrames).
2. Concatenation in NumPy#
np.concatenate() joins arrays along a specified axis.
import numpy as np
# Create sample arrays
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
# Stack vertically (add rows)
vertical = np.concatenate((a, b), axis=0)
print("Vertical Concatenation:\n", vertical)
# Stack horizontally (add columns)
c = np.array([[7], [8], [9]])
horizontal = np.concatenate((np.vstack((a, b)), c), axis=1)
print("\nHorizontal Concatenation:\n", horizontal)
✅ Output:
Vertical Concatenation:
[[1 2]
[3 4]
[5 6]]
Horizontal Concatenation:
[[1 2 7]
[3 4 8]
[5 6 9]]
3. Stacking Helpers in NumPy#
NumPy has shortcuts for cleaner syntax:
np.vstack((a, b)) # Vertical stack
np.hstack((a, b.T)) # Horizontal stack
np.column_stack((a.flatten(), b.flatten()))
🧾 Using pandas#
pandas offers concat, merge, and join — each useful for different tasks.
4. Concatenation with pd.concat()#
Used to stack DataFrames vertically or horizontally.
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [4, 5], 'Name': ['David', 'Emma']})
# Vertical concatenation (same columns)
result = pd.concat([df1, df2])
print(result)
✅ Output:
ID Name
0 1 Alice
1 2 Bob
2 3 Charlie
0 4 David
1 5 Emma
You can reset the index:
result = pd.concat([df1, df2], ignore_index=True)
🔄 Horizontal concatenation#
If you want to combine side by side:
pd.concat([df1, df2], axis=1)
5. Merging with pd.merge()#
merge() is similar to SQL joins — you can merge based on common columns or indices.
Example 1: Simple Inner Join#
sales = pd.DataFrame({
'CustomerID': [1, 2, 3],
'Amount': [200, 150, 300]
})
customers = pd.DataFrame({
'CustomerID': [1, 2, 4],
'Name': ['Alice', 'Bob', 'David']
})
merged = pd.merge(sales, customers, on='CustomerID', how='inner')
print(merged)
✅ Output:
CustomerID Amount Name
0 1 200 Alice
1 2 150 Bob
Example 2: Left, Right, Outer Joins#
pd.merge(sales, customers, on='CustomerID', how='left')
pd.merge(sales, customers, on='CustomerID', how='right')
pd.merge(sales, customers, on='CustomerID', how='outer')
Join Type |
Keeps Rows From |
Example Use |
|---|---|---|
|
Both tables |
Only matching customers |
|
Left table |
All sales, even missing customers |
|
Right table |
All customers, even missing sales |
|
Both |
All records, fills missing with NaN |
6. Joining with .join()#
.join() is a simpler method for joining on index (like Excel’s VLOOKUP).
left = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
right = pd.DataFrame({'City': ['Delhi', 'Bangalore', 'Mumbai']}, index=['Alice', 'Bob', 'Charlie'])
joined = left.join(right.set_index('City'), how='left')
But usually, we join using:
joined = left.set_index('Name').join(right)
print(joined)
✅ Output:
Age City
Name
Alice 25 Delhi
Bob 30 Bangalore
Charlie 35 Mumbai
7. Combine by Column Keys (Multiple Keys)#
You can merge using multiple columns:
df1 = pd.DataFrame({
'Region': ['North', 'South', 'North'],
'Year': [2023, 2023, 2024],
'Sales': [200, 150, 220]
})
df2 = pd.DataFrame({
'Region': ['North', 'South', 'West'],
'Year': [2023, 2023, 2024],
'Profit': [50, 40, 70]
})
merged = pd.merge(df1, df2, on=['Region', 'Year'], how='outer')
print(merged)
✅ Output:
Region Year Sales Profit
0 North 2023 200.0 50.0
1 South 2023 150.0 40.0
2 North 2024 220.0 NaN
3 West 2024 NaN 70.0
8. Summary Table#
Method |
Library |
Purpose |
Example |
|---|---|---|---|
|
NumPy |
Stack arrays vertically/horizontally |
Combine numeric arrays |
|
pandas |
Combine DataFrames by rows/columns |
Append monthly data |
|
pandas |
SQL-style join on keys |
Combine customer & sales info |
|
pandas |
Join by index |
Combine side tables easily |
9. Practice Exercises#
Concatenate two arrays using NumPy vertically and horizontally.
Merge two DataFrames —
ordersandcustomers— using an inner join.Perform an outer join to include all rows even if missing in one table.
Join two DataFrames on index (try
.set_index()+.join()).Create a dataset of quarterly sales for each region and combine them into one final table using
pd.concat().
Data#
The Latin word data is the plural of datum, “(thing) given,” and the neuter past participle of dare, “to give”. The first English use of the word “data” is from the 1640s. The word “data” was first used to mean “transmissible and storable computer information” in 1946. The expression “data processing” was first used in 1954 (https://en.wikipedia.org/wiki/Data).
These are some of the different types of data: Geographical, Cultural, Scientific, Financial, Statistical, Meteorological, Natural, Transport
1. Types of Data#
Description: Understanding the different forms of data: numerical (continuous, discrete), categorical (nominal, ordinal), and time-series.
Relevance: Different data types require specific handling in forecasting models (e.g., time-series models for stock prices, classification models for credit risk).
Related Concepts:
Data Scales: Nominal, ordinal, interval, and ratio scales.
Data Sources: Primary (e.g., company sales records) vs. secondary (e.g., market indices).
Time-Series Components: Trend, seasonality, and noise in financial data.
Example: Numerical: Stock prices (continuous). Categorical: Industry sectors (nominal). Time-series: Daily sales.
Mathematical Context: Time-series data often follows models like: $\( y_t = T_t + S_t + \epsilon_t \)\( where \) T_t \( is the trend, \) S_t \( is the seasonal component, and \) \epsilon_t $ is random noise.
Teaching Tip: Use real-world examples (e.g., stock prices vs. customer ratings) and ask students to classify data types from a sample dataset.
Type |
Description |
Example |
|---|---|---|
Numerical |
Quantitative values |
Stock price, volume |
Categorical |
Distinct groups |
Sector: IT, Pharma, Banking |
Time Series |
Indexed by time |
Daily closing prices |
Textual |
Unstructured data |
News headlines, financial reports |
Ordinal |
Ordered categories |
Credit ratings: AAA, AA, A, BBB… |
pip install faker
Requirement already satisfied: faker in /Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages (37.3.0)
Requirement already satisfied: tzdata in /Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages (from faker) (2023.3)
Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
from faker import Faker
import datetime
# Set random seed for reproducibility
np.random.seed(42)
# Initialize Faker for generating realistic textual data
fake = Faker()
# Number of samples
n_samples = 100
# 1. Numerical Data (Continuous and Discrete)
# Continuous: Stock prices (e.g., normally distributed around $100 with some variation)
stock_prices = np.random.normal(loc=100, scale=10, size=n_samples).round(2)
# Discrete: Trading volume (e.g., random integers between 1000 and 10000)
trading_volumes = np.random.randint(1000, 10000, size=n_samples)
# 2. Categorical Data (Nominal)
# Sectors: Randomly choose from IT, Pharma, Banking, Energy
sectors = np.random.choice(['IT', 'Pharma', 'Banking', 'Energy'], size=n_samples)
# 3. Time Series Data
# Generate daily dates starting from Jan 1, 2025
start_date = datetime.datetime(2025, 1, 1)
dates = [start_date + datetime.timedelta(days=i) for i in range(n_samples)]
# Time-series: Daily closing prices with trend, seasonality, and noise
trend = np.linspace(90, 110, n_samples) # Linear trend from 90 to 110
seasonality = 5 * np.sin(np.linspace(0, 2 * np.pi * 10, n_samples)) # Seasonal component
noise = np.random.normal(0, 2, n_samples) # Random noise
closing_prices = (trend + seasonality + noise).round(2)
# 4. Textual Data
# Generate fake financial news headlines
headlines = [fake.sentence(nb_words=10) for _ in range(n_samples)]
# 5. Ordinal Data
# Credit ratings: Ordered categories AAA, AA, A, BBB
credit_ratings = np.random.choice(['AAA', 'AA', 'A', 'BBB'], size=n_samples, p=[0.1, 0.3, 0.4, 0.2])
# Create a DataFrame to hold all sample data
data = {
'Date': dates,
'Stock_Price': stock_prices,
'Trading_Volume': trading_volumes,
'Sector': sectors,
'Closing_Price': closing_prices,
'News_Headline': headlines,
'Credit_Rating': credit_ratings
}
df = pd.DataFrame(data)
df.head()
| Date | Stock_Price | Trading_Volume | Sector | Closing_Price | News_Headline | Credit_Rating | |
|---|---|---|---|---|---|---|---|
| 0 | 2025-01-01 | 104.97 | 5413 | IT | 92.17 | Focus a place building me you way security som... | A |
| 1 | 2025-01-02 | 98.62 | 4748 | Energy | 95.27 | Really lay future imagine perform such trial s... | BBB |
| 2 | 2025-01-03 | 106.48 | 1663 | Energy | 92.42 | Mother seat financial attorney left meeting fo... | BBB |
| 3 | 2025-01-04 | 115.23 | 2998 | Energy | 93.46 | Down maintain standard recognize TV relate off... | A |
| 4 | 2025-01-05 | 97.66 | 8994 | Energy | 94.67 | Board market morning western feel whether pape... | AA |
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()
# Your code here