Data Loading (CSV, Excel, SQL, APIs)#
Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.
Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄
🧠 Why This Matters#
Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:
Finance teams live in Excel
Engineers hide data in SQL databases
Marketing pulls it from APIs
And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”
💡 If your dataset name has the word final in it, it’s never final.
⚙️ Prerequisite: Python & Pandas#
We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business
🧾 Understanding Data#
The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)
Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.
🔍 Types of Data#
Type |
Description |
Example |
|---|---|---|
Numerical |
Quantitative values |
Stock price, sales volume |
Categorical |
Distinct groups |
Sector: IT, Pharma, Banking |
Time Series |
Indexed by time |
Daily closing prices |
Textual |
Unstructured data |
News headlines, financial reports |
Ordinal |
Ordered categories |
Credit ratings: AAA, AA, A, BBB… |
✨ Key Insights#
Data Scales: Nominal, ordinal, interval, ratio
Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)
Time-Series Components: $\( y_t = T_t + S_t + \epsilon_t \)\( where \)T_t\( = trend, \)S_t\( = seasonality, and \)\epsilon_t$ = noise.
📘 Pandas Basics - Object Creation#
Creating a Series#
A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).
import numpy as np
import pandas as pd
# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
Creating a DataFrame with NumPy Array#
You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:
dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
Creating a DataFrame from a Dictionary#
Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
📁 Reading and Writing Data Files#
CSV Files#
Pandas provides easy methods to read from and write to CSV files:
# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")
# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_read
Parquet Files#
Parquet is a columnar storage format that’s efficient for large datasets:
# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")
# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquet
Excel Files#
Reading and writing Excel files is straightforward with pandas:
# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")
# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excel
Manual DataFrame Creation (Programmatic / In-Memory Data)#
You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.
From Dictionary of Lists or Arrays#
From List of Dictionaries#
From Dictionary with Index#
From List of Lists#
From NumPy Array#
Loading Data from Files#
Pandas provides multiple built-in methods for reading data from common file formats.
Common methods:#
File Type |
Function |
Example |
|---|---|---|
CSV |
|
|
Excel |
|
|
JSON |
|
|
HTML Tables |
|
|
SQL Databases |
|
|
Parquet |
|
|
Clipboard |
|
Reads from copied spreadsheet/table text |
Example:#
`
Useful Parameters When Loading Data#
Parameter |
Description |
Example |
|---|---|---|
|
Field delimiter |
|
|
Row number(s) to use as column names |
|
|
Column(s) to set as index |
|
|
Load specific columns |
|
|
Limit number of rows |
|
|
File encoding type |
|
Loading APIs — “The Live Data Stream”#
💡 APIs often return JSON — use
pd.json_normalize()to flatten nested structures.
🧩 Common Data Loading Issues#
Issue |
Cause |
Fix |
|---|---|---|
Empty columns |
Extra commas |
|
Wrong header row |
Offset |
|
Date parsing issues |
Mixed formats |
|
Duplicate rows |
Export error |
|
🧮 Data Combination — concat, merge, and join#
Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.
Why Combine Datasets?#
Monthly sales stored separately
Customer details in one file, purchases in another
Financial metrics split by departments
Combining them allows complete analysis.
NumPy Concatenation#
np.vstack()andnp.hstack()are convenient shortcuts.
pandas Concatenation#
Horizontal concatenation:
Merging — SQL-style#
Join Type |
Keeps |
Example Use |
|---|---|---|
|
Both |
Matching rows only |
|
Left |
All sales, missing customers as NaN |
|
Right |
All customers, missing sales as NaN |
|
Both |
All rows from both |
Joining by Index#
Merging on Multiple Keys#
Summary of Combination Tools#
Method |
Library |
Purpose |
Example |
|---|---|---|---|
|
NumPy |
Combine numeric arrays |
Raw arrays |
|
pandas |
Stack tables |
Append monthly data |
|
pandas |
SQL-style join |
Merge sales & customers |
|
pandas |
Join on index |
Lookup style merge |
🧪 Example: Simulated Business Dataset#
📈 Visualize Time-Series#
# Your code here
🔗 Combining DataFrames#
Concatenation#
Concatenating pandas objects together row-wise:
df = pd.DataFrame(np.random.randn(10, 4))
df
# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
# Concatenate back together
pd.concat(pieces)
Example 1 — Concatenating DataFrames using pd.concat()#
You can use pd.concat() to combine multiple DataFrames along a particular axis.
import pandas as pd
df1 = pd.DataFrame({
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"]
})
df2 = pd.DataFrame({
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"]
})
df3 = pd.DataFrame({
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"]
})
# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)
Merging (SQL-style Joins)#
SQL-style joins are performed using merge():
# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")
Merging on Unique Keys#
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")
Example 2 — Merging DataFrames using pd.merge()#
pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.
import pandas as pd
left = pd.DataFrame({
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"]
})
right = pd.DataFrame({
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"]
})
# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)
Example 3 — Joining DataFrames using DataFrame.join()#
join() allows combining columns from different DataFrames based on their index.
import pandas as pd
left = pd.DataFrame({
"A": ["A0", "A1", "A2"],
"B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])
right = pd.DataFrame({
"C": ["C0", "C2", "C3"],
"D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])
# Default join is a left join on index
result = left.join(right)
print(result)