Data Loading (CSV, Excel, SQL, APIs) - Programming for Machine Learning and Business

Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.

Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄

🧠 Why This Matters¶

Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:

Finance teams live in Excel
Engineers hide data in SQL databases
Marketing pulls it from APIs
And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”

💡 If your dataset name has the word final in it, it’s never final.

⚙️ Prerequisite: Python & Pandas¶

We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business

🧾 Understanding Data¶

The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)

Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.

🔍 Types of Data¶

Type	Description	Example
Numerical	Quantitative values	Stock price, sales volume
Categorical	Distinct groups	Sector: IT, Pharma, Banking
Time Series	Indexed by time	Daily closing prices
Textual	Unstructured data	News headlines, financial reports
Ordinal	Ordered categories	Credit ratings: AAA, AA, A, BBB...

✨ Key Insights¶

Data Scales: Nominal, ordinal, interval, ratio
Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)
Time-Series Components:
$y_t = T_t + S_t + \epsilon_t$
(1)
where $T_t$ = trend, $S_t$ = seasonality, and $\epsilon_t$ = noise.

📘 Pandas Basics - Object Creation¶

Creating a Series¶

A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).

import numpy as np
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Creating a DataFrame with NumPy Array¶

You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:

dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Creating a DataFrame from a Dictionary¶

Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

📁 Reading and Writing Data Files¶

CSV Files¶

Pandas provides easy methods to read from and write to CSV files:

# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))

# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")

# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_read

Parquet Files¶

Parquet is a columnar storage format that’s efficient for large datasets:

# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")

# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquet

Excel Files¶

Reading and writing Excel files is straightforward with pandas:

# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")

# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excel

Manual DataFrame Creation (Programmatic / In-Memory Data)¶

You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.

From Dictionary of Lists or Arrays¶

import pandas as pd

data = {
    "Product": ["A", "B", "C"],
    "Sales": [100, 150, 200],
    "Region": ["East", "West", "North"]
}

df = pd.DataFrame(data)

From List of Dictionaries¶

records = [
    {"Product": "A", "Sales": 100, "Region": "East"},
    {"Product": "B", "Sales": 150, "Region": "West"},
]

df = pd.DataFrame.from_records(records)

From Dictionary with Index¶

df = pd.DataFrame.from_dict(
    {"A": [1, 2], "B": [3, 4]},
    orient="columns"
)

From List of Lists¶

df = pd.DataFrame(
    [["A", 100, "East"], ["B", 150, "West"]],
    columns=["Product", "Sales", "Region"]
)

From NumPy Array¶

import numpy as np

df = pd.DataFrame(
    np.random.rand(3, 2),
    columns=["A", "B"]
)

Loading Data from Files¶

Pandas provides multiple built-in methods for reading data from common file formats.

Common methods:¶

File Type	Function	Example
CSV	`pd.read_csv()`	`pd.read_csv("sales.csv")`
Excel	`pd.read_excel()`	`pd.read_excel("sales.xlsx", sheet_name="2023")`
JSON	`pd.read_json()`	`pd.read_json("data.json")`
HTML Tables	`pd.read_html()`	`pd.read_html("https://example.com")[0]`
SQL Databases	`pd.read_sql()`	`pd.read_sql("SELECT * FROM sales", conn)`
Parquet	`pd.read_parquet()`	`pd.read_parquet("data.parquet")`
Clipboard	`pd.read_clipboard()`	Reads from copied spreadsheet/table text

Example:¶

import pandas as pd

# Load CSV file
df_csv = pd.read_csv("sales.csv")

# Load Excel sheet
df_excel = pd.read_excel("sales_data.xlsx", sheet_name="2023")

# Load JSON file
df_json = pd.read_json("data.json")

Parameter	Description	Example
`sep`	Field delimiter	`pd.read_csv("file.txt", sep="\t")`
`header`	Row number(s) to use as column names	`pd.read_csv("file.csv", header=0)`
`index_col`	Column(s) to set as index	`pd.read_csv("file.csv", index_col="ID")`
`usecols`	Load specific columns	`pd.read_csv("file.csv", usecols=["Name", "Sales"])`
`nrows`	Limit number of rows	`pd.read_csv("file.csv", nrows=100)`
`encoding`	File encoding type	`pd.read_csv("file.csv", encoding="utf-8")`

Loading SQL Data — “The Hidden Goldmine”¶

import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
conn.close()

Connection Examples

Source	Syntax Example
SQLite	`sqlite3.connect("data.db")`
MySQL	`mysql+pymysql://user:pass@host/db`
PostgreSQL	`postgresql://user:pass@host/db`

💬 SQL is like Excel’s cooler cousin who actually organizes things.

Loading APIs — “The Live Data Stream”¶

import requests
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()

df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())

💡 APIs often return JSON — use pd.json_normalize() to flatten nested structures.

🧩 Common Data Loading Issues¶

Issue	Cause	Fix
Empty columns	Extra commas	`error_bad_lines=False`
Wrong header row	Offset	`header=1`
Date parsing issues	Mixed formats	`parse_dates=['Date']`
Duplicate rows	Export error	`df.drop_duplicates()`

🧮 Data Combination — concat, merge, and join¶

Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.

Why Combine Datasets?¶

Monthly sales stored separately
Customer details in one file, purchases in another
Financial metrics split by departments

Combining them allows complete analysis.

NumPy Concatenation¶

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

np.concatenate((a, b), axis=0)

np.vstack() and np.hstack() are convenient shortcuts.

pandas Concatenation¶

import pandas as pd

df1 = pd.DataFrame({'ID': [1,2,3], 'Name': ['Alice','Bob','Charlie']})
df2 = pd.DataFrame({'ID': [4,5], 'Name': ['David','Emma']})

pd.concat([df1, df2], ignore_index=True)

Horizontal concatenation:

pd.concat([df1, df2], axis=1)

Merging — SQL-style¶

sales = pd.DataFrame({'CustomerID':[1,2,3], 'Amount':[200,150,300]})
customers = pd.DataFrame({'CustomerID':[1,2,4], 'Name':['Alice','Bob','David']})

pd.merge(sales, customers, on='CustomerID', how='inner')

Join Type	Keeps	Example Use
`inner`	Both	Matching rows only
`left`	Left	All sales, missing customers as NaN
`right`	Right	All customers, missing sales as NaN
`outer`	Both	All rows from both

Joining by Index¶

left = pd.DataFrame({'Name':['Alice','Bob','Charlie'],'Age':[25,30,35]})
right = pd.DataFrame({'City':['Delhi','Bangalore','Mumbai']}, index=['Alice','Bob','Charlie'])

joined = left.set_index('Name').join(right)

Merging on Multiple Keys¶

df1 = pd.DataFrame({
    'Region':['North','South','North'],
    'Year':[2023,2023,2024],
    'Sales':[200,150,220]
})
df2 = pd.DataFrame({
    'Region':['North','South','West'],
    'Year':[2023,2023,2024],
    'Profit':[50,40,70]
})
pd.merge(df1, df2, on=['Region','Year'], how='outer')

Summary of Combination Tools¶

Method	Library	Purpose	Example
`np.concatenate()`	NumPy	Combine numeric arrays	Raw arrays
`pd.concat()`	pandas	Stack tables	Append monthly data
`pd.merge()`	pandas	SQL-style join	Merge sales & customers
`.join()`	pandas	Join on index	Lookup style merge

🧪 Example: Simulated Business Dataset¶

import numpy as np, pandas as pd, datetime
from faker import Faker
fake = Faker()

np.random.seed(42)
n = 100
dates = [datetime.datetime(2025,1,1)+datetime.timedelta(days=i) for i in range(n)]
trend = np.linspace(90,110,n)
seasonality = 5*np.sin(np.linspace(0,2*np.pi*10,n))
noise = np.random.normal(0,2,n)
closing_prices = (trend+seasonality+noise).round(2)

data = {
    'Date': dates,
    'Stock_Price': np.random.normal(100,10,n).round(2),
    'Trading_Volume': np.random.randint(1000,10000,n),
    'Sector': np.random.choice(['IT','Pharma','Banking','Energy'],n),
    'Closing_Price': closing_prices,
    'News_Headline': [fake.sentence(nb_words=10) for _ in range(n)],
    'Credit_Rating': np.random.choice(['AAA','AA','A','BBB'],n,p=[0.1,0.3,0.4,0.2])
}

df = pd.DataFrame(data)
df.head()

📈 Visualize Time-Series¶

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()

# Your code here

🔗 Combining DataFrames¶

Concatenation¶

Concatenating pandas objects together row-wise:

df = pd.DataFrame(np.random.randn(10, 4))
df

# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

# Concatenate back together
pd.concat(pieces)

Example 1 — Concatenating DataFrames using `pd.concat()`¶

You can use pd.concat() to combine multiple DataFrames along a particular axis.

import pandas as pd

df1 = pd.DataFrame({
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

df2 = pd.DataFrame({
    "A": ["A4", "A5", "A6", "A7"],
    "B": ["B4", "B5", "B6", "B7"]
})

df3 = pd.DataFrame({
    "A": ["A8", "A9", "A10", "A11"],
    "B": ["B8", "B9", "B10", "B11"]
})

# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)

Merging (SQL-style Joins)¶

SQL-style joins are performed using merge():

# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Merging on Unique Keys¶

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Example 2 — Merging DataFrames using `pd.merge()`¶

pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.

import pandas as pd

left = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

right = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"]
})

# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)

Example 3 — Joining DataFrames using `DataFrame.join()`¶

join() allows combining columns from different DataFrames based on their index.

import pandas as pd

left = pd.DataFrame({
    "A": ["A0", "A1", "A2"],
    "B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])

right = pd.DataFrame({
    "C": ["C0", "C2", "C3"],
    "D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])

# Default join is a left join on index
result = left.join(right)
print(result)