Data Loading (CSV, Excel, SQL, APIs)#

⏳ Loading Pyodide…

Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.

Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄


🧠 Why This Matters#

Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:

  • Finance teams live in Excel

  • Engineers hide data in SQL databases

  • Marketing pulls it from APIs

  • And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”

💡 If your dataset name has the word final in it, it’s never final.


⚙️ Prerequisite: Python & Pandas#

We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business


🧾 Understanding Data#

The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)

Data types diagram

Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.


🔍 Types of Data#

Type

Description

Example

Numerical

Quantitative values

Stock price, sales volume

Categorical

Distinct groups

Sector: IT, Pharma, Banking

Time Series

Indexed by time

Daily closing prices

Textual

Unstructured data

News headlines, financial reports

Ordinal

Ordered categories

Credit ratings: AAA, AA, A, BBB…

✨ Key Insights#

  • Data Scales: Nominal, ordinal, interval, ratio

  • Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)

  • Time-Series Components: $\( y_t = T_t + S_t + \epsilon_t \)\( where \)T_t\( = trend, \)S_t\( = seasonality, and \)\epsilon_t$ = noise.


📘 Pandas Basics - Object Creation#

Creating a Series#

A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).

import numpy as np
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Creating a DataFrame with NumPy Array#

You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:

dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Creating a DataFrame from a Dictionary#

Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

📁 Reading and Writing Data Files#

CSV Files#

Pandas provides easy methods to read from and write to CSV files:

# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))

# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")

# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_read

Parquet Files#

Parquet is a columnar storage format that’s efficient for large datasets:

# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")

# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquet

Excel Files#

Reading and writing Excel files is straightforward with pandas:

# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")

# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excel

Manual DataFrame Creation (Programmatic / In-Memory Data)#

You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.

From Dictionary of Lists or Arrays#

From List of Dictionaries#

From Dictionary with Index#

From List of Lists#

From NumPy Array#


Loading Data from Files#

Pandas provides multiple built-in methods for reading data from common file formats.

Common methods:#

File Type

Function

Example

CSV

pd.read_csv()

pd.read_csv("sales.csv")

Excel

pd.read_excel()

pd.read_excel("sales.xlsx", sheet_name="2023")

JSON

pd.read_json()

pd.read_json("data.json")

HTML Tables

pd.read_html()

pd.read_html("https://example.com")[0]

SQL Databases

pd.read_sql()

pd.read_sql("SELECT * FROM sales", conn)

Parquet

pd.read_parquet()

pd.read_parquet("data.parquet")

Clipboard

pd.read_clipboard()

Reads from copied spreadsheet/table text

Example:#

`


Useful Parameters When Loading Data#

Parameter

Description

Example

sep

Field delimiter

pd.read_csv("file.txt", sep="\t")

header

Row number(s) to use as column names

pd.read_csv("file.csv", header=0)

index_col

Column(s) to set as index

pd.read_csv("file.csv", index_col="ID")

usecols

Load specific columns

pd.read_csv("file.csv", usecols=["Name", "Sales"])

nrows

Limit number of rows

pd.read_csv("file.csv", nrows=100)

encoding

File encoding type

pd.read_csv("file.csv", encoding="utf-8")


Loading SQL Data — “The Hidden Goldmine”#

Connection Examples

Source

Syntax Example

SQLite

sqlite3.connect("data.db")

MySQL

mysql+pymysql://user:pass@host/db

PostgreSQL

postgresql://user:pass@host/db

💬 SQL is like Excel’s cooler cousin who actually organizes things.


Loading APIs — “The Live Data Stream”#

💡 APIs often return JSON — use pd.json_normalize() to flatten nested structures.


🧩 Common Data Loading Issues#

Issue

Cause

Fix

Empty columns

Extra commas

error_bad_lines=False

Wrong header row

Offset

header=1

Date parsing issues

Mixed formats

parse_dates=['Date']

Duplicate rows

Export error

df.drop_duplicates()


🧮 Data Combination — concat, merge, and join#

Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.


Why Combine Datasets?#

  • Monthly sales stored separately

  • Customer details in one file, purchases in another

  • Financial metrics split by departments

Combining them allows complete analysis.


NumPy Concatenation#

np.vstack() and np.hstack() are convenient shortcuts.


pandas Concatenation#

Horizontal concatenation:


Merging — SQL-style#

Join Type

Keeps

Example Use

inner

Both

Matching rows only

left

Left

All sales, missing customers as NaN

right

Right

All customers, missing sales as NaN

outer

Both

All rows from both


Joining by Index#


Merging on Multiple Keys#


Summary of Combination Tools#

Method

Library

Purpose

Example

np.concatenate()

NumPy

Combine numeric arrays

Raw arrays

pd.concat()

pandas

Stack tables

Append monthly data

pd.merge()

pandas

SQL-style join

Merge sales & customers

.join()

pandas

Join on index

Lookup style merge


🧪 Example: Simulated Business Dataset#


📈 Visualize Time-Series#


# Your code here

🔗 Combining DataFrames#

Concatenation#

Concatenating pandas objects together row-wise:

df = pd.DataFrame(np.random.randn(10, 4))
df

# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

# Concatenate back together
pd.concat(pieces)

Example 1 — Concatenating DataFrames using pd.concat()#

You can use pd.concat() to combine multiple DataFrames along a particular axis.

import pandas as pd

df1 = pd.DataFrame({
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

df2 = pd.DataFrame({
    "A": ["A4", "A5", "A6", "A7"],
    "B": ["B4", "B5", "B6", "B7"]
})

df3 = pd.DataFrame({
    "A": ["A8", "A9", "A10", "A11"],
    "B": ["B8", "B9", "B10", "B11"]
})

# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)

Merging (SQL-style Joins)#

SQL-style joins are performed using merge():

# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Merging on Unique Keys#

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Example 2 — Merging DataFrames using pd.merge()#

pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.

import pandas as pd

left = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

right = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"]
})

# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)

Example 3 — Joining DataFrames using DataFrame.join()#

join() allows combining columns from different DataFrames based on their index.

import pandas as pd

left = pd.DataFrame({
    "A": ["A0", "A1", "A2"],
    "B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])

right = pd.DataFrame({
    "C": ["C0", "C2", "C3"],
    "D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])

# Default join is a left join on index
result = left.join(right)
print(result)