Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.

Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄


🧠 Why This Matters

Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:

  • Finance teams live in Excel

  • Engineers hide data in SQL databases

  • Marketing pulls it from APIs

  • And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”

💡 If your dataset name has the word final in it, it’s never final.


⚙️ Prerequisite: Python & Pandas

We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business


🧾 Understanding Data

The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)

Data types diagram

Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.


🔍 Types of Data

TypeDescriptionExample
NumericalQuantitative valuesStock price, sales volume
CategoricalDistinct groupsSector: IT, Pharma, Banking
Time SeriesIndexed by timeDaily closing prices
TextualUnstructured dataNews headlines, financial reports
OrdinalOrdered categoriesCredit ratings: AAA, AA, A, BBB...

✨ Key Insights

  • Data Scales: Nominal, ordinal, interval, ratio

  • Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)

  • Time-Series Components:

    yt=Tt+St+ϵty_t = T_t + S_t + \epsilon_t

    where TtT_t = trend, StS_t = seasonality, and ϵt\epsilon_t = noise.


📘 Pandas Basics - Object Creation

Creating a Series

A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).

import numpy as np
import pandas as pd

# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Creating a DataFrame with NumPy Array

You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:

dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Creating a DataFrame from a Dictionary

Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

📁 Reading and Writing Data Files

CSV Files

Pandas provides easy methods to read from and write to CSV files:

# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))

# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")

# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_read

Parquet Files

Parquet is a columnar storage format that’s efficient for large datasets:

# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")

# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquet

Excel Files

Reading and writing Excel files is straightforward with pandas:

# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")

# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excel

Manual DataFrame Creation (Programmatic / In-Memory Data)

You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.

From Dictionary of Lists or Arrays

import pandas as pd

data = {
    "Product": ["A", "B", "C"],
    "Sales": [100, 150, 200],
    "Region": ["East", "West", "North"]
}

df = pd.DataFrame(data)

From List of Dictionaries

records = [
    {"Product": "A", "Sales": 100, "Region": "East"},
    {"Product": "B", "Sales": 150, "Region": "West"},
]

df = pd.DataFrame.from_records(records)

From Dictionary with Index

df = pd.DataFrame.from_dict(
    {"A": [1, 2], "B": [3, 4]},
    orient="columns"
)

From List of Lists

df = pd.DataFrame(
    [["A", 100, "East"], ["B", 150, "West"]],
    columns=["Product", "Sales", "Region"]
)

From NumPy Array

import numpy as np

df = pd.DataFrame(
    np.random.rand(3, 2),
    columns=["A", "B"]
)

Loading Data from Files

Pandas provides multiple built-in methods for reading data from common file formats.

Common methods:

File TypeFunctionExample
CSVpd.read_csv()pd.read_csv("sales.csv")
Excelpd.read_excel()pd.read_excel("sales.xlsx", sheet_name="2023")
JSONpd.read_json()pd.read_json("data.json")
HTML Tablespd.read_html()pd.read_html("https://example.com")[0]
SQL Databasespd.read_sql()pd.read_sql("SELECT * FROM sales", conn)
Parquetpd.read_parquet()pd.read_parquet("data.parquet")
Clipboardpd.read_clipboard()Reads from copied spreadsheet/table text

Example:

import pandas as pd

# Load CSV file
df_csv = pd.read_csv("sales.csv")

# Load Excel sheet
df_excel = pd.read_excel("sales_data.xlsx", sheet_name="2023")

# Load JSON file
df_json = pd.read_json("data.json")

Useful Parameters When Loading Data

ParameterDescriptionExample
sepField delimiterpd.read_csv("file.txt", sep="\t")
headerRow number(s) to use as column namespd.read_csv("file.csv", header=0)
index_colColumn(s) to set as indexpd.read_csv("file.csv", index_col="ID")
usecolsLoad specific columnspd.read_csv("file.csv", usecols=["Name", "Sales"])
nrowsLimit number of rowspd.read_csv("file.csv", nrows=100)
encodingFile encoding typepd.read_csv("file.csv", encoding="utf-8")

Loading SQL Data — “The Hidden Goldmine”

import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
conn.close()

Connection Examples

SourceSyntax Example
SQLitesqlite3.connect("data.db")
MySQLmysql+pymysql://user:pass@host/db
PostgreSQLpostgresql://user:pass@host/db

💬 SQL is like Excel’s cooler cousin who actually organizes things.


Loading APIs — “The Live Data Stream”

import requests
import pandas as pd

url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()

df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())

💡 APIs often return JSON — use pd.json_normalize() to flatten nested structures.


🧩 Common Data Loading Issues

IssueCauseFix
Empty columnsExtra commaserror_bad_lines=False
Wrong header rowOffsetheader=1
Date parsing issuesMixed formatsparse_dates=['Date']
Duplicate rowsExport errordf.drop_duplicates()

🧮 Data Combination — concat, merge, and join

Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.


Why Combine Datasets?

  • Monthly sales stored separately

  • Customer details in one file, purchases in another

  • Financial metrics split by departments

Combining them allows complete analysis.


NumPy Concatenation

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

np.concatenate((a, b), axis=0)

np.vstack() and np.hstack() are convenient shortcuts.


pandas Concatenation

import pandas as pd

df1 = pd.DataFrame({'ID': [1,2,3], 'Name': ['Alice','Bob','Charlie']})
df2 = pd.DataFrame({'ID': [4,5], 'Name': ['David','Emma']})

pd.concat([df1, df2], ignore_index=True)

Horizontal concatenation:

pd.concat([df1, df2], axis=1)

Merging — SQL-style

sales = pd.DataFrame({'CustomerID':[1,2,3], 'Amount':[200,150,300]})
customers = pd.DataFrame({'CustomerID':[1,2,4], 'Name':['Alice','Bob','David']})

pd.merge(sales, customers, on='CustomerID', how='inner')
Join TypeKeepsExample Use
innerBothMatching rows only
leftLeftAll sales, missing customers as NaN
rightRightAll customers, missing sales as NaN
outerBothAll rows from both

Joining by Index

left = pd.DataFrame({'Name':['Alice','Bob','Charlie'],'Age':[25,30,35]})
right = pd.DataFrame({'City':['Delhi','Bangalore','Mumbai']}, index=['Alice','Bob','Charlie'])

joined = left.set_index('Name').join(right)

Merging on Multiple Keys

df1 = pd.DataFrame({
    'Region':['North','South','North'],
    'Year':[2023,2023,2024],
    'Sales':[200,150,220]
})
df2 = pd.DataFrame({
    'Region':['North','South','West'],
    'Year':[2023,2023,2024],
    'Profit':[50,40,70]
})
pd.merge(df1, df2, on=['Region','Year'], how='outer')

Summary of Combination Tools

MethodLibraryPurposeExample
np.concatenate()NumPyCombine numeric arraysRaw arrays
pd.concat()pandasStack tablesAppend monthly data
pd.merge()pandasSQL-style joinMerge sales & customers
.join()pandasJoin on indexLookup style merge

🧪 Example: Simulated Business Dataset

import numpy as np, pandas as pd, datetime
from faker import Faker
fake = Faker()

np.random.seed(42)
n = 100
dates = [datetime.datetime(2025,1,1)+datetime.timedelta(days=i) for i in range(n)]
trend = np.linspace(90,110,n)
seasonality = 5*np.sin(np.linspace(0,2*np.pi*10,n))
noise = np.random.normal(0,2,n)
closing_prices = (trend+seasonality+noise).round(2)

data = {
    'Date': dates,
    'Stock_Price': np.random.normal(100,10,n).round(2),
    'Trading_Volume': np.random.randint(1000,10000,n),
    'Sector': np.random.choice(['IT','Pharma','Banking','Energy'],n),
    'Closing_Price': closing_prices,
    'News_Headline': [fake.sentence(nb_words=10) for _ in range(n)],
    'Credit_Rating': np.random.choice(['AAA','AA','A','BBB'],n,p=[0.1,0.3,0.4,0.2])
}

df = pd.DataFrame(data)
df.head()

📈 Visualize Time-Series

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()

# Your code here

🔗 Combining DataFrames

Concatenation

Concatenating pandas objects together row-wise:

df = pd.DataFrame(np.random.randn(10, 4))
df

# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

# Concatenate back together
pd.concat(pieces)

Example 1 — Concatenating DataFrames using pd.concat()

You can use pd.concat() to combine multiple DataFrames along a particular axis.

import pandas as pd

df1 = pd.DataFrame({
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

df2 = pd.DataFrame({
    "A": ["A4", "A5", "A6", "A7"],
    "B": ["B4", "B5", "B6", "B7"]
})

df3 = pd.DataFrame({
    "A": ["A8", "A9", "A10", "A11"],
    "B": ["B8", "B9", "B10", "B11"]
})

# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)

Merging (SQL-style Joins)

SQL-style joins are performed using merge():

# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Merging on Unique Keys

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

print("\nMerged DataFrame:")
pd.merge(left, right, on="key")

Example 2 — Merging DataFrames using pd.merge()

pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.

import pandas as pd

left = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"]
})

right = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"]
})

# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)

Example 3 — Joining DataFrames using DataFrame.join()

join() allows combining columns from different DataFrames based on their index.

import pandas as pd

left = pd.DataFrame({
    "A": ["A0", "A1", "A2"],
    "B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])

right = pd.DataFrame({
    "C": ["C0", "C2", "C3"],
    "D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])

# Default join is a left join on index
result = left.join(right)
print(result)