Welcome to the Great Data Buffet, where we feast on CSVs, Excel sheets, SQL databases, APIs — and even understand what data truly means.
Before we can do Machine Learning, we first have to do Machine Feeding. 🤖🥄
🧠 Why This Matters¶
Your ML model is only as good as the data you feed it. Think of this chapter as learning how to open the right doors to the right kitchens — because business data hides everywhere:
Finance teams live in Excel
Engineers hide data in SQL databases
Marketing pulls it from APIs
And sometimes… it’s just a CSV file on someone’s desktop named “final_version_9_REAL.csv”
💡 If your dataset name has the word final in it, it’s never final.
⚙️ Prerequisite: Python & Pandas¶
We’ll be using the pandas library — the king of data handling in Python. If you’re not comfortable with it yet, check out: 👉 📘 Programming for Business
🧾 Understanding Data¶
The Latin word data is the plural of datum — “(thing) given.” Its first recorded English use dates back to the 1640s. By 1946, data referred to “transmissible and storable computer information.” (Source: Wikipedia)
Data exists in many forms — geographical, cultural, financial, statistical, meteorological, and more.
🔍 Types of Data¶
| Type | Description | Example |
|---|---|---|
| Numerical | Quantitative values | Stock price, sales volume |
| Categorical | Distinct groups | Sector: IT, Pharma, Banking |
| Time Series | Indexed by time | Daily closing prices |
| Textual | Unstructured data | News headlines, financial reports |
| Ordinal | Ordered categories | Credit ratings: AAA, AA, A, BBB... |
✨ Key Insights¶
Data Scales: Nominal, ordinal, interval, ratio
Sources: Primary (e.g., internal reports) vs. Secondary (e.g., public indices)
Time-Series Components:
where = trend, = seasonality, and = noise.
📘 Pandas Basics - Object Creation¶
Creating a Series¶
A Series is a one-dimensional labeled array that can hold data of any type (integers, strings, floats, Python objects, etc.).
import numpy as np
import pandas as pd
# Creating a Series from a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
sCreating a DataFrame with NumPy Array¶
You can create a DataFrame by passing a NumPy array with a datetime index using pd.date_range() and labeled columns:
dates = pd.date_range("20130101", periods=6)
dates
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
dfCreating a DataFrame from a Dictionary¶
Creating a DataFrame by passing a dictionary where keys are column labels and values are the column values:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2📁 Reading and Writing Data Files¶
CSV Files¶
Pandas provides easy methods to read from and write to CSV files:
# Create sample data
df_sample = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
# Writing to CSV
df_sample.to_csv("foo.csv")
print("Data written to foo.csv")
# Reading from CSV
df_read = pd.read_csv("foo.csv")
df_readParquet Files¶
Parquet is a columnar storage format that’s efficient for large datasets:
# Writing to Parquet
df_sample.to_parquet("foo.parquet")
print("Data written to foo.parquet")
# Reading from Parquet
df_parquet = pd.read_parquet("foo.parquet")
df_parquetExcel Files¶
Reading and writing Excel files is straightforward with pandas:
# Writing to Excel
df_sample.to_excel("foo.xlsx", sheet_name="Sheet1")
print("Data written to foo.xlsx")
# Reading from Excel
df_excel = pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
df_excelManual DataFrame Creation (Programmatic / In-Memory Data)¶
You can create DataFrames manually using Python dictionaries, lists, or records. This is particularly useful for testing, quick experiments, or small structured datasets.
From Dictionary of Lists or Arrays¶
import pandas as pd
data = {
"Product": ["A", "B", "C"],
"Sales": [100, 150, 200],
"Region": ["East", "West", "North"]
}
df = pd.DataFrame(data)From List of Dictionaries¶
records = [
{"Product": "A", "Sales": 100, "Region": "East"},
{"Product": "B", "Sales": 150, "Region": "West"},
]
df = pd.DataFrame.from_records(records)From Dictionary with Index¶
df = pd.DataFrame.from_dict(
{"A": [1, 2], "B": [3, 4]},
orient="columns"
)From List of Lists¶
df = pd.DataFrame(
[["A", 100, "East"], ["B", 150, "West"]],
columns=["Product", "Sales", "Region"]
)From NumPy Array¶
import numpy as np
df = pd.DataFrame(
np.random.rand(3, 2),
columns=["A", "B"]
)Loading Data from Files¶
Pandas provides multiple built-in methods for reading data from common file formats.
Common methods:¶
| File Type | Function | Example |
|---|---|---|
| CSV | pd.read_csv() | pd.read_csv("sales.csv") |
| Excel | pd.read_excel() | pd.read_excel("sales.xlsx", sheet_name="2023") |
| JSON | pd.read_json() | pd.read_json("data.json") |
| HTML Tables | pd.read_html() | pd.read_html("https://example.com")[0] |
| SQL Databases | pd.read_sql() | pd.read_sql("SELECT * FROM sales", conn) |
| Parquet | pd.read_parquet() | pd.read_parquet("data.parquet") |
| Clipboard | pd.read_clipboard() | Reads from copied spreadsheet/table text |
Example:¶
import pandas as pd
# Load CSV file
df_csv = pd.read_csv("sales.csv")
# Load Excel sheet
df_excel = pd.read_excel("sales_data.xlsx", sheet_name="2023")
# Load JSON file
df_json = pd.read_json("data.json")Useful Parameters When Loading Data¶
| Parameter | Description | Example |
|---|---|---|
sep | Field delimiter | pd.read_csv("file.txt", sep="\t") |
header | Row number(s) to use as column names | pd.read_csv("file.csv", header=0) |
index_col | Column(s) to set as index | pd.read_csv("file.csv", index_col="ID") |
usecols | Load specific columns | pd.read_csv("file.csv", usecols=["Name", "Sales"]) |
nrows | Limit number of rows | pd.read_csv("file.csv", nrows=100) |
encoding | File encoding type | pd.read_csv("file.csv", encoding="utf-8") |
Loading SQL Data — “The Hidden Goldmine”¶
import sqlite3
conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM customers LIMIT 5;", conn)
conn.close()Connection Examples
| Source | Syntax Example |
|---|---|
| SQLite | sqlite3.connect("data.db") |
| MySQL | mysql+pymysql://user:pass@host/db |
| PostgreSQL | postgresql://user:pass@host/db |
💬 SQL is like Excel’s cooler cousin who actually organizes things.
Loading APIs — “The Live Data Stream”¶
import requests
import pandas as pd
url = "https://api.exchangerate-api.com/v4/latest/USD"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data['rates'].items(), columns=['Currency', 'Exchange Rate'])
print(df.head())💡 APIs often return JSON — use
pd.json_normalize()to flatten nested structures.
🧩 Common Data Loading Issues¶
| Issue | Cause | Fix |
|---|---|---|
| Empty columns | Extra commas | error_bad_lines=False |
| Wrong header row | Offset | header=1 |
| Date parsing issues | Mixed formats | parse_dates=['Date'] |
| Duplicate rows | Export error | df.drop_duplicates() |
🧮 Data Combination — concat, merge, and join¶
Once your data is loaded, it often comes from multiple sources. Let’s learn how to combine it efficiently.
Why Combine Datasets?¶
Monthly sales stored separately
Customer details in one file, purchases in another
Financial metrics split by departments
Combining them allows complete analysis.
NumPy Concatenation¶
import numpy as np
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
np.vstack()andnp.hstack()are convenient shortcuts.
pandas Concatenation¶
import pandas as pd
df1 = pd.DataFrame({'ID': [1,2,3], 'Name': ['Alice','Bob','Charlie']})
df2 = pd.DataFrame({'ID': [4,5], 'Name': ['David','Emma']})
pd.concat([df1, df2], ignore_index=True)Horizontal concatenation:
pd.concat([df1, df2], axis=1)Merging — SQL-style¶
sales = pd.DataFrame({'CustomerID':[1,2,3], 'Amount':[200,150,300]})
customers = pd.DataFrame({'CustomerID':[1,2,4], 'Name':['Alice','Bob','David']})
pd.merge(sales, customers, on='CustomerID', how='inner')| Join Type | Keeps | Example Use |
|---|---|---|
inner | Both | Matching rows only |
left | Left | All sales, missing customers as NaN |
right | Right | All customers, missing sales as NaN |
outer | Both | All rows from both |
Joining by Index¶
left = pd.DataFrame({'Name':['Alice','Bob','Charlie'],'Age':[25,30,35]})
right = pd.DataFrame({'City':['Delhi','Bangalore','Mumbai']}, index=['Alice','Bob','Charlie'])
joined = left.set_index('Name').join(right)Merging on Multiple Keys¶
df1 = pd.DataFrame({
'Region':['North','South','North'],
'Year':[2023,2023,2024],
'Sales':[200,150,220]
})
df2 = pd.DataFrame({
'Region':['North','South','West'],
'Year':[2023,2023,2024],
'Profit':[50,40,70]
})
pd.merge(df1, df2, on=['Region','Year'], how='outer')Summary of Combination Tools¶
| Method | Library | Purpose | Example |
|---|---|---|---|
np.concatenate() | NumPy | Combine numeric arrays | Raw arrays |
pd.concat() | pandas | Stack tables | Append monthly data |
pd.merge() | pandas | SQL-style join | Merge sales & customers |
.join() | pandas | Join on index | Lookup style merge |
🧪 Example: Simulated Business Dataset¶
import numpy as np, pandas as pd, datetime
from faker import Faker
fake = Faker()
np.random.seed(42)
n = 100
dates = [datetime.datetime(2025,1,1)+datetime.timedelta(days=i) for i in range(n)]
trend = np.linspace(90,110,n)
seasonality = 5*np.sin(np.linspace(0,2*np.pi*10,n))
noise = np.random.normal(0,2,n)
closing_prices = (trend+seasonality+noise).round(2)
data = {
'Date': dates,
'Stock_Price': np.random.normal(100,10,n).round(2),
'Trading_Volume': np.random.randint(1000,10000,n),
'Sector': np.random.choice(['IT','Pharma','Banking','Energy'],n),
'Closing_Price': closing_prices,
'News_Headline': [fake.sentence(nb_words=10) for _ in range(n)],
'Credit_Rating': np.random.choice(['AAA','AA','A','BBB'],n,p=[0.1,0.3,0.4,0.2])
}
df = pd.DataFrame(data)
df.head()📈 Visualize Time-Series¶
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.plot(df['Date'], df['Closing_Price'], label='Daily Closing Price')
plt.title('Sample Time-Series: Daily Closing Prices')
plt.xlabel('Date')
plt.ylabel('Price ($)')
plt.legend()
plt.grid(True)
plt.show()# Your code heredf = pd.DataFrame(np.random.randn(10, 4))
df
# Break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
# Concatenate back together
pd.concat(pieces)Example 1 — Concatenating DataFrames using pd.concat()¶
You can use pd.concat() to combine multiple DataFrames along a particular axis.
import pandas as pd
df1 = pd.DataFrame({
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"]
})
df2 = pd.DataFrame({
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"]
})
df3 = pd.DataFrame({
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"]
})
# Combine all DataFrames vertically (stacking rows)
result = pd.concat([df1, df2, df3])
print(result)Merging (SQL-style Joins)¶
SQL-style joins are performed using merge():
# Create sample dataframes
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
# Merge on key
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")Merging on Unique Keys¶
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
print("\nMerged DataFrame:")
pd.merge(left, right, on="key")Example 2 — Merging DataFrames using pd.merge()¶
pd.merge() combines two DataFrames based on common columns or keys, similar to SQL joins.
import pandas as pd
left = pd.DataFrame({
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"]
})
right = pd.DataFrame({
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"]
})
# Merge using the common column 'key'
result = pd.merge(left, right, on="key")
print(result)Example 3 — Joining DataFrames using DataFrame.join()¶
join() allows combining columns from different DataFrames based on their index.
import pandas as pd
left = pd.DataFrame({
"A": ["A0", "A1", "A2"],
"B": ["B0", "B1", "B2"]
}, index=["K0", "K1", "K2"])
right = pd.DataFrame({
"C": ["C0", "C2", "C3"],
"D": ["D0", "D2", "D3"]
}, index=["K0", "K2", "K3"])
# Default join is a left join on index
result = left.join(right)
print(result)