Data Cleaning & Preprocessing

Data Cleaning & Preprocessing#

Welcome to Digital Laundry Day! In this chapter, we’ll wash away the sins of dirty datasets — typos, missing values, inconsistent formats, and all those “creative” Excel exports.

💬 Data scientists spend 80% of their time cleaning data and the other 20% complaining about it.

🧠 Why This Matters#

Machine Learning models are like fancy coffee machines ☕ — they only brew good results if you put in clean beans (data).

Garbage In → Garbage Out 🗑️
Clean In → Insights Out 💡

Our mission: turn messy business data into something your model (and your manager) can actually understand.

⚙️ Prerequisite: Pandas Power 🐼#

We’ll use pandas — your digital vacuum cleaner. If you’re new to it, check my other book: 👉 📘 Programming for Business

🧹 Step 1. Inspect Before You Disinfect#

Before cleaning, always look at the dirt.

import pandas as pd

df = pd.read_csv("sales_data.csv")
df.info()
df.describe()
df.head()

💡 Pro Tip: .info() tells you which columns have missing values, and .describe() helps spot weird outliers.

Common Symptom	Diagnosis
`NaN` or `None` values	Missing data
“?” or “N/A” strings	Excel ghosts 👻
Wrong data types	CSV mischief
`Customer ID` as `float`	Export gone rogue

🧽 Step 2. Cleaning Column Names#

Half of business data problems begin with column names like Customer Name (v2 FINAL).

Fix it:#

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print(df.columns)

💬 Now “Customer Name (v2 FINAL)” becomes customer_name_v2_final. Still chaotic, but consistently chaotic.

🧴 Step 3. Removing Duplicates#

Duplicates in business data are like duplicate meetings — unnecessary and confusing.

before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"Removed {before - after} duplicate rows.")

✅ Use subset=['column_name'] if you only want to deduplicate based on a specific column.

🧊 Step 4. Handling Missing Values#

Missing data is the ghost of analytics — invisible, but it haunts your models.

Quick Peek:#

df.isna().sum()

Strategy Options:#

Situation	Fix
Missing numeric values	`df['price'].fillna(df['price'].mean())`
Missing categories	`df['region'].fillna('Unknown')`
Rows totally empty	`df.dropna(how='all', inplace=True)`
Missing values in key columns	`df.dropna(subset=['customer_id'])`

💬 Remember: “Dropping data” sounds scary, but so does training on garbage.

🧪 Step 5. Fixing Data Types#

When you see customer_id as a float, you know Excel has been involved. 🙄

Example:#

df['customer_id'] = df['customer_id'].astype(str)
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

Wrong Type	Correct Fix
`object` → `int`	`astype(int)`
`object` → `datetime`	`pd.to_datetime()`
`float` → `category`	`astype('category')`

💡 Clean types = clean joins, clean analysis, and fewer model tantrums.

🍋 Step 6. Fixing Inconsistent Values#

When your regions are labeled as North, north, and NORTH, the model cries.

Example:#

df['region'] = df['region'].str.lower().str.strip()
df['region'].replace({'n.america': 'north america'}, inplace=True)

Problem	Example	Fix
Case sensitivity	`North` vs `north`	`.str.lower()`
Extra spaces	`" East "`	`.str.strip()`
Abbreviations	`n.america`	`.replace()`

💬 “Standardizing values is like standardizing coffee sizes — you avoid fights later.”

🧯 Step 7. Outlier Treatment Lite (Preview)#

We’ll go deep on this in the next chapter, but here’s a teaser:

import numpy as np

Q1 = df['sales'].quantile(0.25)
Q3 = df['sales'].quantile(0.75)
IQR = Q3 - Q1

df = df[(df['sales'] >= Q1 - 1.5 * IQR) & (df['sales'] <= Q3 + 1.5 * IQR)]

💡 This is like filtering out extremely suspicious “$9,999,999” sales rows entered by accident (or by optimism).

🧩 Practice Challenge: “The Sales Spa”#

Download the dataset messy_sales.csv and perform the following:

Rename columns to snake_case.
Remove duplicate rows.
Fill missing regions with “Unknown”.
Convert date to datetime.
Clean inconsistent region names.

Bonus: Count how many rows survived the cleaning!

# Your turn!

💬 “When in doubt, print .head() — pandas never lies.”

🧭 Recap#

Step	Goal	Tool
Inspect	See what’s wrong	`.info()`, `.describe()`
Clean names	Consistent columns	`.str.strip()`
Drop duplicates	Remove redundancy	`.drop_duplicates()`
Handle missing	Fill or drop	`.fillna()`, `.dropna()`
Fix types	Correct schema	`.astype()`, `pd.to_datetime()`
Standardize values	Normalize text	`.str.lower()`

🎯 Clean data = smooth analysis + happy ML models + one less headache for you.

🔜 Next Stop#

👉 Head to Handling Missing Data & Outliers — where we’ll face the real-world monsters of messy business data: missing info, freak values, and mystery spikes.

Adding Calculated Columns#

Creating calculated columns is one of the most common and powerful operations in data analysis. It allows you to derive new insights or metrics based on existing data — for example, calculating profit from sales and cost or computing growth percentages.

Let’s see how to add new columns in different ways:

Example 1: Simple Arithmetic Operations#

import pandas as pd

df = pd.DataFrame({
    'Revenue': [1000, 1500, 800, 1200],
    'Cost': [400, 700, 300, 600]
})

# Calculate profit
df['Profit'] = df['Revenue'] - df['Cost']
print(df)

Output:

   Revenue  Cost  Profit
   1000   400     600
   1500   700     800
    800   300     500
   1200   600     600

This simple operation shows how easily you can create derived columns using vectorized operations.

Example 2: Calculating Percentages#

You can also compute ratios or percentages directly:

df['Profit Margin (%)'] = (df['Profit'] / df['Revenue']) * 100
print(df)

Output:

   Revenue  Cost  Profit  Profit Margin (%)
   1000   400     600               60.0
   1500   700     800               53.3
    800   300     500               62.5
   1200   600     600               50.0

Example 3: Adding Columns Based on Conditions#

Sometimes, you need to create new columns based on logic — for example, flagging products as “Profitable” or “Unprofitable.”

df['Status'] = ['Profitable' if x > 0 else 'Loss' for x in df['Profit']]
print(df)

Output:

   Revenue  Cost  Profit  Profit Margin (%)      Status
   1000   400     600               60.0  Profitable
   1500   700     800               53.3  Profitable
    800   300     500               62.5  Profitable
   1200   600     600               50.0  Profitable

Example 4: Using `assign()` Method#

You can use assign() to chain operations neatly:

df = (df
      .assign(Tax=lambda x: x['Revenue'] * 0.1,
              Net_Profit=lambda x: x['Profit'] - x['Tax']))
print(df)

The advantage of assign() is method chaining — it allows you to write clean, readable code that flows step by step.

Example 5: Using `apply()` for Custom Logic#

If you have a more complex calculation that can’t be expressed with basic arithmetic:

def performance(row):
    if row['Profit Margin (%)'] > 55:
        return 'Excellent'
    elif row['Profit Margin (%)'] > 50:
        return 'Good'
    else:
        return 'Average'

df['Performance'] = df.apply(performance, axis=1)
print(df)

Summary: Adding Calculated Columns#

Method	Use Case	Notes
Vectorized operations	Fast, simple arithmetic	Best for numeric data
`assign()`	Clean method chaining	Ideal for pipelines
List comprehension	Simple conditional logic	Readable and concise
`apply()`	Complex row-based logic	Slower but flexible

Practical Exercise#

Try these tasks to master calculated columns:

Create a Discount column that gives a 10% discount on Revenue if Profit Margin > 60.
Add a Revenue_after_Discount column that reflects the new revenue.
Add a Cost_to_Revenue_Ratio column.
Use apply() to classify performance as “Excellent,” “Moderate,” or “Needs Improvement.”
Use assign() to add Tax and Net_Profit columns in one go.

💡 Fun Analogy: Think of adding new columns like giving your dataset a promotion — you’re teaching it new skills! It already knew “Revenue” and “Cost,” but now it can calculate “Profit” and “Performance.” You’re turning it from a clerk into a business analyst.

3. Data Selection#

Definition: Data selection is the process of identifying relevant data

Key Methods:#

Boolean indexing: Filter rows by conditions
Column selection: Choose specific features
Time-based selection: data[data['date'].between('2023-01-01', '2023-12-31')]

Modern forecasting leverages vast datasets and computational power, but strategic data selection remains critical to avoid garbage-in-garbage-out (GIGO) scenarios.

Relevance to Objective
- Example:
  - Stock price prediction → Historical prices, trading volume, volatility indices.
  - Credit risk → Transaction history, credit scores, macroeconomic indicators.
- Tech Impact: AI models (e.g., LSTM, Transformer) can process high-dimensional data but require domain-driven feature selection.
Data Types
- Structured: Time-series (stock prices), tabular (balance sheets).
- Unstructured: Earnings call transcripts (NLP), satellite images (CV).
- Tech Tools:
  - APIs (YFinance, Alpha Vantage) for real-time structured data.
  - Scrapy/Selenium for web scraping unstructured data.
Quality & Preprocessing
- Timeliness: Use streaming pipelines (Kafka, Spark) for real-time forecasts.
- Missing Data: Auto-imputation (KNN, MICE) vs. deletion (threshold-based).
- Code Example:
```
# Automated missing data handling
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
clean_data = imputer.fit_transform(raw_data)
```
Feature Engineering
- Traditional: Lags, rolling averages (Pandas).
- AI-Driven: Automated feature generation (FeatureTools, AutoML).
- Math Context: $$ \text{Rolling Mean}_t = \frac{1}{w}\sum_{i=t-w+1}^t x_i $$ where $$ w $ = window size.

Implementation (Python)#

# Sample data selection pipeline for stock forecasting
import yfinance as yf
from sklearn.feature_selection import mutual_info_regression

# 1. Extract data
data = yf.download("AAPL", start="2020-01-01", end="2023-12-31")

# 2. Select features (e.g., OHLCV + technical indicators)
data['MA_50'] = data['Close'].rolling(50).mean()  # Moving average
data['Daily_Return'] = data['Close'].pct_change()

# 3. Filter by relevance (e.g., correlation with target)
target = data['Close'].shift(-1)  # Predict next-day close
selected_features = data[['Open', 'Volume', 'MA_50', 'Daily_Return']]
mi_scores = mutual_info_regression(selected_features, target.dropna())

Ethics Discussion: Bias in data selection (e.g., excluding small-cap stocks in index forecasting).

Why This Matters#

Overfitting Risk: More data ≠ better models. Curated features beat brute force.

Tech Stack Alignment:

Method	Data Needs	Tools
Holt-Winters	Clean univariate series	Statsmodels, Pandas
Prophet	Time + regressors	Facebook Prophet
Deep Learning	High-dimension data (e.g., tick-level)	TensorFlow, PyTorch

4. Data Preparation for Time Series#

Description: Cleaning and transforming data (e.g., handling missing values, normalizing, removing outliers).
Relevance: Clean data is critical for reliable forecasts.
Related Concepts:
- Imputation: Filling missing values (e.g., mean imputation).
- Normalization: Scaling data to a common range (e.g., min-max scaling: $ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $).
- Outlier Detection: Using z-scores or IQR.
Example: Replacing missing sales data with the average of the previous month.

1. Imputation#

Forward Fill (ffill):
- $\hat{y}_j = y_i$, where $y_i$ is the last valid observation before the missing value at time $j$.
Backward Fill (bfill):
- $\hat{y}_j = y_k$, where $y_k$ is the next valid observation after the missing value at time $j$.
Linear Interpolation:
- $\hat{y}_j = y_i + (y_k - y_i) \times \frac{j - i}{k - i}$, where $y_i$ and $y_k$ are the nearest valid observations.
Polynomial Interpolation (Quadratic):
- $\hat{y}_j = P(j)$, where $P(t) = at^2 + bt + c$ is a second-degree polynomial fitted to nearby points.
Time-based Interpolation:
- Analogous to linear interpolation, but $i, j, k$ represent timestamps, considering irregular intervals.
Rolling Mean Imputation:
- $\hat{y}_j = \frac{1}{w} \sum_{i=j-w}^{j-1} y_i$, where $w$ is the window size.
Rolling Median Imputation:
- $\hat{y}_j = \text{median}(y_{j-w}, \dots, y_{j-1})$, where $w$ is the window size.
Seasonal Decomposition and Imputation:
- $y_t = T_t + S_t + R_t$ (additive model). Missing values are imputed using the estimated trend ($T_t$) and seasonal ($S_t$) components at the time of the missing value.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# 1. Create Sample Time Series with Missing Values
dates = pd.date_range(start='2024-01-01', periods=30, freq='D')
data = pd.Series([10, 12, np.nan, 15, 13, np.nan, 20, 22, 25, np.nan,
                   18, 21, 23, 26, np.nan, 29, 31, 30, 28, np.nan,
                   35, 38, 40, np.nan, 42, 45, 43, 41, np.nan, 48], index=dates)

# 2. Implement Different Missing Data Handling Methods
df_imputed = pd.DataFrame({'original': data})
df_imputed['ffill'] = data.ffill()
df_imputed['bfill'] = data.bfill()
df_imputed['linear_interp'] = data.interpolate(method='linear')
df_imputed['quadratic_interp'] = data.interpolate(method='polynomial', order=2)
df_imputed['time_interp'] = data.interpolate(method='time')
df_imputed['rolling_mean'] = data.rolling(window=3, min_periods=1).mean().fillna(data)
df_imputed['rolling_median'] = data.rolling(window=3, min_periods=1).median().fillna(data)

# Seasonal Decomposition and Imputation
try:
    decomposition = seasonal_decompose(data.interpolate(), model='additive', period=7, extrapolate_trend='freq')
    seasonal_component = decomposition.seasonal
    seasonal_imputed_values = data.copy()
    for i in seasonal_imputed_values[seasonal_imputed_values.isna()].index:
        seasonal_imputed_values[i] = decomposition.seasonal[i] + decomposition.trend[i] + decomposition.resid[i]
    df_imputed['seasonal_imputed'] = seasonal_imputed_values
except Exception as e:
    print(f"Error during seasonal decomposition: {e}")
    df_imputed['seasonal_imputed'] = np.nan


# 3. Plot the Results for Comparison
plt.figure(figsize=(15, 12))

plt.subplot(4, 2, 1)
plt.plot(df_imputed.index, df_imputed['original'], marker='o', label='Original (with NaN)')
plt.title('Original Data')
plt.legend()

plt.subplot(4, 2, 2)
plt.plot(df_imputed.index, df_imputed['ffill'], marker='x', label='Forward Fill')
plt.title('Forward Fill')
plt.legend()

plt.subplot(4, 2, 3)
plt.plot(df_imputed.index, df_imputed['bfill'], marker='x', label='Backward Fill')
plt.title('Backward Fill')
plt.legend()

plt.subplot(4, 2, 4)
plt.plot(df_imputed.index, df_imputed['linear_interp'], marker='x', label='Linear Interpolation')
plt.title('Linear Interpolation')
plt.legend()

plt.subplot(4, 2, 5)
plt.plot(df_imputed.index, df_imputed['quadratic_interp'], marker='x', label='Quadratic Interpolation')
plt.title('Quadratic Interpolation')
plt.legend()

plt.subplot(4, 2, 6)
plt.plot(df_imputed.index, df_imputed['time_interp'], marker='x', label='Time-based Interpolation')
plt.title('Time-based Interpolation')
plt.legend()

plt.subplot(4, 2, 7)
plt.plot(df_imputed.index, df_imputed['rolling_mean'], marker='x', label='Rolling Mean (Window=3)')
plt.title('Rolling Mean (Window=3)')
plt.legend()

plt.subplot(4, 2, 8)
plt.plot(df_imputed.index, df_imputed['rolling_median'], marker='x', label='Rolling Median (Window=3)')
plt.title('Rolling Median (Window=3)')
plt.legend()

plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 5))
plt.plot(df_imputed.index, df_imputed['original'], marker='o', label='Original (with NaN)')
plt.plot(df_imputed.index, df_imputed['seasonal_imputed'], marker='x', label='Seasonal Imputation')
plt.title('Seasonal Imputation')
plt.legend()
plt.tight_layout()
plt.show()

_images/02ec885f51c4ad71b705047e8d739d700379334b709cfd7ab45171c0e71dea97.png

_images/c575420074b51fbd7802a6a62b0d6042092ea79bbcee113d552ea6cba4db7ae2.png

df_imputed

	original	ffill	bfill	linear_interp	quadratic_interp	time_interp	rolling_mean	rolling_median	seasonal_imputed
2024-01-01	10.0	10.0	10.0	10.0	10.000000	10.0	10.000000	10.0	10.0
2024-01-02	12.0	12.0	12.0	12.0	12.000000	12.0	11.000000	11.0	12.0
2024-01-03	NaN	12.0	15.0	13.5	14.371799	13.5	11.000000	11.0	13.5
2024-01-04	15.0	15.0	15.0	15.0	15.000000	15.0	13.500000	13.5	15.0
2024-01-05	13.0	13.0	13.0	13.0	13.000000	13.0	14.000000	14.0	13.0
2024-01-06	NaN	13.0	20.0	16.5	15.756569	16.5	14.000000	14.0	16.5
2024-01-07	20.0	20.0	20.0	20.0	20.000000	20.0	16.500000	16.5	20.0
2024-01-08	22.0	22.0	22.0	22.0	22.000000	22.0	21.000000	21.0	22.0
2024-01-09	25.0	25.0	25.0	25.0	25.000000	25.0	22.333333	22.0	25.0
2024-01-10	NaN	25.0	18.0	21.5	21.503347	21.5	23.500000	23.5	21.5
2024-01-11	18.0	18.0	18.0	18.0	18.000000	18.0	21.500000	21.5	18.0
2024-01-12	21.0	21.0	21.0	21.0	21.000000	21.0	19.500000	19.5	21.0
2024-01-13	23.0	23.0	23.0	23.0	23.000000	23.0	20.666667	21.0	23.0
2024-01-14	26.0	26.0	26.0	26.0	26.000000	26.0	23.333333	23.0	26.0
2024-01-15	NaN	26.0	29.0	27.5	27.640680	27.5	24.500000	24.5	27.5
2024-01-16	29.0	29.0	29.0	29.0	29.000000	29.0	27.500000	27.5	29.0
2024-01-17	31.0	31.0	31.0	31.0	31.000000	31.0	30.000000	30.0	31.0
2024-01-18	30.0	30.0	30.0	30.0	30.000000	30.0	30.000000	30.0	30.0
2024-01-19	28.0	28.0	28.0	28.0	28.000000	28.0	29.666667	30.0	28.0
2024-01-20	NaN	28.0	35.0	31.5	30.536134	31.5	29.000000	29.0	31.5
2024-01-21	35.0	35.0	35.0	35.0	35.000000	35.0	31.500000	31.5	35.0
2024-01-22	38.0	38.0	38.0	38.0	38.000000	38.0	36.500000	36.5	38.0
2024-01-23	40.0	40.0	40.0	40.0	40.000000	40.0	37.666667	38.0	40.0
2024-01-24	NaN	40.0	42.0	41.0	40.634635	41.0	39.000000	39.0	41.0
2024-01-25	42.0	42.0	42.0	42.0	42.000000	42.0	41.000000	41.0	42.0
2024-01-26	45.0	45.0	45.0	45.0	45.000000	45.0	43.500000	43.5	45.0
2024-01-27	43.0	43.0	43.0	43.0	43.000000	43.0	43.333333	43.0	43.0
2024-01-28	41.0	41.0	41.0	41.0	41.000000	41.0	43.000000	43.0	41.0
2024-01-29	NaN	41.0	48.0	44.5	42.524938	44.5	42.000000	42.0	44.5
2024-01-30	48.0	48.0	48.0	48.0	48.000000	48.0	44.500000	44.5	48.0

2. Normalization & Scaling#

original: The original time series data.
minmax: The data scaled using the Min-Max Scaler, which transforms values to a range between 0 and 1.
- Mathematically, for each value $x$, the scaled value $x'$ is: $$x' = \frac{x - \min(X)}{\max(X) - \min(X)}$$ where $\min(X)$ and $\max(X)$ are the minimum and maximum values in the dataset $X$.
robust: The data scaled using the Robust Scaler, which is less sensitive to outliers. It scales the data based on the interquartile range (IQR).
- Mathematically, the scaled value $x'$ is: $$x' = \frac{x - \text{median}(X)}{\text{IQR}(X)}$$ where $\text{IQR}(X) = Q_3(X) - Q_1(X)$ is the difference between the 75th and 25th percentiles.
standard: The data scaled using the StandardScaler (Z-score normalization), which centers the data around zero with a standard deviation of one.
- Mathematically, the scaled value $x'$ is: $$x' = \frac{x - \mu}{\sigma}$$ where $\mu$ is the mean of the data and $\sigma$ is the standard deviation.
maxabs: The data scaled using the MaxAbsScaler, which scales each value by the maximum absolute value in the dataset, resulting in a range of $$-1, 1].
- Mathematically, the scaled value $x'$ is: $$x' = \frac{x}{\max(|X|)}$$ where $\max(|X|)$ is the maximum of the absolute values in the dataset $X$.
power_yj: The data transformed using the Yeo-Johnson Power Transformer. This transformation aims to make the data more normally distributed and can handle both positive and negative values. The transformation is defined piecewise and involves a logarithmic or power transformation depending on the value and the estimated parameter $\lambda$.
quantile_uniform: The data transformed using the Quantile Transformer with a uniform output distribution. This maps the data to the percentiles of a uniform distribution, resulting in values between 0 and 1. It can help to reduce the impact of outliers and non-linearities.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler, MaxAbsScaler, PowerTransformer, QuantileTransformer, Normalizer

# Sample financial data (volatile with outliers and some trend)
prices = pd.Series([100, 102, 101, 150, 99, 105, 200, 98, 103, 110, 115, 120, 118, 130])
index = pd.RangeIndex(len(prices))  # Using a simple index for these scalers

# Create a DataFrame for comparison
df_scaled = pd.DataFrame({'original': prices.values})

# 1. Min-Max Scaling (range [0, 1])
scaler_minmax = MinMaxScaler()
df_scaled['minmax'] = scaler_minmax.fit_transform(prices.values.reshape(-1, 1)).flatten()

# 2. Robust Scaling (resistant to outliers)
scaler_robust = RobustScaler()
df_scaled['robust'] = scaler_robust.fit_transform(prices.values.reshape(-1, 1)).flatten()

# 3. StandardScaler (Z-score Normalization)
scaler_standard = StandardScaler()
df_scaled['standard'] = scaler_standard.fit_transform(prices.values.reshape(-1, 1)).flatten()

# 4. MaxAbsScaler (range [-1, 1])
scaler_maxabs = MaxAbsScaler()
df_scaled['maxabs'] = scaler_maxabs.fit_transform(prices.values.reshape(-1, 1)).flatten()

# 5. Power Transformer (Yeo-Johnson)
power_transformer_yj = PowerTransformer(method='yeo-johnson')
df_scaled['power_yj'] = power_transformer_yj.fit_transform(prices.values.reshape(-1, 1)).flatten()

# 6. Quantile Transformer (Uniform Output)
quantile_transformer_uniform = QuantileTransformer(output_distribution='uniform', n_quantiles=len(prices))
df_scaled['quantile_uniform'] = quantile_transformer_uniform.fit_transform(prices.values.reshape(-1, 1)).flatten()


# Plotting the results
plt.figure(figsize=(12, 10))
for i, column in enumerate(df_scaled.columns):
    plt.subplot(4, 2, i + 1)
    plt.plot(df_scaled.index, df_scaled[column], label=column)
    plt.title(f'{column.capitalize()} Scaling')
    plt.xlabel('Time Step')
    plt.ylabel('Scaled Value')
    plt.legend()
    plt.grid(True)
plt.tight_layout()
plt.show()

_images/4e036816e9c89bd0e2c8aeb53df5738b3c56e1adcbfc94de8b5c901489fd92a8.png

df_scaled

	original	minmax	robust	standard	maxabs	power_yj	quantile_uniform
0	100	0.019608	-0.410959	-0.671063	0.500	-1.015589	0.153846
1	102	0.039216	-0.301370	-0.596203	0.510	-0.758435	0.307692
2	101	0.029412	-0.356164	-0.633633	0.505	-0.883538	0.230769
3	150	0.509804	2.328767	1.200427	0.750	1.526262	0.923077
4	99	0.009804	-0.465753	-0.708493	0.495	-1.155050	0.076923
5	105	0.068627	-0.136986	-0.483914	0.525	-0.420687	0.461538
6	200	1.000000	5.068493	3.071918	1.000	1.884688	1.000000
7	98	0.000000	-0.520548	-0.745923	0.490	-1.302415	0.000000
8	103	0.049020	-0.246575	-0.558774	0.515	-0.639853	0.384615
9	110	0.117647	0.136986	-0.296765	0.550	0.037821	0.538462
10	115	0.166667	0.410959	-0.109616	0.575	0.395385	0.615385
11	120	0.215686	0.684932	0.077533	0.600	0.677243	0.769231
12	118	0.196078	0.575342	0.002674	0.590	0.572290	0.692308
13	130	0.313725	1.232877	0.451831	0.650	1.081871	0.846154

3. Outlier Detection & Treatment#

original: The original time series data.
cleaned_iqr_median_rolling: The time series after treating outliers (detected using the IQR method) by replacing them with the median of a rolling window.
- IQR Outlier Detection: An observation $y_t$ is considered an outlier if: $$y_t < Q_1 - 1.5 \times IQR \quad \text{or} \quad y_t > Q_3 + 1.5 \times IQR$$ where $Q_1$ is the first quartile and $Q_3$ is the third quartile, and $IQR = Q_3 - Q_1$.
- Median Rolling Treatment: For each detected outlier at index $i$, the treated value $\hat{y}_i$ is: $$\hat{y}_i = \text{median}(y_{i-w/2}, \dots, y_{i}, \dots, y_{i+w/2})$$ where $w$ is the window size (here, $w=5$), and the median is calculated over the valid observations within that window.
cleaned_zscore_median_rolling: The time series after treating outliers (detected using the Z-score method) by replacing them with the median of a rolling window.
- Z-score Outlier Detection: An observation $y_t$ is considered an outlier if: $$|y_t - \mu| > z \times \sigma$$ where $\mu$ is the mean of the time series, $\sigma$ is the standard deviation, and $z$ is a chosen threshold (here, $z=3$).
- Median Rolling Treatment: (Same as above)
cleaned_es_median_rolling: The time series after treating outliers (detected using the Exponential Smoothing deviation method) by replacing them with the median of a rolling window.
- Exponential Smoothing Deviation Outlier Detection: Let $S_t$ be the exponentially smoothed value. An observation $y_t$ is an outlier if: $$|y_t - S_t| > k \times \sigma_{\text{residuals}}$$ where $k$ is a threshold (here, $k=2$) and $\sigma_{\text{residuals}}$ is the standard deviation of the residuals.
- Median Rolling Treatment: (Same as above)
winsorized: The time series after applying Winsorization, where extreme values below the 5th percentile and above the 95th percentile are capped and floored respectively.
- Let $P_{lower}$ be the value at the lower percentile (e.g., 5th) and $P_{upper}$ be the value at the upper percentile (e.g., 95th). The winsorized value $\hat{y}_t$ is: $$\hat{y}_t = \begin{cases} P_{lower} & \text{if } y_t < P_{lower} \\ y_t & \text{if } P_{lower} \le y_t \le P_{upper} \\ P_{upper} & \text{if } y_t > P_{upper} \end{cases}$$
interpolated_linear_iqr: The time series after treating outliers (detected using the IQR method) by replacing them with NaN and then performing linear interpolation.
- IQR Outlier Detection: (Same as above)
- Linear Interpolation Treatment: For a replaced outlier at time $j$ between two valid points $y_i$ and $y_k$ ($i < j < k$), the interpolated value $\hat{y}_j$ is: $$\hat{y}_j = y_i + (y_k - y_i) \times \frac{j - i}{k - i}$$

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

# 1. Create realistic time series data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100, freq='D')
values = np.random.normal(loc=50, scale=5, size=100).round(1)
values[15] = 150
values[30:33] = [200, 205, 190]
ts = pd.Series(values, index=dates)


# 2. Outlier Detection Functions (for reference in treatment)
def detect_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    return series[(series < (Q1 - 1.5 * IQR)) | (series > (Q3 + 1.5 * IQR))]

def detect_outliers_zscore(series, threshold=3):
    z_scores = np.abs((series - series.mean()) / series.std())
    return series[z_scores > threshold]

def detect_outliers_exponential_smoothing(series, alpha=0.2, std_multiplier=2):
    model = SimpleExpSmoothing(series).fit(smoothing_level=alpha, optimized=False)
    smoothed = model.fittedvalues
    residuals = series - smoothed
    std_dev = residuals.std()
    upper_bound = smoothed + (std_multiplier * std_dev)
    lower_bound = smoothed - (std_multiplier * std_dev)
    return series[(series > upper_bound) | (series < lower_bound)]

# 3. Outlier Treatment Functions
def treat_outliers_median_rolling_iqr(series, window=5):
    cleaned = series.copy()
    outlier_indices = detect_outliers_iqr(series).index
    cleaned[outlier_indices] = series.rolling(
        window, min_periods=1, center=True
    ).median()[outlier_indices]
    return cleaned

def treat_outliers_winsorize(series, lower_percentile=5, upper_percentile=95):
    lower_bound = series.quantile(lower_percentile / 100)
    upper_bound = series.quantile(upper_percentile / 100)
    cleaned = series.copy()
    cleaned[series < lower_bound] = lower_bound
    cleaned[series > upper_bound] = upper_bound
    return cleaned

def treat_outliers_interpolate_iqr(series, method='linear'):
    cleaned = series.copy()
    outlier_indices = detect_outliers_iqr(series).index
    cleaned[outlier_indices] = np.nan
    cleaned = cleaned.interpolate(method=method)
    return cleaned

def treat_outliers_median_rolling_zscore(series, window=5):
    cleaned = series.copy()
    outlier_indices = detect_outliers_zscore(series).index
    cleaned[outlier_indices] = series.rolling(
        window, min_periods=1, center=True
    ).median()[outlier_indices]
    return cleaned

def treat_outliers_median_rolling_es(series, window=5):
    cleaned = series.copy()
    outlier_indices = detect_outliers_exponential_smoothing(series).index
    cleaned[outlier_indices] = series.rolling(
        window, min_periods=1, center=True
    ).median()[outlier_indices]
    return cleaned

# 4. Apply Treatment and Create DataFrame
df_treated = pd.DataFrame({'original': ts})
df_treated['cleaned_iqr_median_rolling'] = treat_outliers_median_rolling_iqr(ts)
df_treated['cleaned_zscore_median_rolling'] = treat_outliers_median_rolling_zscore(ts)
df_treated['cleaned_es_median_rolling'] = treat_outliers_median_rolling_es(ts)
df_treated['winsorized'] = treat_outliers_winsorize(ts)
df_treated['interpolated_linear_iqr'] = treat_outliers_interpolate_iqr(ts, method='linear')

print("\nDataFrame with Original and Treated Values:")
print(df_treated.head())

# 5. Visual Verification
plt.figure(figsize=(15, 15))

plt.subplot(3, 2, 1)
plt.plot(df_treated.index, df_treated['original'], 'b-', alpha=0.5, label='Original')
plt.plot(df_treated.index, df_treated['cleaned_iqr_median_rolling'], 'r--', linewidth=2, label='Cleaned (IQR - Median Rolling)')
outliers_iqr = detect_outliers_iqr(ts)
plt.scatter(outliers_iqr.index, outliers_iqr, c='black', marker='x', s=50, label='Outliers (IQR)')
plt.legend()
plt.title("Outlier Treatment with IQR - Median Rolling")

plt.subplot(3, 2, 2)
plt.plot(df_treated.index, df_treated['original'], 'b-', alpha=0.5, label='Original')
plt.plot(df_treated.index, df_treated['cleaned_zscore_median_rolling'], 'g--', linewidth=2, label='Cleaned (Z-score - Median Rolling)')
outliers_zscore = detect_outliers_zscore(ts)
plt.scatter(outliers_zscore.index, outliers_zscore, c='black', marker='x', s=50, label='Outliers (Z-score)')
plt.legend()
plt.title("Outlier Treatment with Z-score - Median Rolling")

plt.subplot(3, 2, 3)
plt.plot(df_treated.index, df_treated['original'], 'b-', alpha=0.5, label='Original')
plt.plot(df_treated.index, df_treated['cleaned_es_median_rolling'], 'c--', linewidth=2, label='Cleaned (Exp. Smooth - Median Rolling)')
outliers_es = detect_outliers_exponential_smoothing(ts)
plt.scatter(outliers_es.index, outliers_es, c='black', marker='x', s=50, label='Outliers (Exp. Smooth)')
plt.legend()
plt.title("Outlier Treatment with Exponential Smoothing - Median Rolling")

plt.subplot(3, 2, 4)
plt.plot(df_treated.index, df_treated['original'], 'b-', alpha=0.5, label='Original')
plt.plot(df_treated.index, df_treated['winsorized'], 'y--', linewidth=2, label='Winsorized')
winsorized_outlier_indices = ts[ts < ts.quantile(0.05)].index.union(ts[ts > ts.quantile(0.95)].index)
plt.scatter(winsorized_outlier_indices, ts[winsorized_outlier_indices], c='black', marker='x', s=50, label='Winsorized Points')
plt.legend()
plt.title("Outlier Treatment with Winsorization")

plt.subplot(3, 2, 5)
plt.plot(df_treated.index, df_treated['original'], 'b-', alpha=0.5, label='Original')
plt.plot(df_treated.index, df_treated['interpolated_linear_iqr'], 'k--', linewidth=2, label='Interpolated (Linear - IQR)')
outliers_iqr = detect_outliers_iqr(ts)
plt.scatter(outliers_iqr.index, ts[outliers_iqr.index], c='black', marker='x', s=50, label='Outliers (for Interp)')
plt.legend()
plt.title("Outlier Treatment with Linear Interpolation (IQR)")

plt.tight_layout()
plt.show()

DataFrame with Original and Treated Values:
            original  cleaned_iqr_median_rolling  \
2023-01-01      52.5                        52.5   
2023-01-02      49.3                        49.3   
2023-01-03      53.2                        53.2   
2023-01-04      57.6                        57.6   
2023-01-05      48.8                        48.8   

            cleaned_zscore_median_rolling  cleaned_es_median_rolling  \
2023-01-01                           52.5                       52.5   
2023-01-02                           49.3                       49.3   
2023-01-03                           53.2                       53.2   
2023-01-04                           57.6                       57.6   
2023-01-05                           48.8                       48.8   

            winsorized  interpolated_linear_iqr  
2023-01-01        52.5                     52.5  
2023-01-02        49.3                     49.3  
2023-01-03        53.2                     53.2  
2023-01-04        57.6                     57.6  
2023-01-05        48.8                     48.8  

_images/aed1abb80afaca973c1f753a3671243c10d05e80b81ebb77d7f8d6508ff4ba8c.png

df_treated

	original	cleaned_iqr_median_rolling	cleaned_zscore_median_rolling	cleaned_es_median_rolling	winsorized	interpolated_linear_iqr
2023-01-01	52.5	52.5	52.5	52.5	52.5	52.5
2023-01-02	49.3	49.3	49.3	49.3	49.3	49.3
2023-01-03	53.2	53.2	53.2	53.2	53.2	53.2
2023-01-04	57.6	57.6	57.6	57.6	57.6	57.6
2023-01-05	48.8	48.8	48.8	48.8	48.8	48.8
...	...	...	...	...	...	...
2023-04-06	42.7	42.7	42.7	42.7	42.7	42.7
2023-04-07	51.5	51.5	51.5	51.5	51.5	51.5
2023-04-08	51.3	51.3	51.3	51.3	51.3	51.3
2023-04-09	50.0	50.0	50.0	50.0	50.0	50.0
2023-04-10	48.8	48.8	48.8	48.8	48.8	48.8

100 rows × 6 columns

4. Feature Engineering for Time Series#

Time series data exhibit temporal dependencies, trends, and sometimes seasonality. Effective preprocessing helps models learn these patterns. Below are key feature engineering techniques:

Lags (Autoregression)

Lag features use past values of the target variable to capture autoregressive behavior.

\[ \text{lag}_k(t) = y(t - k) \]

For example:

lag_1 captures the value at time $t-1$
lag_2 captures the value at time $t-2$

These are helpful for models like ARIMA or any regression-based forecasting.

Rolling Windows (Moving Statistics)

Rolling windows smooth the data and extract local trends or volatility. Two common statistics:

Rolling Mean:

\[ \text{roll\_mean}_w(t) = \frac{1}{w} \sum_{i=0}^{w-1} y(t - i) \]
Rolling Standard Deviation:

\[ \text{roll\_std}_w(t) = \sqrt{\frac{1}{w} \sum_{i=0}^{w-1} (y(t - i) - \text{roll\_mean}_w(t))^2} \]

This helps detect moving averages and changes in volatility.

Differencing (Stationarity)

Differencing removes trends and stabilizes the mean of a time series. It is especially useful before applying ARIMA:

First-order differencing:

\[ \text{diff}_1(t) = y(t) - y(t - 1) \]

This transformation makes non-stationary time series more stationary by removing linear trends.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate time series data
np.random.seed(42)
date_range = pd.date_range(start='2020-01-01', periods=100, freq='D')
data = pd.DataFrame({
    'value': np.cumsum(np.random.randn(100)) + 10
}, index=date_range)

# --- Debugging Tip: Check for missing values before processing
print("Initial Missing Values:\n", data.isnull().sum())

# --- Lags: Capture autoregressive patterns
data['lag_1'] = data['value'].shift(1)
data['lag_2'] = data['value'].shift(2)

# --- Rolling Windows: Smooth noise and extract trends/volatility
data['roll_mean_7'] = data['value'].rolling(window=7).mean()
data['roll_std_7'] = data['value'].rolling(window=7).std()

# --- Differencing: Remove trend and stabilize the mean
data['diff_1'] = data['value'].diff()

# --- Debugging Tip: Check for new NaNs after feature engineering
print("\nMissing Values After Transformations:\n", data.isnull().sum())

# --- Visualization: Plot transformations
plt.figure(figsize=(14, 12))

# 1. Lags
plt.subplot(4, 1, 1)
data[['value', 'lag_1', 'lag_2']].plot(ax=plt.gca(), title='Original and Lag Features')

# 2. Rolling Mean
plt.subplot(4, 1, 2)
data[['value', 'roll_mean_7']].plot(ax=plt.gca(), title='7-Day Rolling Mean')

# 3. Rolling Std Dev (compared to value)
plt.subplot(4, 1, 3)
data[['value', 'roll_std_7']].plot(ax=plt.gca(), title='7-Day Rolling Standard Deviation')

# 4. Differencing (compared to value)
plt.subplot(4, 1, 4)
data[['value', 'diff_1']].plot(ax=plt.gca(), title='First-order Differencing')

plt.tight_layout()
plt.show()

Initial Missing Values:
 value    0
dtype: int64

Missing Values After Transformations:
 value          0
lag_1          1
lag_2          2
roll_mean_7    6
roll_std_7     6
diff_1         1
dtype: int64

_images/d21ee46df700b9001df43d99d2679cc0c3ea304a9d798069fba5317ffa6e2be5.png

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- Simulate time series with trend, seasonality, and noise
np.random.seed(42)
date_range = pd.date_range(start='2020-01-01', periods=200, freq='D')

trend = np.linspace(0, 10, 200)                            # Linear trend
seasonality = 5 * np.sin(2 * np.pi * date_range.dayofyear / 30)  # Monthly seasonality
noise = np.random.normal(0, 0.5, size=200)                 # Small Gaussian noise

values = trend + seasonality + noise

data = pd.DataFrame({'value': values}, index=date_range)

# --- Debugging Tip: Check for missing values before processing
print("Initial Missing Values:\n", data.isnull().sum())

# --- Feature Engineering
data['lag_1'] = data['value'].shift(1)
data['lag_2'] = data['value'].shift(2)

data['roll_mean_7'] = data['value'].rolling(window=7).mean()
data['roll_std_7'] = data['value'].rolling(window=7).std()

data['diff_1'] = data['value'].diff()

# --- Debugging Tip: Check for missing values after transformations
print("\nMissing Values After Transformations:\n", data.isnull().sum())

# --- Visualization
plt.figure(figsize=(14, 12))

# 1. Lags
plt.subplot(4, 1, 1)
data[['value', 'lag_1', 'lag_2']].plot(ax=plt.gca(), title='Original and Lag Features')

# 2. Rolling Mean
plt.subplot(4, 1, 2)
data[['value', 'roll_mean_7']].plot(ax=plt.gca(), title='7-Day Rolling Mean')

# 3. Rolling Std Dev
plt.subplot(4, 1, 3)
data[['value', 'roll_std_7']].plot(ax=plt.gca(), title='7-Day Rolling Standard Deviation')

# 4. Differencing
plt.subplot(4, 1, 4)
data[['value', 'diff_1']].plot(ax=plt.gca(), title='First-order Differencing')

plt.tight_layout()
plt.show()

# --- Final null summary
print("\nFinal Missing Values Summary:\n", data.isnull().sum())

Initial Missing Values:
 value    0
dtype: int64

Missing Values After Transformations:
 value          0
lag_1          1
lag_2          2
roll_mean_7    6
roll_std_7     6
diff_1         1
dtype: int64

_images/7e8743e8a08d7759fa3a3cb7255db8ae0be1a93ba5e265a65806fe81d791a446.png

Final Missing Values Summary:
 value          0
lag_1          1
lag_2          2
roll_mean_7    6
roll_std_7     6
diff_1         1
dtype: int64

7. Data Classification#

Data classification in forecasting refers to the process of organizing and categorizing input data based on its characteristics or properties. This helps in selecting the appropriate preprocessing methods and forecasting models.

Structured vs Unstructured Data
- Structured: Data organized in tables (e.g., stock prices in a CSV)
- Unstructured: Text, images, audio (e.g., news articles used for sentiment forecasting)
Quantitative vs Qualitative
- Quantitative: Numeric values (e.g., revenue, temperature)
- Qualitative: Descriptive categories (e.g., region, product type)
Time Series vs Cross-sectional
- Time Series: Observations indexed in time order (e.g., daily sales)
- Cross-sectional: Data collected at a single point in time (e.g., customer survey results)
Categorical vs Continuous
- Categorical: Finite set of categories (e.g., “low”, “medium”, “high” risk)
- Continuous: Real-valued measurements (e.g., $x \in \mathbb{R}$)

8. Data Partition#

Description: Splitting data into training, validation, and test sets.
Relevance: Prevents overfitting (e.g., memorizing vs. learning) and evaluates model performance.
Related Concepts:
- Cross-Validation: K-fold validation for robust testing.

Data partitioning is the process of splitting a dataset into subsets for training, validation, and testing.

Standard Splits:#

Training Set (60-70%): Used to train the forecasting model.
Validation Set (10-20%): Used to tune model parameters.
Test Set (20-30%): Used to evaluate model performance.

Mathematically:

If $ D $ is the full dataset,

\[ D = D_{train} \cup D_{val} \cup D_{test} \quad \text{and} \quad D_{train} \cap D_{val} \cap D_{test} = \emptyset \]

import pandas as pd
from sklearn.model_selection import train_test_split

# Using the same dataset from earlier (Monthly Total Equity Flows)
url = "https://raw.githubusercontent.com/datasets/investor-flow-of-funds-us/master/data/weekly.csv"
df = pd.read_csv(url)
df['date'] = pd.to_datetime(df['Date'])
df = df[['date', 'Total Equity']]
df.columns = ['date', 'equity_flow']
df = df.resample('ME', on='date').sum().dropna()

# Create lag features for supervised learning
df['equity_flow_lag1'] = df['equity_flow'].shift(1)
df['equity_flow_lag2'] = df['equity_flow'].shift(2)
df.dropna(inplace=True)

# Split into features and target
X = df[['equity_flow_lag1', 'equity_flow_lag2']]
y = df['equity_flow']

# Split into Train and Test (80-20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X shape: (157, 2)
y shape: (157,)
X_train shape: (125, 2)
X_test shape: (32, 2)
y_train shape: (125,)
y_test shape: (32,)

# Your code here

Data Cleaning & Preprocessing

Contents

Data Cleaning & Preprocessing#

🧠 Why This Matters#

⚙️ Prerequisite: Pandas Power 🐼#

🧹 Step 1. Inspect Before You Disinfect#

🧽 Step 2. Cleaning Column Names#

Fix it:#

🧴 Step 3. Removing Duplicates#

🧊 Step 4. Handling Missing Values#

Quick Peek:#

Strategy Options:#

🧪 Step 5. Fixing Data Types#

Example:#

🍋 Step 6. Fixing Inconsistent Values#

Example:#

🧯 Step 7. Outlier Treatment Lite (Preview)#

🧩 Practice Challenge: “The Sales Spa”#

🧭 Recap#

🔜 Next Stop#

Adding Calculated Columns#

Example 1: Simple Arithmetic Operations#

Example 2: Calculating Percentages#

Example 3: Adding Columns Based on Conditions#

Example 4: Using assign() Method#

Example 5: Using apply() for Custom Logic#

Summary: Adding Calculated Columns#

Practical Exercise#

3. Data Selection#

Key Methods:#

Implementation (Python)#

Why This Matters#

4. Data Preparation for Time Series#

1. Imputation#

2. Normalization & Scaling#

3. Outlier Detection & Treatment#

4. Feature Engineering for Time Series#

7. Data Classification#

8. Data Partition#

Standard Splits:#

Example 4: Using `assign()` Method#

Example 5: Using `apply()` for Custom Logic#