Data Loading, Wrangling & Visualisation#

Welcome to the most honest part of Machine Learning — data wrangling, also known as “90% of the job no one posts about on LinkedIn.” 😅

If math was theory, this chapter is practice with mud. You’ll roll up your sleeves, clean messy data, and make it look like something a CEO would actually want to see in a dashboard.


🧠 Why This Matters#

Machine Learning models are like gourmet chefs — they can only make good predictions if you give them clean ingredients.

Unfortunately, business data often looks like this:

Customer

Age

Revenue

Gender

Notes

A-102

27

$2,000

F

missing

NaN

$500

?

typo

C-554

45

-$200

Male

refund

D-999

300

$1,000

cat

who let this happen

So before we even think about algorithms, we’ll:

  1. Load data from messy sources.

  2. Clean it like digital laundry.

  3. Transform it into model-ready features.

  4. Visualize it like a storytelling pro.


💾 1. The Data Wrangling Trifecta#

Step

Name

Business Goal

Data Loading

Get data into Python

“Where’s my Excel file again?”

Data Cleaning

Fix mistakes & missing values

“Why is revenue negative?”

Feature Engineering

Add useful variables

“Let’s create a loyalty score!”

By the end of this chapter, you’ll make data look so clean it could get a job at McKinsey.


📚 Prerequisite: Python Refresher#

If you’re new to Python or Pandas, don’t panic — it’s easier than assembling IKEA furniture. 👉 Check out my other book: 📘 Programming for Business It covers everything from reading files to basic Python data manipulation.

💡 Tip: You’ll be using libraries like pandas, numpy, and matplotlib. If these look like Pokémon names right now, that book is your Pokédex.


🧩 Practice Corner: “Guess the Data Disaster”#

Match each messy situation with the tool that saves the day:

Situation

Tool

File is 200MB Excel sheet with multiple tabs

pandas.read_excel()

Missing values everywhere

df.fillna() or df.dropna()

Categorical columns like “Yes/No”

pd.get_dummies()

Data stored in a SQL database

pandas.read_sql()

REST API providing JSON data

requests.get()

Pro tip: Pandas is your Swiss Army Knife for data chaos.


🔍 2. Why Businesses Love Clean Data#

Messy data → Confused analysts → Wrong dashboards → Angry executives. Clean data → Confident models → Actionable insights → Happy bonuses. 🎉

You’ll soon realize:

Data cleaning is not boring — it’s debugging reality.

For example:

  • Missing age? → Estimate with median.

  • Wrong gender field? → Normalize text values.

  • Negative revenue? → Check for refunds.

  • Timestamp errors? → Convert to datetime.

You’re not just fixing numbers — you’re restoring business logic.


🎨 3. Visualisation: Turning Data into Business Art#

Once you’ve tamed the chaos, it’s time to make your data pretty and persuasive.

This section covers:

  • Histograms that show sales trends 📊

  • Scatter plots revealing marketing ROI 💸

  • Correlation heatmaps for KPIs 🔥

  • Dashboards that make execs say “wow” ✨

Remember: “If it’s not visualized, it didn’t happen.” — Every Data Scientist, ever.


💬 4. Business Analogy: The Data Spa#

Think of your dataset like a customer entering a spa:

Step

Data Action

Spa Equivalent

Loading

Getting checked in

“Welcome, Mr. CSV!”

Cleaning

Removing noise & junk

Exfoliation time 🧼

Transformation

Standardizing features

Facial mask & makeover 💅

Visualization

Presenting results

Walking the runway 🕺

When your data leaves this spa, it’s ready for the runway — or your next board meeting.


🧩 Practice Corner: “Wrangle This!”#

Here’s a messy dataset in Python. Try cleaning it up using what you’ll learn in this chapter:

import pandas as pd

data = {
    'Customer': ['A1', 'A2', 'A3', 'A4', None],
    'Age': [25, None, 300, 40, 32],
    'Revenue': [2000, 500, -100, None, 1500],
    'Gender': ['M', '?', 'F', 'F', 'unknown']
}

df = pd.DataFrame(data)
print("Original Messy Data:")
print(df)

🧽 Challenge:

  1. Replace None and ? with proper values

  2. Fix negative revenue

  3. Correct impossible ages

  4. Print the clean version


🧭 5. What’s Coming Up#

File

Topic

Funny Summary

data_loading

Loading data from CSV, Excel, SQL & APIs

“The Great Data Buffet” 🍽️

data_cleaning

Cleaning & preprocessing

“Digital Laundry Day” 🧺

handling_missing_outliers

Fixing missing data & outliers

“CSI: Data Edition” 🕵️

feature_encoding

Encoding categories & scaling features

“Teaching Machines English” 🗣️

eda

Exploratory Data Analysis

“Detective Work with Graphs” 🧠

visualisation

Making plots & charts

“Turning KPIs into art” 🎨

business_dashboards

Interactive dashboards

“Your Data’s TED Talk” 🧑‍💼


🚀 Summary#

✅ Data wrangling = preparing the battlefield for ML ✅ Visualization = storytelling for business impact ✅ Clean data = clean insights ✅ Dirty data = bad decisions (and maybe a career pivot)

Remember: “Garbage in → Garbage out” — but in business, garbage often comes with formatting errors.


🔜 Next Stop#

👉 Head to Data Loading (CSV, Excel, SQL, APIs) to learn how to bring all your data under one roof — without crying over file formats.


10. Sorting#


  • Description: The process of arranging data records based on the values of one or more columns (variables). This can be done in ascending order (from smallest to largest, or A to Z) or descending order (from largest to smallest, or Z to A).

  • Related Concepts:

    • Single-Level Sorting: Arranging data based on the values of a single column.

    • Multi-Level Sorting: Sorting data based on the values of multiple columns, where the order of precedence of the columns determines the final arrangement. For example, sorting time series data first by year and then by month.

    • Sorting Algorithms (Optional): For a deeper understanding, one could briefly discuss the basic principles behind algorithms like bubble sort (simple comparison-based sorting) or quicksort (more efficient divide-and-conquer approach). While the underlying algorithms are often abstracted away by software, understanding their efficiency can be beneficial for very large datasets.

  • Example: In a time series of product sales, sorting the data by date ensures that the observations are in chronological order, which is essential for identifying trends and patterns over time. Additionally, one might sort by sales volume within a specific month to see the best-selling products for that period.

Sorting is a foundational operation for organizing and examining data. In time series analysis, maintaining the correct temporal order is paramount.

Example Applications in Time Series:#

  • Chronological Ordering: Ensuring time series data is ordered by the time index (e.g., date, timestamp) before any analysis or modeling. If the data is not sorted chronologically, identifying trends, seasonality, or applying time-dependent models will be incorrect.

  • Identifying Extreme Events: After sorting by the value of the time series (e.g., sales, temperature), one can easily identify the periods with the highest or lowest values.

  • Ranking Periods: Sorting by a specific metric (e.g., sales volume in each month) allows for ranking different time periods based on that metric.

import pandas as pd
import numpy as np

# Sample time series data (replace with your actual data)
dates = pd.to_datetime(['2023-01-05', '2023-01-02', '2023-01-10', '2023-01-01', '2023-01-08'])
values = [25, 30, 22, 35, 28]
time_series_data_unsorted = pd.DataFrame({'Date': dates, 'Value': values})

print("Unsorted Time Series Data:")
print(time_series_data_unsorted)
print("\n---")

# 1. Sorting by Date (Chronological Order)
time_series_data_sorted_by_date = time_series_data_unsorted.sort_values(by='Date')
print("Time Series Data Sorted by Date:")
print(time_series_data_sorted_by_date)
print("\n---")

# Sample data with another column for multi-level sorting
category = ['A', 'B', 'A', 'B', 'A']
time_series_data_multi = pd.DataFrame({'Date': dates, 'Category': category, 'Value': values})

print("Unsorted Time Series Data with Category:")
print(time_series_data_multi)
print("\n---")

# 2. Multi-Level Sorting (by Category then by Value in descending order)
time_series_data_multi_sorted = time_series_data_multi.sort_values(by=['Category', 'Value'], ascending=[True, False])
print("Time Series Data Sorted by Category (ascending) then Value (descending):")
print(time_series_data_multi_sorted)
print("\n---")

# 3. Sorting by Value (to identify extreme values)
time_series_data_sorted_by_value_desc = time_series_data_unsorted.sort_values(by='Value', ascending=False)
print("Time Series Data Sorted by Value (Descending - Highest Value First):")
print(time_series_data_sorted_by_value_desc)
Unsorted Time Series Data:
        Date  Value
0 2023-01-05     25
1 2023-01-02     30
2 2023-01-10     22
3 2023-01-01     35
4 2023-01-08     28

---
Time Series Data Sorted by Date:
        Date  Value
3 2023-01-01     35
1 2023-01-02     30
0 2023-01-05     25
4 2023-01-08     28
2 2023-01-10     22

---
Unsorted Time Series Data with Category:
        Date Category  Value
0 2023-01-05        A     25
1 2023-01-02        B     30
2 2023-01-10        A     22
3 2023-01-01        B     35
4 2023-01-08        A     28

---
Time Series Data Sorted by Category (ascending) then Value (descending):
        Date Category  Value
4 2023-01-08        A     28
0 2023-01-05        A     25
2 2023-01-10        A     22
3 2023-01-01        B     35
1 2023-01-02        B     30

---
Time Series Data Sorted by Value (Descending - Highest Value First):
        Date  Value
3 2023-01-01     35
1 2023-01-02     30
4 2023-01-08     28
0 2023-01-05     25
2 2023-01-10     22

11. Filtering Data#


  • Description: The process of selecting specific subsets of data from a larger dataset based on predefined conditions or criteria. This allows analysts to focus on the most relevant portions of the data for their specific questions or analyses.

  • Related Concepts:

    • Basic Filtering: Applying single conditions to select data (e.g., all sales records from a specific month).

    • Advanced Filtering: Using multiple conditions combined with logical operators (AND, OR, NOT) to create more complex selection criteria (e.g., all sales above a certain threshold AND occurring during a specific promotional period).

    • Querying: Similar to using WHERE clauses in SQL, filtering in data analysis tools allows for specifying conditions that the data records must satisfy to be included in the resulting subset.

  • Example: In a time series of website traffic, filtering the data to show only the traffic during weekdays between 9 AM and 5 PM might be useful for analyzing user behavior during business hours. Another example could be filtering stock prices to only include data points after a specific market event.

Examples in Time Series:#

  • Filtering by Time Period: Selecting data within a specific date range (e.g., sales data for the last quarter, temperature readings for a particular year).

  • Filtering by Value Threshold: Identifying time points where the value of the time series exceeds or falls below a certain level (e.g., days when stock prices were above $150, hours when website traffic was below 500).

  • Filtering by Event: Selecting data points associated with specific events or conditions (if event markers are present in the data).

import pandas as pd
import numpy as np

# Sample time series data (replace with your actual data)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-01-10', freq='D'))
sales = np.random.randint(50, 200, size=len(dates))
temperatures = np.random.uniform(15, 25, size=len(dates))
events = ['No Event', 'Promotion', 'No Event', 'Holiday', 'No Event', 'No Event', 'Promotion', 'No Event', 'Holiday', 'No Event']
time_series_df = pd.DataFrame({'Date': dates, 'Sales': sales, 'Temperature': temperatures, 'Event': events})
time_series_df.set_index('Date', inplace=True)

print("Original Time Series Data:")
print(time_series_df)
print("\n---")

# 1. Filtering by Time Period (e.g., sales after 2023-01-05)
after_date = '2023-01-05'
filtered_after_date = time_series_df[time_series_df.index > after_date]
print(f"Sales data after {after_date}:")
print(filtered_after_date)
print("\n---")

# 2. Filtering by Value Threshold (e.g., days with sales above 150)
sales_threshold = 150
high_sales = time_series_df[time_series_df['Sales'] > sales_threshold]
print(f"Days with sales above {sales_threshold}:")
print(high_sales)
print("\n---")

# 3. Filtering by Multiple Conditions (e.g., days with temperature above 20 AND a 'Promotion' event)
filtered_multiple_conditions = time_series_df[(time_series_df['Temperature'] > 20) & (time_series_df['Event'] == 'Promotion')]
print("Days with temperature above 20 and a 'Promotion' event:")
print(filtered_multiple_conditions)
print("\n---")

# 4. Filtering by a list of values (e.g., events that are 'Holiday')
filtered_by_event = time_series_df[time_series_df['Event'].isin(['Holiday'])]
print("Data points where the event is 'Holiday':")
print(filtered_by_event)
Original Time Series Data:
            Sales  Temperature      Event
Date                                     
2023-01-01    193    15.453040   No Event
2023-01-02    134    18.746126  Promotion
2023-01-03     88    21.258599   No Event
2023-01-04    149    20.031363    Holiday
2023-01-05     82    23.564898   No Event
2023-01-06    150    21.586936   No Event
2023-01-07     72    16.629344  Promotion
2023-01-08     59    15.705687   No Event
2023-01-09    118    21.424193    Holiday
2023-01-10    149    15.265113   No Event

---
Sales data after 2023-01-05:
            Sales  Temperature      Event
Date                                     
2023-01-06    150    21.586936   No Event
2023-01-07     72    16.629344  Promotion
2023-01-08     59    15.705687   No Event
2023-01-09    118    21.424193    Holiday
2023-01-10    149    15.265113   No Event

---
Days with sales above 150:
            Sales  Temperature     Event
Date                                    
2023-01-01    193     15.45304  No Event

---
Days with temperature above 20 and a 'Promotion' event:
Empty DataFrame
Columns: [Sales, Temperature, Event]
Index: []

---
Data points where the event is 'Holiday':
            Sales  Temperature    Event
Date                                   
2023-01-04    149    20.031363  Holiday
2023-01-09    118    21.424193  Holiday

Modifying Data using Excel#

Excel is a practical tool for quick transformations:

  • Range Names: Assign names to cells (e.g., =Average(Returns))

  • Sorting: Data → Sort (ascending/descending)

  • Filtering: Use filters to view subset of data

  • Basic Formulas: =AVERAGE(A1:A10), =STDEV.P(A1:A10)

  • Charts: Insert → Chart → Time Series Plot


12. Creating Distribution from Data#


  • Description: The process of summarizing the values within a time series dataset to understand their frequency or probability of occurrence. This involves grouping the data into intervals (bins) and counting how many observations fall into each bin, or estimating the underlying probability density function.

  • Related Concepts:

    • Histograms: Graphical representations of frequency distributions, where the height of each bar corresponds to the number of observations within a particular bin.

    • Probability Distributions: Mathematical functions that describe the likelihood of different outcomes. Common examples include the normal (Gaussian) distribution, log-normal distribution (often used for asset prices), and others depending on the nature of the time series.

    • Kernel Density Estimation (KDE): A non-parametric method to estimate the probability density function of a random variable. It provides a smoothed representation of the distribution.

  • Example: Creating a histogram of daily stock returns over a year can help visualize the distribution of these returns, allowing one to assess if they roughly follow a normal distribution, are skewed, or have heavy tails (more extreme events than a normal distribution would predict).

  • Mathematical Context: For a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), the probability density function \(f(x)\) is given by: $\( f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \)$ This bell-shaped curve is often used as a benchmark for understanding the distribution of many natural phenomena and financial data.

A distribution provides a visual and statistical summary of how frequently different values occur in a time series. This understanding is vital for:

  • Identifying Data Skewness: Determining if the distribution is asymmetric, with a longer tail on one side.

  • Understanding Volatility in Returns: Observing the spread or dispersion of returns, which is a measure of risk in financial time series.

Common Tools for Visualizing Distributions:#

  • Histogram: Displays the frequency of data points falling within specified bins.

  • Box Plot: Summarizes the distribution through quartiles, median, and potential outliers, providing insights into spread and skewness.

  • KDE (Kernel Density Estimate): Provides a smooth, continuous estimate of the probability density function of the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample time series data (replace with your actual data)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-12-31', freq='D'))
daily_returns = np.random.normal(0.0005, 0.01, size=len(dates)) # Simulate daily returns
time_series_returns = pd.DataFrame({'Date': dates, 'Daily Return': daily_returns})
time_series_returns.set_index('Date', inplace=True)

print("Sample Time Series of Daily Returns:")
print(time_series_returns.head())
print("\n---")

# 1. Histogram of Daily Returns
plt.figure(figsize=(10, 6))
sns.histplot(time_series_returns['Daily Return'], bins=30, kde=False)
plt.title('Histogram of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency')
plt.show()
print("\n---")

# 2. Box Plot of Daily Returns
plt.figure(figsize=(8, 6))
sns.boxplot(y=time_series_returns['Daily Return'])
plt.title('Box Plot of Daily Returns')
plt.ylabel('Daily Return')
plt.show()
print("\n---")

# 3. Kernel Density Estimate (KDE) of Daily Returns
plt.figure(figsize=(10, 6))
sns.kdeplot(time_series_returns['Daily Return'], fill=True)
plt.title('Kernel Density Estimate of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Density')
plt.show()
print("\n---")

# 4. Combining Histogram and KDE
plt.figure(figsize=(10, 6))
sns.histplot(time_series_returns['Daily Return'], bins=30, kde=True, color='skyblue')
plt.title('Histogram with KDE of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
Sample Time Series of Daily Returns:
            Daily Return
Date                    
2023-01-01      0.006972
2023-01-02      0.001761
2023-01-03     -0.018474
2023-01-04      0.013305
2023-01-05     -0.005651

---
_images/ed47576a27e882037d764fd2283fbd82214da8474bdfb8adeab07e6d055a4414.png
---
_images/7a0095f49d3a1c59bb3314c92cc8725df070cbc1e26451d138333b39405591ee.png
---
_images/1bc47cfb896e893d4265e730543daba3c2c555f62a344812fb09c72abadac404.png
---
_images/5b30403219be3bb9267d8a6cfc7c9f9678975c543191c32395bf993be2f3fe9f.png

13. Analyzing Distribution#


  • Description: Quantifying the shape of a distribution by evaluating its asymmetry (skewness) and the heaviness of its tails (kurtosis). These measures provide insights beyond central tendency and dispersion, revealing important characteristics of the data’s distribution.

  • Related Concepts:

    • Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. A positive skew indicates a long tail extending towards more positive values (right-skewed), while a negative skew indicates a long tail extending towards more negative values (left-skewed). The formula for sample skewness is: $\( \text{Skew} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^3}{\sigma^3} \)\( where \)x_i\( are the data points, \)\mu\( is the sample mean, \)\sigma\( is the sample standard deviation, and \)n$ is the number of data points.

    • Kurtosis: A measure of the “tailedness” of the probability distribution. High kurtosis indicates that the distribution has heavier tails and a sharper peak around the mean compared to a normal distribution, suggesting more frequent extreme values. The formula for sample kurtosis (often adjusted to excess kurtosis by subtracting 3, where a normal distribution has a kurtosis of 3) is: $\( \text{Kurt} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^4}{\sigma^4} \)\( Excess kurtosis is then \)\text{Kurt}_{excess} = \text{Kurt} - 3$.

  • Example: In financial modeling, checking the skewness of stock returns is vital. A significant negative skew might indicate a higher probability of large negative returns (crashes) compared to large positive returns. Similarly, examining kurtosis helps assess the risk associated with extreme price movements.

Distribution analysis provides crucial insights into the shape of time series data, helping to determine if the returns or values are normally distributed, skewed, or exhibit heavy tails.

Tools for Assessing Distribution Shape:#

  • Histogram: Visual inspection can reveal asymmetry and the relative frequency of values in the tails.

  • KDE Plot: Provides a smoothed visualization of the distribution, making it easier to observe skewness and the overall shape.

  • Skewness Coefficient: A numerical measure of asymmetry. $\( \text{Skewness} > 0 \implies \text{Right-tailed (positive skew)} \)\( \)\( \text{Skewness} < 0 \implies \text{Left-tailed (negative skew)} \)\( \)\( \text{Skewness} \approx 0 \implies \text{Symmetric} \)$

  • Kurtosis Coefficient: A numerical measure of the tailedness of the distribution. $\( \text{Kurtosis} > 3 \implies \text{Leptokurtic (heavy tails, sharper peak)} \)\( \)\( \text{Kurtosis} < 3 \implies \text{Platykurtic (thin tails, flatter peak)} \)\( \)\( \text{Kurtosis} \approx 3 \implies \text{Mesokurtic (similar to normal distribution)} \)$

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, skewnorm
from scipy.stats import skew, kurtosis

# Generate data points
np.random.seed(42)  # for reproducibility
n_points = 1000

# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=n_points)
skewness_normal = 0

# Right-skewed distribution (using a different distribution for better visual)
right_skewed_data = np.random.exponential(scale=1, size=n_points) - 1 # Shifted exponential
skewness_right = skew(right_skewed_data)

# Left-skewed distribution (mirroring the right-skewed)
left_skewed_data = - (np.random.exponential(scale=1, size=n_points) - 1)
skewness_left = skew(left_skewed_data)

# Plotting histograms
plt.figure(figsize=(14, 6))

plt.subplot(1, 3, 1)
plt.hist(normal_data, bins=30, density=True, alpha=0.7, color='blue', label='Normal')
xmin, xmax = plt.xlim()
x_norm = np.linspace(xmin, xmax, 100)
p_norm = norm.pdf(x_norm, np.mean(normal_data), np.std(normal_data))
plt.plot(x_norm, p_norm, 'k', linewidth=2)
plt.title(f'Normal (Skewness ≈ {skewness_normal:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()

plt.subplot(1, 3, 2)
plt.hist(right_skewed_data, bins=30, density=True, alpha=0.7, color='green', label='Right-Skewed')
xmin_right, xmax_right = plt.xlim()
x_right = np.linspace(xmin_right, xmax_right, 100)
p_right = skewnorm.pdf(x_right, a=skewness_right, loc=np.mean(right_skewed_data), scale=np.std(right_skewed_data)) # Using skewnorm to fit
# Note: Fitting skewnorm might not perfectly match the exponential, but shows the skewed shape
plt.plot(x_right, p_right, 'k', linewidth=2)
plt.title(f'Right-Skewed (Skewness ≈ {skewness_right:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()

plt.subplot(1, 3, 3)
plt.hist(left_skewed_data, bins=30, density=True, alpha=0.7, color='red', label='Left-Skewed')
xmin_left, xmax_left = plt.xlim()
x_left = np.linspace(xmin_left, xmax_left, 100)
p_left = skewnorm.pdf(x_left, a=skewness_left, loc=np.mean(left_skewed_data), scale=np.std(left_skewed_data)) # Using skewnorm to fit
# Note: Fitting skewnorm might not perfectly match the mirrored exponential, but shows the skewed shape
plt.plot(x_left, p_left, 'k', linewidth=2)
plt.title(f'Left-Skewed (Skewness ≈ {skewness_left:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()

plt.tight_layout()
plt.show()
_images/496f4871e605c087f32493b06b147b14cbb25288fed12c3ad3b3b0e8e3ca2580.png
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis

# Sample time series data (replace with your actual data)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-12-31', freq='D'))
daily_returns_normal = np.random.normal(0.0005, 0.01, size=len(dates))
daily_returns_skewed = np.random.normal(0.0005, 0.01, size=len(dates)) + np.random.exponential(0.005, size=len(dates)) * 0.5
daily_returns_heavy_tailed = np.random.standard_t(df=3, size=len(dates)) * 0.01

df = pd.DataFrame({
    'Date': dates,
    'Normal Returns': daily_returns_normal,
    'Skewed Returns': daily_returns_skewed,
    'Heavy-Tailed Returns': daily_returns_heavy_tailed
})
df.set_index('Date', inplace=True)

print("Sample Time Series of Returns:")
print(df.head())
print("\n---")

# Analyze Normal Returns
print("Analysis of Normal Returns:")
skewness_normal = skew(df['Normal Returns'])
kurtosis_normal = kurtosis(df['Normal Returns'])
print(f"Skewness: {skewness_normal:.4f}")
print(f"Kurtosis (Excess): {kurtosis_normal:.4f}") # scipy.stats.kurtosis returns excess kurtosis

plt.figure(figsize=(10, 6))
sns.histplot(df['Normal Returns'], bins=30, kde=True, color='skyblue')
plt.title('Histogram with KDE of Normal Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
print("\n---")

# Analyze Skewed Returns
print("Analysis of Skewed Returns:")
skewness_skewed = skew(df['Skewed Returns'])
kurtosis_skewed = kurtosis(df['Skewed Returns'])
print(f"Skewness: {skewness_skewed:.4f}")
print(f"Kurtosis (Excess): {kurtosis_skewed:.4f}")

plt.figure(figsize=(10, 6))
sns.histplot(df['Skewed Returns'], bins=30, kde=True, color='salmon')
plt.title('Histogram with KDE of Skewed Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
print("\n---")

# Analyze Heavy-Tailed Returns
print("Analysis of Heavy-Tailed Returns:")
skewness_heavy_tailed = skew(df['Heavy-Tailed Returns'])
kurtosis_heavy_tailed = kurtosis(df['Heavy-Tailed Returns'])
print(f"Skewness: {skewness_heavy_tailed:.4f}")
print(f"Kurtosis (Excess): {kurtosis_heavy_tailed:.4f}")

plt.figure(figsize=(10, 6))
sns.histplot(df['Heavy-Tailed Returns'], bins=30, kde=True, color='lightgreen')
plt.title('Histogram with KDE of Heavy-Tailed Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
Sample Time Series of Returns:
            Normal Returns  Skewed Returns  Heavy-Tailed Returns
Date                                                            
2023-01-01       -0.005567       -0.005480              0.016412
2023-01-02        0.002613       -0.003066             -0.012828
2023-01-03        0.012501       -0.007143              0.003276
2023-01-04       -0.004419        0.013458             -0.021952
2023-01-05       -0.018266       -0.001169              0.010649

---
Analysis of Normal Returns:
Skewness: 0.0289
Kurtosis (Excess): 0.3690
_images/2112ac92f1e37734b0054e5a462ca7756aa90675036b4727dd46c5cafbd5ae3a.png
---
Analysis of Skewed Returns:
Skewness: -0.0920
Kurtosis (Excess): -0.1990
_images/c075334317c1caa0a5b49f2a5e7e88c11c9b68996ee71d48c652bf421783fe23.png
---
Analysis of Heavy-Tailed Returns:
Skewness: 0.3461
Kurtosis (Excess): 3.4767
_images/c80dc7df747ce58a801929eb720db162c6e8fe6d2d88edb5459835c2f3588233.png

Creating Distributions from Time Series Data for Forecasting#


In forecasting, distributions of time series data help model uncertainty and predict future values. Key approaches include:

  • Empirical Distributions: Histograms or kernel density estimates of historical values.

  • Residual Distributions: Distributions of errors from a forecasting model.

  • Lag Distributions: Distributions of differences between observations at specific lags.

Example: For a time series $\( X_t = \{x_1, x_2, \ldots, x_T\} \), the residual distribution is based on errors $\( e_t = x_t - \hat{x}_t \), where $\( \hat{x}_t \) is the predicted value. The probability density of residuals can be estimated as: $\( f(e) \approx \frac{1}{Th} \sum_{t=1}^T K\left(\frac{e - e_t}{h}\right) \)\( where \)\( K \) is a kernel function and $\( h \) is the bandwidth.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Generate synthetic time series
np.random.seed(42)
t = np.arange(100)
trend = 0.1 * t
seasonality = 5 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 1, 100)
X = trend + seasonality + noise

# Simple moving average forecast
window = 5
forecast = pd.Series(X).rolling(window=window, min_periods=1).mean().shift(1)
residuals = X - forecast

# Lag differences (lag=1)
lag_diff = pd.Series(X).diff(1).dropna()

# Empirical distribution (histogram)
plt.figure(figsize=(12, 8))
plt.subplot(3, 1, 1)
plt.hist(X, bins=30, density=True, alpha=0.7, color='blue', label='Empirical')
plt.title('Empirical Distribution of Time Series')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()

# Residual distribution (kernel density)
kde = gaussian_kde(residuals.dropna())
x_range = np.linspace(min(residuals.dropna()), max(residuals.dropna()), 100)
plt.subplot(3, 1, 2)
plt.plot(x_range, kde(x_range), 'r-', label='Residual KDE')
plt.hist(residuals.dropna(), bins=30, density=True, alpha=0.5, color='gray', label='Residual Histogram')
plt.title('Residual Distribution')
plt.xlabel('Residual')
plt.ylabel('Density')
plt.legend()

# Lag distribution
plt.subplot(3, 1, 3)
plt.hist(lag_diff, bins=30, density=True, alpha=0.7, color='green', label='Lag-1 Differences')
plt.title('Lag-1 Difference Distribution')
plt.xlabel('Difference')
plt.ylabel('Density')
plt.legend()

plt.tight_layout()
plt.savefig('distributions_forecasting.png')
_images/f26ae81c2ac10d86aaba7b91ee6c5797b313b81c200f13eafe79bcf5c32681b4.png

17. Data Reduction#


  • Dropping Unnecessary Columns:

    • Remove features with no predictive value (e.g., constant variables, irrelevant metadata).

    • Example: Drop store ID if forecasting total chain sales.

  • Aggregating Data:

    • Convert high-frequency data (e.g., daily sales) to a lower frequency (e.g., monthly) to match Holt-Winters’ seasonal period (e.g., 12 months).

    • Smooths noise while preserving seasonality.

  • Feature Selection:

    • Select features with strong correlations to the target (e.g., sales) or lagged effects (e.g., marketing spend at lag 2, as identified by cross-correlation).

    • Use correlation analysis or variance thresholds to keep relevant predictors.

  • Principal Component Analysis (PCA): summarizing correlated data into a few key patterns

    • Reduce multiple correlated features (e.g., financial indicators, marketing channels) into fewer components that capture most variance.

    • Useful when external regressors (e.g., marketing, promotions) are considered alongside Holt-Winters.

  • Binning (Numerosity Reduction):

    • Group continuous variables (e.g., customer age, sales volume) into categories to reduce noise and simplify analysis.

    • Example: Bin daily sales into weekly averages to stabilize fluctuations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import warnings

# Suppress warnings for cleaner output (e.g., Holt-Winters convergence)
warnings.filterwarnings("ignore", category=UserWarning)

# 1. Generate synthetic daily retail sales data (2018–2024)
np.random.seed(42)
dates = pd.date_range(start='2018-01-01', end='2024-12-31', freq='D')
n = len(dates)  # ~2557 days
# Sales: trend, annual seasonality (Dec peaks), noise
trend = 0.0001 * np.arange(n)  # ~$10K/year
seasonality = 3 * np.sin(2 * np.pi * (np.arange(n) % 365.25) / 365.25)  # Dec ~$8K
noise = np.random.normal(0, 0.5, n)
sales = 5 + trend + seasonality + noise  # Daily ~$3K–$8K
# Features
marketing = 1 + 0.5 * np.sin(2 * np.pi * (np.arange(n) % 365.25 - 60) / 365.25) + np.random.normal(0, 0.2, n)
traffic = 7 + 0.5 * marketing + np.random.normal(0, 1, n)  # Correlated
discounts = np.random.uniform(0, 20, n)  # Weakly correlated
store_id = ['Store1'] * n  # Constant

# Create daily DataFrame
df_daily = pd.DataFrame({
    'Date': dates, 'Sales': sales, 'Marketing': marketing,
    'Traffic': traffic, 'Discounts': discounts, 'StoreID': store_id
})

# 2. Data Reduction: Drop unnecessary columns
# Identify constant columns
constant_cols = [col for col in df_daily if df_daily[col].nunique() == 1]
if constant_cols:
    df_reduced = df_daily.drop(columns=constant_cols)
    print(f"Dropped constant columns: {constant_cols}")
else:
    df_reduced = df_daily.copy()
    print("No constant columns to drop")

# 3. Data Reduction: Aggregate to monthly (Holt-Winters s=12)
try:
    df_reduced['Date'] = pd.to_datetime(df_reduced['Date'])
    df_monthly = df_reduced.groupby(df_reduced['Date'].dt.to_period('M')).agg({
        'Sales': 'sum',  # Monthly ~$80K–$200K
        'Marketing': 'sum',
        'Traffic': 'sum',
        'Discounts': 'mean'
    }).reset_index()
    df_monthly['Date'] = df_monthly['Date'].dt.to_timestamp()
except Exception as e:
    raise ValueError(f"Error in aggregation: {e}")

# 4. Data Reduction: Feature selection (correlation-based)
X = df_monthly[['Marketing', 'Traffic', 'Discounts']]
y = df_monthly['Sales']
try:
    selector = SelectKBest(f_regression, k=2)
    X_selected = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()].tolist()  # Marketing, Traffic
    df_selected = pd.DataFrame(X_selected, columns=selected_features)
    df_selected['Date'] = df_monthly['Date']
    df_selected['Sales'] = df_monthly['Sales']
except Exception as e:
    raise ValueError(f"Error in feature selection: {e}")

# 5. Data Reduction: PCA on selected features (marketing, traffic)
# PCA reduces external features, NOT sales
try:
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df_selected[['Marketing', 'Traffic']])
    pca = PCA(n_components=0.95)  # Keep 95% variance
    X_pca = pca.fit_transform(X_scaled)
    df_pca = pd.DataFrame({'PC1': X_pca[:, 0], 'Date': df_monthly['Date']})
    explained_variance = pca.explained_variance_ratio_[0]
except Exception as e:
    raise ValueError(f"Error in PCA: {e}")

# 6. Data Reduction: Binning sales for interpretation
# Binning applies to sales (target), NOT related to PCA
try:
    df_monthly['Sales_Bin'] = pd.cut(df_monthly['Sales'],
                                     bins=[0, 100, 150, 300],
                                     labels=['Low', 'Medium', 'High'])
except Exception as e:
    raise ValueError(f"Error in binning: {e}")

# 7. Holt-Winters forecast on monthly sales
try:
    df_monthly.set_index('Date', inplace=True)
    model = ExponentialSmoothing(df_monthly['Sales'],
                                seasonal='mul',
                                seasonal_periods=12,
                                trend='add')
    fit = model.fit()
    forecast = fit.fittedvalues
    # Forecast for 2025
    forecast_2025 = fit.forecast(12)
    forecast_dates = pd.date_range(start='2025-01-01', periods=12, freq='ME')
    df_forecast = pd.DataFrame({'Forecast': forecast_2025}, index=forecast_dates)
except Exception as e:
    raise ValueError(f"Error in Holt-Winters forecasting: {e}")

# 8. Plot results (4 panels for clarity)
plt.figure(figsize=(12, 16))

# Panel 1: Daily vs. Monthly Sales (Aggregation)
plt.subplot(4, 1, 1)
plt.plot(df_daily['Date'], df_daily['Sales'], 'b-', alpha=0.3, label='Daily Sales ($K)')
plt.plot(df_monthly.index, df_monthly['Sales'], 'r-', label='Monthly Sales ($K)')
plt.title('Aggregation: Daily to Monthly Sales')
plt.ylabel('Sales ($K)')
plt.legend()
plt.grid(True)
plt.text(df_monthly.index[10], 200, 'Monthly data reduces noise,\npreserves 12-month seasonality\nfor Holt-Winters',
         color='red', fontsize=10)

# Panel 2: Selected Features
plt.subplot(4, 1, 2)
plt.plot(df_selected['Date'], df_selected['Marketing'], 'g-', label='Marketing ($K)')
plt.plot(df_selected['Date'], df_selected['Traffic'], 'm-', label='Traffic (K visitors)')
plt.title(f'Selected Features: {", ".join(selected_features)}')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.text(df_selected['Date'][10], 50, 'Correlated with sales,\nuse for post-forecast adjustments',
         color='red', fontsize=10)

# Panel 3: PCA on Features
plt.subplot(4, 1, 3)
plt.plot(df_pca['Date'], df_pca['PC1'], 'c-', label=f'PC1 ({explained_variance:.1%} variance)')
plt.title('PCA: Reduced Features (Marketing, Traffic)')
plt.ylabel('PC1 Value')
plt.legend()
plt.grid(True)
plt.text(df_pca['Date'][10], 2, 'PCA combines marketing & traffic\ninto one predictor for adjustments',
         color='red', fontsize=10)

# Panel 4: Holt-Winters Forecast with Binned Sales
plt.subplot(4, 1, 4)
plt.plot(df_monthly.index, df_monthly['Sales'], 'b-', label='Monthly Sales ($K)')
plt.plot(df_monthly.index, forecast, 'r--', label='Holt-Winters Fitted')
plt.plot(df_forecast.index, df_forecast['Forecast'], 'g--', label='2025 Forecast')
plt.twinx()
plt.bar(df_monthly.index, df_monthly['Sales_Bin'].cat.codes, alpha=0.3, color='gray', label='Sales Bin')
plt.yticks([0, 1, 2], ['Low', 'Medium', 'High'])
plt.title('Holt-Winters Forecast and Binned Sales')
plt.ylabel('Sales Bin')
plt.legend(loc='upper left')
plt.grid(True)
plt.text(df_monthly.index[10], 2, 'Binning categorizes sales\nfor stakeholder reports,\nforecast uses raw sales',
         color='red', fontsize=10)

plt.tight_layout()
plt.show()

# Print summary for teaching
print("Data Reduction Summary:")
print(f"- Original: {n} daily records, 5 features")
print(f"- After dropping: Removed {constant_cols}")
print(f"- After aggregation: 84 monthly records")
print(f"- Selected features: {selected_features}")
print(f"- PCA: 1 component (marketing, traffic), {explained_variance:.1%} variance")
print(f"- Binning: Sales categorized as Low (<$100K), Medium ($100K–$150K), High (>$150K)")
print(f"- Holt-Winters: Forecasted 2025 sales, e.g., Dec 2025 ~${df_forecast['Forecast'].iloc[-1]:.1f}K")
Dropped constant columns: ['StoreID']
_images/c1578f32a54779e82b78fcd5050a2736afccf1a8b1c1926f2748544a98db5df5.png
Data Reduction Summary:
- Original: 2557 daily records, 5 features
- After dropping: Removed ['StoreID']
- After aggregation: 84 monthly records
- Selected features: ['Marketing', 'Traffic']
- PCA: 1 component (marketing, traffic), 84.2% variance
- Binning: Sales categorized as Low (<$100K), Medium ($100K–$150K), High (>$150K)
- Holt-Winters: Forecasted 2025 sales, e.g., Dec 2025 ~$nanK
# Your code here