Data Loading, Wrangling & Visualisation#
Welcome to the most honest part of Machine Learning — data wrangling, also known as “90% of the job no one posts about on LinkedIn.” 😅
If math was theory, this chapter is practice with mud. You’ll roll up your sleeves, clean messy data, and make it look like something a CEO would actually want to see in a dashboard.
🧠 Why This Matters#
Machine Learning models are like gourmet chefs — they can only make good predictions if you give them clean ingredients.
Unfortunately, business data often looks like this:
Customer |
Age |
Revenue |
Gender |
Notes |
|---|---|---|---|---|
A-102 |
27 |
$2,000 |
F |
missing |
NaN |
$500 |
? |
typo |
|
C-554 |
45 |
-$200 |
Male |
refund |
D-999 |
300 |
$1,000 |
cat |
who let this happen |
So before we even think about algorithms, we’ll:
Load data from messy sources.
Clean it like digital laundry.
Transform it into model-ready features.
Visualize it like a storytelling pro.
💾 1. The Data Wrangling Trifecta#
Step |
Name |
Business Goal |
|---|---|---|
Data Loading |
Get data into Python |
“Where’s my Excel file again?” |
Data Cleaning |
Fix mistakes & missing values |
“Why is revenue negative?” |
Feature Engineering |
Add useful variables |
“Let’s create a loyalty score!” |
By the end of this chapter, you’ll make data look so clean it could get a job at McKinsey.
📚 Prerequisite: Python Refresher#
If you’re new to Python or Pandas, don’t panic — it’s easier than assembling IKEA furniture. 👉 Check out my other book: 📘 Programming for Business It covers everything from reading files to basic Python data manipulation.
💡 Tip: You’ll be using libraries like
pandas,numpy, andmatplotlib. If these look like Pokémon names right now, that book is your Pokédex.
🧩 Practice Corner: “Guess the Data Disaster”#
Match each messy situation with the tool that saves the day:
Situation |
Tool |
|---|---|
File is 200MB Excel sheet with multiple tabs |
|
Missing values everywhere |
|
Categorical columns like “Yes/No” |
|
Data stored in a SQL database |
|
REST API providing JSON data |
|
✅ Pro tip: Pandas is your Swiss Army Knife for data chaos.
🔍 2. Why Businesses Love Clean Data#
Messy data → Confused analysts → Wrong dashboards → Angry executives. Clean data → Confident models → Actionable insights → Happy bonuses. 🎉
You’ll soon realize:
Data cleaning is not boring — it’s debugging reality.
For example:
Missing age? → Estimate with median.
Wrong gender field? → Normalize text values.
Negative revenue? → Check for refunds.
Timestamp errors? → Convert to datetime.
You’re not just fixing numbers — you’re restoring business logic.
🎨 3. Visualisation: Turning Data into Business Art#
Once you’ve tamed the chaos, it’s time to make your data pretty and persuasive.
This section covers:
Histograms that show sales trends 📊
Scatter plots revealing marketing ROI 💸
Correlation heatmaps for KPIs 🔥
Dashboards that make execs say “wow” ✨
Remember: “If it’s not visualized, it didn’t happen.” — Every Data Scientist, ever.
💬 4. Business Analogy: The Data Spa#
Think of your dataset like a customer entering a spa:
Step |
Data Action |
Spa Equivalent |
|---|---|---|
Loading |
Getting checked in |
“Welcome, Mr. CSV!” |
Cleaning |
Removing noise & junk |
Exfoliation time 🧼 |
Transformation |
Standardizing features |
Facial mask & makeover 💅 |
Visualization |
Presenting results |
Walking the runway 🕺 |
When your data leaves this spa, it’s ready for the runway — or your next board meeting.
🧩 Practice Corner: “Wrangle This!”#
Here’s a messy dataset in Python. Try cleaning it up using what you’ll learn in this chapter:
import pandas as pd
data = {
'Customer': ['A1', 'A2', 'A3', 'A4', None],
'Age': [25, None, 300, 40, 32],
'Revenue': [2000, 500, -100, None, 1500],
'Gender': ['M', '?', 'F', 'F', 'unknown']
}
df = pd.DataFrame(data)
print("Original Messy Data:")
print(df)
🧽 Challenge:
Replace
Noneand?with proper valuesFix negative revenue
Correct impossible ages
Print the clean version
🧭 5. What’s Coming Up#
File |
Topic |
Funny Summary |
|---|---|---|
data_loading |
Loading data from CSV, Excel, SQL & APIs |
“The Great Data Buffet” 🍽️ |
data_cleaning |
Cleaning & preprocessing |
“Digital Laundry Day” 🧺 |
handling_missing_outliers |
Fixing missing data & outliers |
“CSI: Data Edition” 🕵️ |
feature_encoding |
Encoding categories & scaling features |
“Teaching Machines English” 🗣️ |
eda |
Exploratory Data Analysis |
“Detective Work with Graphs” 🧠 |
visualisation |
Making plots & charts |
“Turning KPIs into art” 🎨 |
business_dashboards |
Interactive dashboards |
“Your Data’s TED Talk” 🧑💼 |
🚀 Summary#
✅ Data wrangling = preparing the battlefield for ML ✅ Visualization = storytelling for business impact ✅ Clean data = clean insights ✅ Dirty data = bad decisions (and maybe a career pivot)
Remember: “Garbage in → Garbage out” — but in business, garbage often comes with formatting errors.
🔜 Next Stop#
👉 Head to Data Loading (CSV, Excel, SQL, APIs) to learn how to bring all your data under one roof — without crying over file formats.
10. Sorting#
Description: The process of arranging data records based on the values of one or more columns (variables). This can be done in ascending order (from smallest to largest, or A to Z) or descending order (from largest to smallest, or Z to A).
Related Concepts:
Single-Level Sorting: Arranging data based on the values of a single column.
Multi-Level Sorting: Sorting data based on the values of multiple columns, where the order of precedence of the columns determines the final arrangement. For example, sorting time series data first by year and then by month.
Sorting Algorithms (Optional): For a deeper understanding, one could briefly discuss the basic principles behind algorithms like bubble sort (simple comparison-based sorting) or quicksort (more efficient divide-and-conquer approach). While the underlying algorithms are often abstracted away by software, understanding their efficiency can be beneficial for very large datasets.
Example: In a time series of product sales, sorting the data by date ensures that the observations are in chronological order, which is essential for identifying trends and patterns over time. Additionally, one might sort by sales volume within a specific month to see the best-selling products for that period.
Sorting is a foundational operation for organizing and examining data. In time series analysis, maintaining the correct temporal order is paramount.
Example Applications in Time Series:#
Chronological Ordering: Ensuring time series data is ordered by the time index (e.g., date, timestamp) before any analysis or modeling. If the data is not sorted chronologically, identifying trends, seasonality, or applying time-dependent models will be incorrect.
Identifying Extreme Events: After sorting by the value of the time series (e.g., sales, temperature), one can easily identify the periods with the highest or lowest values.
Ranking Periods: Sorting by a specific metric (e.g., sales volume in each month) allows for ranking different time periods based on that metric.
import pandas as pd
import numpy as np
# Sample time series data (replace with your actual data)
dates = pd.to_datetime(['2023-01-05', '2023-01-02', '2023-01-10', '2023-01-01', '2023-01-08'])
values = [25, 30, 22, 35, 28]
time_series_data_unsorted = pd.DataFrame({'Date': dates, 'Value': values})
print("Unsorted Time Series Data:")
print(time_series_data_unsorted)
print("\n---")
# 1. Sorting by Date (Chronological Order)
time_series_data_sorted_by_date = time_series_data_unsorted.sort_values(by='Date')
print("Time Series Data Sorted by Date:")
print(time_series_data_sorted_by_date)
print("\n---")
# Sample data with another column for multi-level sorting
category = ['A', 'B', 'A', 'B', 'A']
time_series_data_multi = pd.DataFrame({'Date': dates, 'Category': category, 'Value': values})
print("Unsorted Time Series Data with Category:")
print(time_series_data_multi)
print("\n---")
# 2. Multi-Level Sorting (by Category then by Value in descending order)
time_series_data_multi_sorted = time_series_data_multi.sort_values(by=['Category', 'Value'], ascending=[True, False])
print("Time Series Data Sorted by Category (ascending) then Value (descending):")
print(time_series_data_multi_sorted)
print("\n---")
# 3. Sorting by Value (to identify extreme values)
time_series_data_sorted_by_value_desc = time_series_data_unsorted.sort_values(by='Value', ascending=False)
print("Time Series Data Sorted by Value (Descending - Highest Value First):")
print(time_series_data_sorted_by_value_desc)
Unsorted Time Series Data:
Date Value
0 2023-01-05 25
1 2023-01-02 30
2 2023-01-10 22
3 2023-01-01 35
4 2023-01-08 28
---
Time Series Data Sorted by Date:
Date Value
3 2023-01-01 35
1 2023-01-02 30
0 2023-01-05 25
4 2023-01-08 28
2 2023-01-10 22
---
Unsorted Time Series Data with Category:
Date Category Value
0 2023-01-05 A 25
1 2023-01-02 B 30
2 2023-01-10 A 22
3 2023-01-01 B 35
4 2023-01-08 A 28
---
Time Series Data Sorted by Category (ascending) then Value (descending):
Date Category Value
4 2023-01-08 A 28
0 2023-01-05 A 25
2 2023-01-10 A 22
3 2023-01-01 B 35
1 2023-01-02 B 30
---
Time Series Data Sorted by Value (Descending - Highest Value First):
Date Value
3 2023-01-01 35
1 2023-01-02 30
4 2023-01-08 28
0 2023-01-05 25
2 2023-01-10 22
11. Filtering Data#
Description: The process of selecting specific subsets of data from a larger dataset based on predefined conditions or criteria. This allows analysts to focus on the most relevant portions of the data for their specific questions or analyses.
Related Concepts:
Basic Filtering: Applying single conditions to select data (e.g., all sales records from a specific month).
Advanced Filtering: Using multiple conditions combined with logical operators (AND, OR, NOT) to create more complex selection criteria (e.g., all sales above a certain threshold AND occurring during a specific promotional period).
Querying: Similar to using
WHEREclauses in SQL, filtering in data analysis tools allows for specifying conditions that the data records must satisfy to be included in the resulting subset.
Example: In a time series of website traffic, filtering the data to show only the traffic during weekdays between 9 AM and 5 PM might be useful for analyzing user behavior during business hours. Another example could be filtering stock prices to only include data points after a specific market event.
Examples in Time Series:#
Filtering by Time Period: Selecting data within a specific date range (e.g., sales data for the last quarter, temperature readings for a particular year).
Filtering by Value Threshold: Identifying time points where the value of the time series exceeds or falls below a certain level (e.g., days when stock prices were above $150, hours when website traffic was below 500).
Filtering by Event: Selecting data points associated with specific events or conditions (if event markers are present in the data).
import pandas as pd
import numpy as np
# Sample time series data (replace with your actual data)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-01-10', freq='D'))
sales = np.random.randint(50, 200, size=len(dates))
temperatures = np.random.uniform(15, 25, size=len(dates))
events = ['No Event', 'Promotion', 'No Event', 'Holiday', 'No Event', 'No Event', 'Promotion', 'No Event', 'Holiday', 'No Event']
time_series_df = pd.DataFrame({'Date': dates, 'Sales': sales, 'Temperature': temperatures, 'Event': events})
time_series_df.set_index('Date', inplace=True)
print("Original Time Series Data:")
print(time_series_df)
print("\n---")
# 1. Filtering by Time Period (e.g., sales after 2023-01-05)
after_date = '2023-01-05'
filtered_after_date = time_series_df[time_series_df.index > after_date]
print(f"Sales data after {after_date}:")
print(filtered_after_date)
print("\n---")
# 2. Filtering by Value Threshold (e.g., days with sales above 150)
sales_threshold = 150
high_sales = time_series_df[time_series_df['Sales'] > sales_threshold]
print(f"Days with sales above {sales_threshold}:")
print(high_sales)
print("\n---")
# 3. Filtering by Multiple Conditions (e.g., days with temperature above 20 AND a 'Promotion' event)
filtered_multiple_conditions = time_series_df[(time_series_df['Temperature'] > 20) & (time_series_df['Event'] == 'Promotion')]
print("Days with temperature above 20 and a 'Promotion' event:")
print(filtered_multiple_conditions)
print("\n---")
# 4. Filtering by a list of values (e.g., events that are 'Holiday')
filtered_by_event = time_series_df[time_series_df['Event'].isin(['Holiday'])]
print("Data points where the event is 'Holiday':")
print(filtered_by_event)
Original Time Series Data:
Sales Temperature Event
Date
2023-01-01 193 15.453040 No Event
2023-01-02 134 18.746126 Promotion
2023-01-03 88 21.258599 No Event
2023-01-04 149 20.031363 Holiday
2023-01-05 82 23.564898 No Event
2023-01-06 150 21.586936 No Event
2023-01-07 72 16.629344 Promotion
2023-01-08 59 15.705687 No Event
2023-01-09 118 21.424193 Holiday
2023-01-10 149 15.265113 No Event
---
Sales data after 2023-01-05:
Sales Temperature Event
Date
2023-01-06 150 21.586936 No Event
2023-01-07 72 16.629344 Promotion
2023-01-08 59 15.705687 No Event
2023-01-09 118 21.424193 Holiday
2023-01-10 149 15.265113 No Event
---
Days with sales above 150:
Sales Temperature Event
Date
2023-01-01 193 15.45304 No Event
---
Days with temperature above 20 and a 'Promotion' event:
Empty DataFrame
Columns: [Sales, Temperature, Event]
Index: []
---
Data points where the event is 'Holiday':
Sales Temperature Event
Date
2023-01-04 149 20.031363 Holiday
2023-01-09 118 21.424193 Holiday
Modifying Data using Excel#
Excel is a practical tool for quick transformations:
Range Names: Assign names to cells (e.g.,
=Average(Returns))Sorting: Data → Sort (ascending/descending)
Filtering: Use filters to view subset of data
Basic Formulas:
=AVERAGE(A1:A10),=STDEV.P(A1:A10)Charts: Insert → Chart → Time Series Plot
12. Creating Distribution from Data#
Description: The process of summarizing the values within a time series dataset to understand their frequency or probability of occurrence. This involves grouping the data into intervals (bins) and counting how many observations fall into each bin, or estimating the underlying probability density function.
Related Concepts:
Histograms: Graphical representations of frequency distributions, where the height of each bar corresponds to the number of observations within a particular bin.
Probability Distributions: Mathematical functions that describe the likelihood of different outcomes. Common examples include the normal (Gaussian) distribution, log-normal distribution (often used for asset prices), and others depending on the nature of the time series.
Kernel Density Estimation (KDE): A non-parametric method to estimate the probability density function of a random variable. It provides a smoothed representation of the distribution.
Example: Creating a histogram of daily stock returns over a year can help visualize the distribution of these returns, allowing one to assess if they roughly follow a normal distribution, are skewed, or have heavy tails (more extreme events than a normal distribution would predict).
Mathematical Context: For a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), the probability density function \(f(x)\) is given by: $\( f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \)$ This bell-shaped curve is often used as a benchmark for understanding the distribution of many natural phenomena and financial data.
A distribution provides a visual and statistical summary of how frequently different values occur in a time series. This understanding is vital for:
Identifying Data Skewness: Determining if the distribution is asymmetric, with a longer tail on one side.
Understanding Volatility in Returns: Observing the spread or dispersion of returns, which is a measure of risk in financial time series.
Common Tools for Visualizing Distributions:#
Histogram: Displays the frequency of data points falling within specified bins.
Box Plot: Summarizes the distribution through quartiles, median, and potential outliers, providing insights into spread and skewness.
KDE (Kernel Density Estimate): Provides a smooth, continuous estimate of the probability density function of the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Sample time series data (replace with your actual data)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-12-31', freq='D'))
daily_returns = np.random.normal(0.0005, 0.01, size=len(dates)) # Simulate daily returns
time_series_returns = pd.DataFrame({'Date': dates, 'Daily Return': daily_returns})
time_series_returns.set_index('Date', inplace=True)
print("Sample Time Series of Daily Returns:")
print(time_series_returns.head())
print("\n---")
# 1. Histogram of Daily Returns
plt.figure(figsize=(10, 6))
sns.histplot(time_series_returns['Daily Return'], bins=30, kde=False)
plt.title('Histogram of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency')
plt.show()
print("\n---")
# 2. Box Plot of Daily Returns
plt.figure(figsize=(8, 6))
sns.boxplot(y=time_series_returns['Daily Return'])
plt.title('Box Plot of Daily Returns')
plt.ylabel('Daily Return')
plt.show()
print("\n---")
# 3. Kernel Density Estimate (KDE) of Daily Returns
plt.figure(figsize=(10, 6))
sns.kdeplot(time_series_returns['Daily Return'], fill=True)
plt.title('Kernel Density Estimate of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Density')
plt.show()
print("\n---")
# 4. Combining Histogram and KDE
plt.figure(figsize=(10, 6))
sns.histplot(time_series_returns['Daily Return'], bins=30, kde=True, color='skyblue')
plt.title('Histogram with KDE of Daily Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
Sample Time Series of Daily Returns:
Daily Return
Date
2023-01-01 0.006972
2023-01-02 0.001761
2023-01-03 -0.018474
2023-01-04 0.013305
2023-01-05 -0.005651
---
---
---
---
13. Analyzing Distribution#
Description: Quantifying the shape of a distribution by evaluating its asymmetry (skewness) and the heaviness of its tails (kurtosis). These measures provide insights beyond central tendency and dispersion, revealing important characteristics of the data’s distribution.
Related Concepts:
Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. A positive skew indicates a long tail extending towards more positive values (right-skewed), while a negative skew indicates a long tail extending towards more negative values (left-skewed). The formula for sample skewness is: $\( \text{Skew} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^3}{\sigma^3} \)\( where \)x_i\( are the data points, \)\mu\( is the sample mean, \)\sigma\( is the sample standard deviation, and \)n$ is the number of data points.
Kurtosis: A measure of the “tailedness” of the probability distribution. High kurtosis indicates that the distribution has heavier tails and a sharper peak around the mean compared to a normal distribution, suggesting more frequent extreme values. The formula for sample kurtosis (often adjusted to excess kurtosis by subtracting 3, where a normal distribution has a kurtosis of 3) is: $\( \text{Kurt} = \frac{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^4}{\sigma^4} \)\( Excess kurtosis is then \)\text{Kurt}_{excess} = \text{Kurt} - 3$.
Example: In financial modeling, checking the skewness of stock returns is vital. A significant negative skew might indicate a higher probability of large negative returns (crashes) compared to large positive returns. Similarly, examining kurtosis helps assess the risk associated with extreme price movements.
Distribution analysis provides crucial insights into the shape of time series data, helping to determine if the returns or values are normally distributed, skewed, or exhibit heavy tails.
Tools for Assessing Distribution Shape:#
Histogram: Visual inspection can reveal asymmetry and the relative frequency of values in the tails.
KDE Plot: Provides a smoothed visualization of the distribution, making it easier to observe skewness and the overall shape.
Skewness Coefficient: A numerical measure of asymmetry. $\( \text{Skewness} > 0 \implies \text{Right-tailed (positive skew)} \)\( \)\( \text{Skewness} < 0 \implies \text{Left-tailed (negative skew)} \)\( \)\( \text{Skewness} \approx 0 \implies \text{Symmetric} \)$
Kurtosis Coefficient: A numerical measure of the tailedness of the distribution. $\( \text{Kurtosis} > 3 \implies \text{Leptokurtic (heavy tails, sharper peak)} \)\( \)\( \text{Kurtosis} < 3 \implies \text{Platykurtic (thin tails, flatter peak)} \)\( \)\( \text{Kurtosis} \approx 3 \implies \text{Mesokurtic (similar to normal distribution)} \)$
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, skewnorm
from scipy.stats import skew, kurtosis
# Generate data points
np.random.seed(42) # for reproducibility
n_points = 1000
# Normal distribution
normal_data = np.random.normal(loc=0, scale=1, size=n_points)
skewness_normal = 0
# Right-skewed distribution (using a different distribution for better visual)
right_skewed_data = np.random.exponential(scale=1, size=n_points) - 1 # Shifted exponential
skewness_right = skew(right_skewed_data)
# Left-skewed distribution (mirroring the right-skewed)
left_skewed_data = - (np.random.exponential(scale=1, size=n_points) - 1)
skewness_left = skew(left_skewed_data)
# Plotting histograms
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
plt.hist(normal_data, bins=30, density=True, alpha=0.7, color='blue', label='Normal')
xmin, xmax = plt.xlim()
x_norm = np.linspace(xmin, xmax, 100)
p_norm = norm.pdf(x_norm, np.mean(normal_data), np.std(normal_data))
plt.plot(x_norm, p_norm, 'k', linewidth=2)
plt.title(f'Normal (Skewness ≈ {skewness_normal:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()
plt.subplot(1, 3, 2)
plt.hist(right_skewed_data, bins=30, density=True, alpha=0.7, color='green', label='Right-Skewed')
xmin_right, xmax_right = plt.xlim()
x_right = np.linspace(xmin_right, xmax_right, 100)
p_right = skewnorm.pdf(x_right, a=skewness_right, loc=np.mean(right_skewed_data), scale=np.std(right_skewed_data)) # Using skewnorm to fit
# Note: Fitting skewnorm might not perfectly match the exponential, but shows the skewed shape
plt.plot(x_right, p_right, 'k', linewidth=2)
plt.title(f'Right-Skewed (Skewness ≈ {skewness_right:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()
plt.subplot(1, 3, 3)
plt.hist(left_skewed_data, bins=30, density=True, alpha=0.7, color='red', label='Left-Skewed')
xmin_left, xmax_left = plt.xlim()
x_left = np.linspace(xmin_left, xmax_left, 100)
p_left = skewnorm.pdf(x_left, a=skewness_left, loc=np.mean(left_skewed_data), scale=np.std(left_skewed_data)) # Using skewnorm to fit
# Note: Fitting skewnorm might not perfectly match the mirrored exponential, but shows the skewed shape
plt.plot(x_left, p_left, 'k', linewidth=2)
plt.title(f'Left-Skewed (Skewness ≈ {skewness_left:.2f})')
plt.xlabel('Data Value')
plt.ylabel('Probability Density')
plt.legend()
plt.tight_layout()
plt.show()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew, kurtosis
# Sample time series data (replace with your actual data)
dates = pd.to_datetime(pd.date_range(start='2023-01-01', end='2023-12-31', freq='D'))
daily_returns_normal = np.random.normal(0.0005, 0.01, size=len(dates))
daily_returns_skewed = np.random.normal(0.0005, 0.01, size=len(dates)) + np.random.exponential(0.005, size=len(dates)) * 0.5
daily_returns_heavy_tailed = np.random.standard_t(df=3, size=len(dates)) * 0.01
df = pd.DataFrame({
'Date': dates,
'Normal Returns': daily_returns_normal,
'Skewed Returns': daily_returns_skewed,
'Heavy-Tailed Returns': daily_returns_heavy_tailed
})
df.set_index('Date', inplace=True)
print("Sample Time Series of Returns:")
print(df.head())
print("\n---")
# Analyze Normal Returns
print("Analysis of Normal Returns:")
skewness_normal = skew(df['Normal Returns'])
kurtosis_normal = kurtosis(df['Normal Returns'])
print(f"Skewness: {skewness_normal:.4f}")
print(f"Kurtosis (Excess): {kurtosis_normal:.4f}") # scipy.stats.kurtosis returns excess kurtosis
plt.figure(figsize=(10, 6))
sns.histplot(df['Normal Returns'], bins=30, kde=True, color='skyblue')
plt.title('Histogram with KDE of Normal Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
print("\n---")
# Analyze Skewed Returns
print("Analysis of Skewed Returns:")
skewness_skewed = skew(df['Skewed Returns'])
kurtosis_skewed = kurtosis(df['Skewed Returns'])
print(f"Skewness: {skewness_skewed:.4f}")
print(f"Kurtosis (Excess): {kurtosis_skewed:.4f}")
plt.figure(figsize=(10, 6))
sns.histplot(df['Skewed Returns'], bins=30, kde=True, color='salmon')
plt.title('Histogram with KDE of Skewed Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
print("\n---")
# Analyze Heavy-Tailed Returns
print("Analysis of Heavy-Tailed Returns:")
skewness_heavy_tailed = skew(df['Heavy-Tailed Returns'])
kurtosis_heavy_tailed = kurtosis(df['Heavy-Tailed Returns'])
print(f"Skewness: {skewness_heavy_tailed:.4f}")
print(f"Kurtosis (Excess): {kurtosis_heavy_tailed:.4f}")
plt.figure(figsize=(10, 6))
sns.histplot(df['Heavy-Tailed Returns'], bins=30, kde=True, color='lightgreen')
plt.title('Histogram with KDE of Heavy-Tailed Returns')
plt.xlabel('Daily Return')
plt.ylabel('Frequency / Density')
plt.show()
Sample Time Series of Returns:
Normal Returns Skewed Returns Heavy-Tailed Returns
Date
2023-01-01 -0.005567 -0.005480 0.016412
2023-01-02 0.002613 -0.003066 -0.012828
2023-01-03 0.012501 -0.007143 0.003276
2023-01-04 -0.004419 0.013458 -0.021952
2023-01-05 -0.018266 -0.001169 0.010649
---
Analysis of Normal Returns:
Skewness: 0.0289
Kurtosis (Excess): 0.3690
---
Analysis of Skewed Returns:
Skewness: -0.0920
Kurtosis (Excess): -0.1990
---
Analysis of Heavy-Tailed Returns:
Skewness: 0.3461
Kurtosis (Excess): 3.4767
Creating Distributions from Time Series Data for Forecasting#
In forecasting, distributions of time series data help model uncertainty and predict future values. Key approaches include:
Empirical Distributions: Histograms or kernel density estimates of historical values.
Residual Distributions: Distributions of errors from a forecasting model.
Lag Distributions: Distributions of differences between observations at specific lags.
Example: For a time series $\( X_t = \{x_1, x_2, \ldots, x_T\} \), the residual distribution is based on errors $\( e_t = x_t - \hat{x}_t \), where $\( \hat{x}_t \) is the predicted value. The probability density of residuals can be estimated as: $\( f(e) \approx \frac{1}{Th} \sum_{t=1}^T K\left(\frac{e - e_t}{h}\right) \)\( where \)\( K \) is a kernel function and $\( h \) is the bandwidth.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate synthetic time series
np.random.seed(42)
t = np.arange(100)
trend = 0.1 * t
seasonality = 5 * np.sin(2 * np.pi * t / 12)
noise = np.random.normal(0, 1, 100)
X = trend + seasonality + noise
# Simple moving average forecast
window = 5
forecast = pd.Series(X).rolling(window=window, min_periods=1).mean().shift(1)
residuals = X - forecast
# Lag differences (lag=1)
lag_diff = pd.Series(X).diff(1).dropna()
# Empirical distribution (histogram)
plt.figure(figsize=(12, 8))
plt.subplot(3, 1, 1)
plt.hist(X, bins=30, density=True, alpha=0.7, color='blue', label='Empirical')
plt.title('Empirical Distribution of Time Series')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
# Residual distribution (kernel density)
kde = gaussian_kde(residuals.dropna())
x_range = np.linspace(min(residuals.dropna()), max(residuals.dropna()), 100)
plt.subplot(3, 1, 2)
plt.plot(x_range, kde(x_range), 'r-', label='Residual KDE')
plt.hist(residuals.dropna(), bins=30, density=True, alpha=0.5, color='gray', label='Residual Histogram')
plt.title('Residual Distribution')
plt.xlabel('Residual')
plt.ylabel('Density')
plt.legend()
# Lag distribution
plt.subplot(3, 1, 3)
plt.hist(lag_diff, bins=30, density=True, alpha=0.7, color='green', label='Lag-1 Differences')
plt.title('Lag-1 Difference Distribution')
plt.xlabel('Difference')
plt.ylabel('Density')
plt.legend()
plt.tight_layout()
plt.savefig('distributions_forecasting.png')
17. Data Reduction#
Dropping Unnecessary Columns:
Remove features with no predictive value (e.g., constant variables, irrelevant metadata).
Example: Drop store ID if forecasting total chain sales.
Aggregating Data:
Convert high-frequency data (e.g., daily sales) to a lower frequency (e.g., monthly) to match Holt-Winters’ seasonal period (e.g., 12 months).
Smooths noise while preserving seasonality.
Feature Selection:
Select features with strong correlations to the target (e.g., sales) or lagged effects (e.g., marketing spend at lag 2, as identified by cross-correlation).
Use correlation analysis or variance thresholds to keep relevant predictors.
Principal Component Analysis (PCA): summarizing correlated data into a few key patterns
Reduce multiple correlated features (e.g., financial indicators, marketing channels) into fewer components that capture most variance.
Useful when external regressors (e.g., marketing, promotions) are considered alongside Holt-Winters.
Binning (Numerosity Reduction):
Group continuous variables (e.g., customer age, sales volume) into categories to reduce noise and simplify analysis.
Example: Bin daily sales into weekly averages to stabilize fluctuations.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import warnings
# Suppress warnings for cleaner output (e.g., Holt-Winters convergence)
warnings.filterwarnings("ignore", category=UserWarning)
# 1. Generate synthetic daily retail sales data (2018–2024)
np.random.seed(42)
dates = pd.date_range(start='2018-01-01', end='2024-12-31', freq='D')
n = len(dates) # ~2557 days
# Sales: trend, annual seasonality (Dec peaks), noise
trend = 0.0001 * np.arange(n) # ~$10K/year
seasonality = 3 * np.sin(2 * np.pi * (np.arange(n) % 365.25) / 365.25) # Dec ~$8K
noise = np.random.normal(0, 0.5, n)
sales = 5 + trend + seasonality + noise # Daily ~$3K–$8K
# Features
marketing = 1 + 0.5 * np.sin(2 * np.pi * (np.arange(n) % 365.25 - 60) / 365.25) + np.random.normal(0, 0.2, n)
traffic = 7 + 0.5 * marketing + np.random.normal(0, 1, n) # Correlated
discounts = np.random.uniform(0, 20, n) # Weakly correlated
store_id = ['Store1'] * n # Constant
# Create daily DataFrame
df_daily = pd.DataFrame({
'Date': dates, 'Sales': sales, 'Marketing': marketing,
'Traffic': traffic, 'Discounts': discounts, 'StoreID': store_id
})
# 2. Data Reduction: Drop unnecessary columns
# Identify constant columns
constant_cols = [col for col in df_daily if df_daily[col].nunique() == 1]
if constant_cols:
df_reduced = df_daily.drop(columns=constant_cols)
print(f"Dropped constant columns: {constant_cols}")
else:
df_reduced = df_daily.copy()
print("No constant columns to drop")
# 3. Data Reduction: Aggregate to monthly (Holt-Winters s=12)
try:
df_reduced['Date'] = pd.to_datetime(df_reduced['Date'])
df_monthly = df_reduced.groupby(df_reduced['Date'].dt.to_period('M')).agg({
'Sales': 'sum', # Monthly ~$80K–$200K
'Marketing': 'sum',
'Traffic': 'sum',
'Discounts': 'mean'
}).reset_index()
df_monthly['Date'] = df_monthly['Date'].dt.to_timestamp()
except Exception as e:
raise ValueError(f"Error in aggregation: {e}")
# 4. Data Reduction: Feature selection (correlation-based)
X = df_monthly[['Marketing', 'Traffic', 'Discounts']]
y = df_monthly['Sales']
try:
selector = SelectKBest(f_regression, k=2)
X_selected = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()].tolist() # Marketing, Traffic
df_selected = pd.DataFrame(X_selected, columns=selected_features)
df_selected['Date'] = df_monthly['Date']
df_selected['Sales'] = df_monthly['Sales']
except Exception as e:
raise ValueError(f"Error in feature selection: {e}")
# 5. Data Reduction: PCA on selected features (marketing, traffic)
# PCA reduces external features, NOT sales
try:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_selected[['Marketing', 'Traffic']])
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X_scaled)
df_pca = pd.DataFrame({'PC1': X_pca[:, 0], 'Date': df_monthly['Date']})
explained_variance = pca.explained_variance_ratio_[0]
except Exception as e:
raise ValueError(f"Error in PCA: {e}")
# 6. Data Reduction: Binning sales for interpretation
# Binning applies to sales (target), NOT related to PCA
try:
df_monthly['Sales_Bin'] = pd.cut(df_monthly['Sales'],
bins=[0, 100, 150, 300],
labels=['Low', 'Medium', 'High'])
except Exception as e:
raise ValueError(f"Error in binning: {e}")
# 7. Holt-Winters forecast on monthly sales
try:
df_monthly.set_index('Date', inplace=True)
model = ExponentialSmoothing(df_monthly['Sales'],
seasonal='mul',
seasonal_periods=12,
trend='add')
fit = model.fit()
forecast = fit.fittedvalues
# Forecast for 2025
forecast_2025 = fit.forecast(12)
forecast_dates = pd.date_range(start='2025-01-01', periods=12, freq='ME')
df_forecast = pd.DataFrame({'Forecast': forecast_2025}, index=forecast_dates)
except Exception as e:
raise ValueError(f"Error in Holt-Winters forecasting: {e}")
# 8. Plot results (4 panels for clarity)
plt.figure(figsize=(12, 16))
# Panel 1: Daily vs. Monthly Sales (Aggregation)
plt.subplot(4, 1, 1)
plt.plot(df_daily['Date'], df_daily['Sales'], 'b-', alpha=0.3, label='Daily Sales ($K)')
plt.plot(df_monthly.index, df_monthly['Sales'], 'r-', label='Monthly Sales ($K)')
plt.title('Aggregation: Daily to Monthly Sales')
plt.ylabel('Sales ($K)')
plt.legend()
plt.grid(True)
plt.text(df_monthly.index[10], 200, 'Monthly data reduces noise,\npreserves 12-month seasonality\nfor Holt-Winters',
color='red', fontsize=10)
# Panel 2: Selected Features
plt.subplot(4, 1, 2)
plt.plot(df_selected['Date'], df_selected['Marketing'], 'g-', label='Marketing ($K)')
plt.plot(df_selected['Date'], df_selected['Traffic'], 'm-', label='Traffic (K visitors)')
plt.title(f'Selected Features: {", ".join(selected_features)}')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.text(df_selected['Date'][10], 50, 'Correlated with sales,\nuse for post-forecast adjustments',
color='red', fontsize=10)
# Panel 3: PCA on Features
plt.subplot(4, 1, 3)
plt.plot(df_pca['Date'], df_pca['PC1'], 'c-', label=f'PC1 ({explained_variance:.1%} variance)')
plt.title('PCA: Reduced Features (Marketing, Traffic)')
plt.ylabel('PC1 Value')
plt.legend()
plt.grid(True)
plt.text(df_pca['Date'][10], 2, 'PCA combines marketing & traffic\ninto one predictor for adjustments',
color='red', fontsize=10)
# Panel 4: Holt-Winters Forecast with Binned Sales
plt.subplot(4, 1, 4)
plt.plot(df_monthly.index, df_monthly['Sales'], 'b-', label='Monthly Sales ($K)')
plt.plot(df_monthly.index, forecast, 'r--', label='Holt-Winters Fitted')
plt.plot(df_forecast.index, df_forecast['Forecast'], 'g--', label='2025 Forecast')
plt.twinx()
plt.bar(df_monthly.index, df_monthly['Sales_Bin'].cat.codes, alpha=0.3, color='gray', label='Sales Bin')
plt.yticks([0, 1, 2], ['Low', 'Medium', 'High'])
plt.title('Holt-Winters Forecast and Binned Sales')
plt.ylabel('Sales Bin')
plt.legend(loc='upper left')
plt.grid(True)
plt.text(df_monthly.index[10], 2, 'Binning categorizes sales\nfor stakeholder reports,\nforecast uses raw sales',
color='red', fontsize=10)
plt.tight_layout()
plt.show()
# Print summary for teaching
print("Data Reduction Summary:")
print(f"- Original: {n} daily records, 5 features")
print(f"- After dropping: Removed {constant_cols}")
print(f"- After aggregation: 84 monthly records")
print(f"- Selected features: {selected_features}")
print(f"- PCA: 1 component (marketing, traffic), {explained_variance:.1%} variance")
print(f"- Binning: Sales categorized as Low (<$100K), Medium ($100K–$150K), High (>$150K)")
print(f"- Holt-Winters: Forecasted 2025 sales, e.g., Dec 2025 ~${df_forecast['Forecast'].iloc[-1]:.1f}K")
Dropped constant columns: ['StoreID']
Data Reduction Summary:
- Original: 2557 daily records, 5 features
- After dropping: Removed ['StoreID']
- After aggregation: 84 monthly records
- Selected features: ['Marketing', 'Traffic']
- PCA: 1 component (marketing, traffic), 84.2% variance
- Binning: Sales categorized as Low (<$100K), Medium ($100K–$150K), High (>$150K)
- Holt-Winters: Forecasted 2025 sales, e.g., Dec 2025 ~$nanK
# Your code here