ARIMA / SARIMA#
“Because sometimes, predicting the future means arguing with your past.” 🕰️
🎯 Goal#
ARIMA models are like your reliable yet slightly grumpy data professors — they’ll give you solid forecasts if your data behaves.
But first, they’ll demand:
“Is your data stationary? Did you difference it? Did you check autocorrelation?” 😤
If you survived the previous section (Stationarity & Differencing), congrats — you’re ready to meet the mighty ARIMA family.
🧩 The ARIMA Family Tree#
Let’s decode this cryptic acronym before it starts judging you:
[ ARIMA(p, d, q) ]
Component |
Meaning |
Analogy |
|---|---|---|
AR (p) |
Auto-Regressive |
“I remember my past and it affects my present.” |
I (d) |
Integrated (differencing) |
“Let’s remove my emotional trend swings.” |
MA (q) |
Moving Average |
“I’m still haunted by past forecast errors.” |
TL;DR: ARIMA models say:
“My next value = past values + past mistakes + stability.”
🧮 A Simple Example#
Suppose you have monthly sales data. You suspect trends and maybe some residual chaos (a.k.a. marketing campaigns gone rogue).
Let’s fit a basic ARIMA:
from statsmodels.tsa.arima.model import ARIMA
import pandas as pd
model = ARIMA(df['sales'], order=(1,1,1))
model_fit = model.fit()
print(model_fit.summary())
Boom 💥 — you just built an ARIMA(1,1,1) model. It’s like saying:
“One look into the past, one emotional adjustment, one apology for past errors.”
🧠 Interpreting ARIMA Results#
Output |
Translation |
|---|---|
|
How much the past influences the future |
|
If they’re small, you’ve found meaningful patterns |
|
Akaike Information Criterion — lower = better (and Akaike was basically the OG ML guy) |
🎡 Enter the Seasonal Cousin: SARIMA#
When your series says,
“I repeat myself every December.” 🎄
you call SARIMA, the seasonally aware sibling.
[ SARIMA(p, d, q, P, D, Q, s) ]
Term |
Description |
|---|---|
|
Seasonal AR, I, MA components |
|
Period of the season (e.g., 12 for monthly data) |
Example:
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(df['sales'], order=(1,1,1), seasonal_order=(1,1,1,12))
results = model.fit()
results.summary()
🎁 SARIMA = “ARIMA, but festive.” Handles holiday patterns, monthly cycles, or “every Monday blues.”
📊 Diagnostics (Because Models Lie)#
Always check residuals — if they’re not random, your model is gossiping instead of forecasting.
results.plot_diagnostics(figsize=(10,6))
✅ Random noise → Great! ❌ Patterns → Your ARIMA has commitment issues.
📅 Forecasting#
Once your model behaves:
forecast = results.forecast(steps=12)
forecast.plot(title="📈 Next 12 Months of Chaos")
Business translation:
“Predict sales, demand, or coffee consumption — one cycle at a time.”
🧠 When ARIMA Fails#
Symptom |
Likely Cause |
Remedy |
|---|---|---|
Overfits noise |
p/d/q too high |
Simplify! |
Misses seasonality |
Forgot |
Add seasonal terms |
Bizarre spikes |
Data not stationary |
Differencing, again 🌀 |
Constant line |
Model gave up |
Check parameters or caffeine level |
💼 Business Example#
A retailer’s monthly sales grow with holiday spikes.
ARIMA captures the trend
SARIMA captures the holiday rhythm
You capture the promotion bonus 🎉
🧾 TL;DR#
Concept |
Translation |
|---|---|
ARIMA |
Past + Differencing + Past Errors |
SARIMA |
ARIMA with holiday plans |
p, d, q |
Model order knobs |
P, D, Q, s |
Seasonal order knobs |
AIC |
Lower is better (your new leaderboard) |
“Forecasting is like dating — you think you’ve found the perfect model… until next season proves you wrong.” 💔
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from ipywidgets import interact, IntSlider
def plot_arima(p, d, q):
np.random.seed(42) # For reproducibility
n = 500 # Number of samples
# Generate AR parameters (decreasing for stability)
ar_params = np.array([0.8 / (i + 1) for i in range(p)]) if p > 0 else np.array([])
ar = np.r_[1, -ar_params]
# Generate MA parameters
ma_params = np.array([0.5 / (i + 1) for i in range(q)]) if q > 0 else np.array([])
ma = np.r_[1, ma_params]
try:
process = ArmaProcess(ar, ma)
if not process.isstationary:
print("Warning: The AR part is not stationary. Adjust parameters for better results.")
return
sample = process.generate_sample(nsample=n)
except ValueError as e:
print(f"Error generating ARMA process: {e}")
return
# Apply differencing (integration for simulation)
series = sample.copy()
for _ in range(d):
series = np.cumsum(series)
# Create plots
fig, axes = plt.subplots(3, 1, figsize=(12, 15))
# Plot the time series
axes[0].plot(series)
axes[0].set_title(f'Simulated ARIMA({p}, {d}, {q}) Time Series')
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Value')
# Plot ACF
plot_acf(series, ax=axes[1], lags=40)
axes[1].set_title('Autocorrelation Function (ACF)')
# Plot PACF
plot_pacf(series, ax=axes[2], lags=40)
axes[2].set_title('Partial Autocorrelation Function (PACF)')
plt.tight_layout()
plt.show()
# Create interactive sliders
interact(
plot_arima,
p=IntSlider(min=0, max=5, step=1, value=1, description='AR order (p)'),
d=IntSlider(min=0, max=2, step=1, value=0, description='Differencing (d)'),
q=IntSlider(min=0, max=5, step=1, value=1, description='MA order (q)')
)
<function __main__.plot_arima(p, d, q)>
!pip install pmdarima
Collecting pmdarima
Obtaining dependency information for pmdarima from https://files.pythonhosted.org/packages/a2/14/cd7417d90312ad6de4e5bb48d98f0b89e77bb427f5b80495074ab25cd13b/pmdarima-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata
Downloading pmdarima-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Requirement already satisfied: joblib>=0.11 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.1.0)
Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (0.29.32)
Requirement already satisfied: numpy>=1.21.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.21.6)
Requirement already satisfied: pandas>=0.19 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.4.4)
Requirement already satisfied: scikit-learn>=0.22 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.0.2)
Requirement already satisfied: scipy>=1.3.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.9.1)
Requirement already satisfied: statsmodels>=0.13.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (0.13.2)
Requirement already satisfied: urllib3 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.26.11)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (60.2.0)
Requirement already satisfied: packaging>=17.1 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from packaging>=17.1->pmdarima) (3.0.9)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pandas>=0.19->pmdarima) (2023.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.22->pmdarima) (2.2.0)
Requirement already satisfied: patsy>=0.5.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from statsmodels>=0.13.2->pmdarima) (0.5.2)
Requirement already satisfied: six in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from patsy>=0.5.2->statsmodels>=0.13.2->pmdarima) (1.16.0)
Downloading pmdarima-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 301.9 kB/s eta 0:00:00m eta 0:00:01[36m0:00:01
?25hInstalling collected packages: pmdarima
Successfully installed pmdarima-2.0.4
[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf
from ipywidgets import interact, IntSlider, Output, VBox, HBox
from IPython.display import display, clear_output
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
# Try to import auto_arima for better parameter selection
try:
from pmdarima import auto_arima
AUTO_ARIMA_AVAILABLE = True
except ImportError:
AUTO_ARIMA_AVAILABLE = False
print("Note: pmdarima not available. Using grid search for optimal parameters.")
# Suppress warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning, message="Non-invertible starting MA parameters found.")
warnings.filterwarnings("ignore", category=UserWarning, message="Non-stationary starting autoregressive parameters found.")
def interpret_acf_pacf(series, lags=40, sig_level=0.05):
"""Interpret ACF and PACF patterns"""
try:
# Compute ACF and PACF
max_lags = min(lags, len(series)//4)
acf_vals, acf_conf = acf(series, nlags=max_lags, alpha=sig_level)
pacf_vals, pacf_conf = pacf(series, nlags=max_lags, alpha=sig_level)
# Confidence intervals (upper and lower)
acf_upper = acf_conf[:, 1] - acf_vals
pacf_upper = pacf_conf[:, 1] - pacf_vals
# Find where ACF and PACF become insignificant
acf_cutoff = next((i for i, val in enumerate(acf_vals[1:], 1) if abs(val) < acf_upper[i]), max_lags)
pacf_cutoff = next((i for i, val in enumerate(pacf_vals[1:], 1) if abs(val) < pacf_upper[i]), max_lags)
except:
return "Error in ACF/PACF computation"
# Basic interpretation
interpretation = []
# Check for non-stationarity
if len(acf_vals) > 10 and acf_vals[1] > 0.9:
interpretation.append("Series appears non-stationary (slow ACF decay) - differencing needed.")
# AR process identification
if pacf_cutoff <= 3 and acf_cutoff > 5:
interpretation.append(f"Suggests AR({pacf_cutoff}) process - PACF cuts off, ACF decays.")
# MA process identification
elif acf_cutoff <= 3 and pacf_cutoff > 5:
interpretation.append(f"Suggests MA({acf_cutoff}) process - ACF cuts off, PACF decays.")
# ARMA process
elif acf_cutoff > 5 and pacf_cutoff > 5:
interpretation.append("Suggests ARMA process - both ACF and PACF decay gradually.")
else:
interpretation.append("Mixed pattern - consider ARIMA model.")
return "\\n".join(interpretation)
# Generate realistic time series that ARIMA can fit well
def generate_perfect_arima_data():
"""Generate time series data that ARIMA can fit near-perfectly"""
np.random.seed(42)
n = 500
# Create a more predictable time series with clear patterns
# This will allow ARIMA to achieve near-perfect fits
# Option 1: Generate from known ARIMA process
ar_params = np.array([0.75, -0.25]) # AR(2) parameters
ma_params = np.array([0.65, 0.35]) # MA(2) parameters
ar = np.r_[1, -ar_params]
ma = np.r_[1, ma_params]
try:
process = ArmaProcess(ar, ma)
if process.isstationary:
arma_sample = process.generate_sample(nsample=n, scale=1.0)
else:
# Fallback to simpler process
arma_sample = np.random.randn(n)
except:
arma_sample = np.random.randn(n)
# Add trend and seasonality for integration
trend = np.linspace(0, 10, n)
seasonal = 2 * np.sin(2 * np.pi * np.arange(n) / 50) + 1.5 * np.cos(2 * np.pi * np.arange(n) / 25)
# Combine components
series = trend + seasonal + arma_sample
# Apply one level of integration to make it I(1)
series = np.cumsum(series - np.mean(series))
return series
# Generate the perfect ARIMA data
series = generate_perfect_arima_data()
n = len(series)
# Split data
train_size = int(n * 0.8)
train = series[:train_size]
test = series[train_size:]
# Find optimal parameters using auto_arima if available, otherwise grid search
print("Finding optimal ARIMA parameters...")
if AUTO_ARIMA_AVAILABLE:
try:
# Use auto_arima for better parameter selection
auto_model = auto_arima(train,
start_p=0, start_q=0, max_p=5, max_q=5, max_d=2,
seasonal=False, stepwise=True, suppress_warnings=True,
error_action='ignore', max_order=None, trace=False)
best_order = auto_model.order
best_aic = auto_model.aic()
print(f"Auto-ARIMA selected: {best_order} with AIC = {best_aic:.2f}")
except:
best_order = (2, 1, 2)
best_aic = 0
print("Auto-ARIMA failed, using default (2,1,2)")
else:
# Grid search fallback
best_aic = np.inf
best_order = (2, 1, 2) # Default
for p in range(6):
for d in range(3):
for q in range(6):
try:
model = ARIMA(train, order=(p, d, q))
model_fit = model.fit()
aic = model_fit.aic
if aic < best_aic:
best_aic = aic
best_order = (p, d, q)
except:
pass
print(f"Grid search selected: {best_order} with AIC = {best_aic:.2f}")
# True parameters (what we know works well for this data)
true_p, true_d, true_q = best_order
def plot_arima_perfect_fit(fitted_p, fitted_d, fitted_q):
"""Plot ARIMA with perfect curve fitting focus"""
# Clear previous output
clear_output(wait=True)
try:
# Fit ARIMA model
model = ARIMA(train, order=(fitted_p, fitted_d, fitted_q))
model_fit = model.fit()
# Get fitted values for training period
fitted_values = model_fit.fittedvalues
# Forecast for test period
forecast = model_fit.forecast(steps=len(test))
# Get prediction intervals
forecast_result = model_fit.get_forecast(steps=len(test))
forecast_ci = forecast_result.conf_int()
except Exception as e:
print(f"Error fitting ARIMA({fitted_p},{fitted_d},{fitted_q}): {e}")
return
# Calculate error metrics
mse = np.mean((test - forecast)**2)
mae = np.mean(np.abs(test - forecast))
# Calculate fit quality for training data
if len(fitted_values) > 0:
train_subset = train[-len(fitted_values):]
train_mse = np.mean((train_subset - fitted_values)**2)
train_mae = np.mean(np.abs(train_subset - fitted_values))
else:
train_mse = train_mae = float('inf')
# Create comprehensive plot
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle(f'ARIMA({fitted_p}, {fitted_d}, {fitted_q}) - Perfect Curve Fitting Analysis', fontsize=16, fontweight='bold')
# Main time series plot with perfect overlap focus
ax1 = axes[0, 0]
# Plot original data
ax1.plot(range(len(series)), series, 'b-', label='Original Data', linewidth=2, alpha=0.8)
# Plot fitted values (training)
if len(fitted_values) > 0:
fit_start_idx = len(train) - len(fitted_values)
fit_indices = range(fit_start_idx, len(train))
ax1.plot(fit_indices, fitted_values, 'r-', label='Fitted (Training)', linewidth=3, alpha=0.9)
# Plot forecast (testing) - this should overlap perfectly with original data
test_indices = range(len(train), len(series))
ax1.plot(test_indices, forecast, 'orange', label='Forecast (Testing)',
linewidth=4, alpha=0.9, linestyle='--')
# Add confidence intervals
try:
if hasattr(forecast_ci, 'iloc'):
# If it's a DataFrame
ax1.fill_between(test_indices, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1],
color='orange', alpha=0.2, label='95% Confidence Interval')
else:
# If it's a numpy array
ax1.fill_between(test_indices, forecast_ci[:, 0], forecast_ci[:, 1],
color='orange', alpha=0.2, label='95% Confidence Interval')
except Exception as e:
print(f"Could not plot confidence intervals: {e}")
# Continue without confidence intervals
# Mark train/test split
ax1.axvline(x=train_size, color='black', linestyle=':', linewidth=2, alpha=0.7, label='Train/Test Split')
ax1.set_title('Time Series with ARIMA Fit & Forecast')
ax1.set_xlabel('Time')
ax1.set_ylabel('Value')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)
# Add performance metrics as text
metrics_text = f'Test MSE: {mse:.4f}\nTest MAE: {mae:.4f}\nTrain MSE: {train_mse:.4f}\nTrain MAE: {train_mae:.4f}'
ax1.text(0.02, 0.98, metrics_text, transform=ax1.transAxes,
bbox=dict(boxstyle="round,pad=0.5", facecolor="lightgreen", alpha=0.8),
fontsize=10, verticalalignment='top', fontweight='bold')
# Residuals plot to check model fit quality
ax2 = axes[0, 1]
if len(fitted_values) > 0:
residuals = train_subset - fitted_values
ax2.plot(residuals, 'g-', alpha=0.7)
ax2.axhline(y=0, color='red', linestyle='--', alpha=0.7)
ax2.set_title('Training Residuals')
ax2.set_xlabel('Time')
ax2.set_ylabel('Residuals')
ax2.grid(True, alpha=0.3)
# Add residual statistics
res_std = np.std(residuals)
ax2.text(0.02, 0.98, f'Residual Std: {res_std:.4f}', transform=ax2.transAxes,
bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue", alpha=0.7),
fontsize=9, verticalalignment='top')
# ACF plot
ax3 = axes[1, 0]
try:
plot_acf(train, ax=ax3, lags=min(40, len(train)//4))
ax3.set_title('ACF of Training Data')
except:
ax3.text(0.5, 0.5, 'ACF plot error', ha='center', va='center', transform=ax3.transAxes)
# PACF plot
ax4 = axes[1, 1]
try:
plot_pacf(train, ax=ax4, lags=min(40, len(train)//4))
ax4.set_title('PACF of Training Data')
except:
ax4.text(0.5, 0.5, 'PACF plot error', ha='center', va='center', transform=ax4.transAxes)
plt.tight_layout()
plt.show()
# Display detailed results
results_output = Output()
with results_output:
print("=" * 80)
print(f"ARIMA({fitted_p}, {fitted_d}, {fitted_q}) PERFECT FIT ANALYSIS")
print("=" * 80)
print(f"Optimal Parameters (by AIC): {best_order}")
print(f"Current Parameters: ({fitted_p}, {fitted_d}, {fitted_q})")
print()
print("PERFORMANCE METRICS:")
print(f"├── Test Forecast MSE: {mse:.6f}")
print(f"├── Test Forecast MAE: {mae:.6f}")
print(f"├── Training Fit MSE: {train_mse:.6f}")
print(f"└── Training Fit MAE: {train_mae:.6f}")
print()
# Fit quality assessment
if mse < 1.0 and mae < 0.8:
print("🎯 EXCELLENT FIT: Near-perfect curve overlap achieved!")
elif mse < 5.0 and mae < 2.0:
print("✅ GOOD FIT: Curves align well with minor deviations")
else:
print("⚠️ POOR FIT: Try adjusting parameters for better overlap")
print()
print("CURVE OVERLAP ANALYSIS:")
overlap_score = max(0, 100 - (mse * 10 + mae * 5))
print(f"└── Overlap Score: {overlap_score:.1f}% (higher is better)")
display(results_output)
# ACF/PACF interpretation
interp_output = Output()
with interp_output:
print("\nACF/PACF INTERPRETATION:")
print("-" * 40)
print(interpret_acf_pacf(train))
print()
print("OPTIMIZATION TIPS:")
print(f"• Current AIC would be: ~{model_fit.aic:.2f}")
print(f"• Optimal parameters: {best_order}")
print("• For perfect overlap, try parameters close to optimal")
print("• Lower MSE/MAE = better curve alignment")
display(interp_output)
# Raw data display
data_output = Output()
with data_output:
print("\nRAW DATA SAMPLES:")
print("-" * 20)
print("First 10 values:", series[:10])
print("Last 10 values:", series[-10:])
print("Training size:", len(train))
print("Test size:", len(test))
display(data_output)
# Create the interactive widget with optimal starting values
print("\n🚀 INTERACTIVE ARIMA PERFECT CURVE FITTING")
print("=" * 50)
print(f"Data generated with near-perfect ARIMA characteristics")
print(f"Recommended starting parameters: {best_order}")
print("Adjust sliders to achieve perfect curve overlap!")
print("=" * 50)
# Use the optimal parameters as starting values for perfect initial fit
interact(
plot_arima_perfect_fit,
fitted_p=IntSlider(min=0, max=5, step=1, value=best_order[0], description='AR (p)'),
fitted_d=IntSlider(min=0, max=2, step=1, value=best_order[1], description='Diff (d)'),
fitted_q=IntSlider(min=0, max=5, step=1, value=best_order[2], description='MA (q)')
)
Finding optimal ARIMA parameters...
Auto-ARIMA selected: (3, 2, 0) with AIC = 1190.65
🚀 INTERACTIVE ARIMA PERFECT CURVE FITTING
==================================================
Data generated with near-perfect ARIMA characteristics
Recommended starting parameters: (3, 2, 0)
Adjust sliders to achieve perfect curve overlap!
==================================================
<function __main__.plot_arima_perfect_fit(fitted_p, fitted_d, fitted_q)>
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf, adfuller
from ipywidgets import interact, IntSlider, Output, VBox, HBox
from IPython.display import display, clear_output
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
# Try to import auto_arima for better parameter selection
try:
from pmdarima import auto_arima
AUTO_ARIMA_AVAILABLE = True
except ImportError:
AUTO_ARIMA_AVAILABLE = False
print("Note: pmdarima not available. Using grid search for optimal parameters.")
# Suppress warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning, message="Non-invertible starting MA parameters found.")
warnings.filterwarnings("ignore", category=UserWarning, message="Non-stationary starting autoregressive parameters found.")
def comprehensive_acf_pacf_interpretation(series, fitted_p, fitted_d, fitted_q, max_lags=50):
"""
Comprehensive interpretation of ACF and PACF patterns
ACF (Autocorrelation Function):
- Measures correlation between observations at different time lags
- Shows how current value relates to past values
- For AR processes: Decays gradually (exponentially or sinusoidally)
- For MA processes: Cuts off sharply after q lags
- For non-stationary: Very slow decay, stays high for many lags
PACF (Partial Autocorrelation Function):
- Measures direct correlation between observations k periods apart
- Removes intermediate correlations
- For AR processes: Cuts off sharply after p lags
- For MA processes: Decays gradually
"""
interpretation = []
interpretation.append("=" * 80)
interpretation.append("📊 COMPREHENSIVE ACF & PACF ANALYSIS")
interpretation.append("=" * 80)
# Apply differencing if specified
import pandas as pd
# Convert to pandas Series if it's numpy array
if isinstance(series, np.ndarray):
analyzed_series = pd.Series(series.copy())
else:
analyzed_series = series.copy()
diff_description = "Original series"
if fitted_d > 0:
for d in range(fitted_d):
analyzed_series = analyzed_series.diff().dropna()
diff_description = f"After {fitted_d} difference(s)"
interpretation.append(f"Analysis performed on: {diff_description}")
interpretation.append(f"Series length after differencing: {len(analyzed_series)}")
interpretation.append("")
try:
# Compute ACF and PACF
max_lags = min(max_lags, len(analyzed_series)//4)
acf_vals, acf_conf = acf(analyzed_series, nlags=max_lags, alpha=0.05)
pacf_vals, pacf_conf = pacf(analyzed_series, nlags=max_lags, alpha=0.05)
# Stationarity test
try:
adf_result = adfuller(analyzed_series)
adf_pvalue = adf_result[1]
is_stationary = adf_pvalue < 0.05
except:
is_stationary = None
adf_pvalue = None
interpretation.append("🔍 STATIONARITY ANALYSIS:")
if is_stationary is not None:
if is_stationary:
interpretation.append(f"✅ Series is STATIONARY (ADF p-value: {adf_pvalue:.4f} < 0.05)")
else:
interpretation.append(f"❌ Series is NON-STATIONARY (ADF p-value: {adf_pvalue:.4f} > 0.05)")
interpretation.append(" → Consider more differencing (increase d parameter)")
else:
interpretation.append("⚠️ Could not perform stationarity test")
interpretation.append("")
# ACF Analysis
interpretation.append("📈 ACF (Autocorrelation Function) ANALYSIS:")
interpretation.append(" What ACF tells us:")
interpretation.append(" • Measures correlation between observations k periods apart")
interpretation.append(" • Shows persistence and memory in the time series")
interpretation.append(" • Values outside blue confidence bands are significant")
interpretation.append("")
# Find significant ACF lags
acf_significant_lags = []
for i in range(1, min(len(acf_vals), 20)):
if abs(acf_vals[i]) > abs(acf_conf[i, 1] - acf_vals[i]):
acf_significant_lags.append(i)
# ACF pattern analysis
if len(acf_vals) > 5:
acf_first_few = acf_vals[1:6] # First 5 lags
if acf_vals[1] > 0.7:
interpretation.append(" 🔴 VERY HIGH ACF at lag 1 → Strong persistence/trend")
interpretation.append(" → Series may need differencing")
elif acf_vals[1] > 0.3:
interpretation.append(" 🟡 MODERATE ACF at lag 1 → Some persistence")
else:
interpretation.append(" 🟢 LOW ACF at lag 1 → Good for modeling")
# Check for gradual decay vs sharp cutoff
if len(acf_significant_lags) > 5:
interpretation.append(" 📉 ACF decays GRADUALLY → Suggests AR component")
elif len(acf_significant_lags) <= 3:
interpretation.append(f" 📉 ACF cuts off after lag {max(acf_significant_lags) if acf_significant_lags else 0} → Suggests MA component")
interpretation.append(f" 📊 Significant ACF lags: {acf_significant_lags[:10]}") # Show first 10
interpretation.append("")
# PACF Analysis
interpretation.append("📉 PACF (Partial Autocorrelation Function) ANALYSIS:")
interpretation.append(" What PACF tells us:")
interpretation.append(" • Measures DIRECT correlation between observations k periods apart")
interpretation.append(" • Removes influence of intermediate lags")
interpretation.append(" • Identifies the order of AR processes")
interpretation.append("")
# Find significant PACF lags
pacf_significant_lags = []
for i in range(1, min(len(pacf_vals), 20)):
if abs(pacf_vals[i]) > abs(pacf_conf[i, 1] - pacf_vals[i]):
pacf_significant_lags.append(i)
if len(pacf_vals) > 5:
# PACF pattern analysis
if len(pacf_significant_lags) <= 3 and len(pacf_significant_lags) > 0:
max_pacf_lag = max(pacf_significant_lags)
interpretation.append(f" 📉 PACF cuts off after lag {max_pacf_lag} → Suggests AR({max_pacf_lag}) component")
elif len(pacf_significant_lags) > 5:
interpretation.append(" 📈 PACF decays GRADUALLY → Suggests MA component")
else:
interpretation.append(" 📊 PACF shows mixed pattern")
interpretation.append(f" 📊 Significant PACF lags: {pacf_significant_lags[:10]}") # Show first 10
interpretation.append("")
# Model Identification
interpretation.append("🎯 MODEL IDENTIFICATION GUIDELINES:")
interpretation.append(" Based on ACF/PACF patterns:")
interpretation.append("")
# AR Process identification
if len(acf_significant_lags) > 3 and len(pacf_significant_lags) <= 3:
if pacf_significant_lags:
suggested_p = max(pacf_significant_lags)
interpretation.append(f" 🎯 Pattern suggests: AR({suggested_p}) process")
interpretation.append(" → ACF decays gradually, PACF cuts off")
interpretation.append(f" → Try p = {suggested_p}, d = {fitted_d}, q = 0")
# MA Process identification
elif len(acf_significant_lags) <= 3 and len(pacf_significant_lags) > 3:
if acf_significant_lags:
suggested_q = max(acf_significant_lags)
interpretation.append(f" 🎯 Pattern suggests: MA({suggested_q}) process")
interpretation.append(" → ACF cuts off, PACF decays gradually")
interpretation.append(f" → Try p = 0, d = {fitted_d}, q = {suggested_q}")
# ARMA Process identification
elif len(acf_significant_lags) > 3 and len(pacf_significant_lags) > 3:
interpretation.append(" 🎯 Pattern suggests: ARMA process")
interpretation.append(" → Both ACF and PACF decay gradually")
interpretation.append(f" → Try mixed model like ARIMA({fitted_p},{fitted_d},{fitted_q})")
# White Noise
elif len(acf_significant_lags) == 0 and len(pacf_significant_lags) == 0:
interpretation.append(" 🎯 Pattern suggests: WHITE NOISE")
interpretation.append(" → No significant autocorrelations")
interpretation.append(" → Series may already be well-modeled")
interpretation.append("")
# Current Model Assessment
interpretation.append(f"📋 CURRENT MODEL ASSESSMENT: ARIMA({fitted_p},{fitted_d},{fitted_q})")
if fitted_p > 0:
if len(pacf_significant_lags) > 0 and max(pacf_significant_lags) <= fitted_p:
interpretation.append(f" ✅ AR({fitted_p}): Good choice - PACF supports this order")
else:
interpretation.append(f" ⚠️ AR({fitted_p}): May be over/under-specified")
if fitted_q > 0:
if len(acf_significant_lags) > 0 and max(acf_significant_lags) <= fitted_q:
interpretation.append(f" ✅ MA({fitted_q}): Good choice - ACF supports this order")
else:
interpretation.append(f" ⚠️ MA({fitted_q}): May be over/under-specified")
if fitted_d > 0:
if is_stationary:
interpretation.append(f" ✅ I({fitted_d}): Series is now stationary")
else:
interpretation.append(f" ⚠️ I({fitted_d}): Series may need more differencing")
except Exception as e:
interpretation.append(f"❌ Error in ACF/PACF analysis: {e}")
interpretation.append("")
interpretation.append("💡 QUICK REFERENCE:")
interpretation.append(" • ACF decays slowly → Non-stationary (need differencing)")
interpretation.append(" • ACF decays quickly, PACF cuts off → AR process")
interpretation.append(" • ACF cuts off, PACF decays → MA process")
interpretation.append(" • Both decay gradually → ARMA process")
interpretation.append(" • Both are insignificant → White noise")
return "\n".join(interpretation)
def generate_realistic_time_series():
"""
Generate realistic time series with clear trend, seasonality, and noise
This will make differencing effects more obvious
"""
np.random.seed(42)
# Create longer time series (5 years of daily data)
n = 1825 # 5 years * 365 days
t = np.arange(n)
# 1. Strong linear trend
trend = 0.05 * t + 100
# 2. Multiple seasonal patterns
# Annual seasonality (365 days)
annual_seasonal = 15 * np.sin(2 * np.pi * t / 365) + 10 * np.cos(2 * np.pi * t / 365)
# Weekly seasonality (7 days)
weekly_seasonal = 5 * np.sin(2 * np.pi * t / 7) + 3 * np.cos(2 * np.pi * t / 7)
# Monthly-like pattern (30 days)
monthly_seasonal = 8 * np.sin(2 * np.pi * t / 30) + 6 * np.cos(2 * np.pi * t / 30)
# 3. AR(2) component for autocorrelation
ar_params = np.array([0.6, -0.2])
ar = np.r_[1, -ar_params]
ma = np.r_[1, 0.3, 0.1] # Small MA component
try:
arma_process = ArmaProcess(ar, ma)
if arma_process.isstationary:
ar_component = arma_process.generate_sample(nsample=n, scale=2.0)
else:
ar_component = np.random.randn(n) * 2.0
except:
ar_component = np.random.randn(n) * 2.0
# 4. Random noise
noise = np.random.normal(0, 3, n)
# 5. Combine all components
series = trend + annual_seasonal + weekly_seasonal + monthly_seasonal + ar_component + noise
# 6. Add some structural breaks for realism
# Sudden level shift at year 2
series[730:] += 20
# Gradual change in trend after year 3
series[1095:] += 0.02 * np.arange(len(series) - 1095)
return series
# Generate realistic data
print("Generating realistic time series data with trend, seasonality, and autocorrelation...")
series = generate_realistic_time_series()
series = pd.Series(series) # Convert to pandas Series for easier manipulation
n = len(series)
print(f"Generated time series with {n} observations (approximately 5 years of daily data)")
print(f"Data includes: Linear trend, annual seasonality, weekly patterns, AR structure, and noise")
# Split data (use more recent data for testing)
train_size = int(n * 0.85) # Use 85% for training
train = series[:train_size]
test = series[train_size:]
print(f"Training data: {len(train)} observations")
print(f"Test data: {len(test)} observations")
# Find optimal parameters
print("\nFinding optimal ARIMA parameters...")
if AUTO_ARIMA_AVAILABLE:
try:
print("Using auto-ARIMA for parameter selection...")
auto_model = auto_arima(train,
start_p=0, start_q=0, max_p=5, max_q=5, max_d=2,
seasonal=False, stepwise=True, suppress_warnings=True,
error_action='ignore', max_order=None, trace=True)
best_order = auto_model.order
best_aic = auto_model.aic()
print(f"Auto-ARIMA selected: {best_order} with AIC = {best_aic:.2f}")
except Exception as e:
print(f"Auto-ARIMA failed: {e}")
best_order = (2, 1, 1)
best_aic = 0
print("Using default parameters (2,1,1)")
else:
print("Using grid search for parameter selection...")
best_aic = np.inf
best_order = (2, 1, 1) # Default
for p in range(4): # 0 to 3
for d in range(3): # 0 to 2
for q in range(4): # 0 to 3
try:
model = ARIMA(train, order=(p, d, q))
model_fit = model.fit()
aic = model_fit.aic
if aic < best_aic:
best_aic = aic
best_order = (p, d, q)
except:
pass
print(f"Grid search selected: {best_order} with AIC = {best_aic:.2f}")
def plot_comprehensive_arima_analysis(fitted_p, fitted_d, fitted_q):
"""Enhanced plotting with comprehensive analysis"""
clear_output(wait=True)
try:
# Fit ARIMA model
model = ARIMA(train, order=(fitted_p, fitted_d, fitted_q))
model_fit = model.fit()
# Get fitted values and forecast
fitted_values = model_fit.fittedvalues
forecast = model_fit.forecast(steps=len(test))
# Get prediction intervals
forecast_result = model_fit.get_forecast(steps=len(test))
forecast_ci = forecast_result.conf_int()
except Exception as e:
print(f"❌ Error fitting ARIMA({fitted_p},{fitted_d},{fitted_q}): {e}")
print("Try different parameters or check data quality")
return
# Calculate metrics
mse = np.mean((test - forecast)**2)
mae = np.mean(np.abs(test - forecast))
if len(fitted_values) > 0:
train_subset = train[-len(fitted_values):]
train_mse = np.mean((train_subset - fitted_values)**2)
train_mae = np.mean(np.abs(train_subset - fitted_values))
else:
train_mse = train_mae = float('inf')
# Create comprehensive visualization
fig = plt.figure(figsize=(20, 16))
# Main time series plot
ax1 = plt.subplot(3, 2, (1, 2)) # Spans 2 columns
# Plot full series with different colors for train/test
ax1.plot(range(len(train)), train, 'b-', label='Training Data', linewidth=1.5, alpha=0.8)
ax1.plot(range(len(train), len(series)), test, 'g-', label='Test Data (Actual)', linewidth=2, alpha=0.9)
# Plot fitted values
if len(fitted_values) > 0:
fit_start_idx = len(train) - len(fitted_values)
fit_indices = range(fit_start_idx, len(train))
ax1.plot(fit_indices, fitted_values, 'r-', label='Fitted (Training)', linewidth=2, alpha=0.9)
# Plot forecast
test_indices = range(len(train), len(series))
ax1.plot(test_indices, forecast, 'orange', label='Forecast (Test)',
linewidth=3, alpha=0.9, linestyle='--')
# Add confidence intervals
try:
if hasattr(forecast_ci, 'iloc'):
ax1.fill_between(test_indices, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1],
color='orange', alpha=0.2, label='95% Confidence Interval')
else:
ax1.fill_between(test_indices, forecast_ci[:, 0], forecast_ci[:, 1],
color='orange', alpha=0.2, label='95% Confidence Interval')
except:
pass
# Mark train/test split
ax1.axvline(x=len(train), color='black', linestyle=':', linewidth=2, alpha=0.7, label='Train/Test Split')
ax1.set_title(f'ARIMA({fitted_p},{fitted_d},{fitted_q}) Analysis - Realistic Time Series with Trend & Seasonality', fontsize=14, fontweight='bold')
ax1.set_xlabel('Time (Days)')
ax1.set_ylabel('Value')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)
# Performance metrics
metrics_text = f'Test MSE: {mse:.2f}\nTest MAE: {mae:.2f}\nTraining MSE: {train_mse:.2f}\nTraining MAE: {train_mae:.2f}'
color = 'lightgreen' if mse < 100 else 'yellow' if mse < 500 else 'lightcoral'
ax1.text(0.02, 0.98, metrics_text, transform=ax1.transAxes,
bbox=dict(boxstyle="round,pad=0.5", facecolor=color, alpha=0.8),
fontsize=10, verticalalignment='top', fontweight='bold')
# ACF plot
ax2 = plt.subplot(3, 2, 3)
try:
plot_acf(train, ax=ax2, lags=min(50, len(train)//10), alpha=0.05)
ax2.set_title('ACF - Training Data (Original)')
except:
ax2.text(0.5, 0.5, 'Error plotting ACF', ha='center', va='center', transform=ax2.transAxes)
# PACF plot
ax3 = plt.subplot(3, 2, 4)
try:
plot_pacf(train, ax=ax3, lags=min(50, len(train)//10), alpha=0.05)
ax3.set_title('PACF - Training Data (Original)')
except:
ax3.text(0.5, 0.5, 'Error plotting PACF', ha='center', va='center', transform=ax3.transAxes)
# ACF of differenced series (if d > 0)
ax4 = plt.subplot(3, 2, 5)
if fitted_d > 0:
try:
diff_series = pd.Series(train.copy()) if isinstance(train, np.ndarray) else train.copy()
for _ in range(fitted_d):
diff_series = diff_series.diff().dropna()
plot_acf(diff_series, ax=ax4, lags=min(40, len(diff_series)//10), alpha=0.05)
ax4.set_title(f'ACF - After {fitted_d} Differencing')
except:
ax4.text(0.5, 0.5, 'Error plotting differenced ACF', ha='center', va='center', transform=ax4.transAxes)
else:
ax4.text(0.5, 0.5, 'No differencing applied (d=0)', ha='center', va='center', transform=ax4.transAxes)
ax4.set_title('Differenced Series ACF (d=0)')
# PACF of differenced series (if d > 0)
ax5 = plt.subplot(3, 2, 6)
if fitted_d > 0:
try:
diff_series = train.copy()
for _ in range(fitted_d):
diff_series = diff_series.diff().dropna()
plot_pacf(diff_series, ax=ax5, lags=min(40, len(diff_series)//10), alpha=0.05)
ax5.set_title(f'PACF - After {fitted_d} Differencing')
except:
ax5.text(0.5, 0.5, 'Error plotting differenced PACF', ha='center', va='center', transform=ax5.transAxes)
else:
ax5.text(0.5, 0.5, 'No differencing applied (d=0)', ha='center', va='center', transform=ax5.transAxes)
ax5.set_title('Differenced Series PACF (d=0)')
plt.tight_layout()
plt.show()
# Comprehensive ACF/PACF interpretation
interpretation_output = Output()
with interpretation_output:
interpretation = comprehensive_acf_pacf_interpretation(train, fitted_p, fitted_d, fitted_q)
print(interpretation)
display(interpretation_output)
# Create interactive widget
print("\n" + "="*70)
print("🚀 INTERACTIVE COMPREHENSIVE ARIMA ANALYSIS")
print("="*70)
print("Features:")
print("• Realistic time series with trend, seasonality, and autocorrelation")
print("• Comprehensive ACF/PACF interpretation and model guidance")
print("• Comparison of original vs differenced series")
print("• Detailed stationarity analysis")
print("• Model selection recommendations")
print(f"• Starting with optimal parameters: {best_order}")
print("="*70)
# Interactive widget with optimal starting values
interact(
plot_comprehensive_arima_analysis,
fitted_p=IntSlider(min=0, max=5, step=1, value=best_order[0], description='AR (p)'),
fitted_d=IntSlider(min=0, max=3, step=1, value=best_order[1], description='Diff (d)'),
fitted_q=IntSlider(min=0, max=5, step=1, value=best_order[2], description='MA (q)')
)
Generating realistic time series data with trend, seasonality, and autocorrelation...
Generated time series with 1825 observations (approximately 5 years of daily data)
Data includes: Linear trend, annual seasonality, weekly patterns, AR structure, and noise
Training data: 1551 observations
Test data: 274 observations
Finding optimal ARIMA parameters...
Using auto-ARIMA for parameter selection...
Performing stepwise search to minimize aic
ARIMA(0,1,0)(0,0,0)[0] intercept : AIC=10137.459, Time=0.07 sec
ARIMA(1,1,0)(0,0,0)[0] intercept : AIC=10136.877, Time=0.19 sec
ARIMA(0,1,1)(0,0,0)[0] intercept : AIC=10136.635, Time=0.29 sec
ARIMA(0,1,0)(0,0,0)[0] : AIC=10135.588, Time=0.05 sec
ARIMA(1,1,1)(0,0,0)[0] intercept : AIC=10138.514, Time=0.62 sec
Best model: ARIMA(0,1,0)(0,0,0)[0]
Total fit time: 1.232 seconds
Auto-ARIMA selected: (0, 1, 0) with AIC = 10135.59
======================================================================
🚀 INTERACTIVE COMPREHENSIVE ARIMA ANALYSIS
======================================================================
Features:
• Realistic time series with trend, seasonality, and autocorrelation
• Comprehensive ACF/PACF interpretation and model guidance
• Comparison of original vs differenced series
• Detailed stationarity analysis
• Model selection recommendations
• Starting with optimal parameters: (0, 1, 0)
======================================================================
<function __main__.plot_comprehensive_arima_analysis(fitted_p, fitted_d, fitted_q)>
ACF, PACF, and Differencing#
🔹 1. Vertical lines in ACF and PACF#
When you plot the Autocorrelation Function (ACF) or Partial Autocorrelation Function (PACF):
The blue shaded region or dashed confidence bands represent the range where autocorrelations are not statistically significant (usually at 95% confidence).
The vertical lines (spikes) show the correlation (ACF) or partial correlation (PACF) at different lags.
Interpretation:
If a spike goes beyond the confidence band, it means the autocorrelation at that lag is significant.
If the spikes die out gradually, it indicates persistence in the series (likely non-stationary).
If spikes cut off sharply (after lag 1 or 2), it suggests an AR or MA process.
Examples:
ACF slowly decaying \(\;\;\rightarrow\;\) likely an AR process or non-stationary series.
PACF cuts off after lag 1 \(\;\;\rightarrow\;\) AR(1).
ACF cuts off after lag 1 \(\;\;\rightarrow\;\) MA(1).
🔹 2. Example of Differencing#
Differencing is used to make a non-stationary time series stationary by removing trend.
Suppose we have a simple time series with a trend:
[ Y_t = t + \text{noise} ]
Time \((t)\) |
Value \((Y_t)\) |
|---|---|
1 |
5 |
2 |
7 |
3 |
9 |
4 |
11 |
5 |
13 |
Clearly, it has a linear upward trend.
First Difference:#
[ Y’t = Y_t - Y{t-1} ]
Time \((t)\) |
\(Y_t\) |
Difference \((Y'_t)\) |
|---|---|---|
2 |
7 |
2 |
3 |
9 |
2 |
4 |
11 |
2 |
5 |
13 |
2 |
Now the series is stationary (all values \(=2\), i.e., no trend).
✅ Takeaway:
Vertical lines in ACF/PACF \(\rightarrow\) significance of correlations at lags.
Differencing \(\rightarrow\) removes trends and makes series stationary.
# Your code here