ARIMA / SARIMA#

“Because sometimes, predicting the future means arguing with your past.” 🕰️


🎯 Goal#

ARIMA models are like your reliable yet slightly grumpy data professors — they’ll give you solid forecasts if your data behaves.

But first, they’ll demand:

“Is your data stationary? Did you difference it? Did you check autocorrelation?” 😤

If you survived the previous section (Stationarity & Differencing), congrats — you’re ready to meet the mighty ARIMA family.


🧩 The ARIMA Family Tree#

Let’s decode this cryptic acronym before it starts judging you:

[ ARIMA(p, d, q) ]

Component

Meaning

Analogy

AR (p)

Auto-Regressive

“I remember my past and it affects my present.”

I (d)

Integrated (differencing)

“Let’s remove my emotional trend swings.”

MA (q)

Moving Average

“I’m still haunted by past forecast errors.”

TL;DR: ARIMA models say:

“My next value = past values + past mistakes + stability.”


🧮 A Simple Example#

Suppose you have monthly sales data. You suspect trends and maybe some residual chaos (a.k.a. marketing campaigns gone rogue).

Let’s fit a basic ARIMA:

from statsmodels.tsa.arima.model import ARIMA
import pandas as pd

model = ARIMA(df['sales'], order=(1,1,1))
model_fit = model.fit()
print(model_fit.summary())

Boom 💥 — you just built an ARIMA(1,1,1) model. It’s like saying:

“One look into the past, one emotional adjustment, one apology for past errors.”


🧠 Interpreting ARIMA Results#

Output

Translation

coef

How much the past influences the future

p-values

If they’re small, you’ve found meaningful patterns

AIC

Akaike Information Criterion — lower = better (and Akaike was basically the OG ML guy)


🎡 Enter the Seasonal Cousin: SARIMA#

When your series says,

“I repeat myself every December.” 🎄

you call SARIMA, the seasonally aware sibling.

[ SARIMA(p, d, q, P, D, Q, s) ]

Term

Description

P, D, Q

Seasonal AR, I, MA components

s

Period of the season (e.g., 12 for monthly data)

Example:

from statsmodels.tsa.statespace.sarimax import SARIMAX

model = SARIMAX(df['sales'], order=(1,1,1), seasonal_order=(1,1,1,12))
results = model.fit()
results.summary()

🎁 SARIMA = “ARIMA, but festive.” Handles holiday patterns, monthly cycles, or “every Monday blues.”


📊 Diagnostics (Because Models Lie)#

Always check residuals — if they’re not random, your model is gossiping instead of forecasting.

results.plot_diagnostics(figsize=(10,6))

✅ Random noise → Great! ❌ Patterns → Your ARIMA has commitment issues.


📅 Forecasting#

Once your model behaves:

forecast = results.forecast(steps=12)
forecast.plot(title="📈 Next 12 Months of Chaos")

Business translation:

“Predict sales, demand, or coffee consumption — one cycle at a time.”


🧠 When ARIMA Fails#

Symptom

Likely Cause

Remedy

Overfits noise

p/d/q too high

Simplify!

Misses seasonality

Forgot SARIMA

Add seasonal terms

Bizarre spikes

Data not stationary

Differencing, again 🌀

Constant line

Model gave up

Check parameters or caffeine level


💼 Business Example#

A retailer’s monthly sales grow with holiday spikes.

  • ARIMA captures the trend

  • SARIMA captures the holiday rhythm

  • You capture the promotion bonus 🎉


🧾 TL;DR#

Concept

Translation

ARIMA

Past + Differencing + Past Errors

SARIMA

ARIMA with holiday plans

p, d, q

Model order knobs

P, D, Q, s

Seasonal order knobs

AIC

Lower is better (your new leaderboard)


“Forecasting is like dating — you think you’ve found the perfect model… until next season proves you wrong.” 💔

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from ipywidgets import interact, IntSlider

def plot_arima(p, d, q):
    np.random.seed(42)  # For reproducibility
    n = 500  # Number of samples

    # Generate AR parameters (decreasing for stability)
    ar_params = np.array([0.8 / (i + 1) for i in range(p)]) if p > 0 else np.array([])
    ar = np.r_[1, -ar_params]

    # Generate MA parameters
    ma_params = np.array([0.5 / (i + 1) for i in range(q)]) if q > 0 else np.array([])
    ma = np.r_[1, ma_params]

    try:
        process = ArmaProcess(ar, ma)
        if not process.isstationary:
            print("Warning: The AR part is not stationary. Adjust parameters for better results.")
            return
        sample = process.generate_sample(nsample=n)
    except ValueError as e:
        print(f"Error generating ARMA process: {e}")
        return

    # Apply differencing (integration for simulation)
    series = sample.copy()
    for _ in range(d):
        series = np.cumsum(series)

    # Create plots
    fig, axes = plt.subplots(3, 1, figsize=(12, 15))

    # Plot the time series
    axes[0].plot(series)
    axes[0].set_title(f'Simulated ARIMA({p}, {d}, {q}) Time Series')
    axes[0].set_xlabel('Time')
    axes[0].set_ylabel('Value')

    # Plot ACF
    plot_acf(series, ax=axes[1], lags=40)
    axes[1].set_title('Autocorrelation Function (ACF)')

    # Plot PACF
    plot_pacf(series, ax=axes[2], lags=40)
    axes[2].set_title('Partial Autocorrelation Function (PACF)')

    plt.tight_layout()
    plt.show()

# Create interactive sliders
interact(
    plot_arima,
    p=IntSlider(min=0, max=5, step=1, value=1, description='AR order (p)'),
    d=IntSlider(min=0, max=2, step=1, value=0, description='Differencing (d)'),
    q=IntSlider(min=0, max=5, step=1, value=1, description='MA order (q)')
)
<function __main__.plot_arima(p, d, q)>
!pip install pmdarima
Collecting pmdarima
  Obtaining dependency information for pmdarima from https://files.pythonhosted.org/packages/a2/14/cd7417d90312ad6de4e5bb48d98f0b89e77bb427f5b80495074ab25cd13b/pmdarima-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata
  Downloading pmdarima-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (7.8 kB)
Requirement already satisfied: joblib>=0.11 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.1.0)
Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (0.29.32)
Requirement already satisfied: numpy>=1.21.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.21.6)
Requirement already satisfied: pandas>=0.19 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.4.4)
Requirement already satisfied: scikit-learn>=0.22 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.0.2)
Requirement already satisfied: scipy>=1.3.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.9.1)
Requirement already satisfied: statsmodels>=0.13.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (0.13.2)
Requirement already satisfied: urllib3 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (1.26.11)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (60.2.0)
Requirement already satisfied: packaging>=17.1 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pmdarima) (21.3)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from packaging>=17.1->pmdarima) (3.0.9)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from pandas>=0.19->pmdarima) (2023.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.22->pmdarima) (2.2.0)
Requirement already satisfied: patsy>=0.5.2 in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from statsmodels>=0.13.2->pmdarima) (0.5.2)
Requirement already satisfied: six in /home/chandravesh/anaconda3/lib/python3.9/site-packages (from patsy>=0.5.2->statsmodels>=0.13.2->pmdarima) (1.16.0)
Downloading pmdarima-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 301.9 kB/s eta 0:00:00m eta 0:00:01[36m0:00:01
?25hInstalling collected packages: pmdarima
Successfully installed pmdarima-2.0.4

[notice] A new release of pip is available: 23.2.1 -> 25.2
[notice] To update, run: pip install --upgrade pip
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf
from ipywidgets import interact, IntSlider, Output, VBox, HBox
from IPython.display import display, clear_output
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning

# Try to import auto_arima for better parameter selection
try:
    from pmdarima import auto_arima
    AUTO_ARIMA_AVAILABLE = True
except ImportError:
    AUTO_ARIMA_AVAILABLE = False
    print("Note: pmdarima not available. Using grid search for optimal parameters.")

# Suppress warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning, message="Non-invertible starting MA parameters found.")
warnings.filterwarnings("ignore", category=UserWarning, message="Non-stationary starting autoregressive parameters found.")

def interpret_acf_pacf(series, lags=40, sig_level=0.05):
    """Interpret ACF and PACF patterns"""
    try:
        # Compute ACF and PACF
        max_lags = min(lags, len(series)//4)
        acf_vals, acf_conf = acf(series, nlags=max_lags, alpha=sig_level)
        pacf_vals, pacf_conf = pacf(series, nlags=max_lags, alpha=sig_level)

        # Confidence intervals (upper and lower)
        acf_upper = acf_conf[:, 1] - acf_vals
        pacf_upper = pacf_conf[:, 1] - pacf_vals

        # Find where ACF and PACF become insignificant
        acf_cutoff = next((i for i, val in enumerate(acf_vals[1:], 1) if abs(val) < acf_upper[i]), max_lags)
        pacf_cutoff = next((i for i, val in enumerate(pacf_vals[1:], 1) if abs(val) < pacf_upper[i]), max_lags)
    except:
        return "Error in ACF/PACF computation"

    # Basic interpretation
    interpretation = []

    # Check for non-stationarity
    if len(acf_vals) > 10 and acf_vals[1] > 0.9:
        interpretation.append("Series appears non-stationary (slow ACF decay) - differencing needed.")

    # AR process identification
    if pacf_cutoff <= 3 and acf_cutoff > 5:
        interpretation.append(f"Suggests AR({pacf_cutoff}) process - PACF cuts off, ACF decays.")
    # MA process identification
    elif acf_cutoff <= 3 and pacf_cutoff > 5:
        interpretation.append(f"Suggests MA({acf_cutoff}) process - ACF cuts off, PACF decays.")
    # ARMA process
    elif acf_cutoff > 5 and pacf_cutoff > 5:
        interpretation.append("Suggests ARMA process - both ACF and PACF decay gradually.")
    else:
        interpretation.append("Mixed pattern - consider ARIMA model.")

    return "\\n".join(interpretation)

# Generate realistic time series that ARIMA can fit well
def generate_perfect_arima_data():
    """Generate time series data that ARIMA can fit near-perfectly"""
    np.random.seed(42)
    n = 500

    # Create a more predictable time series with clear patterns
    # This will allow ARIMA to achieve near-perfect fits

    # Option 1: Generate from known ARIMA process
    ar_params = np.array([0.75, -0.25])  # AR(2) parameters
    ma_params = np.array([0.65, 0.35])   # MA(2) parameters
    ar = np.r_[1, -ar_params]
    ma = np.r_[1, ma_params]

    try:
        process = ArmaProcess(ar, ma)
        if process.isstationary:
            arma_sample = process.generate_sample(nsample=n, scale=1.0)
        else:
            # Fallback to simpler process
            arma_sample = np.random.randn(n)
    except:
        arma_sample = np.random.randn(n)

    # Add trend and seasonality for integration
    trend = np.linspace(0, 10, n)
    seasonal = 2 * np.sin(2 * np.pi * np.arange(n) / 50) + 1.5 * np.cos(2 * np.pi * np.arange(n) / 25)

    # Combine components
    series = trend + seasonal + arma_sample

    # Apply one level of integration to make it I(1)
    series = np.cumsum(series - np.mean(series))

    return series

# Generate the perfect ARIMA data
series = generate_perfect_arima_data()
n = len(series)

# Split data
train_size = int(n * 0.8)
train = series[:train_size]
test = series[train_size:]

# Find optimal parameters using auto_arima if available, otherwise grid search
print("Finding optimal ARIMA parameters...")

if AUTO_ARIMA_AVAILABLE:
    try:
        # Use auto_arima for better parameter selection
        auto_model = auto_arima(train,
                               start_p=0, start_q=0, max_p=5, max_q=5, max_d=2,
                               seasonal=False, stepwise=True, suppress_warnings=True,
                               error_action='ignore', max_order=None, trace=False)
        best_order = auto_model.order
        best_aic = auto_model.aic()
        print(f"Auto-ARIMA selected: {best_order} with AIC = {best_aic:.2f}")
    except:
        best_order = (2, 1, 2)
        best_aic = 0
        print("Auto-ARIMA failed, using default (2,1,2)")
else:
    # Grid search fallback
    best_aic = np.inf
    best_order = (2, 1, 2)  # Default

    for p in range(6):
        for d in range(3):
            for q in range(6):
                try:
                    model = ARIMA(train, order=(p, d, q))
                    model_fit = model.fit()
                    aic = model_fit.aic
                    if aic < best_aic:
                        best_aic = aic
                        best_order = (p, d, q)
                except:
                    pass

    print(f"Grid search selected: {best_order} with AIC = {best_aic:.2f}")

# True parameters (what we know works well for this data)
true_p, true_d, true_q = best_order

def plot_arima_perfect_fit(fitted_p, fitted_d, fitted_q):
    """Plot ARIMA with perfect curve fitting focus"""

    # Clear previous output
    clear_output(wait=True)

    try:
        # Fit ARIMA model
        model = ARIMA(train, order=(fitted_p, fitted_d, fitted_q))
        model_fit = model.fit()

        # Get fitted values for training period
        fitted_values = model_fit.fittedvalues

        # Forecast for test period
        forecast = model_fit.forecast(steps=len(test))

        # Get prediction intervals
        forecast_result = model_fit.get_forecast(steps=len(test))
        forecast_ci = forecast_result.conf_int()

    except Exception as e:
        print(f"Error fitting ARIMA({fitted_p},{fitted_d},{fitted_q}): {e}")
        return

    # Calculate error metrics
    mse = np.mean((test - forecast)**2)
    mae = np.mean(np.abs(test - forecast))

    # Calculate fit quality for training data
    if len(fitted_values) > 0:
        train_subset = train[-len(fitted_values):]
        train_mse = np.mean((train_subset - fitted_values)**2)
        train_mae = np.mean(np.abs(train_subset - fitted_values))
    else:
        train_mse = train_mae = float('inf')

    # Create comprehensive plot
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle(f'ARIMA({fitted_p}, {fitted_d}, {fitted_q}) - Perfect Curve Fitting Analysis', fontsize=16, fontweight='bold')

    # Main time series plot with perfect overlap focus
    ax1 = axes[0, 0]

    # Plot original data
    ax1.plot(range(len(series)), series, 'b-', label='Original Data', linewidth=2, alpha=0.8)

    # Plot fitted values (training)
    if len(fitted_values) > 0:
        fit_start_idx = len(train) - len(fitted_values)
        fit_indices = range(fit_start_idx, len(train))
        ax1.plot(fit_indices, fitted_values, 'r-', label='Fitted (Training)', linewidth=3, alpha=0.9)

    # Plot forecast (testing) - this should overlap perfectly with original data
    test_indices = range(len(train), len(series))
    ax1.plot(test_indices, forecast, 'orange', label='Forecast (Testing)',
             linewidth=4, alpha=0.9, linestyle='--')

    # Add confidence intervals
    try:
        if hasattr(forecast_ci, 'iloc'):
            # If it's a DataFrame
            ax1.fill_between(test_indices, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1],
                             color='orange', alpha=0.2, label='95% Confidence Interval')
        else:
            # If it's a numpy array
            ax1.fill_between(test_indices, forecast_ci[:, 0], forecast_ci[:, 1],
                             color='orange', alpha=0.2, label='95% Confidence Interval')
    except Exception as e:
        print(f"Could not plot confidence intervals: {e}")
        # Continue without confidence intervals

    # Mark train/test split
    ax1.axvline(x=train_size, color='black', linestyle=':', linewidth=2, alpha=0.7, label='Train/Test Split')

    ax1.set_title('Time Series with ARIMA Fit & Forecast')
    ax1.set_xlabel('Time')
    ax1.set_ylabel('Value')
    ax1.legend(loc='upper left')
    ax1.grid(True, alpha=0.3)

    # Add performance metrics as text
    metrics_text = f'Test MSE: {mse:.4f}\nTest MAE: {mae:.4f}\nTrain MSE: {train_mse:.4f}\nTrain MAE: {train_mae:.4f}'
    ax1.text(0.02, 0.98, metrics_text, transform=ax1.transAxes,
             bbox=dict(boxstyle="round,pad=0.5", facecolor="lightgreen", alpha=0.8),
             fontsize=10, verticalalignment='top', fontweight='bold')

    # Residuals plot to check model fit quality
    ax2 = axes[0, 1]
    if len(fitted_values) > 0:
        residuals = train_subset - fitted_values
        ax2.plot(residuals, 'g-', alpha=0.7)
        ax2.axhline(y=0, color='red', linestyle='--', alpha=0.7)
        ax2.set_title('Training Residuals')
        ax2.set_xlabel('Time')
        ax2.set_ylabel('Residuals')
        ax2.grid(True, alpha=0.3)

        # Add residual statistics
        res_std = np.std(residuals)
        ax2.text(0.02, 0.98, f'Residual Std: {res_std:.4f}', transform=ax2.transAxes,
                bbox=dict(boxstyle="round,pad=0.3", facecolor="lightblue", alpha=0.7),
                fontsize=9, verticalalignment='top')

    # ACF plot
    ax3 = axes[1, 0]
    try:
        plot_acf(train, ax=ax3, lags=min(40, len(train)//4))
        ax3.set_title('ACF of Training Data')
    except:
        ax3.text(0.5, 0.5, 'ACF plot error', ha='center', va='center', transform=ax3.transAxes)

    # PACF plot
    ax4 = axes[1, 1]
    try:
        plot_pacf(train, ax=ax4, lags=min(40, len(train)//4))
        ax4.set_title('PACF of Training Data')
    except:
        ax4.text(0.5, 0.5, 'PACF plot error', ha='center', va='center', transform=ax4.transAxes)

    plt.tight_layout()
    plt.show()

    # Display detailed results
    results_output = Output()
    with results_output:
        print("=" * 80)
        print(f"ARIMA({fitted_p}, {fitted_d}, {fitted_q}) PERFECT FIT ANALYSIS")
        print("=" * 80)
        print(f"Optimal Parameters (by AIC): {best_order}")
        print(f"Current Parameters: ({fitted_p}, {fitted_d}, {fitted_q})")
        print()
        print("PERFORMANCE METRICS:")
        print(f"├── Test Forecast MSE: {mse:.6f}")
        print(f"├── Test Forecast MAE: {mae:.6f}")
        print(f"├── Training Fit MSE: {train_mse:.6f}")
        print(f"└── Training Fit MAE: {train_mae:.6f}")
        print()

        # Fit quality assessment
        if mse < 1.0 and mae < 0.8:
            print("🎯 EXCELLENT FIT: Near-perfect curve overlap achieved!")
        elif mse < 5.0 and mae < 2.0:
            print("✅ GOOD FIT: Curves align well with minor deviations")
        else:
            print("⚠️  POOR FIT: Try adjusting parameters for better overlap")

        print()
        print("CURVE OVERLAP ANALYSIS:")
        overlap_score = max(0, 100 - (mse * 10 + mae * 5))
        print(f"└── Overlap Score: {overlap_score:.1f}% (higher is better)")

    display(results_output)

    # ACF/PACF interpretation
    interp_output = Output()
    with interp_output:
        print("\nACF/PACF INTERPRETATION:")
        print("-" * 40)
        print(interpret_acf_pacf(train))
        print()
        print("OPTIMIZATION TIPS:")
        print(f"• Current AIC would be: ~{model_fit.aic:.2f}")
        print(f"• Optimal parameters: {best_order}")
        print("• For perfect overlap, try parameters close to optimal")
        print("• Lower MSE/MAE = better curve alignment")
    display(interp_output)

    # Raw data display
    data_output = Output()
    with data_output:
        print("\nRAW DATA SAMPLES:")
        print("-" * 20)
        print("First 10 values:", series[:10])
        print("Last 10 values:", series[-10:])
        print("Training size:", len(train))
        print("Test size:", len(test))
    display(data_output)

# Create the interactive widget with optimal starting values
print("\n🚀 INTERACTIVE ARIMA PERFECT CURVE FITTING")
print("=" * 50)
print(f"Data generated with near-perfect ARIMA characteristics")
print(f"Recommended starting parameters: {best_order}")
print("Adjust sliders to achieve perfect curve overlap!")
print("=" * 50)

# Use the optimal parameters as starting values for perfect initial fit
interact(
    plot_arima_perfect_fit,
    fitted_p=IntSlider(min=0, max=5, step=1, value=best_order[0], description='AR (p)'),
    fitted_d=IntSlider(min=0, max=2, step=1, value=best_order[1], description='Diff (d)'),
    fitted_q=IntSlider(min=0, max=5, step=1, value=best_order[2], description='MA (q)')
)
Finding optimal ARIMA parameters...
Auto-ARIMA selected: (3, 2, 0) with AIC = 1190.65

🚀 INTERACTIVE ARIMA PERFECT CURVE FITTING
==================================================
Data generated with near-perfect ARIMA characteristics
Recommended starting parameters: (3, 2, 0)
Adjust sliders to achieve perfect curve overlap!
==================================================
<function __main__.plot_arima_perfect_fit(fitted_p, fitted_d, fitted_q)>
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf, adfuller
from ipywidgets import interact, IntSlider, Output, VBox, HBox
from IPython.display import display, clear_output
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning

# Try to import auto_arima for better parameter selection
try:
    from pmdarima import auto_arima
    AUTO_ARIMA_AVAILABLE = True
except ImportError:
    AUTO_ARIMA_AVAILABLE = False
    print("Note: pmdarima not available. Using grid search for optimal parameters.")

# Suppress warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning, message="Non-invertible starting MA parameters found.")
warnings.filterwarnings("ignore", category=UserWarning, message="Non-stationary starting autoregressive parameters found.")

def comprehensive_acf_pacf_interpretation(series, fitted_p, fitted_d, fitted_q, max_lags=50):
    """
    Comprehensive interpretation of ACF and PACF patterns

    ACF (Autocorrelation Function):
    - Measures correlation between observations at different time lags
    - Shows how current value relates to past values
    - For AR processes: Decays gradually (exponentially or sinusoidally)
    - For MA processes: Cuts off sharply after q lags
    - For non-stationary: Very slow decay, stays high for many lags

    PACF (Partial Autocorrelation Function):
    - Measures direct correlation between observations k periods apart
    - Removes intermediate correlations
    - For AR processes: Cuts off sharply after p lags
    - For MA processes: Decays gradually
    """

    interpretation = []
    interpretation.append("=" * 80)
    interpretation.append("📊 COMPREHENSIVE ACF & PACF ANALYSIS")
    interpretation.append("=" * 80)

    # Apply differencing if specified
    import pandas as pd

    # Convert to pandas Series if it's numpy array
    if isinstance(series, np.ndarray):
        analyzed_series = pd.Series(series.copy())
    else:
        analyzed_series = series.copy()

    diff_description = "Original series"

    if fitted_d > 0:
        for d in range(fitted_d):
            analyzed_series = analyzed_series.diff().dropna()
            diff_description = f"After {fitted_d} difference(s)"

    interpretation.append(f"Analysis performed on: {diff_description}")
    interpretation.append(f"Series length after differencing: {len(analyzed_series)}")
    interpretation.append("")

    try:
        # Compute ACF and PACF
        max_lags = min(max_lags, len(analyzed_series)//4)
        acf_vals, acf_conf = acf(analyzed_series, nlags=max_lags, alpha=0.05)
        pacf_vals, pacf_conf = pacf(analyzed_series, nlags=max_lags, alpha=0.05)

        # Stationarity test
        try:
            adf_result = adfuller(analyzed_series)
            adf_pvalue = adf_result[1]
            is_stationary = adf_pvalue < 0.05
        except:
            is_stationary = None
            adf_pvalue = None

        interpretation.append("🔍 STATIONARITY ANALYSIS:")
        if is_stationary is not None:
            if is_stationary:
                interpretation.append(f"✅ Series is STATIONARY (ADF p-value: {adf_pvalue:.4f} < 0.05)")
            else:
                interpretation.append(f"❌ Series is NON-STATIONARY (ADF p-value: {adf_pvalue:.4f} > 0.05)")
                interpretation.append("   → Consider more differencing (increase d parameter)")
        else:
            interpretation.append("⚠️ Could not perform stationarity test")

        interpretation.append("")

        # ACF Analysis
        interpretation.append("📈 ACF (Autocorrelation Function) ANALYSIS:")
        interpretation.append("   What ACF tells us:")
        interpretation.append("   • Measures correlation between observations k periods apart")
        interpretation.append("   • Shows persistence and memory in the time series")
        interpretation.append("   • Values outside blue confidence bands are significant")
        interpretation.append("")

        # Find significant ACF lags
        acf_significant_lags = []
        for i in range(1, min(len(acf_vals), 20)):
            if abs(acf_vals[i]) > abs(acf_conf[i, 1] - acf_vals[i]):
                acf_significant_lags.append(i)

        # ACF pattern analysis
        if len(acf_vals) > 5:
            acf_first_few = acf_vals[1:6]  # First 5 lags

            if acf_vals[1] > 0.7:
                interpretation.append("   🔴 VERY HIGH ACF at lag 1 → Strong persistence/trend")
                interpretation.append("      → Series may need differencing")
            elif acf_vals[1] > 0.3:
                interpretation.append("   🟡 MODERATE ACF at lag 1 → Some persistence")
            else:
                interpretation.append("   🟢 LOW ACF at lag 1 → Good for modeling")

            # Check for gradual decay vs sharp cutoff
            if len(acf_significant_lags) > 5:
                interpretation.append("   📉 ACF decays GRADUALLY → Suggests AR component")
            elif len(acf_significant_lags) <= 3:
                interpretation.append(f"   📉 ACF cuts off after lag {max(acf_significant_lags) if acf_significant_lags else 0} → Suggests MA component")

            interpretation.append(f"   📊 Significant ACF lags: {acf_significant_lags[:10]}")  # Show first 10

        interpretation.append("")

        # PACF Analysis
        interpretation.append("📉 PACF (Partial Autocorrelation Function) ANALYSIS:")
        interpretation.append("   What PACF tells us:")
        interpretation.append("   • Measures DIRECT correlation between observations k periods apart")
        interpretation.append("   • Removes influence of intermediate lags")
        interpretation.append("   • Identifies the order of AR processes")
        interpretation.append("")

        # Find significant PACF lags
        pacf_significant_lags = []
        for i in range(1, min(len(pacf_vals), 20)):
            if abs(pacf_vals[i]) > abs(pacf_conf[i, 1] - pacf_vals[i]):
                pacf_significant_lags.append(i)

        if len(pacf_vals) > 5:
            # PACF pattern analysis
            if len(pacf_significant_lags) <= 3 and len(pacf_significant_lags) > 0:
                max_pacf_lag = max(pacf_significant_lags)
                interpretation.append(f"   📉 PACF cuts off after lag {max_pacf_lag} → Suggests AR({max_pacf_lag}) component")
            elif len(pacf_significant_lags) > 5:
                interpretation.append("   📈 PACF decays GRADUALLY → Suggests MA component")
            else:
                interpretation.append("   📊 PACF shows mixed pattern")

            interpretation.append(f"   📊 Significant PACF lags: {pacf_significant_lags[:10]}")  # Show first 10

        interpretation.append("")

        # Model Identification
        interpretation.append("🎯 MODEL IDENTIFICATION GUIDELINES:")
        interpretation.append("   Based on ACF/PACF patterns:")
        interpretation.append("")

        # AR Process identification
        if len(acf_significant_lags) > 3 and len(pacf_significant_lags) <= 3:
            if pacf_significant_lags:
                suggested_p = max(pacf_significant_lags)
                interpretation.append(f"   🎯 Pattern suggests: AR({suggested_p}) process")
                interpretation.append("      → ACF decays gradually, PACF cuts off")
                interpretation.append(f"      → Try p = {suggested_p}, d = {fitted_d}, q = 0")

        # MA Process identification
        elif len(acf_significant_lags) <= 3 and len(pacf_significant_lags) > 3:
            if acf_significant_lags:
                suggested_q = max(acf_significant_lags)
                interpretation.append(f"   🎯 Pattern suggests: MA({suggested_q}) process")
                interpretation.append("      → ACF cuts off, PACF decays gradually")
                interpretation.append(f"      → Try p = 0, d = {fitted_d}, q = {suggested_q}")

        # ARMA Process identification
        elif len(acf_significant_lags) > 3 and len(pacf_significant_lags) > 3:
            interpretation.append("   🎯 Pattern suggests: ARMA process")
            interpretation.append("      → Both ACF and PACF decay gradually")
            interpretation.append(f"      → Try mixed model like ARIMA({fitted_p},{fitted_d},{fitted_q})")

        # White Noise
        elif len(acf_significant_lags) == 0 and len(pacf_significant_lags) == 0:
            interpretation.append("   🎯 Pattern suggests: WHITE NOISE")
            interpretation.append("      → No significant autocorrelations")
            interpretation.append("      → Series may already be well-modeled")

        interpretation.append("")

        # Current Model Assessment
        interpretation.append(f"📋 CURRENT MODEL ASSESSMENT: ARIMA({fitted_p},{fitted_d},{fitted_q})")

        if fitted_p > 0:
            if len(pacf_significant_lags) > 0 and max(pacf_significant_lags) <= fitted_p:
                interpretation.append(f"   ✅ AR({fitted_p}): Good choice - PACF supports this order")
            else:
                interpretation.append(f"   ⚠️ AR({fitted_p}): May be over/under-specified")

        if fitted_q > 0:
            if len(acf_significant_lags) > 0 and max(acf_significant_lags) <= fitted_q:
                interpretation.append(f"   ✅ MA({fitted_q}): Good choice - ACF supports this order")
            else:
                interpretation.append(f"   ⚠️ MA({fitted_q}): May be over/under-specified")

        if fitted_d > 0:
            if is_stationary:
                interpretation.append(f"   ✅ I({fitted_d}): Series is now stationary")
            else:
                interpretation.append(f"   ⚠️ I({fitted_d}): Series may need more differencing")

    except Exception as e:
        interpretation.append(f"❌ Error in ACF/PACF analysis: {e}")

    interpretation.append("")
    interpretation.append("💡 QUICK REFERENCE:")
    interpretation.append("   • ACF decays slowly → Non-stationary (need differencing)")
    interpretation.append("   • ACF decays quickly, PACF cuts off → AR process")
    interpretation.append("   • ACF cuts off, PACF decays → MA process")
    interpretation.append("   • Both decay gradually → ARMA process")
    interpretation.append("   • Both are insignificant → White noise")

    return "\n".join(interpretation)

def generate_realistic_time_series():
    """
    Generate realistic time series with clear trend, seasonality, and noise
    This will make differencing effects more obvious
    """
    np.random.seed(42)

    # Create longer time series (5 years of daily data)
    n = 1825  # 5 years * 365 days
    t = np.arange(n)

    # 1. Strong linear trend
    trend = 0.05 * t + 100

    # 2. Multiple seasonal patterns
    # Annual seasonality (365 days)
    annual_seasonal = 15 * np.sin(2 * np.pi * t / 365) + 10 * np.cos(2 * np.pi * t / 365)

    # Weekly seasonality (7 days)
    weekly_seasonal = 5 * np.sin(2 * np.pi * t / 7) + 3 * np.cos(2 * np.pi * t / 7)

    # Monthly-like pattern (30 days)
    monthly_seasonal = 8 * np.sin(2 * np.pi * t / 30) + 6 * np.cos(2 * np.pi * t / 30)

    # 3. AR(2) component for autocorrelation
    ar_params = np.array([0.6, -0.2])
    ar = np.r_[1, -ar_params]
    ma = np.r_[1, 0.3, 0.1]  # Small MA component

    try:
        arma_process = ArmaProcess(ar, ma)
        if arma_process.isstationary:
            ar_component = arma_process.generate_sample(nsample=n, scale=2.0)
        else:
            ar_component = np.random.randn(n) * 2.0
    except:
        ar_component = np.random.randn(n) * 2.0

    # 4. Random noise
    noise = np.random.normal(0, 3, n)

    # 5. Combine all components
    series = trend + annual_seasonal + weekly_seasonal + monthly_seasonal + ar_component + noise

    # 6. Add some structural breaks for realism
    # Sudden level shift at year 2
    series[730:] += 20

    # Gradual change in trend after year 3
    series[1095:] += 0.02 * np.arange(len(series) - 1095)

    return series

# Generate realistic data
print("Generating realistic time series data with trend, seasonality, and autocorrelation...")
series = generate_realistic_time_series()
series = pd.Series(series)  # Convert to pandas Series for easier manipulation
n = len(series)

print(f"Generated time series with {n} observations (approximately 5 years of daily data)")
print(f"Data includes: Linear trend, annual seasonality, weekly patterns, AR structure, and noise")

# Split data (use more recent data for testing)
train_size = int(n * 0.85)  # Use 85% for training
train = series[:train_size]
test = series[train_size:]

print(f"Training data: {len(train)} observations")
print(f"Test data: {len(test)} observations")

# Find optimal parameters
print("\nFinding optimal ARIMA parameters...")

if AUTO_ARIMA_AVAILABLE:
    try:
        print("Using auto-ARIMA for parameter selection...")
        auto_model = auto_arima(train,
                               start_p=0, start_q=0, max_p=5, max_q=5, max_d=2,
                               seasonal=False, stepwise=True, suppress_warnings=True,
                               error_action='ignore', max_order=None, trace=True)
        best_order = auto_model.order
        best_aic = auto_model.aic()
        print(f"Auto-ARIMA selected: {best_order} with AIC = {best_aic:.2f}")
    except Exception as e:
        print(f"Auto-ARIMA failed: {e}")
        best_order = (2, 1, 1)
        best_aic = 0
        print("Using default parameters (2,1,1)")
else:
    print("Using grid search for parameter selection...")
    best_aic = np.inf
    best_order = (2, 1, 1)  # Default

    for p in range(4):  # 0 to 3
        for d in range(3):  # 0 to 2
            for q in range(4):  # 0 to 3
                try:
                    model = ARIMA(train, order=(p, d, q))
                    model_fit = model.fit()
                    aic = model_fit.aic
                    if aic < best_aic:
                        best_aic = aic
                        best_order = (p, d, q)
                except:
                    pass

    print(f"Grid search selected: {best_order} with AIC = {best_aic:.2f}")

def plot_comprehensive_arima_analysis(fitted_p, fitted_d, fitted_q):
    """Enhanced plotting with comprehensive analysis"""

    clear_output(wait=True)

    try:
        # Fit ARIMA model
        model = ARIMA(train, order=(fitted_p, fitted_d, fitted_q))
        model_fit = model.fit()

        # Get fitted values and forecast
        fitted_values = model_fit.fittedvalues
        forecast = model_fit.forecast(steps=len(test))

        # Get prediction intervals
        forecast_result = model_fit.get_forecast(steps=len(test))
        forecast_ci = forecast_result.conf_int()

    except Exception as e:
        print(f"❌ Error fitting ARIMA({fitted_p},{fitted_d},{fitted_q}): {e}")
        print("Try different parameters or check data quality")
        return

    # Calculate metrics
    mse = np.mean((test - forecast)**2)
    mae = np.mean(np.abs(test - forecast))

    if len(fitted_values) > 0:
        train_subset = train[-len(fitted_values):]
        train_mse = np.mean((train_subset - fitted_values)**2)
        train_mae = np.mean(np.abs(train_subset - fitted_values))
    else:
        train_mse = train_mae = float('inf')

    # Create comprehensive visualization
    fig = plt.figure(figsize=(20, 16))

    # Main time series plot
    ax1 = plt.subplot(3, 2, (1, 2))  # Spans 2 columns

    # Plot full series with different colors for train/test
    ax1.plot(range(len(train)), train, 'b-', label='Training Data', linewidth=1.5, alpha=0.8)
    ax1.plot(range(len(train), len(series)), test, 'g-', label='Test Data (Actual)', linewidth=2, alpha=0.9)

    # Plot fitted values
    if len(fitted_values) > 0:
        fit_start_idx = len(train) - len(fitted_values)
        fit_indices = range(fit_start_idx, len(train))
        ax1.plot(fit_indices, fitted_values, 'r-', label='Fitted (Training)', linewidth=2, alpha=0.9)

    # Plot forecast
    test_indices = range(len(train), len(series))
    ax1.plot(test_indices, forecast, 'orange', label='Forecast (Test)',
             linewidth=3, alpha=0.9, linestyle='--')

    # Add confidence intervals
    try:
        if hasattr(forecast_ci, 'iloc'):
            ax1.fill_between(test_indices, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1],
                             color='orange', alpha=0.2, label='95% Confidence Interval')
        else:
            ax1.fill_between(test_indices, forecast_ci[:, 0], forecast_ci[:, 1],
                             color='orange', alpha=0.2, label='95% Confidence Interval')
    except:
        pass

    # Mark train/test split
    ax1.axvline(x=len(train), color='black', linestyle=':', linewidth=2, alpha=0.7, label='Train/Test Split')

    ax1.set_title(f'ARIMA({fitted_p},{fitted_d},{fitted_q}) Analysis - Realistic Time Series with Trend & Seasonality', fontsize=14, fontweight='bold')
    ax1.set_xlabel('Time (Days)')
    ax1.set_ylabel('Value')
    ax1.legend(loc='upper left')
    ax1.grid(True, alpha=0.3)

    # Performance metrics
    metrics_text = f'Test MSE: {mse:.2f}\nTest MAE: {mae:.2f}\nTraining MSE: {train_mse:.2f}\nTraining MAE: {train_mae:.2f}'
    color = 'lightgreen' if mse < 100 else 'yellow' if mse < 500 else 'lightcoral'
    ax1.text(0.02, 0.98, metrics_text, transform=ax1.transAxes,
             bbox=dict(boxstyle="round,pad=0.5", facecolor=color, alpha=0.8),
             fontsize=10, verticalalignment='top', fontweight='bold')

    # ACF plot
    ax2 = plt.subplot(3, 2, 3)
    try:
        plot_acf(train, ax=ax2, lags=min(50, len(train)//10), alpha=0.05)
        ax2.set_title('ACF - Training Data (Original)')
    except:
        ax2.text(0.5, 0.5, 'Error plotting ACF', ha='center', va='center', transform=ax2.transAxes)

    # PACF plot
    ax3 = plt.subplot(3, 2, 4)
    try:
        plot_pacf(train, ax=ax3, lags=min(50, len(train)//10), alpha=0.05)
        ax3.set_title('PACF - Training Data (Original)')
    except:
        ax3.text(0.5, 0.5, 'Error plotting PACF', ha='center', va='center', transform=ax3.transAxes)

    # ACF of differenced series (if d > 0)
    ax4 = plt.subplot(3, 2, 5)
    if fitted_d > 0:
        try:
            diff_series = pd.Series(train.copy()) if isinstance(train, np.ndarray) else train.copy()
            for _ in range(fitted_d):
                diff_series = diff_series.diff().dropna()
            plot_acf(diff_series, ax=ax4, lags=min(40, len(diff_series)//10), alpha=0.05)
            ax4.set_title(f'ACF - After {fitted_d} Differencing')
        except:
            ax4.text(0.5, 0.5, 'Error plotting differenced ACF', ha='center', va='center', transform=ax4.transAxes)
    else:
        ax4.text(0.5, 0.5, 'No differencing applied (d=0)', ha='center', va='center', transform=ax4.transAxes)
        ax4.set_title('Differenced Series ACF (d=0)')

    # PACF of differenced series (if d > 0)
    ax5 = plt.subplot(3, 2, 6)
    if fitted_d > 0:
        try:
            diff_series = train.copy()
            for _ in range(fitted_d):
                diff_series = diff_series.diff().dropna()
            plot_pacf(diff_series, ax=ax5, lags=min(40, len(diff_series)//10), alpha=0.05)
            ax5.set_title(f'PACF - After {fitted_d} Differencing')
        except:
            ax5.text(0.5, 0.5, 'Error plotting differenced PACF', ha='center', va='center', transform=ax5.transAxes)
    else:
        ax5.text(0.5, 0.5, 'No differencing applied (d=0)', ha='center', va='center', transform=ax5.transAxes)
        ax5.set_title('Differenced Series PACF (d=0)')

    plt.tight_layout()
    plt.show()

    # Comprehensive ACF/PACF interpretation
    interpretation_output = Output()
    with interpretation_output:
        interpretation = comprehensive_acf_pacf_interpretation(train, fitted_p, fitted_d, fitted_q)
        print(interpretation)
    display(interpretation_output)

# Create interactive widget
print("\n" + "="*70)
print("🚀 INTERACTIVE COMPREHENSIVE ARIMA ANALYSIS")
print("="*70)
print("Features:")
print("• Realistic time series with trend, seasonality, and autocorrelation")
print("• Comprehensive ACF/PACF interpretation and model guidance")
print("• Comparison of original vs differenced series")
print("• Detailed stationarity analysis")
print("• Model selection recommendations")
print(f"• Starting with optimal parameters: {best_order}")
print("="*70)

# Interactive widget with optimal starting values
interact(
    plot_comprehensive_arima_analysis,
    fitted_p=IntSlider(min=0, max=5, step=1, value=best_order[0], description='AR (p)'),
    fitted_d=IntSlider(min=0, max=3, step=1, value=best_order[1], description='Diff (d)'),
    fitted_q=IntSlider(min=0, max=5, step=1, value=best_order[2], description='MA (q)')
)
Generating realistic time series data with trend, seasonality, and autocorrelation...
Generated time series with 1825 observations (approximately 5 years of daily data)
Data includes: Linear trend, annual seasonality, weekly patterns, AR structure, and noise
Training data: 1551 observations
Test data: 274 observations

Finding optimal ARIMA parameters...
Using auto-ARIMA for parameter selection...
Performing stepwise search to minimize aic
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=10137.459, Time=0.07 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=10136.877, Time=0.19 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=10136.635, Time=0.29 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=10135.588, Time=0.05 sec
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=10138.514, Time=0.62 sec

Best model:  ARIMA(0,1,0)(0,0,0)[0]          
Total fit time: 1.232 seconds
Auto-ARIMA selected: (0, 1, 0) with AIC = 10135.59

======================================================================
🚀 INTERACTIVE COMPREHENSIVE ARIMA ANALYSIS
======================================================================
Features:
• Realistic time series with trend, seasonality, and autocorrelation
• Comprehensive ACF/PACF interpretation and model guidance
• Comparison of original vs differenced series
• Detailed stationarity analysis
• Model selection recommendations
• Starting with optimal parameters: (0, 1, 0)
======================================================================
<function __main__.plot_comprehensive_arima_analysis(fitted_p, fitted_d, fitted_q)>

ACF, PACF, and Differencing#


🔹 1. Vertical lines in ACF and PACF#

When you plot the Autocorrelation Function (ACF) or Partial Autocorrelation Function (PACF):

  • The blue shaded region or dashed confidence bands represent the range where autocorrelations are not statistically significant (usually at 95% confidence).

  • The vertical lines (spikes) show the correlation (ACF) or partial correlation (PACF) at different lags.

Interpretation:

  • If a spike goes beyond the confidence band, it means the autocorrelation at that lag is significant.

  • If the spikes die out gradually, it indicates persistence in the series (likely non-stationary).

  • If spikes cut off sharply (after lag 1 or 2), it suggests an AR or MA process.

Examples:

  • ACF slowly decaying \(\;\;\rightarrow\;\) likely an AR process or non-stationary series.

  • PACF cuts off after lag 1 \(\;\;\rightarrow\;\) AR(1).

  • ACF cuts off after lag 1 \(\;\;\rightarrow\;\) MA(1).


🔹 2. Example of Differencing#

Differencing is used to make a non-stationary time series stationary by removing trend.

Suppose we have a simple time series with a trend:

[ Y_t = t + \text{noise} ]

Time \((t)\)

Value \((Y_t)\)

1

5

2

7

3

9

4

11

5

13

Clearly, it has a linear upward trend.

First Difference:#

[ Y’t = Y_t - Y{t-1} ]

Time \((t)\)

\(Y_t\)

Difference \((Y'_t)\)

2

7

2

3

9

2

4

11

2

5

13

2

Now the series is stationary (all values \(=2\), i.e., no trend).


Takeaway:

  • Vertical lines in ACF/PACF \(\rightarrow\) significance of correlations at lags.

  • Differencing \(\rightarrow\) removes trends and makes series stationary.

# Your code here