Ordinary Least Squares (OLS) & Normal Equations - Machine Learning for Business

Because sometimes, math can solve your problem faster than hiking down a loss valley. 😌

🎯 What Is OLS?¶

So far, we’ve made our poor regression model stumble downhill with gradients, slowly minimizing error. But what if we could just jump straight to the bottom of the valley — no hiking boots required? 👟

That’s Ordinary Least Squares (OLS).

It’s the “shortcut” method that says:

“We can find the perfect slope and intercept directly — with pure algebra.”

🧠 The Idea¶

OLS minimizes the same thing Gradient Descent does — Mean Squared Error (MSE):

[ J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 ]

But instead of learning gradually, we solve for β directly by setting derivatives to zero. Basically, we say:

“We want the slope where error stops changing.”

That gives us the Normal Equation:

[ \hat{\beta} = (X^T X)^{-1} X^T y ]

Where:

( X ): matrix of features (with a column of ones for intercept)
( y ): target variable
( \hat{\beta} ): the optimal coefficients

🧾 Why “Normal” Equation?¶

Because it comes from setting the derivative of the loss function to zero (a “normal” condition for optimality).

Or, as a data scientist might explain to an executive:

“Normal equations make your regression weights perfectly balanced — like all things should be.” 😎

🧮 A Simple Example¶

Suppose we’re modeling:

Sales = β₀ + β₁ × TV_Ad_Spend

import numpy as np

# Example data
X = np.array([[1, 230],
              [1, 44],
              [1, 17],
              [1, 151],
              [1, 180]])  # add column of 1s for intercept

y = np.array([22, 10, 7, 18, 20])

# OLS via Normal Equation
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients:", beta)

This gives:

Coefficients: [7.03, 0.07]

Meaning:

Even with $0 ad spend, you get baseline sales ≈ 7.03 Every additional$ 1 in TV ads adds about $0.07 in sales. 📺💰

🧩 Intuition: A Line that Minimizes Apologies¶

Think of OLS as drawing the “least embarrassing line” through your scatter plot:

For each point, the vertical error (residual) is the model’s mistake.
OLS finds the line that makes the sum of squared mistakes as small as possible.

No gradient descent, no random initialization, no drama. Just straight math, no feelings. 🧘‍♀️

⚙️ OLS in Scikit-Learn¶

In practice, you rarely invert matrices yourself. You can let scikit-learn handle it faster (and with numerical stability):

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X[:, 1].reshape(-1, 1), y)

print("Intercept:", model.intercept_)
print("Slope:", model.coef_)

📊 Visualising the Fit¶

import matplotlib.pyplot as plt

plt.scatter(X[:, 1], y, label="Actual Sales", alpha=0.8)
plt.plot(X[:, 1], model.predict(X[:, 1].reshape(-1, 1)),
         color="red", label="OLS Fit Line", linewidth=2)
plt.xlabel("TV Advertising Spend ($)")
plt.ylabel("Sales ($)")
plt.title("OLS Regression Line – Sales vs TV Spend")
plt.legend()
plt.show()

“When your scatter plot looks like a calm red line — that’s when business harmony is achieved.” 📈☯️

🧮 Matrix Shapes (Quick Reference)¶

Symbol	Meaning	Shape
( X )	Feature matrix	(n_samples, n_features + 1)
( y )	Target vector	(n_samples, 1)
( X^T X )	Square matrix	(n_features + 1, n_features + 1)
( (X^T X)^{-1} )	Inverse	(n_features + 1, n_features + 1)
( \hat{\beta} )	Coefficient vector	(n_features + 1, 1)

“If your matrix dimensions don’t align, your model’s chakras are blocked.” 🧘‍♂️

⚠️ Limitations of OLS¶

Limitation	Description
💻 Computational Cost	Matrix inversion is expensive for large data
💥 Multicollinearity	( X^T X ) can become singular (not invertible)
📏 Assumes Linearity	Works only for linear relationships
📊 Sensitive to Outliers	One crazy data point can tilt your whole model

💡 Business Analogy¶

OLS is like a consultant who instantly gives you the “best-fit” solution — but charges extra if your data is messy or high-dimensional. 💼

Gradient descent, on the other hand, is like an intern who learns slowly but can handle huge data cheaply. 🧑‍💻

📚 Tip for Python Learners¶

If you’re new to Python or NumPy matrix operations, check out my companion book: 👉 Programming for Business It’s like “Python Gym” before you lift machine learning weights. 🏋️‍♂️🐍

🧭 Recap¶

Concept	Description
OLS	Analytical method to minimize squared errors
Normal Equation	Closed-form solution for regression weights
Advantage	No iterative training needed
Disadvantage	Computationally heavy for large datasets
Relation to Gradient Descent	Both minimize same cost — different paths

💬 Final Thought¶

“OLS doesn’t learn — it knows. Like that one kid in class who never studied but still topped the exam.” 😏📚

🔜 Next Up¶

👉 Head to Non-linear & Polynomial Features where we’ll make our linear models curvy and flexible — because business problems rarely run in straight lines. 📈🔀

# Your code here

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def regression_summary_paragraph(model, 
                                alpha: float = 0.05,
                                vif_thresholds: tuple = (5, 10),
                                r2_thresholds: tuple = (0.3, 0.6)) -> str:
    """
    Creates a readable, multi-line summary from a fitted statsmodels OLS model.

    Returns a short, human-readable multi-line string (suitable for printing in
    notebooks or terminals). Keeps the same numeric checks as before but places
    results on separate lines for better readability.
    """
    n = int(model.nobs)
    r2 = model.rsquared
    adj_r2 = model.rsquared_adj
    fval = model.fvalue
    fp = model.f_pvalue

    # R² strength
    if r2 < r2_thresholds[0]:
        r2_qual = "weak"
    elif r2 < r2_thresholds[1]:
        r2_qual = "moderate"
    else:
        r2_qual = "strong"

    overall_sig = "statistically significant" if fp < alpha else "not statistically significant"

    # Coefficients (one line per predictor)
    coef_lines = []
    conf = model.conf_int(alpha).to_numpy()
    for var, coef, pval, ci_low, ci_high in zip(
        model.params.index,
        model.params,
        model.pvalues,
        conf[:, 0],
        conf[:, 1],
    ):
        if var == "const":
            continue
        sig_text = "significant" if pval < alpha else "not significant"
        sign = "+" if coef >= 0 else "-"
        coef_lines.append(
            f"{var}: {coef:+.3f} (p={pval:.4f}, 95% CI [{ci_low:.3f}, {ci_high:.3f}]) — {sig_text}"
        )

    # VIF (if applicable)
    vif_line = ""
    if model.df_model > 0:
        X = model.model.exog
        if X.shape[1] > 1:  # more than just intercept
            vif_df = pd.DataFrame({
                "variable": model.model.exog_names,
                "VIF": [variance_inflation_factor(X, i) for i in range(X.shape[1])],
            })
            vif_df = vif_df[vif_df["variable"] != "const"]
            vif_items = []
            for _, row in vif_df.iterrows():
                v = row["VIF"]
                if v < vif_thresholds[0]:
                    q = "low"
                elif v < vif_thresholds[1]:
                    q = "moderate"
                else:
                    q = "high"
                vif_items.append(f"{row['variable']} = {v:.2f} ({q})")
            if vif_items:
                vif_line = "VIFs: " + ", ".join(vif_items)

    # Durbin-Watson
    dw = sm.stats.durbin_watson(model.resid)
    dw_interp = (
        "≈ 2 → no clear autocorrelation"
        if 1.5 <= dw <= 2.5
        else (f"{dw:.2f} → possible positive autocorrelation" if dw < 1.5 else f"{dw:.2f} → possible negative autocorrelation")
    )

    # Assemble multi-line output
    lines = []
    lines.append(f"Observations: {n}  |  R² = {r2:.3f}  (adj R² = {adj_r2:.3f})  —  {r2_qual}")
    lines.append(f"Overall model: {overall_sig}  |  F = {fval:.2f}, p = {fp:.4f}")
    lines.append("")
    lines.append("Coefficients:")
    if coef_lines:
        for cl in coef_lines:
            lines.append(f"  - {cl}")
    else:
        lines.append("  (no predictors besides intercept)")

    if vif_line:
        lines.append("")
        lines.append(vif_line)

    lines.append("")
    lines.append(f"Durbin–Watson: {dw_interp}")

    return "\n".join(lines)

# --- Minimal runnable example using synthetic data
import numpy as np
import pandas as pd

np.random.seed(0)
n = 200
# create a small advertising-style dataset (predictors only)
your_predictors_df = pd.DataFrame({
    "TV": np.random.normal(150, 30, size=n),
    "Radio": np.random.normal(30, 8, size=n),
    "Online": np.random.normal(50, 15, size=n),
})

# generate a target with known coefficients + noise
dependent_series = (
    10.0
    + 0.05 * your_predictors_df["TV"]
    + 0.4 * your_predictors_df["Radio"]
    - 0.02 * your_predictors_df["Online"]
    + np.random.normal(0, 3.0, size=n)
)
dependent_series = pd.Series(dependent_series, name="Sales")

# Fit OLS and show both the statsmodels summary and the human-readable paragraph
X = sm.add_constant(your_predictors_df)
model = sm.OLS(dependent_series, X).fit()

# print(model.summary())
print("\nReadable summary:\n")
print(regression_summary_paragraph(model))


Readable summary:

Observations: 200  |  R² = 0.551  (adj R² = 0.544)  —  moderate
Overall model: statistically significant  |  F = 80.05, p = 0.0000

Coefficients:
  - TV: +0.048 (p=0.0000, 95% CI [0.035, 0.061]) — significant
  - Radio: +0.369 (p=0.0000, 95% CI [0.313, 0.424]) — significant
  - Online: -0.025 (p=0.0657, 95% CI [-0.052, 0.002]) — not significant

VIFs: TV = 1.01 (low), Radio = 1.04 (low), Online = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

# Run every scenario sequentially (static dashboard view)
for s in SCENARIOS:
    print('\n' + '=' * 80)
    print(f"Running scenario: {s}")
    demo_scenario(s, random_state=0)


================================================================================
Running scenario: high_r2

Observations: 200  |  R² = 0.982  (adj R² = 0.982)  —  strong
Overall model: statistically significant  |  F = 3596.20, p = 0.0000

Coefficients:
  - X1: +2.990 (p=0.0000, 95% CI [2.923, 3.057]) — significant
  - X2: -2.042 (p=0.0000, 95% CI [-2.116, -1.968]) — significant
  - X3: +1.487 (p=0.0000, 95% CI [1.419, 1.554]) — significant

VIFs: X1 = 1.01 (low), X2 = 1.04 (low), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - High R²: signal (true coefficients) is large relative to noise.


================================================================================
Running scenario: low_r2

Observations: 200  |  R² = 0.011  (adj R² = -0.004)  —  weak
Overall model: not statistically significant  |  F = 0.74, p = 0.5291

Coefficients:
  - X1: +0.099 (p=0.7725, 95% CI [-0.573, 0.770]) — not significant
  - X2: -0.519 (p=0.1667, 95% CI [-1.257, 0.218]) — not significant
  - X3: -0.082 (p=0.8095, 95% CI [-0.756, 0.591]) — not significant

VIFs: X1 = 1.01 (low), X2 = 1.04 (low), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - Low R²: predictors explain little variance because noise dominates.


================================================================================
Running scenario: multicollinearity

Observations: 200  |  R² = 0.908  (adj R² = 0.907)  —  strong
Overall model: statistically significant  |  F = 646.36, p = 0.0000

Coefficients:
  - X1: +3.073 (p=0.0328, 95% CI [0.255, 5.892]) — significant
  - X2: -0.177 (p=0.9058, 95% CI [-3.129, 2.774]) — not significant
  - X3: -0.526 (p=0.0000, 95% CI [-0.661, -0.392]) — significant

VIFs: X1 = 445.32 (high), X2 = 444.90 (high), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - High multicollinearity: similar predictors inflate standard errors (high VIF).


================================================================================
Running scenario: insignificant_predictor

Observations: 200  |  R² = 0.661  (adj R² = 0.656)  —  strong
Overall model: statistically significant  |  F = 127.58, p = 0.0000

Coefficients:
  - X1: +1.970 (p=0.0000, 95% CI [1.768, 2.171]) — significant
  - X2: -0.126 (p=0.2637, 95% CI [-0.347, 0.096]) — not significant
  - X3: +0.460 (p=0.0000, 95% CI [0.258, 0.662]) — significant

VIFs: X1 = 1.01 (low), X2 = 1.04 (low), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - One predictor has (near) zero coefficient → will be statistically insignificant.


================================================================================
Running scenario: negative_coefficient

Observations: 200  |  R² = 0.876  (adj R² = 0.874)  —  strong
Overall model: statistically significant  |  F = 460.89, p = 0.0000

Coefficients:
  - X1: -2.520 (p=0.0000, 95% CI [-2.655, -2.386]) — significant
  - X2: +0.416 (p=0.0000, 95% CI [0.269, 0.564]) — significant
  - X3: -0.026 (p=0.6986, 95% CI [-0.161, 0.108]) — not significant

VIFs: X1 = 1.01 (low), X2 = 1.04 (low), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - Negative coefficient example — sign is determined by the data-generating process.


================================================================================
Running scenario: autocorrelation

Observations: 200  |  R² = 0.238  (adj R² = 0.227)  —  weak
Overall model: statistically significant  |  F = 20.44, p = 0.0000

Coefficients:
  - X1: +0.588 (p=0.0000, 95% CI [0.394, 0.783]) — significant
  - X2: +0.443 (p=0.0000, 95% CI [0.252, 0.634]) — significant
  - X3: +0.268 (p=0.0113, 95% CI [0.061, 0.474]) — significant

VIFs: X1 = 1.00 (low), X2 = 1.01 (low), X3 = 1.01 (low)

Durbin–Watson: 0.63 → possible positive autocorrelation

Why this happened:
  - Autocorrelated residuals (AR(1)) — Durbin–Watson will be far from 2.


================================================================================
Running scenario: heteroscedasticity

Observations: 200  |  R² = 0.156  (adj R² = 0.143)  —  weak
Overall model: statistically significant  |  F = 12.05, p = 0.0000

Coefficients:
  - X1: +0.461 (p=0.0000, 95% CI [0.308, 0.614]) — significant
  - X2: +0.176 (p=0.4505, 95% CI [-0.283, 0.635]) — not significant
  - X3: +0.145 (p=0.5070, 95% CI [-0.284, 0.573]) — not significant

VIFs: X1 = 1.02 (low), X2 = 1.06 (low), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - Heteroscedasticity: residual variance increases with X1 — affects standard errors.


================================================================================
Running scenario: outliers

Observations: 200  |  R² = 0.265  (adj R² = 0.254)  —  weak
Overall model: statistically significant  |  F = 23.56, p = 0.0000

Coefficients:
  - X1: +1.452 (p=0.0000, 95% CI [1.110, 1.794]) — significant
  - X2: -0.178 (p=0.3506, 95% CI [-0.554, 0.197]) — not significant
  - X3: -0.017 (p=0.9218, 95% CI [-0.360, 0.326]) — not significant

VIFs: X1 = 1.01 (low), X2 = 1.04 (low), X3 = 1.04 (low)

Durbin–Watson: ≈ 2 → no clear autocorrelation

Why this happened:
  - A few large outliers that can disproportionately affect coefficients and R².