Ordinary Least Squares (OLS) & Normal Equations#

Because sometimes, math can solve your problem faster than hiking down a loss valley. 😌


🎯 What Is OLS?#

So far, we’ve made our poor regression model stumble downhill with gradients, slowly minimizing error. But what if we could just jump straight to the bottom of the valley — no hiking boots required? 👟

That’s Ordinary Least Squares (OLS).

It’s the “shortcut” method that says:

“We can find the perfect slope and intercept directly — with pure algebra.”


🧠 The Idea#

OLS minimizes the same thing Gradient Descent does — Mean Squared Error (MSE):

[ J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 ]

But instead of learning gradually, we solve for β directly by setting derivatives to zero. Basically, we say:

“We want the slope where error stops changing.”

That gives us the Normal Equation:

[ \hat{\beta} = (X^T X)^{-1} X^T y ]

Where:

  • ( X ): matrix of features (with a column of ones for intercept)

  • ( y ): target variable

  • ( \hat{\beta} ): the optimal coefficients


🧾 Why “Normal” Equation?#

Because it comes from setting the derivative of the loss function to zero (a “normal” condition for optimality).

Or, as a data scientist might explain to an executive:

“Normal equations make your regression weights perfectly balanced — like all things should be.” 😎


🧮 A Simple Example#

Suppose we’re modeling:

Sales = β₀ + β₁ × TV_Ad_Spend

import numpy as np

# Example data
X = np.array([[1, 230],
              [1, 44],
              [1, 17],
              [1, 151],
              [1, 180]])  # add column of 1s for intercept

y = np.array([22, 10, 7, 18, 20])

# OLS via Normal Equation
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients:", beta)

This gives:

Coefficients: [7.03, 0.07]

Meaning:

Even with \(0 ad spend, you get baseline sales ≈ 7.03 Every additional \)1 in TV ads adds about $0.07 in sales. 📺💰


🧩 Intuition: A Line that Minimizes Apologies#

Think of OLS as drawing the “least embarrassing line” through your scatter plot:

  • For each point, the vertical error (residual) is the model’s mistake.

  • OLS finds the line that makes the sum of squared mistakes as small as possible.

No gradient descent, no random initialization, no drama. Just straight math, no feelings. 🧘‍♀️


⚙️ OLS in Scikit-Learn#

In practice, you rarely invert matrices yourself. You can let scikit-learn handle it faster (and with numerical stability):

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X[:, 1].reshape(-1, 1), y)

print("Intercept:", model.intercept_)
print("Slope:", model.coef_)

📊 Visualising the Fit#

import matplotlib.pyplot as plt

plt.scatter(X[:, 1], y, label="Actual Sales", alpha=0.8)
plt.plot(X[:, 1], model.predict(X[:, 1].reshape(-1, 1)),
         color="red", label="OLS Fit Line", linewidth=2)
plt.xlabel("TV Advertising Spend ($)")
plt.ylabel("Sales ($)")
plt.title("OLS Regression Line – Sales vs TV Spend")
plt.legend()
plt.show()

“When your scatter plot looks like a calm red line — that’s when business harmony is achieved.” 📈☯️


🧮 Matrix Shapes (Quick Reference)#

Symbol

Meaning

Shape

( X )

Feature matrix

(n_samples, n_features + 1)

( y )

Target vector

(n_samples, 1)

( X^T X )

Square matrix

(n_features + 1, n_features + 1)

( (X^T X)^{-1} )

Inverse

(n_features + 1, n_features + 1)

( \hat{\beta} )

Coefficient vector

(n_features + 1, 1)

“If your matrix dimensions don’t align, your model’s chakras are blocked.” 🧘‍♂️


⚠️ Limitations of OLS#

Limitation

Description

💻 Computational Cost

Matrix inversion is expensive for large data

💥 Multicollinearity

( X^T X ) can become singular (not invertible)

📏 Assumes Linearity

Works only for linear relationships

📊 Sensitive to Outliers

One crazy data point can tilt your whole model


💡 Business Analogy#

OLS is like a consultant who instantly gives you the “best-fit” solution — but charges extra if your data is messy or high-dimensional. 💼

Gradient descent, on the other hand, is like an intern who learns slowly but can handle huge data cheaply. 🧑‍💻


📚 Tip for Python Learners#

If you’re new to Python or NumPy matrix operations, check out my companion book: 👉 Programming for Business It’s like “Python Gym” before you lift machine learning weights. 🏋️‍♂️🐍


🧭 Recap#

Concept

Description

OLS

Analytical method to minimize squared errors

Normal Equation

Closed-form solution for regression weights

Advantage

No iterative training needed

Disadvantage

Computationally heavy for large datasets

Relation to Gradient Descent

Both minimize same cost — different paths


💬 Final Thought#

“OLS doesn’t learn — it knows. Like that one kid in class who never studied but still topped the exam.” 😏📚


🔜 Next Up#

👉 Head to Non-linear & Polynomial Features where we’ll make our linear models curvy and flexible — because business problems rarely run in straight lines. 📈🔀


# Your code here