Ordinary Least Squares (OLS) & Normal Equations#
Because sometimes, math can solve your problem faster than hiking down a loss valley. 😌
🎯 What Is OLS?#
So far, we’ve made our poor regression model stumble downhill with gradients, slowly minimizing error. But what if we could just jump straight to the bottom of the valley — no hiking boots required? 👟
That’s Ordinary Least Squares (OLS).
It’s the “shortcut” method that says:
“We can find the perfect slope and intercept directly — with pure algebra.”
🧠 The Idea#
OLS minimizes the same thing Gradient Descent does — Mean Squared Error (MSE):
[ J(\beta) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 ]
But instead of learning gradually, we solve for β directly by setting derivatives to zero. Basically, we say:
“We want the slope where error stops changing.”
That gives us the Normal Equation:
[ \hat{\beta} = (X^T X)^{-1} X^T y ]
Where:
( X ): matrix of features (with a column of ones for intercept)
( y ): target variable
( \hat{\beta} ): the optimal coefficients
🧾 Why “Normal” Equation?#
Because it comes from setting the derivative of the loss function to zero (a “normal” condition for optimality).
Or, as a data scientist might explain to an executive:
“Normal equations make your regression weights perfectly balanced — like all things should be.” 😎
🧮 A Simple Example#
Suppose we’re modeling:
Sales = β₀ + β₁ × TV_Ad_Spend
import numpy as np
# Example data
X = np.array([[1, 230],
[1, 44],
[1, 17],
[1, 151],
[1, 180]]) # add column of 1s for intercept
y = np.array([22, 10, 7, 18, 20])
# OLS via Normal Equation
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print("Coefficients:", beta)
This gives:
Coefficients: [7.03, 0.07]
Meaning:
Even with \(0 ad spend, you get baseline sales ≈ 7.03 Every additional \)1 in TV ads adds about $0.07 in sales. 📺💰
🧩 Intuition: A Line that Minimizes Apologies#
Think of OLS as drawing the “least embarrassing line” through your scatter plot:
For each point, the vertical error (residual) is the model’s mistake.
OLS finds the line that makes the sum of squared mistakes as small as possible.
No gradient descent, no random initialization, no drama. Just straight math, no feelings. 🧘♀️
⚙️ OLS in Scikit-Learn#
In practice, you rarely invert matrices yourself. You can let scikit-learn handle it faster (and with numerical stability):
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X[:, 1].reshape(-1, 1), y)
print("Intercept:", model.intercept_)
print("Slope:", model.coef_)
📊 Visualising the Fit#
import matplotlib.pyplot as plt
plt.scatter(X[:, 1], y, label="Actual Sales", alpha=0.8)
plt.plot(X[:, 1], model.predict(X[:, 1].reshape(-1, 1)),
color="red", label="OLS Fit Line", linewidth=2)
plt.xlabel("TV Advertising Spend ($)")
plt.ylabel("Sales ($)")
plt.title("OLS Regression Line – Sales vs TV Spend")
plt.legend()
plt.show()
“When your scatter plot looks like a calm red line — that’s when business harmony is achieved.” 📈☯️
🧮 Matrix Shapes (Quick Reference)#
Symbol |
Meaning |
Shape |
|---|---|---|
( X ) |
Feature matrix |
(n_samples, n_features + 1) |
( y ) |
Target vector |
(n_samples, 1) |
( X^T X ) |
Square matrix |
(n_features + 1, n_features + 1) |
( (X^T X)^{-1} ) |
Inverse |
(n_features + 1, n_features + 1) |
( \hat{\beta} ) |
Coefficient vector |
(n_features + 1, 1) |
“If your matrix dimensions don’t align, your model’s chakras are blocked.” 🧘♂️
⚠️ Limitations of OLS#
Limitation |
Description |
|---|---|
💻 Computational Cost |
Matrix inversion is expensive for large data |
💥 Multicollinearity |
( X^T X ) can become singular (not invertible) |
📏 Assumes Linearity |
Works only for linear relationships |
📊 Sensitive to Outliers |
One crazy data point can tilt your whole model |
💡 Business Analogy#
OLS is like a consultant who instantly gives you the “best-fit” solution — but charges extra if your data is messy or high-dimensional. 💼
Gradient descent, on the other hand, is like an intern who learns slowly but can handle huge data cheaply. 🧑💻
📚 Tip for Python Learners#
If you’re new to Python or NumPy matrix operations, check out my companion book: 👉 Programming for Business It’s like “Python Gym” before you lift machine learning weights. 🏋️♂️🐍
🧭 Recap#
Concept |
Description |
|---|---|
OLS |
Analytical method to minimize squared errors |
Normal Equation |
Closed-form solution for regression weights |
Advantage |
No iterative training needed |
Disadvantage |
Computationally heavy for large datasets |
Relation to Gradient Descent |
Both minimize same cost — different paths |
💬 Final Thought#
“OLS doesn’t learn — it knows. Like that one kid in class who never studied but still topped the exam.” 😏📚
🔜 Next Up#
👉 Head to Non-linear & Polynomial Features where we’ll make our linear models curvy and flexible — because business problems rarely run in straight lines. 📈🔀
# Your code here