Gradient Descent Variants¶

Climbing Down the Loss Surface, Step by Step¶

Every trainable model has one goal: find the parameters that minimise the loss. Gradient descent is the algorithm that gets it there — iteratively following the slope of the loss surface downhill. This notebook covers the update rule from first principles, the three main variants (batch, stochastic, mini-batch), learning-rate effects, and the practical decision of which variant to use when.

Why Gradient Descent Matters in Business¶

Scale is the real reason: the Normal Equations give an exact closed-form solution in one step — but computing $(\mathbf{X}^\top\mathbf{X})^{-1}$ costs $O(p^3)$ in the number of features and requires the entire dataset in memory. A recommendation model with 50 million users and 500,000 features cannot be fitted that way. Gradient descent processes data in small batches, updating parameters incrementally, making it the only practical option at scale.

Scenario	Why GD is necessary
Neural network with millions of weights	Normal Equations don’t exist for non-linear models
Real-time fraud model updated every hour	Incremental (online) updates via SGD
Demand forecast on 10M SKUs	$\mathbf{X}^\top\mathbf{X}$ inversion is infeasible — mini-batch GD scales
Logistic / Poisson regression	No closed form — loss must be minimised iteratively

The Update Rule — From First Principles¶

For a model with parameters $\boldsymbol{\theta}$ and MSE loss:

J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(\mathbf{x}^{(i)}) - y^{(i)} \right)^2

(1)

Taking the partial derivative with respect to $\theta_j$ and setting the update rule:

\theta_j \;\leftarrow\; \theta_j - \color{#ff7f0e}{\alpha} \cdot \underbrace{\frac{\partial J}{\partial \theta_j}}_{\text{gradient}} = \theta_j - \color{#ff7f0e}{\alpha} \cdot \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(\mathbf{x}^{(i)}) - y^{(i)} \right) x_j^{(i)}

(2)

where $\color{#ff7f0e}{\alpha}$ is the learning rate — how large a step to take each iteration.

Key intuition: the gradient points uphill (direction of steepest increase). Subtracting it moves parameters downhill — toward lower loss.

The Three Variants¶

The only difference between variants is how many samples are used to estimate the gradient at each step:

Variant	Samples per update	Cost per step	Noise	Memory
Batch GD	All $m$	$O(m)$	None — exact gradient	Full dataset
Stochastic GD (SGD)	1	$O(1)$	High — single-sample estimate	One row
Mini-Batch GD	$B$ (e.g. 32–256)	$O(B)$	Moderate — default choice	$B$ rows

Batch GD

SGD

Mini-Batch GD

for epoch in range(epochs):
    grad = (1/m) * X.T @ (X @ theta - y)   # full dataset
    theta -= alpha * grad

Slow per epoch on large data. Smooth, stable loss curve. Suitable when the entire dataset fits in RAM and $m$ is modest.

for epoch in range(epochs):
    idx = np.random.permutation(m)
    for i in idx:
        grad = X[i] * (X[i] @ theta - y[i])   # one sample
        theta -= alpha * grad

Very noisy loss curve. Can escape shallow local minima. Used in online/streaming settings.

for epoch in range(epochs):
    for i in range(0, m, batch_size):
        Xb, yb = X[i:i+batch_size], y[i:i+batch_size]
        grad = (1/batch_size) * Xb.T @ (Xb @ theta - yb)
        theta -= alpha * grad

GPU-friendly (vectorised over batch). Loss curve moderately smooth. Standard choice for neural networks and large-scale ML.

Visual Intuition — GD on a Velocity-Error Cost¶

The cell below shows gradient descent minimising a simple 1D cost $J(t) = (2t + 2 - v^*)^2$ — the squared error between a linear velocity model and a target velocity. This is the exact same update rule used inside LinearRegression.fit(), scaled up to many parameters.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Desired velocity
v_star = 10

# Define cost and its gradient
def J(t):
    return (2*t + 2 - v_star)**2

def dJ(t):
    return 4 * (2*t + 2 - v_star)

# Plot J(t)
t_vals = np.linspace(-2, 8, 400)
J_vals = J(t_vals)

plt.figure(figsize=(8, 5))
plt.plot(t_vals, J_vals, label=r'$J(t)=(2t+2-v^*)^2$', linewidth=2)

# Gradient descent path
t = 0.0
eta = 0.1
path = [t]
for _ in range(10):
    t = t - eta * dJ(t)
    path.append(t)

plt.plot(path, J(np.array(path)), 'o-', color='tomato', label='GD steps', linewidth=1.5)
plt.axvline(path[-1], color='green', linestyle=':', label=f'Converged at t={path[-1]:.3f}')
plt.title('Gradient Descent on Velocity-Error Cost')
plt.xlabel('$t$')
plt.ylabel('$J(t)$')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Analytical minimum: t* = {(v_star - 2) / 2:.3f}")
print(f"GD result after 10 steps: t = {path[-1]:.3f}")

Business Cost Function — Gradient Descent on $C(q)$ ¶

The same algorithm applies to any differentiable cost function. Here we minimise a quadratic business cost $C(q) = 5q^2 - 40q + 200$ (total cost as a function of production quantity $q$ ). The gradient points toward increasing cost; subtracting it moves toward the minimum-cost quantity.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Define the cost function and its derivative
def C(q):
    return 5*q**2 - 40*q + 200

def dC(q):
    return 10*q - 40

# Define quantity values
q_vals = np.linspace(0, 8, 400)
cost_vals = C(q_vals)

# Plot the cost function
plt.figure(figsize=(10, 6))
plt.plot(q_vals, cost_vals, label=r'$C(q) = 5q^2 - 40q + 200$', color='blue', linewidth=2)

# Mark the minimum point
q_min = 4
plt.scatter([q_min], [C(q_min)], color='red', s=80, zorder=5, label='Minimum Cost at $q=4$')
plt.axvline(x=q_min, linestyle='--', color='gray', alpha=0.5)

# Tangent line at q = 6
q_tangent = 6
slope = dC(q_tangent)
y_tangent = C(q_tangent)
q_range = np.linspace(q_tangent - 1, q_tangent + 1, 100)
tangent_line = slope * (q_range - q_tangent) + y_tangent
plt.plot(q_range, tangent_line, linestyle='--', color='green', linewidth=1.5, label='Tangent at $q=6$')

# Gradient descent visualization
q = 8.0
eta = 0.01
q_path = [q]
for _ in range(50):
    q = q - eta * dC(q)
    q_path.append(q)

plt.plot(q_path, C(np.array(q_path)), 'o-', label='Gradient Descent Steps', color='orange',
         markersize=4, linewidth=1.2)

plt.title('Business Cost Function and Gradient Descent')
plt.xlabel('Quantity $q$')
plt.ylabel('Cost $C(q)$')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

print(f"Minimum cost quantity: q* = {q_path[-1]:.4f}  (analytical: 4.0)")
print(f"Minimum cost: C(q*) = {C(q_path[-1]):.4f}  (analytical: {C(4):.0f})")

Convergence Animation — GD on $J(\theta) = (\theta - 3)^2$ ¶

The animation below shows each gradient descent step on a simple 1D loss: the dot moves toward the minimum at $\theta^* = 3$ , with each step size proportional to the gradient magnitude. As the dot nears the minimum, the gradient shrinks and steps become smaller — that is the automatic deceleration built into the update rule.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# Cost function and derivative
def J(theta):
    return (theta - 3) ** 2

def dJ(theta):
    return 2 * (theta - 3)

# Hyperparameters
alpha = 0.1
theta = 0.0
steps = 50

# Record history for animation
thetas = [theta]
costs = [J(theta)]

# Perform gradient descent
for _ in range(steps):
    theta = theta - alpha * dJ(theta)
    thetas.append(theta)
    costs.append(J(theta))

# Setup figure and axis
theta_vals = np.linspace(-1, 7, 400)
cost_vals = J(theta_vals)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(theta_vals, cost_vals, 'b-', label=r'$J(\theta) = (\theta - 3)^2$')
point, = ax.plot([], [], 'ro', label='Current Step')
text = ax.text(0.05, 0.9, '', transform=ax.transAxes)

ax.set_xlim(-1, 7)
ax.set_ylim(0, max(cost_vals) + 1)
ax.set_xlabel(r'$\theta$')
ax.set_ylabel(r'$J(\theta)$')
ax.set_title('Gradient Descent Minimizing Squared Error')
ax.grid(True, alpha=0.3)
ax.legend()

def init():
    point.set_data([], [])
    text.set_text('')
    return point, text

def update(frame):
    x = [thetas[frame]]
    y = [costs[frame]]
    point.set_data(x, y)
    text.set_text(f'Step: {frame}\n\u03b8: {x[0]:.4f}\nJ(\u03b8): {y[0]:.4f}')
    return point, text

ani = FuncAnimation(fig, update, frames=len(thetas), init_func=init, blit=False, interval=300)
ani.save('gradientdescent.gif', writer=PillowWriter(fps=10))
plt.close()

from IPython.display import Image
Image(filename='gradientdescent.gif')

Convergence Animation with Direction Arrow¶

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# Cost function and derivative
def J(theta):
    return (theta - 3)**2

def dJ(theta):
    return 2 * (theta - 3)

# Hyperparameters
alpha = 0.1
theta = 0.0
steps = 50

thetas = [theta]
costs = [J(theta)]

# Gradient descent
for _ in range(steps):
    theta = theta - alpha * dJ(theta)
    thetas.append(theta)
    costs.append(J(theta))

# Plot setup
theta_vals = np.linspace(-1, 7, 400)
cost_vals = J(theta_vals)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(theta_vals, cost_vals, 'b-', label=r'$J(\theta) = (\theta - 3)^2$')
point, = ax.plot([], [], 'ro', label='Current Step')
arrow = ax.annotate('', xy=(0, 0), xytext=(0, 0),
                    arrowprops=dict(arrowstyle='->', color='red', lw=2))
text = ax.text(0.05, 0.9, '', transform=ax.transAxes)

ax.set_xlim(-1, 7)
ax.set_ylim(0, max(cost_vals) + 1)
ax.set_xlabel(r'$\theta$')
ax.set_ylabel(r'$J(\theta)$')
ax.set_title('Gradient Descent Direction Toward Minimum')
ax.grid(True, alpha=0.3)
ax.legend()

def init():
    point.set_data([], [])
    arrow.set_position((0, 0))
    arrow.xy = (0, 0)
    text.set_text('')
    return point, arrow, text

def update(frame):
    x = thetas[frame]
    y = costs[frame]
    point.set_data([x], [y])
    if frame < len(thetas) - 1:
        next_x = thetas[frame + 1]
        next_y = costs[frame + 1]
        arrow.xy = (next_x, next_y)
        arrow.set_position((x, y))
    else:
        arrow.xy = (x, y)
        arrow.set_position((x, y))
    text.set_text(f'Step: {frame}\n\u03b8: {x:.4f}\nJ(\u03b8): {y:.4f}')
    return point, arrow, text

ani = FuncAnimation(fig, update, frames=len(thetas), init_func=init, blit=False, interval=300)
ani.save('gradientdescentarrow.gif', writer=PillowWriter(fps=10))
plt.close()

from IPython.display import Image
Image(filename='gradientdescentarrow.gif')

Learning Rate Effects¶

The learning rate $\alpha$ is the single most important hyperparameter for gradient descent:

\alpha \text{ too small} \Rightarrow \text{slow convergence, wastes compute} \qquad \alpha \text{ too large} \Rightarrow \text{overshoots, diverges} \qquad \alpha \text{ well-tuned} \Rightarrow \text{fast, stable convergence}

(3)

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):  return (theta - 3)**2
def dJ(theta): return 2 * (theta - 3)

theta_vals = np.linspace(-2, 9, 400)

configs = [
    (0.01,  'Too small (\u03b1=0.01)',  'steelblue'),
    (0.10,  'Well-tuned (\u03b1=0.10)', 'seagreen'),
    (0.99,  'Too large  (\u03b1=0.99)', 'tomato'),
]

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Left: paths on loss surface
axes[0].plot(theta_vals, J(theta_vals), 'k-', linewidth=2, label='Loss surface')
for alpha, label, color in configs:
    t, path = 0.0, [0.0]
    for _ in range(20):
        t = t - alpha * dJ(t)
        path.append(t)
    axes[0].plot(path, J(np.array(path)), 'o-', color=color, label=label,
                 markersize=4, linewidth=1.5, alpha=0.8)
axes[0].set_xlim(-2, 9); axes[0].set_ylim(-1, 40)
axes[0].set_xlabel(r'$\theta$'); axes[0].set_ylabel(r'$J(\theta)$')
axes[0].set_title('Paths on loss surface (20 steps)')
axes[0].legend(fontsize=8); axes[0].grid(True, alpha=0.3)

# Right: loss vs iteration
for alpha, label, color in configs:
    t, losses = 0.0, []
    for _ in range(50):
        t = t - alpha * dJ(t)
        losses.append(J(t))
    axes[1].plot(range(50), losses, color=color, label=label, linewidth=1.8)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel(r'$J(\theta)$')
axes[1].set_title('Loss curve vs iteration')
axes[1].set_ylim(-1, 50); axes[1].legend(fontsize=8); axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Why GD Scales — Normal Equations vs Gradient Descent¶

Criterion	Normal Equations	Gradient Descent
Works for non-linear models	No	Yes
Memory requirement	Full dataset in RAM	One mini-batch
Time complexity	$O(mp^2 + p^3)$	$O(k \cdot B \cdot p)$ per epoch
Online / streaming updates	No	Yes (SGD/mini-batch)
Hyperparameters	None	Learning rate, batch size, epochs

Try It in the Browser¶

Mini-batch gradient descent on a 1D linear regression — in pure Python. Change the learning rate and batch size and watch convergence speed change.

Guided Practice¶

Which gradient descent variant uses one sample at a time for updates?¶

Batch gradient descentBatch GD uses the entire dataset for each gradient estimate before updating parameters.

Stochastic gradient descent (SGD)Correct. SGD updates parameters after computing the gradient on a single training sample. This makes it fast but noisy.

Mini-batch gradient descentMini-batch uses a small fixed subset (e.g. 32–256 samples), not a single sample.

Normal EquationsNormal Equations are a closed-form solution, not an iterative gradient-based method.

What is the main advantage of mini-batch gradient descent?¶

It guarantees zero training error in one stepNo gradient-based method guarantees zero training error in a single step.

It balances speed and stability better than full-batch or single-sample updates aloneCorrect. Mini-batches provide a noisier but faster gradient estimate than full-batch, while being more stable than single-sample SGD. It is also GPU-friendly via vectorised matrix operations.

It removes the need for a learning rateAll gradient descent variants require a learning rate. Removing it is not possible without an adaptive method like Adam.

It always uses the entire dataset in memory at onceThat describes batch GD. Mini-batching specifically avoids loading the full dataset — only B rows are needed at a time.

If the learning rate is set too large, what happens during gradient descent?¶

Convergence is slow but guaranteedSlow convergence is the symptom of a learning rate that is too small, not too large.

The update overshoots the minimum and the loss may oscillate or divergeCorrect. A large learning rate causes the parameter to jump past the minimum, potentially increasing loss on the next step. In the worst case the algorithm diverges entirely.

The gradient becomes zero immediatelyA large step does not make the gradient zero — it overshoots the point where the gradient is zero.

Batch size automatically increases to compensateBatch size and learning rate are independent hyperparameters. One does not automatically adjust the other.

Why is gradient descent preferred over Normal Equations for training large neural networks?¶

Because Normal Equations always overfit on large datasetsOverfitting is a model-complexity issue, not a property of the Normal Equations solver.

Because Normal Equations require inverting a matrix that does not exist for non-linear models, and inversion is O(p³) — infeasible at scaleCorrect. Neural networks have non-linear losses with no closed-form solution. Even if they did, O(p³) inversion with millions of parameters is computationally infeasible.

Because gradient descent always finds the global minimumGradient descent is not guaranteed to find the global minimum in non-convex settings (e.g. neural networks). It may converge to a local minimum or saddle point.

Because Normal Equations require more hyperparametersNormal Equations have no hyperparameters (no learning rate, no epochs). It is gradient descent that requires hyperparameter tuning.

Exercises¶

Exercise 1 — GD from scratch on a 2D loss surface¶

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# 2D loss: J(t0, t1) = (t0 - 2)^2 + 2*(t1 + 1)^2
# Minimum at (2, -1)

def J2(t0, t1): return (t0 - 2)**2 + 2*(t1 + 1)**2
def dJ_t0(t0, t1): return 2*(t0 - 2)
def dJ_t1(t0, t1): return 4*(t1 + 1)

# TODO:
# 1. Implement gradient descent from t0=0, t1=0 with alpha=0.1, 40 steps
# 2. Record the (t0, t1) path and loss at each step
# 3. Plot the loss surface as a contour plot
# 4. Overlay the GD path on the contour plot
# 5. Mark the starting point and the minimum

t0_grid = np.linspace(-1, 5, 100)
t1_grid = np.linspace(-4, 2, 100)
T0, T1 = np.meshgrid(t0_grid, t1_grid)
Z = J2(T0, T1)

fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(T0, T1, Z, levels=20, cmap='Blues_r', alpha=0.7)
ax.contour(T0, T1, Z, levels=20, colors='white', linewidths=0.5, alpha=0.4)
ax.scatter([2], [-1], color='gold', s=120, zorder=5, label='Minimum (2, -1)')
ax.set_xlabel(r'$\theta_0$'); ax.set_ylabel(r'$\theta_1$')
ax.set_title('2D Loss Surface — add your GD path')
ax.legend()
plt.tight_layout()
plt.show()

# Uncomment to start:
# t0, t1, alpha = 0.0, 0.0, 0.1
# path_t0, path_t1 = [t0], [t1]
# for _ in range(40):
#     t0 -= alpha * dJ_t0(t0, t1)
#     t1 -= alpha * dJ_t1(t0, t1)
#     path_t0.append(t0); path_t1.append(t1)
# ax.plot(path_t0, path_t1, 'r-o', markersize=3, label='GD path')

Exercise 2 — Compare batch vs mini-batch convergence¶

Generate a dataset with 500 samples and 1 feature. Implement both batch GD and mini-batch GD (batch size 32). Run each for 50 epochs and plot the loss curve side-by-side. Which converges faster? Which is smoother?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)
m = 500
X = np.random.randn(m, 1)
y = 3 * X.ravel() + 2 + np.random.randn(m) * 0.5  # true: theta0=2, theta1=3
Xb = np.hstack([np.ones((m, 1)), X])  # add bias column

# TODO:
# Batch GD:
#   theta = np.zeros(2)
#   for epoch in range(50):
#       grad = (1/m) * Xb.T @ (Xb @ theta - y)
#       theta -= 0.05 * grad
#       record MSE
#
# Mini-batch GD (batch_size=32):
#   theta = np.zeros(2)
#   for epoch in range(50):
#       shuffle indices
#       for each mini-batch: compute grad on batch, update theta
#       record MSE at end of epoch
#
# Plot both loss curves on same axes

print("Implement batch vs mini-batch GD and plot loss curves.")

Exercise 3 — Learning rate grid search¶

For the 1D problem $J(\theta) = (\theta - 3)^2$ , run gradient descent with five learning rates: [0.001, 0.01, 0.1, 0.5, 1.1]. Start each from $\theta = 0$ and run 60 steps. Plot loss vs iteration for all five on the same axes and identify which rates converge, which are slow, and which diverge.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):  return (theta - 3)**2
def dJ(theta): return 2 * (theta - 3)

alphas = [0.001, 0.01, 0.1, 0.5, 1.1]
epochs = 60

# TODO:
# for alpha in alphas:
#     theta = 0.0
#     losses = []
#     for _ in range(epochs):
#         theta -= alpha * dJ(theta)
#         losses.append(J(theta))
#     plt.plot(losses, label=f'alpha={alpha}')
#
# plt.yscale('log')   # log scale makes slow/fast convergence visible
# plt.xlabel('Iteration'); plt.ylabel('Loss (log scale)')
# plt.title('Learning rate comparison'); plt.legend(); plt.grid(True, alpha=0.3)
# plt.show()

print("Implement learning rate comparison and plot.")

Common Pitfalls¶

Summary¶

Concept	Formula / Rule
Update rule	$\theta_j \leftarrow \theta_j - \alpha \cdot \frac{\partial J}{\partial \theta_j}$
MSE gradient	$\frac{1}{m}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\theta} - \mathbf{y})$
Batch GD	Exact gradient, slow on large data, no hyperparams beyond $\alpha$
SGD	One sample per step, fast, noisy, online-capable
Mini-batch GD	$B$ samples per step, GPU-friendly, default for deep learning
Learning rate too small	Slow convergence
Learning rate too large	Overshooting, possible divergence
Why not Normal Equations at scale	$O(p^3)$ inversion, requires full dataset, no non-linear support

Decision rule: for small linear regression ( $p < 10\,000$ , data fits in RAM) use Normal Equations. For everything else — large data, non-linear models, online updates — use mini-batch gradient descent.

Next Up — Advanced Optimizers¶

You can now implement gradient descent from scratch. Next: make it faster and smarter.¶

The next notebook — Advanced Optimizers (Adam, RMSprop, Momentum) — shows that plain gradient descent has a fixed step size for all parameters. Adaptive optimizers maintain per-parameter learning rates, use gradient history to build momentum, and converge faster on the ill-conditioned loss surfaces common in deep learning. Dependencies: the GD update rule and learning-rate intuition from this notebook.