Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Gradient Descent Variants

Climbing Down the Loss Surface, Step by Step

Every trainable model has one goal: find the parameters that minimise the loss. Gradient descent is the algorithm that gets it there — iteratively following the slope of the loss surface downhill. This notebook covers the update rule from first principles, the three main variants (batch, stochastic, mini-batch), learning-rate effects, and the practical decision of which variant to use when.

Why Gradient Descent Matters in Business

Scale is the real reason: the Normal Equations give an exact closed-form solution in one step — but computing $(\mathbf{X}^\top\mathbf{X})^{-1}$ costs $O(p^3)$ in the number of features and requires the entire dataset in memory. A recommendation model with 50 million users and 500,000 features cannot be fitted that way. Gradient descent processes data in small batches, updating parameters incrementally, making it the only practical option at scale.
ScenarioWhy GD is necessary
Neural network with millions of weightsNormal Equations don’t exist for non-linear models
Real-time fraud model updated every hourIncremental (online) updates via SGD
Demand forecast on 10M SKUsXX\mathbf{X}^\top\mathbf{X} inversion is infeasible — mini-batch GD scales
Logistic / Poisson regressionNo closed form — loss must be minimised iteratively

The Update Rule — From First Principles

For a model with parameters θ\boldsymbol{\theta} and MSE loss:

J(θ)=12mi=1m(hθ(x(i))y(i))2J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(\mathbf{x}^{(i)}) - y^{(i)} \right)^2

Taking the partial derivative with respect to θj\theta_j and setting the update rule:

θj    θjαJθjgradient=θjα1mi=1m(hθ(x(i))y(i))xj(i)\theta_j \;\leftarrow\; \theta_j - \color{#ff7f0e}{\alpha} \cdot \underbrace{\frac{\partial J}{\partial \theta_j}}_{\text{gradient}} = \theta_j - \color{#ff7f0e}{\alpha} \cdot \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(\mathbf{x}^{(i)}) - y^{(i)} \right) x_j^{(i)}

where α\color{#ff7f0e}{\alpha} is the learning rate — how large a step to take each iteration.

Key intuition: the gradient points uphill (direction of steepest increase). Subtracting it moves parameters downhill — toward lower loss.

The Three Variants

The only difference between variants is how many samples are used to estimate the gradient at each step:

VariantSamples per updateCost per stepNoiseMemory
Batch GDAll mmO(m)O(m)None — exact gradientFull dataset
Stochastic GD (SGD)1O(1)O(1)High — single-sample estimateOne row
Mini-Batch GDBB (e.g. 32–256)O(B)O(B)Moderate — default choiceBB rows
Batch GD
SGD
Mini-Batch GD
for epoch in range(epochs):
    grad = (1/m) * X.T @ (X @ theta - y)   # full dataset
    theta -= alpha * grad

Slow per epoch on large data. Smooth, stable loss curve. Suitable when the entire dataset fits in RAM and mm is modest.

Visual Intuition — GD on a Velocity-Error Cost

The cell below shows gradient descent minimising a simple 1D cost J(t)=(2t+2v)2J(t) = (2t + 2 - v^*)^2 — the squared error between a linear velocity model and a target velocity. This is the exact same update rule used inside LinearRegression.fit(), scaled up to many parameters.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Desired velocity
v_star = 10

# Define cost and its gradient
def J(t):
    return (2*t + 2 - v_star)**2

def dJ(t):
    return 4 * (2*t + 2 - v_star)

# Plot J(t)
t_vals = np.linspace(-2, 8, 400)
J_vals = J(t_vals)

plt.figure(figsize=(8, 5))
plt.plot(t_vals, J_vals, label=r'$J(t)=(2t+2-v^*)^2$', linewidth=2)

# Gradient descent path
t = 0.0
eta = 0.1
path = [t]
for _ in range(10):
    t = t - eta * dJ(t)
    path.append(t)

plt.plot(path, J(np.array(path)), 'o-', color='tomato', label='GD steps', linewidth=1.5)
plt.axvline(path[-1], color='green', linestyle=':', label=f'Converged at t={path[-1]:.3f}')
plt.title('Gradient Descent on Velocity-Error Cost')
plt.xlabel('$t$')
plt.ylabel('$J(t)$')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Analytical minimum: t* = {(v_star - 2) / 2:.3f}")
print(f"GD result after 10 steps: t = {path[-1]:.3f}")

Business Cost Function — Gradient Descent on C(q)C(q)

The same algorithm applies to any differentiable cost function. Here we minimise a quadratic business cost C(q)=5q240q+200C(q) = 5q^2 - 40q + 200 (total cost as a function of production quantity qq). The gradient points toward increasing cost; subtracting it moves toward the minimum-cost quantity.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Define the cost function and its derivative
def C(q):
    return 5*q**2 - 40*q + 200

def dC(q):
    return 10*q - 40

# Define quantity values
q_vals = np.linspace(0, 8, 400)
cost_vals = C(q_vals)

# Plot the cost function
plt.figure(figsize=(10, 6))
plt.plot(q_vals, cost_vals, label=r'$C(q) = 5q^2 - 40q + 200$', color='blue', linewidth=2)

# Mark the minimum point
q_min = 4
plt.scatter([q_min], [C(q_min)], color='red', s=80, zorder=5, label='Minimum Cost at $q=4$')
plt.axvline(x=q_min, linestyle='--', color='gray', alpha=0.5)

# Tangent line at q = 6
q_tangent = 6
slope = dC(q_tangent)
y_tangent = C(q_tangent)
q_range = np.linspace(q_tangent - 1, q_tangent + 1, 100)
tangent_line = slope * (q_range - q_tangent) + y_tangent
plt.plot(q_range, tangent_line, linestyle='--', color='green', linewidth=1.5, label='Tangent at $q=6$')

# Gradient descent visualization
q = 8.0
eta = 0.01
q_path = [q]
for _ in range(50):
    q = q - eta * dC(q)
    q_path.append(q)

plt.plot(q_path, C(np.array(q_path)), 'o-', label='Gradient Descent Steps', color='orange',
         markersize=4, linewidth=1.2)

plt.title('Business Cost Function and Gradient Descent')
plt.xlabel('Quantity $q$')
plt.ylabel('Cost $C(q)$')
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

print(f"Minimum cost quantity: q* = {q_path[-1]:.4f}  (analytical: 4.0)")
print(f"Minimum cost: C(q*) = {C(q_path[-1]):.4f}  (analytical: {C(4):.0f})")

Convergence Animation — GD on J(θ)=(θ3)2J(\theta) = (\theta - 3)^2

The animation below shows each gradient descent step on a simple 1D loss: the dot moves toward the minimum at θ=3\theta^* = 3, with each step size proportional to the gradient magnitude. As the dot nears the minimum, the gradient shrinks and steps become smaller — that is the automatic deceleration built into the update rule.

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# Cost function and derivative
def J(theta):
    return (theta - 3) ** 2

def dJ(theta):
    return 2 * (theta - 3)

# Hyperparameters
alpha = 0.1
theta = 0.0
steps = 50

# Record history for animation
thetas = [theta]
costs = [J(theta)]

# Perform gradient descent
for _ in range(steps):
    theta = theta - alpha * dJ(theta)
    thetas.append(theta)
    costs.append(J(theta))

# Setup figure and axis
theta_vals = np.linspace(-1, 7, 400)
cost_vals = J(theta_vals)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(theta_vals, cost_vals, 'b-', label=r'$J(\theta) = (\theta - 3)^2$')
point, = ax.plot([], [], 'ro', label='Current Step')
text = ax.text(0.05, 0.9, '', transform=ax.transAxes)

ax.set_xlim(-1, 7)
ax.set_ylim(0, max(cost_vals) + 1)
ax.set_xlabel(r'$\theta$')
ax.set_ylabel(r'$J(\theta)$')
ax.set_title('Gradient Descent Minimizing Squared Error')
ax.grid(True, alpha=0.3)
ax.legend()

def init():
    point.set_data([], [])
    text.set_text('')
    return point, text

def update(frame):
    x = [thetas[frame]]
    y = [costs[frame]]
    point.set_data(x, y)
    text.set_text(f'Step: {frame}\n\u03b8: {x[0]:.4f}\nJ(\u03b8): {y[0]:.4f}')
    return point, text

ani = FuncAnimation(fig, update, frames=len(thetas), init_func=init, blit=False, interval=300)
ani.save('gradientdescent.gif', writer=PillowWriter(fps=10))
plt.close()

from IPython.display import Image
Image(filename='gradientdescent.gif')

Convergence Animation with Direction Arrow

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# Cost function and derivative
def J(theta):
    return (theta - 3)**2

def dJ(theta):
    return 2 * (theta - 3)

# Hyperparameters
alpha = 0.1
theta = 0.0
steps = 50

thetas = [theta]
costs = [J(theta)]

# Gradient descent
for _ in range(steps):
    theta = theta - alpha * dJ(theta)
    thetas.append(theta)
    costs.append(J(theta))

# Plot setup
theta_vals = np.linspace(-1, 7, 400)
cost_vals = J(theta_vals)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(theta_vals, cost_vals, 'b-', label=r'$J(\theta) = (\theta - 3)^2$')
point, = ax.plot([], [], 'ro', label='Current Step')
arrow = ax.annotate('', xy=(0, 0), xytext=(0, 0),
                    arrowprops=dict(arrowstyle='->', color='red', lw=2))
text = ax.text(0.05, 0.9, '', transform=ax.transAxes)

ax.set_xlim(-1, 7)
ax.set_ylim(0, max(cost_vals) + 1)
ax.set_xlabel(r'$\theta$')
ax.set_ylabel(r'$J(\theta)$')
ax.set_title('Gradient Descent Direction Toward Minimum')
ax.grid(True, alpha=0.3)
ax.legend()

def init():
    point.set_data([], [])
    arrow.set_position((0, 0))
    arrow.xy = (0, 0)
    text.set_text('')
    return point, arrow, text

def update(frame):
    x = thetas[frame]
    y = costs[frame]
    point.set_data([x], [y])
    if frame < len(thetas) - 1:
        next_x = thetas[frame + 1]
        next_y = costs[frame + 1]
        arrow.xy = (next_x, next_y)
        arrow.set_position((x, y))
    else:
        arrow.xy = (x, y)
        arrow.set_position((x, y))
    text.set_text(f'Step: {frame}\n\u03b8: {x:.4f}\nJ(\u03b8): {y:.4f}')
    return point, arrow, text

ani = FuncAnimation(fig, update, frames=len(thetas), init_func=init, blit=False, interval=300)
ani.save('gradientdescentarrow.gif', writer=PillowWriter(fps=10))
plt.close()

from IPython.display import Image
Image(filename='gradientdescentarrow.gif')

Learning Rate Effects

The learning rate α\alpha is the single most important hyperparameter for gradient descent:

α too smallslow convergence, wastes computeα too largeovershoots, divergesα well-tunedfast, stable convergence\alpha \text{ too small} \Rightarrow \text{slow convergence, wastes compute} \qquad \alpha \text{ too large} \Rightarrow \text{overshoots, diverges} \qquad \alpha \text{ well-tuned} \Rightarrow \text{fast, stable convergence}
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):  return (theta - 3)**2
def dJ(theta): return 2 * (theta - 3)

theta_vals = np.linspace(-2, 9, 400)

configs = [
    (0.01,  'Too small (\u03b1=0.01)',  'steelblue'),
    (0.10,  'Well-tuned (\u03b1=0.10)', 'seagreen'),
    (0.99,  'Too large  (\u03b1=0.99)', 'tomato'),
]

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Left: paths on loss surface
axes[0].plot(theta_vals, J(theta_vals), 'k-', linewidth=2, label='Loss surface')
for alpha, label, color in configs:
    t, path = 0.0, [0.0]
    for _ in range(20):
        t = t - alpha * dJ(t)
        path.append(t)
    axes[0].plot(path, J(np.array(path)), 'o-', color=color, label=label,
                 markersize=4, linewidth=1.5, alpha=0.8)
axes[0].set_xlim(-2, 9); axes[0].set_ylim(-1, 40)
axes[0].set_xlabel(r'$\theta$'); axes[0].set_ylabel(r'$J(\theta)$')
axes[0].set_title('Paths on loss surface (20 steps)')
axes[0].legend(fontsize=8); axes[0].grid(True, alpha=0.3)

# Right: loss vs iteration
for alpha, label, color in configs:
    t, losses = 0.0, []
    for _ in range(50):
        t = t - alpha * dJ(t)
        losses.append(J(t))
    axes[1].plot(range(50), losses, color=color, label=label, linewidth=1.8)
axes[1].set_xlabel('Iteration'); axes[1].set_ylabel(r'$J(\theta)$')
axes[1].set_title('Loss curve vs iteration')
axes[1].set_ylim(-1, 50); axes[1].legend(fontsize=8); axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Why GD Scales — Normal Equations vs Gradient Descent

CriterionNormal EquationsGradient Descent
Works for non-linear modelsNoYes
Memory requirementFull dataset in RAMOne mini-batch
Time complexityO(mp2+p3)O(mp^2 + p^3)O(kBp)O(k \cdot B \cdot p) per epoch
Online / streaming updatesNoYes (SGD/mini-batch)
HyperparametersNoneLearning rate, batch size, epochs

Try It in the Browser

Mini-batch gradient descent on a 1D linear regression — in pure Python. Change the learning rate and batch size and watch convergence speed change.

Guided Practice

Which gradient descent variant uses one sample at a time for updates?

Batch gradient descentBatch GD uses the entire dataset for each gradient estimate before updating parameters.
Stochastic gradient descent (SGD)Correct. SGD updates parameters after computing the gradient on a single training sample. This makes it fast but noisy.
Mini-batch gradient descentMini-batch uses a small fixed subset (e.g. 32–256 samples), not a single sample.
Normal EquationsNormal Equations are a closed-form solution, not an iterative gradient-based method.

What is the main advantage of mini-batch gradient descent?

It guarantees zero training error in one stepNo gradient-based method guarantees zero training error in a single step.
It balances speed and stability better than full-batch or single-sample updates aloneCorrect. Mini-batches provide a noisier but faster gradient estimate than full-batch, while being more stable than single-sample SGD. It is also GPU-friendly via vectorised matrix operations.
It removes the need for a learning rateAll gradient descent variants require a learning rate. Removing it is not possible without an adaptive method like Adam.
It always uses the entire dataset in memory at onceThat describes batch GD. Mini-batching specifically avoids loading the full dataset — only B rows are needed at a time.

If the learning rate is set too large, what happens during gradient descent?

Convergence is slow but guaranteedSlow convergence is the symptom of a learning rate that is too small, not too large.
The update overshoots the minimum and the loss may oscillate or divergeCorrect. A large learning rate causes the parameter to jump past the minimum, potentially increasing loss on the next step. In the worst case the algorithm diverges entirely.
The gradient becomes zero immediatelyA large step does not make the gradient zero — it overshoots the point where the gradient is zero.
Batch size automatically increases to compensateBatch size and learning rate are independent hyperparameters. One does not automatically adjust the other.

Why is gradient descent preferred over Normal Equations for training large neural networks?

Because Normal Equations always overfit on large datasetsOverfitting is a model-complexity issue, not a property of the Normal Equations solver.
Because Normal Equations require inverting a matrix that does not exist for non-linear models, and inversion is O(p³) — infeasible at scaleCorrect. Neural networks have non-linear losses with no closed-form solution. Even if they did, O(p³) inversion with millions of parameters is computationally infeasible.
Because gradient descent always finds the global minimumGradient descent is not guaranteed to find the global minimum in non-convex settings (e.g. neural networks). It may converge to a local minimum or saddle point.
Because Normal Equations require more hyperparametersNormal Equations have no hyperparameters (no learning rate, no epochs). It is gradient descent that requires hyperparameter tuning.

Exercises

Exercise 1 — GD from scratch on a 2D loss surface

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# 2D loss: J(t0, t1) = (t0 - 2)^2 + 2*(t1 + 1)^2
# Minimum at (2, -1)

def J2(t0, t1): return (t0 - 2)**2 + 2*(t1 + 1)**2
def dJ_t0(t0, t1): return 2*(t0 - 2)
def dJ_t1(t0, t1): return 4*(t1 + 1)

# TODO:
# 1. Implement gradient descent from t0=0, t1=0 with alpha=0.1, 40 steps
# 2. Record the (t0, t1) path and loss at each step
# 3. Plot the loss surface as a contour plot
# 4. Overlay the GD path on the contour plot
# 5. Mark the starting point and the minimum

t0_grid = np.linspace(-1, 5, 100)
t1_grid = np.linspace(-4, 2, 100)
T0, T1 = np.meshgrid(t0_grid, t1_grid)
Z = J2(T0, T1)

fig, ax = plt.subplots(figsize=(7, 6))
ax.contourf(T0, T1, Z, levels=20, cmap='Blues_r', alpha=0.7)
ax.contour(T0, T1, Z, levels=20, colors='white', linewidths=0.5, alpha=0.4)
ax.scatter([2], [-1], color='gold', s=120, zorder=5, label='Minimum (2, -1)')
ax.set_xlabel(r'$\theta_0$'); ax.set_ylabel(r'$\theta_1$')
ax.set_title('2D Loss Surface — add your GD path')
ax.legend()
plt.tight_layout()
plt.show()

# Uncomment to start:
# t0, t1, alpha = 0.0, 0.0, 0.1
# path_t0, path_t1 = [t0], [t1]
# for _ in range(40):
#     t0 -= alpha * dJ_t0(t0, t1)
#     t1 -= alpha * dJ_t1(t0, t1)
#     path_t0.append(t0); path_t1.append(t1)
# ax.plot(path_t0, path_t1, 'r-o', markersize=3, label='GD path')

Exercise 2 — Compare batch vs mini-batch convergence

Generate a dataset with 500 samples and 1 feature. Implement both batch GD and mini-batch GD (batch size 32). Run each for 50 epochs and plot the loss curve side-by-side. Which converges faster? Which is smoother?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)
m = 500
X = np.random.randn(m, 1)
y = 3 * X.ravel() + 2 + np.random.randn(m) * 0.5  # true: theta0=2, theta1=3
Xb = np.hstack([np.ones((m, 1)), X])  # add bias column

# TODO:
# Batch GD:
#   theta = np.zeros(2)
#   for epoch in range(50):
#       grad = (1/m) * Xb.T @ (Xb @ theta - y)
#       theta -= 0.05 * grad
#       record MSE
#
# Mini-batch GD (batch_size=32):
#   theta = np.zeros(2)
#   for epoch in range(50):
#       shuffle indices
#       for each mini-batch: compute grad on batch, update theta
#       record MSE at end of epoch
#
# Plot both loss curves on same axes

print("Implement batch vs mini-batch GD and plot loss curves.")

For the 1D problem J(θ)=(θ3)2J(\theta) = (\theta - 3)^2, run gradient descent with five learning rates: [0.001, 0.01, 0.1, 0.5, 1.1]. Start each from θ=0\theta = 0 and run 60 steps. Plot loss vs iteration for all five on the same axes and identify which rates converge, which are slow, and which diverge.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):  return (theta - 3)**2
def dJ(theta): return 2 * (theta - 3)

alphas = [0.001, 0.01, 0.1, 0.5, 1.1]
epochs = 60

# TODO:
# for alpha in alphas:
#     theta = 0.0
#     losses = []
#     for _ in range(epochs):
#         theta -= alpha * dJ(theta)
#         losses.append(J(theta))
#     plt.plot(losses, label=f'alpha={alpha}')
#
# plt.yscale('log')   # log scale makes slow/fast convergence visible
# plt.xlabel('Iteration'); plt.ylabel('Loss (log scale)')
# plt.title('Learning rate comparison'); plt.legend(); plt.grid(True, alpha=0.3)
# plt.show()

print("Implement learning rate comparison and plot.")

Common Pitfalls

Summary

Key takeaways
ConceptFormula / Rule
Update ruleθjθjαJθj\theta_j \leftarrow \theta_j - \alpha \cdot \frac{\partial J}{\partial \theta_j}
MSE gradient1mX(Xθy)\frac{1}{m}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\theta} - \mathbf{y})
Batch GDExact gradient, slow on large data, no hyperparams beyond α\alpha
SGDOne sample per step, fast, noisy, online-capable
Mini-batch GDBB samples per step, GPU-friendly, default for deep learning
Learning rate too smallSlow convergence
Learning rate too largeOvershooting, possible divergence
Why not Normal Equations at scaleO(p3)O(p^3) inversion, requires full dataset, no non-linear support

Decision rule: for small linear regression (p<10000p < 10\,000, data fits in RAM) use Normal Equations. For everything else — large data, non-linear models, online updates — use mini-batch gradient descent.

Next Up — Advanced Optimizers

You can now implement gradient descent from scratch. Next: make it faster and smarter.

The next notebook — Advanced Optimizers (Adam, RMSprop, Momentum) — shows that plain gradient descent has a fixed step size for all parameters. Adaptive optimizers maintain per-parameter learning rates, use gradient history to build momentum, and converge faster on the ill-conditioned loss surfaces common in deep learning. Dependencies: the GD update rule and learning-rate intuition from this notebook.