Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Advanced Optimizers

SGD + Momentum · Adagrad · RMSProp · Adam · AdamW

Chapter 6 — Optimization & Training Practicalities

Why Plain Gradient Descent Isn’t Enough

Business context. A fraud-detection model trains on 50 million transactions. Plain batch GD computes the full gradient over all 50 M rows before taking one step — roughly 40 seconds per update. With 10 000 steps to converge that is four days of compute time. Mini-batch SGD fixes throughput but introduces noisy, oscillating updates that bounce around the optimum rather than converging cleanly. Advanced optimizers solve both problems: they learn fast from mini-batches and converge stably.

Three failure modes of vanilla GD that motivate adaptive methods:

Failure modeRoot causeAdaptive fix
Slow progress in flat regionsSame α\alpha everywherePer-parameter scaling (Adagrad, RMSProp)
Oscillation in ravinesGradient direction flips each stepMomentum smooths past directions
Manual LR tuning each projectNo memory of curvatureAdam’s bias-corrected moments
Gradient vanishing in sparse featuresFrequent zero gradients dilute updatesAdagrad rewards rare features

The Optimizer Family — Update Rules

All optimizers share the same skeleton: compute a modified step Δθ\Delta\theta, then subtract it.

θt+1=θtΔθt\color{#e94560}{\theta_{t+1}} = \color{#1f77b4}{\theta_t} - \color{#ff7f0e}{\Delta\theta_t}

What differs is how Δθ\Delta\theta is constructed:


SGD + Momentum

vt=βvt1+(1β)gtθt+1=θtαvt\color{#2ca02c}{v_t} = \beta\,\color{#2ca02c}{v_{t-1}} + (1-\beta)\,g_t \qquad \theta_{t+1} = \theta_t - \alpha\,\color{#2ca02c}{v_t}

vtv_t is an exponential moving average of past gradients. With β=0.9\beta = 0.9, the last gradient contributes 10 %, but directions that have been consistent for many steps build up velocity.


Adagrad

Gt=Gt1+gt2θt+1=θtαGt+εgt\color{#9467bd}{G_t} = \color{#9467bd}{G_{t-1}} + g_t^2 \qquad \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\color{#9467bd}{G_t}} + \varepsilon}\,g_t

GtG_t accumulates all squared gradients — dimensions with frequent large gradients get a smaller effective learning rate. Rarely updated parameters keep a larger rate. Drawback: GtG_t only grows, so the effective rate decays to near zero over time.


RMSProp

st=ρst1+(1ρ)gt2θt+1=θtαst+εgt\color{#8c564b}{s_t} = \rho\,\color{#8c564b}{s_{t-1}} + (1-\rho)\,g_t^2 \qquad \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\color{#8c564b}{s_t}} + \varepsilon}\,g_t

Replaces Adagrad’s cumulative sum with a decaying average (decay ρ0.9\rho \approx 0.9). This prevents the learning rate from collapsing to zero.


Adam (Adaptive Moment Estimation)

mt=β1mt1+(1β1)gt(1st moment — momentum)\color{#2ca02c}{m_t} = \beta_1\,m_{t-1} + (1-\beta_1)\,g_t \quad\text{(1st moment — momentum)}
vt=β2vt1+(1β2)gt2(2nd moment — RMSProp)\color{#9467bd}{v_t} = \beta_2\,v_{t-1} + (1-\beta_2)\,g_t^2 \quad\text{(2nd moment — RMSProp)}
m^t=mt1β1tv^t=vt1β2t(bias correction for cold start)\hat{m}_t = \frac{m_t}{1-\beta_1^t} \quad\hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad\text{(bias correction for cold start)}
θt+1=θtαm^tv^t+ε\theta_{t+1} = \theta_t - \alpha\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}

Defaults: β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ε=108\varepsilon=10^{-8}, α=103\alpha=10^{-3}. Adam is the standard starting point for most deep learning and gradient-based ML work.

Optimizer Family Tree

Each node is a modification of the node(s) above it. Adam inherits both the direction memory of Momentum and the per-parameter scaling of RMSProp.

NumPy Implementations from Scratch

The cell below implements all four optimizers on a simple 1D quadratic loss J(θ)=(θ3)2J(\theta) = (\theta - 3)^2 so you can see the exact update equations in code before testing on a harder surface.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):   return (theta - 3.0) ** 2
def dJ(theta):  return 2.0 * (theta - 3.0)

def run_sgd_momentum(theta0=0.0, alpha=0.1, beta=0.9, n=60):
    theta, v, path = theta0, 0.0, [theta0]
    for _ in range(n):
        g = dJ(theta)
        v = beta * v + (1 - beta) * g
        theta -= alpha * v
        path.append(theta)
    return path

def run_adagrad(theta0=0.0, alpha=0.5, eps=1e-8, n=60):
    theta, G, path = theta0, 0.0, [theta0]
    for _ in range(n):
        g = dJ(theta)
        G += g ** 2
        theta -= alpha / (np.sqrt(G) + eps) * g
        path.append(theta)
    return path

def run_rmsprop(theta0=0.0, alpha=0.1, rho=0.9, eps=1e-8, n=60):
    theta, s, path = theta0, 0.0, [theta0]
    for _ in range(n):
        g = dJ(theta)
        s = rho * s + (1 - rho) * g ** 2
        theta -= alpha / (np.sqrt(s) + eps) * g
        path.append(theta)
    return path

def run_adam(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60):
    theta, m, v, path = theta0, 0.0, 0.0, [theta0]
    for t in range(1, n + 1):
        g = dJ(theta)
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta)
    return path

n_steps = 60
paths = {
    'SGD + Momentum (α=0.10, β=0.9)': run_sgd_momentum(n=n_steps),
    'Adagrad          (α=0.50)':        run_adagrad(n=n_steps),
    'RMSProp          (α=0.10, ρ=0.9)': run_rmsprop(n=n_steps),
    'Adam             (α=0.30)':        run_adam(n=n_steps),
}
colors = ['steelblue', 'darkorange', 'seagreen', 'tomato']

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(n_steps + 1)
theta_range = np.linspace(-0.5, 6.5, 300)

# Left: loss vs iteration
ax = axes[0]
for (label, path), color in zip(paths.items(), colors):
    ax.plot(steps, [J(t) for t in path], color=color, lw=2, label=label)
ax.set_xlabel('Iteration')
ax.set_ylabel('J(θ) = (θ − 3)²')
ax.set_title('Convergence on 1D Quadratic')
ax.legend(fontsize=8, loc='upper right')
ax.set_ylim(-0.1, 10)
ax.grid(alpha=0.3)

# Right: parameter trajectory on loss surface
ax2 = axes[1]
ax2.plot(theta_range, J(theta_range), 'k--', lw=1.5, alpha=0.4, label='J(θ)')
for (label, path), color in zip(paths.items(), colors):
    short = label.split('(')[0].strip()
    ax2.plot(path, [J(t) for t in path], '-o', color=color, ms=3, lw=1.5, label=short)
    ax2.plot(path[-1], J(path[-1]), 'D', color=color, ms=7)
ax2.axvline(3.0, color='gray', lw=1, linestyle=':')
ax2.set_xlabel('θ')
ax2.set_ylabel('J(θ)')
ax2.set_title('Parameter Trajectory')
ax2.legend(fontsize=8)
ax2.grid(alpha=0.3)

plt.suptitle('Optimizer Comparison — 1D Quadratic Loss', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Optimizers on the Ackley Function — Non-Convex Surface

The 1D quadratic is too easy — any optimizer converges. The Ackley function is a standard non-convex benchmark with many local minima and a narrow path to the global minimum at (0,0)(0, 0):

f(x,y)=20exp ⁣(0.20.5(x2+y2))exp ⁣(0.5(cos2πx+cos2πy))+20+ef(x,y) = -20\exp\!\left(-0.2\sqrt{0.5(x^2+y^2)}\right) - \exp\!\left(0.5\bigl(\cos 2\pi x + \cos 2\pi y\bigr)\right) + 20 + e

The implementations below are preserved from the original notebook — they implement each optimizer correctly and run them from the same starting point so trajectories are directly comparable.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

# ── Ackley function and its gradient ──────────────────────────────────────────
def objective_function(x, y):
    term1 = -20 * np.exp(-0.2 * np.sqrt(0.5 * (x**2 + y**2)))
    term2 = -np.exp(0.5 * (np.cos(2 * np.pi * x) + np.cos(2 * np.pi * y)))
    return term1 + term2 + 20 + np.e

def gradient(x, y):
    dx = 2 * x + 3 * np.sin(1.5 * x) * np.cos(1.5 * x)
    dy = 2 * y + 3 * np.sin(1.5 * y) * np.cos(1.5 * y)
    return np.array([dx, dy])

def clip_point(pt, bounds=(-3, 3)):
    return np.clip(pt, bounds[0], bounds[1])

# ── Optimizer implementations ─────────────────────────────────────────────────
def sgd_momentum(start, lr=0.05, n=200, beta=0.9, tol=0.01):
    pt, vel = np.array(start, float), np.zeros(2)
    hist = [pt.copy()]
    for i in range(n):
        g = gradient(pt[0], pt[1])
        vel = beta * vel - lr * g
        pt = clip_point(pt + vel)
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

def adagrad(start, lr=0.05, n=200, eps=1e-8, tol=0.01):
    pt = np.array(start, float)
    G = np.zeros(2)
    hist = [pt.copy()]
    for i in range(n):
        g = gradient(pt[0], pt[1])
        G += g ** 2
        pt = clip_point(pt - lr / (np.sqrt(G) + eps) * g)
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

def rmsprop(start, lr=0.05, n=200, rho=0.9, eps=1e-8, tol=0.01):
    pt = np.array(start, float)
    s = np.zeros(2)
    hist = [pt.copy()]
    for i in range(n):
        g = gradient(pt[0], pt[1])
        s = rho * s + (1 - rho) * g ** 2
        pt = clip_point(pt - lr / (np.sqrt(s) + eps) * g)
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

def adam(start, lr=0.05, n=200, b1=0.9, b2=0.999, eps=1e-8, tol=0.01):
    pt = np.array(start, float)
    m, v = np.zeros(2), np.zeros(2)
    hist = [pt.copy()]
    for t in range(1, n + 1):
        g = gradient(pt[0], pt[1])
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        pt = clip_point(pt - lr * m_hat / (np.sqrt(v_hat) + eps))
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

# ── Run all optimizers from the same start ────────────────────────────────────
START = [2.0, 2.0]
paths_2d = {
    'SGD+Momentum': (sgd_momentum(START), 'tomato'),
    'Adagrad':      (adagrad(START),      'steelblue'),
    'RMSProp':      (rmsprop(START),      'seagreen'),
    'Adam':         (adam(START),         'gold'),
}

# ── Plot: contour + 2D trajectories ──────────────────────────────────────────
res = 120
xs = np.linspace(-3, 3, res)
X, Y = np.meshgrid(xs, xs)
Z = objective_function(X, Y)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Contour trajectories
ax = axes[0]
ax.contourf(X, Y, Z, levels=25, cmap='viridis', alpha=0.75)
ax.contour(X, Y, Z, levels=25, colors='white', linewidths=0.3, alpha=0.4)
for name, (path, color) in paths_2d.items():
    ax.plot(path[:, 0], path[:, 1], '-o', color=color, ms=2, lw=2, label=f'{name} ({len(path)-1} steps)')
    ax.plot(path[-1, 0], path[-1, 1], 'D', color=color, ms=8)
ax.plot(0, 0, 'w*', ms=14, label='Global min (0,0)')
ax.plot(*START, 'ws', ms=10, label='Start (2,2)')
ax.set_title('Trajectories on Ackley Function')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.legend(fontsize=8, loc='upper left')

# Loss vs steps
ax2 = axes[1]
for name, (path, color) in paths_2d.items():
    losses = [objective_function(pt[0], pt[1]) for pt in path]
    ax2.plot(losses, color=color, lw=2, label=name)
ax2.set_xlabel('Step'); ax2.set_ylabel('f(x, y)')
ax2.set_title('Loss vs Steps — Ackley')
ax2.legend(fontsize=9)
ax2.grid(alpha=0.3)

plt.suptitle('Optimizer Trajectories on Non-Convex Ackley Surface', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("Final positions and function values:")
for name, (path, _) in paths_2d.items():
    final = path[-1]
    fval  = objective_function(final[0], final[1])
    print(f"  {name:<16}: x=({final[0]:+.3f}, {final[1]:+.3f})  f={fval:.4f}  steps={len(path)-1}")

Try It in the Browser — Adam vs SGD on a 1D Loss

Edit β1\beta_1, β2\beta_2, and α\alpha in the cell below and observe how quickly each optimizer reaches θ=3\theta^* = 3.

Choosing an Optimizer in Practice

ScenarioRecommended optimizerReason
Deep learning, first attemptAdamRobust defaults, fast cold start
NLP / text, sparse featuresAdagrad or AdamRare token gradients need boosted LR
Fine-tuning pre-trained modelAdamWWeight decay prevents catastrophic forgetting
Large-batch convex sklearn modelSGD + MomentumLower memory, same convergence
Scikit-learn SGDClassifierSGD (built-in momentum optional)Direct API mapping
Tight compute budgetRMSPropSlightly less memory than Adam (no 1st moment)

Guided Practice

What problem does the bias-correction step in Adam solve?

The moment estimates are initialised at zero and are biased toward zero in early stepsCorrect. Dividing by $(1 - \beta^t)$ corrects the initialisation bias so early steps are not artificially small.
It prevents the learning rate from growing unboundedlyThe denominator $\sqrt{\hat{v}_t}$ handles that. Bias correction is specifically for the cold-start issue.
It adds L2 regularisation to the parametersThat is AdamW's weight-decay term, not bias correction.
It normalises gradients to unit lengthGradient clipping normalises magnitude; bias correction normalises moment estimates.

Why does Adagrad's effective learning rate eventually collapse to near zero?

$G_t$ only ever increases because it accumulates all squared gradients without decayCorrect. After many steps, $\sqrt{G_t}$ is so large the effective LR $\alpha / \sqrt{G_t}$ is negligible. RMSProp fixes this with an exponential decay.
Adagrad uses a smaller base learning rate than other optimizersThe base $\alpha$ can be set to any value; the collapse is structural.
It applies momentum in the wrong directionAdagrad does not use momentum at all.
It requires gradients to be normalised before useNo normalisation step is required.

Which component of Adam is inherited from SGD with Momentum?

The per-parameter learning rate scaling $\hat{v}_t$$\hat{v}_t$ is the second moment — it comes from RMSProp.
The first moment $\hat{m}_t$ (exponential moving average of gradients)Correct. $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ is precisely the momentum term used in SGD+Momentum.
The bias-correction denominators $(1-\beta^t)$Bias correction is unique to Adam; it addresses the cold-start problem of initialising moments at zero.
The $\varepsilon$ numerical stability constant$\varepsilon$ prevents division by zero in the denominator; it is not the momentum component.

A scikit-learn SGDClassifier on a sparse NLP feature matrix is converging slowly. Which change is most likely to help?

Switch to Newton's method (second-order)Newton's method requires computing and inverting the Hessian — infeasible at NLP scale.
Remove all regularisation termsRemoving regularisation risks overfitting; it does not address slow convergence on sparse data.
Use an adaptive optimizer (Adam / Adagrad) that gives larger updates to infrequent featuresCorrect. Sparse NLP features have many near-zero gradients. Adaptive per-parameter rates give rare tokens the larger updates they need to influence the model.
Increase batch size to reduce gradient noiseLarger batches reduce noise but do not fix the core problem of uniform LR across dense and sparse features.

Exercises

Exercise 1 — Implement AdamW

AdamW adds decoupled weight decay: instead of folding λθ\lambda\theta into the gradient before computing moments, it subtracts it directly from the parameter after the Adam step:

θt+1=θtαm^tv^t+εαλθt\theta_{t+1} = \theta_t - \alpha\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon} - \alpha\lambda\theta_t

Implement run_adamw below and compare its trajectory to Adam’s on the 1D quadratic with λ=0.01\lambda = 0.01.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):  return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)

def run_adamw(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, lam=0.01, n=60):
    theta, m, v = theta0, 0.0, 0.0
    path = [theta]
    for t in range(1, n + 1):
        g = dJ(theta)
        # TODO: compute m, v, bias-corrected m_hat, v_hat
        # TODO: Adam step
        # TODO: weight decay step (subtract alpha * lam * theta)
        path.append(theta)
    return path

# Compare with Adam from the implementations above
def run_adam(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60):
    theta, m, v = theta0, 0.0, 0.0
    path = [theta]
    for t in range(1, n + 1):
        g = dJ(theta)
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta)
    return path

adam_path  = run_adam()
adamw_path = run_adamw()

steps = np.arange(61)
plt.figure(figsize=(7, 4))
plt.plot(steps, [J(t) for t in adam_path],  lw=2, label='Adam')
plt.plot(steps, [J(t) for t in adamw_path], lw=2, linestyle='--', label='AdamW (λ=0.01)')
plt.xlabel('Step'); plt.ylabel('J(θ)'); plt.title('Adam vs AdamW')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Exercise 2 — Momentum Sensitivity

Run SGD+Momentum on J(θ)=(θ3)2J(\theta) = (\theta-3)^2 starting at θ0=0\theta_0=0 with α=0.3\alpha=0.3 and vary β{0.0,0.5,0.9,0.99}\beta \in \{0.0, 0.5, 0.9, 0.99\}. Plot the loss trajectory for each value. What happens as β1\beta \to 1?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(t):  return (t - 3.0) ** 2
def dJ(t): return 2.0 * (t - 3.0)

betas  = [0.0, 0.5, 0.9, 0.99]
colors = ['steelblue', 'seagreen', 'darkorange', 'tomato']
alpha  = 0.3
n_steps = 60

plt.figure(figsize=(8, 4))
for beta, color in zip(betas, colors):
    # TODO: implement SGD+Momentum with this beta value
    # hint: v = beta*v + (1-beta)*g  then  theta -= alpha*v
    losses = [J(0.0)]  # replace with actual trajectory
    plt.plot(losses, color=color, lw=2, label=f'β = {beta}')

plt.xlabel('Step'); plt.ylabel('J(θ)')
plt.title('Momentum Sensitivity (α=0.3)')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Exercise 3 — RMSProp vs Adagrad on a Sparse Gradient Signal

Simulate a sparse gradient sequence: gradients are zero 90 % of the time and equal to 5 on non-zero steps. Run both Adagrad and RMSProp for 200 steps on J(θ)=θ2J(\theta) = \theta^2 (minimum at 0). Show how Adagrad’s effective learning rate collapses while RMSProp’s stays active.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 200
# Sparse gradient: 10% of steps have gradient = 5, rest = 0
sparse_grads = np.where(np.random.rand(n) < 0.1, 5.0, 0.0)

alpha = 0.5
eps   = 1e-8
rho   = 0.9

# TODO: run Adagrad and RMSProp using the sparse_grads array
# Record effective LR (alpha / sqrt(accumulator + eps)) at each step
adagrad_eff_lr  = np.ones(n)  # replace
rmsprop_eff_lr  = np.ones(n)  # replace

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(sparse_grads, alpha=0.5, label='gradient signal')
axes[0].set_title('Sparse Gradient Signal')
axes[0].set_xlabel('Step'); axes[0].legend()

axes[1].plot(adagrad_eff_lr,  lw=2, label='Adagrad eff LR')
axes[1].plot(rmsprop_eff_lr,  lw=2, label='RMSProp eff LR')
axes[1].set_title('Effective Learning Rate Over Time')
axes[1].set_xlabel('Step'); axes[1].set_ylabel('α / √(acc + ε)')
axes[1].legend(); axes[1].grid(alpha=0.3)

plt.tight_layout(); plt.show()

Common Pitfalls

Summary

Key takeaways
OptimizerUpdate formula (simplified)Key property
SGD + Momentumθθαvt\theta \leftarrow \theta - \alpha v_tSmooths oscillation; needs LR tuning
AdagradθθαGtgt\theta \leftarrow \theta - \frac{\alpha}{\sqrt{G_t}}g_tGreat for sparse features; LR collapses
RMSPropθθαstgt\theta \leftarrow \theta - \frac{\alpha}{\sqrt{s_t}}g_tFixes Adagrad collapse via decay ρ\rho
Adamθθαm^tv^t+ε\theta \leftarrow \theta - \alpha\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon}Momentum + RMSProp + bias correction
AdamWAdam + αλθt-\alpha\lambda\theta_tBest for regularised training

Rule of thumb: start with Adam at α=103\alpha=10^{-3}. Switch to AdamW when you need weight decay. Fall back to SGD+Momentum for large-batch convex problems where final test accuracy matters more than convergence speed.

Next Up — Learning Rate Schedules

Learning Rate Schedules
You now know how each optimizer adapts its step size internally. The next notebook shows how to adapt the base learning rate $\alpha$ itself over training — warm-up, step decay, cosine annealing, and cyclic schedules that squeeze out the last few percent of accuracy.