Advanced Optimizers¶

SGD + Momentum · Adagrad · RMSProp · Adam · AdamW

Chapter 6 — Optimization & Training Practicalities

Why Plain Gradient Descent Isn’t Enough¶

Business context. A fraud-detection model trains on 50 million transactions. Plain batch GD computes the full gradient over all 50 M rows before taking one step — roughly 40 seconds per update. With 10 000 steps to converge that is four days of compute time. Mini-batch SGD fixes throughput but introduces noisy, oscillating updates that bounce around the optimum rather than converging cleanly. Advanced optimizers solve both problems: they learn fast from mini-batches and converge stably.

Three failure modes of vanilla GD that motivate adaptive methods:

Failure mode	Root cause	Adaptive fix
Slow progress in flat regions	Same $\alpha$ everywhere	Per-parameter scaling (Adagrad, RMSProp)
Oscillation in ravines	Gradient direction flips each step	Momentum smooths past directions
Manual LR tuning each project	No memory of curvature	Adam’s bias-corrected moments
Gradient vanishing in sparse features	Frequent zero gradients dilute updates	Adagrad rewards rare features

The Optimizer Family — Update Rules¶

All optimizers share the same skeleton: compute a modified step $\Delta\theta$ , then subtract it.

\color{#e94560}{\theta_{t+1}} = \color{#1f77b4}{\theta_t} - \color{#ff7f0e}{\Delta\theta_t}

(1)

What differs is how $\Delta\theta$ is constructed:

SGD + Momentum¶

\color{#2ca02c}{v_t} = \beta\,\color{#2ca02c}{v_{t-1}} + (1-\beta)\,g_t \qquad \theta_{t+1} = \theta_t - \alpha\,\color{#2ca02c}{v_t}

(2)

$v_t$ is an exponential moving average of past gradients. With $\beta = 0.9$ , the last gradient contributes 10 %, but directions that have been consistent for many steps build up velocity.

Adagrad¶

\color{#9467bd}{G_t} = \color{#9467bd}{G_{t-1}} + g_t^2 \qquad \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\color{#9467bd}{G_t}} + \varepsilon}\,g_t

(3)

$G_t$ accumulates all squared gradients — dimensions with frequent large gradients get a smaller effective learning rate. Rarely updated parameters keep a larger rate. Drawback: $G_t$ only grows, so the effective rate decays to near zero over time.

RMSProp¶

\color{#8c564b}{s_t} = \rho\,\color{#8c564b}{s_{t-1}} + (1-\rho)\,g_t^2 \qquad \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{\color{#8c564b}{s_t}} + \varepsilon}\,g_t

(4)

Replaces Adagrad’s cumulative sum with a decaying average (decay $\rho \approx 0.9$ ). This prevents the learning rate from collapsing to zero.

Adam (Adaptive Moment Estimation)¶

\color{#2ca02c}{m_t} = \beta_1\,m_{t-1} + (1-\beta_1)\,g_t \quad\text{(1st moment — momentum)}

(5)

\color{#9467bd}{v_t} = \beta_2\,v_{t-1} + (1-\beta_2)\,g_t^2 \quad\text{(2nd moment — RMSProp)}

(6)

\hat{m}_t = \frac{m_t}{1-\beta_1^t} \quad\hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad\text{(bias correction for cold start)}

(7)

\theta_{t+1} = \theta_t - \alpha\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}

(8)

Defaults: $\beta_1=0.9$ , $\beta_2=0.999$ , $\varepsilon=10^{-8}$ , $\alpha=10^{-3}$ . Adam is the standard starting point for most deep learning and gradient-based ML work.

Optimizer Family Tree¶

Each node is a modification of the node(s) above it. Adam inherits both the direction memory of Momentum and the per-parameter scaling of RMSProp.

NumPy Implementations from Scratch¶

The cell below implements all four optimizers on a simple 1D quadratic loss $J(\theta) = (\theta - 3)^2$ so you can see the exact update equations in code before testing on a harder surface.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):   return (theta - 3.0) ** 2
def dJ(theta):  return 2.0 * (theta - 3.0)

def run_sgd_momentum(theta0=0.0, alpha=0.1, beta=0.9, n=60):
    theta, v, path = theta0, 0.0, [theta0]
    for _ in range(n):
        g = dJ(theta)
        v = beta * v + (1 - beta) * g
        theta -= alpha * v
        path.append(theta)
    return path

def run_adagrad(theta0=0.0, alpha=0.5, eps=1e-8, n=60):
    theta, G, path = theta0, 0.0, [theta0]
    for _ in range(n):
        g = dJ(theta)
        G += g ** 2
        theta -= alpha / (np.sqrt(G) + eps) * g
        path.append(theta)
    return path

def run_rmsprop(theta0=0.0, alpha=0.1, rho=0.9, eps=1e-8, n=60):
    theta, s, path = theta0, 0.0, [theta0]
    for _ in range(n):
        g = dJ(theta)
        s = rho * s + (1 - rho) * g ** 2
        theta -= alpha / (np.sqrt(s) + eps) * g
        path.append(theta)
    return path

def run_adam(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60):
    theta, m, v, path = theta0, 0.0, 0.0, [theta0]
    for t in range(1, n + 1):
        g = dJ(theta)
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta)
    return path

n_steps = 60
paths = {
    'SGD + Momentum (α=0.10, β=0.9)': run_sgd_momentum(n=n_steps),
    'Adagrad          (α=0.50)':        run_adagrad(n=n_steps),
    'RMSProp          (α=0.10, ρ=0.9)': run_rmsprop(n=n_steps),
    'Adam             (α=0.30)':        run_adam(n=n_steps),
}
colors = ['steelblue', 'darkorange', 'seagreen', 'tomato']

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(n_steps + 1)
theta_range = np.linspace(-0.5, 6.5, 300)

# Left: loss vs iteration
ax = axes[0]
for (label, path), color in zip(paths.items(), colors):
    ax.plot(steps, [J(t) for t in path], color=color, lw=2, label=label)
ax.set_xlabel('Iteration')
ax.set_ylabel('J(θ) = (θ − 3)²')
ax.set_title('Convergence on 1D Quadratic')
ax.legend(fontsize=8, loc='upper right')
ax.set_ylim(-0.1, 10)
ax.grid(alpha=0.3)

# Right: parameter trajectory on loss surface
ax2 = axes[1]
ax2.plot(theta_range, J(theta_range), 'k--', lw=1.5, alpha=0.4, label='J(θ)')
for (label, path), color in zip(paths.items(), colors):
    short = label.split('(')[0].strip()
    ax2.plot(path, [J(t) for t in path], '-o', color=color, ms=3, lw=1.5, label=short)
    ax2.plot(path[-1], J(path[-1]), 'D', color=color, ms=7)
ax2.axvline(3.0, color='gray', lw=1, linestyle=':')
ax2.set_xlabel('θ')
ax2.set_ylabel('J(θ)')
ax2.set_title('Parameter Trajectory')
ax2.legend(fontsize=8)
ax2.grid(alpha=0.3)

plt.suptitle('Optimizer Comparison — 1D Quadratic Loss', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Optimizers on the Ackley Function — Non-Convex Surface¶

The 1D quadratic is too easy — any optimizer converges. The Ackley function is a standard non-convex benchmark with many local minima and a narrow path to the global minimum at $(0, 0)$ :

f(x,y) = -20\exp\!\left(-0.2\sqrt{0.5(x^2+y^2)}\right) - \exp\!\left(0.5\bigl(\cos 2\pi x + \cos 2\pi y\bigr)\right) + 20 + e

(9)

The implementations below are preserved from the original notebook — they implement each optimizer correctly and run them from the same starting point so trajectories are directly comparable.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

# ── Ackley function and its gradient ──────────────────────────────────────────
def objective_function(x, y):
    term1 = -20 * np.exp(-0.2 * np.sqrt(0.5 * (x**2 + y**2)))
    term2 = -np.exp(0.5 * (np.cos(2 * np.pi * x) + np.cos(2 * np.pi * y)))
    return term1 + term2 + 20 + np.e

def gradient(x, y):
    dx = 2 * x + 3 * np.sin(1.5 * x) * np.cos(1.5 * x)
    dy = 2 * y + 3 * np.sin(1.5 * y) * np.cos(1.5 * y)
    return np.array([dx, dy])

def clip_point(pt, bounds=(-3, 3)):
    return np.clip(pt, bounds[0], bounds[1])

# ── Optimizer implementations ─────────────────────────────────────────────────
def sgd_momentum(start, lr=0.05, n=200, beta=0.9, tol=0.01):
    pt, vel = np.array(start, float), np.zeros(2)
    hist = [pt.copy()]
    for i in range(n):
        g = gradient(pt[0], pt[1])
        vel = beta * vel - lr * g
        pt = clip_point(pt + vel)
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

def adagrad(start, lr=0.05, n=200, eps=1e-8, tol=0.01):
    pt = np.array(start, float)
    G = np.zeros(2)
    hist = [pt.copy()]
    for i in range(n):
        g = gradient(pt[0], pt[1])
        G += g ** 2
        pt = clip_point(pt - lr / (np.sqrt(G) + eps) * g)
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

def rmsprop(start, lr=0.05, n=200, rho=0.9, eps=1e-8, tol=0.01):
    pt = np.array(start, float)
    s = np.zeros(2)
    hist = [pt.copy()]
    for i in range(n):
        g = gradient(pt[0], pt[1])
        s = rho * s + (1 - rho) * g ** 2
        pt = clip_point(pt - lr / (np.sqrt(s) + eps) * g)
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

def adam(start, lr=0.05, n=200, b1=0.9, b2=0.999, eps=1e-8, tol=0.01):
    pt = np.array(start, float)
    m, v = np.zeros(2), np.zeros(2)
    hist = [pt.copy()]
    for t in range(1, n + 1):
        g = gradient(pt[0], pt[1])
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        pt = clip_point(pt - lr * m_hat / (np.sqrt(v_hat) + eps))
        hist.append(pt.copy())
        if objective_function(pt[0], pt[1]) < tol:
            break
    return np.array(hist)

# ── Run all optimizers from the same start ────────────────────────────────────
START = [2.0, 2.0]
paths_2d = {
    'SGD+Momentum': (sgd_momentum(START), 'tomato'),
    'Adagrad':      (adagrad(START),      'steelblue'),
    'RMSProp':      (rmsprop(START),      'seagreen'),
    'Adam':         (adam(START),         'gold'),
}

# ── Plot: contour + 2D trajectories ──────────────────────────────────────────
res = 120
xs = np.linspace(-3, 3, res)
X, Y = np.meshgrid(xs, xs)
Z = objective_function(X, Y)

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

# Contour trajectories
ax = axes[0]
ax.contourf(X, Y, Z, levels=25, cmap='viridis', alpha=0.75)
ax.contour(X, Y, Z, levels=25, colors='white', linewidths=0.3, alpha=0.4)
for name, (path, color) in paths_2d.items():
    ax.plot(path[:, 0], path[:, 1], '-o', color=color, ms=2, lw=2, label=f'{name} ({len(path)-1} steps)')
    ax.plot(path[-1, 0], path[-1, 1], 'D', color=color, ms=8)
ax.plot(0, 0, 'w*', ms=14, label='Global min (0,0)')
ax.plot(*START, 'ws', ms=10, label='Start (2,2)')
ax.set_title('Trajectories on Ackley Function')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.legend(fontsize=8, loc='upper left')

# Loss vs steps
ax2 = axes[1]
for name, (path, color) in paths_2d.items():
    losses = [objective_function(pt[0], pt[1]) for pt in path]
    ax2.plot(losses, color=color, lw=2, label=name)
ax2.set_xlabel('Step'); ax2.set_ylabel('f(x, y)')
ax2.set_title('Loss vs Steps — Ackley')
ax2.legend(fontsize=9)
ax2.grid(alpha=0.3)

plt.suptitle('Optimizer Trajectories on Non-Convex Ackley Surface', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print("Final positions and function values:")
for name, (path, _) in paths_2d.items():
    final = path[-1]
    fval  = objective_function(final[0], final[1])
    print(f"  {name:<16}: x=({final[0]:+.3f}, {final[1]:+.3f})  f={fval:.4f}  steps={len(path)-1}")

Try It in the Browser — Adam vs SGD on a 1D Loss¶

Edit $\beta_1$ , $\beta_2$ , and $\alpha$ in the cell below and observe how quickly each optimizer reaches $\theta^* = 3$ .

Choosing an Optimizer in Practice¶

Scenario	Recommended optimizer	Reason
Deep learning, first attempt	Adam	Robust defaults, fast cold start
NLP / text, sparse features	Adagrad or Adam	Rare token gradients need boosted LR
Fine-tuning pre-trained model	AdamW	Weight decay prevents catastrophic forgetting
Large-batch convex sklearn model	SGD + Momentum	Lower memory, same convergence
Scikit-learn `SGDClassifier`	SGD (built-in momentum optional)	Direct API mapping
Tight compute budget	RMSProp	Slightly less memory than Adam (no 1st moment)

Guided Practice¶

What problem does the bias-correction step in Adam solve?¶

The moment estimates are initialised at zero and are biased toward zero in early stepsCorrect. Dividing by $(1 - \beta^t)$ corrects the initialisation bias so early steps are not artificially small.

It prevents the learning rate from growing unboundedlyThe denominator $\sqrt{\hat{v}_t}$ handles that. Bias correction is specifically for the cold-start issue.

It adds L2 regularisation to the parametersThat is AdamW's weight-decay term, not bias correction.

It normalises gradients to unit lengthGradient clipping normalises magnitude; bias correction normalises moment estimates.

Why does Adagrad's effective learning rate eventually collapse to near zero?¶

$G_t$ only ever increases because it accumulates all squared gradients without decayCorrect. After many steps, $\sqrt{G_t}$ is so large the effective LR $\alpha / \sqrt{G_t}$ is negligible. RMSProp fixes this with an exponential decay.

Adagrad uses a smaller base learning rate than other optimizersThe base $\alpha$ can be set to any value; the collapse is structural.

It applies momentum in the wrong directionAdagrad does not use momentum at all.

It requires gradients to be normalised before useNo normalisation step is required.

Which component of Adam is inherited from SGD with Momentum?¶

The per-parameter learning rate scaling $\hat{v}_t$$\hat{v}_t$ is the second moment — it comes from RMSProp.

The first moment $\hat{m}_t$ (exponential moving average of gradients)Correct. $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ is precisely the momentum term used in SGD+Momentum.

The bias-correction denominators $(1-\beta^t)$Bias correction is unique to Adam; it addresses the cold-start problem of initialising moments at zero.

The $\varepsilon$ numerical stability constant$\varepsilon$ prevents division by zero in the denominator; it is not the momentum component.

A scikit-learn SGDClassifier on a sparse NLP feature matrix is converging slowly. Which change is most likely to help?¶

Switch to Newton's method (second-order)Newton's method requires computing and inverting the Hessian — infeasible at NLP scale.

Remove all regularisation termsRemoving regularisation risks overfitting; it does not address slow convergence on sparse data.

Use an adaptive optimizer (Adam / Adagrad) that gives larger updates to infrequent featuresCorrect. Sparse NLP features have many near-zero gradients. Adaptive per-parameter rates give rare tokens the larger updates they need to influence the model.

Increase batch size to reduce gradient noiseLarger batches reduce noise but do not fix the core problem of uniform LR across dense and sparse features.

Exercises¶

Exercise 1 — Implement AdamW¶

AdamW adds decoupled weight decay: instead of folding $\lambda\theta$ into the gradient before computing moments, it subtracts it directly from the parameter after the Adam step:

\theta_{t+1} = \theta_t - \alpha\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon} - \alpha\lambda\theta_t

(10)

Implement run_adamw below and compare its trajectory to Adam’s on the 1D quadratic with $\lambda = 0.01$ .

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(theta):  return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)

def run_adamw(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, lam=0.01, n=60):
    theta, m, v = theta0, 0.0, 0.0
    path = [theta]
    for t in range(1, n + 1):
        g = dJ(theta)
        # TODO: compute m, v, bias-corrected m_hat, v_hat
        # TODO: Adam step
        # TODO: weight decay step (subtract alpha * lam * theta)
        path.append(theta)
    return path

# Compare with Adam from the implementations above
def run_adam(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60):
    theta, m, v = theta0, 0.0, 0.0
    path = [theta]
    for t in range(1, n + 1):
        g = dJ(theta)
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
        path.append(theta)
    return path

adam_path  = run_adam()
adamw_path = run_adamw()

steps = np.arange(61)
plt.figure(figsize=(7, 4))
plt.plot(steps, [J(t) for t in adam_path],  lw=2, label='Adam')
plt.plot(steps, [J(t) for t in adamw_path], lw=2, linestyle='--', label='AdamW (λ=0.01)')
plt.xlabel('Step'); plt.ylabel('J(θ)'); plt.title('Adam vs AdamW')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Exercise 2 — Momentum Sensitivity¶

Run SGD+Momentum on $J(\theta) = (\theta-3)^2$ starting at $\theta_0=0$ with $\alpha=0.3$ and vary $\beta \in \{0.0, 0.5, 0.9, 0.99\}$ . Plot the loss trajectory for each value. What happens as $\beta \to 1$ ?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(t):  return (t - 3.0) ** 2
def dJ(t): return 2.0 * (t - 3.0)

betas  = [0.0, 0.5, 0.9, 0.99]
colors = ['steelblue', 'seagreen', 'darkorange', 'tomato']
alpha  = 0.3
n_steps = 60

plt.figure(figsize=(8, 4))
for beta, color in zip(betas, colors):
    # TODO: implement SGD+Momentum with this beta value
    # hint: v = beta*v + (1-beta)*g  then  theta -= alpha*v
    losses = [J(0.0)]  # replace with actual trajectory
    plt.plot(losses, color=color, lw=2, label=f'β = {beta}')

plt.xlabel('Step'); plt.ylabel('J(θ)')
plt.title('Momentum Sensitivity (α=0.3)')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Exercise 3 — RMSProp vs Adagrad on a Sparse Gradient Signal¶

Simulate a sparse gradient sequence: gradients are zero 90 % of the time and equal to 5 on non-zero steps. Run both Adagrad and RMSProp for 200 steps on $J(\theta) = \theta^2$ (minimum at 0). Show how Adagrad’s effective learning rate collapses while RMSProp’s stays active.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 200
# Sparse gradient: 10% of steps have gradient = 5, rest = 0
sparse_grads = np.where(np.random.rand(n) < 0.1, 5.0, 0.0)

alpha = 0.5
eps   = 1e-8
rho   = 0.9

# TODO: run Adagrad and RMSProp using the sparse_grads array
# Record effective LR (alpha / sqrt(accumulator + eps)) at each step
adagrad_eff_lr  = np.ones(n)  # replace
rmsprop_eff_lr  = np.ones(n)  # replace

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(sparse_grads, alpha=0.5, label='gradient signal')
axes[0].set_title('Sparse Gradient Signal')
axes[0].set_xlabel('Step'); axes[0].legend()

axes[1].plot(adagrad_eff_lr,  lw=2, label='Adagrad eff LR')
axes[1].plot(rmsprop_eff_lr,  lw=2, label='RMSProp eff LR')
axes[1].set_title('Effective Learning Rate Over Time')
axes[1].set_xlabel('Step'); axes[1].set_ylabel('α / √(acc + ε)')
axes[1].legend(); axes[1].grid(alpha=0.3)

plt.tight_layout(); plt.show()

Common Pitfalls¶

Summary¶

Optimizer	Update formula (simplified)	Key property
SGD + Momentum	$\theta \leftarrow \theta - \alpha v_t$	Smooths oscillation; needs LR tuning
Adagrad	$\theta \leftarrow \theta - \frac{\alpha}{\sqrt{G_t}}g_t$	Great for sparse features; LR collapses
RMSProp	$\theta \leftarrow \theta - \frac{\alpha}{\sqrt{s_t}}g_t$	Fixes Adagrad collapse via decay $\rho$
Adam	$\theta \leftarrow \theta - \alpha\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon}$	Momentum + RMSProp + bias correction
AdamW	Adam + $-\alpha\lambda\theta_t$	Best for regularised training

Rule of thumb: start with Adam at $\alpha=10^{-3}$ . Switch to AdamW when you need weight decay. Fall back to SGD+Momentum for large-batch convex problems where final test accuracy matters more than convergence speed.

Next Up — Learning Rate Schedules¶

Learning Rate Schedules
You now know how each optimizer adapts its step size internally. The next notebook shows how to adapt the base learning rate $\alpha$ itself over training — warm-up, step decay, cosine annealing, and cyclic schedules that squeeze out the last few percent of accuracy.