
Advanced Optimizers¶
SGD + Momentum · Adagrad · RMSProp · Adam · AdamW
Chapter 6 — Optimization & Training Practicalities
Why Plain Gradient Descent Isn’t Enough¶

Three failure modes of vanilla GD that motivate adaptive methods:
| Failure mode | Root cause | Adaptive fix |
|---|---|---|
| Slow progress in flat regions | Same everywhere | Per-parameter scaling (Adagrad, RMSProp) |
| Oscillation in ravines | Gradient direction flips each step | Momentum smooths past directions |
| Manual LR tuning each project | No memory of curvature | Adam’s bias-corrected moments |
| Gradient vanishing in sparse features | Frequent zero gradients dilute updates | Adagrad rewards rare features |
The Optimizer Family — Update Rules¶
All optimizers share the same skeleton: compute a modified step , then subtract it.
What differs is how is constructed:
SGD + Momentum¶
is an exponential moving average of past gradients. With , the last gradient contributes 10 %, but directions that have been consistent for many steps build up velocity.
Adagrad¶
accumulates all squared gradients — dimensions with frequent large gradients get a smaller effective learning rate. Rarely updated parameters keep a larger rate. Drawback: only grows, so the effective rate decays to near zero over time.
RMSProp¶
Replaces Adagrad’s cumulative sum with a decaying average (decay ). This prevents the learning rate from collapsing to zero.
Adam (Adaptive Moment Estimation)¶
Defaults: , , , . Adam is the standard starting point for most deep learning and gradient-based ML work.
Optimizer Family Tree¶
Each node is a modification of the node(s) above it. Adam inherits both the direction memory of Momentum and the per-parameter scaling of RMSProp.
NumPy Implementations from Scratch¶
The cell below implements all four optimizers on a simple 1D quadratic loss so you can see the exact update equations in code before testing on a harder surface.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def J(theta): return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)
def run_sgd_momentum(theta0=0.0, alpha=0.1, beta=0.9, n=60):
theta, v, path = theta0, 0.0, [theta0]
for _ in range(n):
g = dJ(theta)
v = beta * v + (1 - beta) * g
theta -= alpha * v
path.append(theta)
return path
def run_adagrad(theta0=0.0, alpha=0.5, eps=1e-8, n=60):
theta, G, path = theta0, 0.0, [theta0]
for _ in range(n):
g = dJ(theta)
G += g ** 2
theta -= alpha / (np.sqrt(G) + eps) * g
path.append(theta)
return path
def run_rmsprop(theta0=0.0, alpha=0.1, rho=0.9, eps=1e-8, n=60):
theta, s, path = theta0, 0.0, [theta0]
for _ in range(n):
g = dJ(theta)
s = rho * s + (1 - rho) * g ** 2
theta -= alpha / (np.sqrt(s) + eps) * g
path.append(theta)
return path
def run_adam(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60):
theta, m, v, path = theta0, 0.0, 0.0, [theta0]
for t in range(1, n + 1):
g = dJ(theta)
m = b1 * m + (1 - b1) * g
v = b2 * v + (1 - b2) * g ** 2
m_hat = m / (1 - b1 ** t)
v_hat = v / (1 - b2 ** t)
theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
path.append(theta)
return path
n_steps = 60
paths = {
'SGD + Momentum (α=0.10, β=0.9)': run_sgd_momentum(n=n_steps),
'Adagrad (α=0.50)': run_adagrad(n=n_steps),
'RMSProp (α=0.10, ρ=0.9)': run_rmsprop(n=n_steps),
'Adam (α=0.30)': run_adam(n=n_steps),
}
colors = ['steelblue', 'darkorange', 'seagreen', 'tomato']
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(n_steps + 1)
theta_range = np.linspace(-0.5, 6.5, 300)
# Left: loss vs iteration
ax = axes[0]
for (label, path), color in zip(paths.items(), colors):
ax.plot(steps, [J(t) for t in path], color=color, lw=2, label=label)
ax.set_xlabel('Iteration')
ax.set_ylabel('J(θ) = (θ − 3)²')
ax.set_title('Convergence on 1D Quadratic')
ax.legend(fontsize=8, loc='upper right')
ax.set_ylim(-0.1, 10)
ax.grid(alpha=0.3)
# Right: parameter trajectory on loss surface
ax2 = axes[1]
ax2.plot(theta_range, J(theta_range), 'k--', lw=1.5, alpha=0.4, label='J(θ)')
for (label, path), color in zip(paths.items(), colors):
short = label.split('(')[0].strip()
ax2.plot(path, [J(t) for t in path], '-o', color=color, ms=3, lw=1.5, label=short)
ax2.plot(path[-1], J(path[-1]), 'D', color=color, ms=7)
ax2.axvline(3.0, color='gray', lw=1, linestyle=':')
ax2.set_xlabel('θ')
ax2.set_ylabel('J(θ)')
ax2.set_title('Parameter Trajectory')
ax2.legend(fontsize=8)
ax2.grid(alpha=0.3)
plt.suptitle('Optimizer Comparison — 1D Quadratic Loss', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()Optimizers on the Ackley Function — Non-Convex Surface¶
The 1D quadratic is too easy — any optimizer converges. The Ackley function is a standard non-convex benchmark with many local minima and a narrow path to the global minimum at :
The implementations below are preserved from the original notebook — they implement each optimizer correctly and run them from the same starting point so trajectories are directly comparable.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
# ── Ackley function and its gradient ──────────────────────────────────────────
def objective_function(x, y):
term1 = -20 * np.exp(-0.2 * np.sqrt(0.5 * (x**2 + y**2)))
term2 = -np.exp(0.5 * (np.cos(2 * np.pi * x) + np.cos(2 * np.pi * y)))
return term1 + term2 + 20 + np.e
def gradient(x, y):
dx = 2 * x + 3 * np.sin(1.5 * x) * np.cos(1.5 * x)
dy = 2 * y + 3 * np.sin(1.5 * y) * np.cos(1.5 * y)
return np.array([dx, dy])
def clip_point(pt, bounds=(-3, 3)):
return np.clip(pt, bounds[0], bounds[1])
# ── Optimizer implementations ─────────────────────────────────────────────────
def sgd_momentum(start, lr=0.05, n=200, beta=0.9, tol=0.01):
pt, vel = np.array(start, float), np.zeros(2)
hist = [pt.copy()]
for i in range(n):
g = gradient(pt[0], pt[1])
vel = beta * vel - lr * g
pt = clip_point(pt + vel)
hist.append(pt.copy())
if objective_function(pt[0], pt[1]) < tol:
break
return np.array(hist)
def adagrad(start, lr=0.05, n=200, eps=1e-8, tol=0.01):
pt = np.array(start, float)
G = np.zeros(2)
hist = [pt.copy()]
for i in range(n):
g = gradient(pt[0], pt[1])
G += g ** 2
pt = clip_point(pt - lr / (np.sqrt(G) + eps) * g)
hist.append(pt.copy())
if objective_function(pt[0], pt[1]) < tol:
break
return np.array(hist)
def rmsprop(start, lr=0.05, n=200, rho=0.9, eps=1e-8, tol=0.01):
pt = np.array(start, float)
s = np.zeros(2)
hist = [pt.copy()]
for i in range(n):
g = gradient(pt[0], pt[1])
s = rho * s + (1 - rho) * g ** 2
pt = clip_point(pt - lr / (np.sqrt(s) + eps) * g)
hist.append(pt.copy())
if objective_function(pt[0], pt[1]) < tol:
break
return np.array(hist)
def adam(start, lr=0.05, n=200, b1=0.9, b2=0.999, eps=1e-8, tol=0.01):
pt = np.array(start, float)
m, v = np.zeros(2), np.zeros(2)
hist = [pt.copy()]
for t in range(1, n + 1):
g = gradient(pt[0], pt[1])
m = b1 * m + (1 - b1) * g
v = b2 * v + (1 - b2) * g ** 2
m_hat = m / (1 - b1 ** t)
v_hat = v / (1 - b2 ** t)
pt = clip_point(pt - lr * m_hat / (np.sqrt(v_hat) + eps))
hist.append(pt.copy())
if objective_function(pt[0], pt[1]) < tol:
break
return np.array(hist)
# ── Run all optimizers from the same start ────────────────────────────────────
START = [2.0, 2.0]
paths_2d = {
'SGD+Momentum': (sgd_momentum(START), 'tomato'),
'Adagrad': (adagrad(START), 'steelblue'),
'RMSProp': (rmsprop(START), 'seagreen'),
'Adam': (adam(START), 'gold'),
}
# ── Plot: contour + 2D trajectories ──────────────────────────────────────────
res = 120
xs = np.linspace(-3, 3, res)
X, Y = np.meshgrid(xs, xs)
Z = objective_function(X, Y)
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
# Contour trajectories
ax = axes[0]
ax.contourf(X, Y, Z, levels=25, cmap='viridis', alpha=0.75)
ax.contour(X, Y, Z, levels=25, colors='white', linewidths=0.3, alpha=0.4)
for name, (path, color) in paths_2d.items():
ax.plot(path[:, 0], path[:, 1], '-o', color=color, ms=2, lw=2, label=f'{name} ({len(path)-1} steps)')
ax.plot(path[-1, 0], path[-1, 1], 'D', color=color, ms=8)
ax.plot(0, 0, 'w*', ms=14, label='Global min (0,0)')
ax.plot(*START, 'ws', ms=10, label='Start (2,2)')
ax.set_title('Trajectories on Ackley Function')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.legend(fontsize=8, loc='upper left')
# Loss vs steps
ax2 = axes[1]
for name, (path, color) in paths_2d.items():
losses = [objective_function(pt[0], pt[1]) for pt in path]
ax2.plot(losses, color=color, lw=2, label=name)
ax2.set_xlabel('Step'); ax2.set_ylabel('f(x, y)')
ax2.set_title('Loss vs Steps — Ackley')
ax2.legend(fontsize=9)
ax2.grid(alpha=0.3)
plt.suptitle('Optimizer Trajectories on Non-Convex Ackley Surface', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
print("Final positions and function values:")
for name, (path, _) in paths_2d.items():
final = path[-1]
fval = objective_function(final[0], final[1])
print(f" {name:<16}: x=({final[0]:+.3f}, {final[1]:+.3f}) f={fval:.4f} steps={len(path)-1}")Try It in the Browser — Adam vs SGD on a 1D Loss¶
Edit , , and in the cell below and observe how quickly each optimizer reaches .
Choosing an Optimizer in Practice¶
| Scenario | Recommended optimizer | Reason |
|---|---|---|
| Deep learning, first attempt | Adam | Robust defaults, fast cold start |
| NLP / text, sparse features | Adagrad or Adam | Rare token gradients need boosted LR |
| Fine-tuning pre-trained model | AdamW | Weight decay prevents catastrophic forgetting |
| Large-batch convex sklearn model | SGD + Momentum | Lower memory, same convergence |
Scikit-learn SGDClassifier | SGD (built-in momentum optional) | Direct API mapping |
| Tight compute budget | RMSProp | Slightly less memory than Adam (no 1st moment) |
Guided Practice¶
What problem does the bias-correction step in Adam solve?¶
Why does Adagrad's effective learning rate eventually collapse to near zero?¶
Which component of Adam is inherited from SGD with Momentum?¶
A scikit-learn SGDClassifier on a sparse NLP feature matrix is converging slowly. Which change is most likely to help?¶
Exercises¶
Exercise 1 — Implement AdamW¶
AdamW adds decoupled weight decay: instead of folding into the gradient before computing moments, it subtracts it directly from the parameter after the Adam step:
Implement run_adamw below and compare its trajectory to Adam’s on the 1D quadratic with .
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def J(theta): return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)
def run_adamw(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, lam=0.01, n=60):
theta, m, v = theta0, 0.0, 0.0
path = [theta]
for t in range(1, n + 1):
g = dJ(theta)
# TODO: compute m, v, bias-corrected m_hat, v_hat
# TODO: Adam step
# TODO: weight decay step (subtract alpha * lam * theta)
path.append(theta)
return path
# Compare with Adam from the implementations above
def run_adam(theta0=0.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60):
theta, m, v = theta0, 0.0, 0.0
path = [theta]
for t in range(1, n + 1):
g = dJ(theta)
m = b1 * m + (1 - b1) * g
v = b2 * v + (1 - b2) * g ** 2
m_hat = m / (1 - b1 ** t)
v_hat = v / (1 - b2 ** t)
theta -= alpha * m_hat / (np.sqrt(v_hat) + eps)
path.append(theta)
return path
adam_path = run_adam()
adamw_path = run_adamw()
steps = np.arange(61)
plt.figure(figsize=(7, 4))
plt.plot(steps, [J(t) for t in adam_path], lw=2, label='Adam')
plt.plot(steps, [J(t) for t in adamw_path], lw=2, linestyle='--', label='AdamW (λ=0.01)')
plt.xlabel('Step'); plt.ylabel('J(θ)'); plt.title('Adam vs AdamW')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()Exercise 2 — Momentum Sensitivity¶
Run SGD+Momentum on starting at with and vary . Plot the loss trajectory for each value. What happens as ?
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def J(t): return (t - 3.0) ** 2
def dJ(t): return 2.0 * (t - 3.0)
betas = [0.0, 0.5, 0.9, 0.99]
colors = ['steelblue', 'seagreen', 'darkorange', 'tomato']
alpha = 0.3
n_steps = 60
plt.figure(figsize=(8, 4))
for beta, color in zip(betas, colors):
# TODO: implement SGD+Momentum with this beta value
# hint: v = beta*v + (1-beta)*g then theta -= alpha*v
losses = [J(0.0)] # replace with actual trajectory
plt.plot(losses, color=color, lw=2, label=f'β = {beta}')
plt.xlabel('Step'); plt.ylabel('J(θ)')
plt.title('Momentum Sensitivity (α=0.3)')
plt.legend(); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()Exercise 3 — RMSProp vs Adagrad on a Sparse Gradient Signal¶
Simulate a sparse gradient sequence: gradients are zero 90 % of the time and equal to 5 on non-zero steps. Run both Adagrad and RMSProp for 200 steps on (minimum at 0). Show how Adagrad’s effective learning rate collapses while RMSProp’s stays active.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n = 200
# Sparse gradient: 10% of steps have gradient = 5, rest = 0
sparse_grads = np.where(np.random.rand(n) < 0.1, 5.0, 0.0)
alpha = 0.5
eps = 1e-8
rho = 0.9
# TODO: run Adagrad and RMSProp using the sparse_grads array
# Record effective LR (alpha / sqrt(accumulator + eps)) at each step
adagrad_eff_lr = np.ones(n) # replace
rmsprop_eff_lr = np.ones(n) # replace
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(sparse_grads, alpha=0.5, label='gradient signal')
axes[0].set_title('Sparse Gradient Signal')
axes[0].set_xlabel('Step'); axes[0].legend()
axes[1].plot(adagrad_eff_lr, lw=2, label='Adagrad eff LR')
axes[1].plot(rmsprop_eff_lr, lw=2, label='RMSProp eff LR')
axes[1].set_title('Effective Learning Rate Over Time')
axes[1].set_xlabel('Step'); axes[1].set_ylabel('α / √(acc + ε)')
axes[1].legend(); axes[1].grid(alpha=0.3)
plt.tight_layout(); plt.show()Common Pitfalls¶
Summary¶
Key takeaways
| Optimizer | Update formula (simplified) | Key property |
|---|---|---|
| SGD + Momentum | Smooths oscillation; needs LR tuning | |
| Adagrad | Great for sparse features; LR collapses | |
| RMSProp | Fixes Adagrad collapse via decay | |
| Adam | Momentum + RMSProp + bias correction | |
| AdamW | Adam + | Best for regularised training |
Rule of thumb: start with Adam at . Switch to AdamW when you need weight decay. Fall back to SGD+Momentum for large-batch convex problems where final test accuracy matters more than convergence speed.
Next Up — Learning Rate Schedules¶

You now know how each optimizer adapts its step size internally. The next notebook shows how to adapt the base learning rate $\alpha$ itself over training — warm-up, step decay, cosine annealing, and cyclic schedules that squeeze out the last few percent of accuracy.