
Learning Rate Schedules¶
Step Decay · Exponential Decay · Cosine Annealing · Warm-up · Cyclic LR
Chapter 6 — Optimization & Training Practicalities
Why the Learning Rate Should Change Over Time¶

The intuition: training has two phases with conflicting needs.
| Phase | Stage | Need | Optimal |
|---|---|---|---|
| Early | Far from optimum | Cover large distances quickly | Large |
| Late | Near optimum | Fine-tune without overshooting | Small |
A fixed learning rate makes a compromise that is suboptimal in both phases. A schedule gives each phase what it needs.
Schedule Formulas¶
Let be the initial learning rate, the current epoch, and the total epochs.
Step Decay¶
Every epochs, multiply by decay factor (e.g., , ). Piecewise constant — the rate stays flat and then drops suddenly. Easy to debug; common default for CNNs.
Exponential Decay¶
Smooth continuous decay. Controlled by rate constant — small decays slowly, large drops fast. Risk: with large the rate becomes negligibly small before training finishes.
Cosine Annealing¶
Smoothly interpolates from down to following a cosine curve. Avoids the hard drop of step decay and the potential under-annealing of exponential decay. State-of-the-art default for transformer and ResNet training.
Linear Warm-up + Decay¶
Ramp from near-zero up to over warm-up epochs, then apply any decay schedule. Warm-up prevents early catastrophic gradient updates when moment estimates (in Adam) are still noisy.
Cyclic Learning Rate¶
Oscillates between and with cycle length . The periodic high-LR phases help escape sharp local minima; the low-LR phases allow fine-tuning. Known to improve generalisation on some tasks.
Schedule Selection Map¶
Decision tree for schedule selection by problem type. All schedules can be combined with any optimizer.
Visualising All Schedules¶
The cell below implements each schedule as a pure-NumPy function and plots the learning rate curve over 100 epochs.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
T = 100 # total epochs
alpha0 = 0.1 # initial LR
alpha_min = 1e-4 # floor for cosine
epochs = np.arange(T)
def step_decay(t, alpha0=alpha0, gamma=0.5, step_size=20):
return alpha0 * (gamma ** (t // step_size))
def exp_decay(t, alpha0=alpha0, lam=0.04):
return alpha0 * np.exp(-lam * t)
def cosine_anneal(t, alpha0=alpha0, alpha_min=alpha_min, T=T):
return alpha_min + 0.5 * (alpha0 - alpha_min) * (1 + np.cos(np.pi * t / T))
def warmup_cosine(t, alpha0=alpha0, alpha_min=alpha_min, T=T, t_warm=10):
if t <= t_warm:
return alpha0 * t / t_warm
return alpha_min + 0.5 * (alpha0 - alpha_min) * (1 + np.cos(np.pi * (t - t_warm) / (T - t_warm)))
def cyclic_lr(t, alpha_min=1e-3, alpha_max=alpha0, cycle=20):
x = abs(t % (2 * cycle) / cycle - 1)
return alpha_min + (alpha_max - alpha_min) * max(0, 1 - x)
schedules = {
'Step Decay (γ=0.5, s=20)': [step_decay(t) for t in epochs],
'Exponential Decay (λ=0.04)': [exp_decay(t) for t in epochs],
'Cosine Annealing': [cosine_anneal(t) for t in epochs],
'Warm-up + Cosine (t_warm=10)': [warmup_cosine(t) for t in epochs],
'Cyclic LR (cycle=20)': [cyclic_lr(t) for t in epochs],
}
colors = ['steelblue', 'darkorange', 'seagreen', 'tomato', 'purple']
styles = ['-', '--', '-', '-.', ':']
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
# Left: all on same axes
for (label, lrs), color, ls in zip(schedules.items(), colors, styles):
axes[0].plot(epochs, lrs, color=color, lw=2, linestyle=ls, label=label)
axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('Learning Rate')
axes[0].set_title('All Schedules — Full Range')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)
# Right: log scale to see small-LR behaviour
for (label, lrs), color, ls in zip(schedules.items(), colors, styles):
axes[1].semilogy(epochs, lrs, color=color, lw=2, linestyle=ls, label=label)
axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('Learning Rate (log scale)')
axes[1].set_title('All Schedules — Log Scale')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3, which='both')
plt.suptitle('Learning Rate Schedules over 100 Epochs', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()Effect on Convergence — Simulation on a Noisy Loss Surface¶
The cell below runs vanilla SGD (no momentum) on a 1D loss with added gradient noise, comparing fixed LR vs cosine annealing. Noise simulates the stochastic gradient variance of mini-batch training.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
def J(theta): return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)
T = 100
noise_sd = 0.3 # simulated SGD gradient noise
alpha0 = 0.3
alpha_min = 1e-4
def cosine_anneal(t, a0=alpha0, a_min=alpha_min, T=T):
return a_min + 0.5*(a0 - a_min)*(1 + np.cos(np.pi * t / T))
configs = [
('Fixed α=0.30', lambda t: alpha0),
('Fixed α=0.05', lambda t: 0.05),
('Cosine Annealing', cosine_anneal),
('Step Decay (γ=0.5, s=25)', lambda t: alpha0 * (0.5 ** (t // 25))),
('Warm-up + Cosine', lambda t: cosine_anneal(t, a0=alpha0*t/10) if t<=10 else cosine_anneal(t)),
]
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange', 'purple']
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
for (label, sched), color in zip(configs, colors):
theta = 0.0
loss_hist, alpha_hist = [], []
for t in range(T):
alpha_t = sched(t)
g = dJ(theta) + np.random.randn() * noise_sd
theta -= alpha_t * g
loss_hist.append(J(theta))
alpha_hist.append(alpha_t)
axes[0].plot(loss_hist, color=color, lw=1.8, label=f'{label} → J={loss_hist[-1]:.4f}')
axes[1].plot(alpha_hist, color=color, lw=1.8, label=label)
axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('J(θ) = (θ−3)²')
axes[0].set_title('Loss with Noisy SGD — Schedule Comparison')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)
axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('α')
axes[1].set_title('Schedule Curves Used')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3)
plt.suptitle('Schedule Effect on Noisy SGD Convergence', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()Try It in the Browser — Design Your Own Schedule¶
Edit the parameters below to design a custom schedule and see the learning rate curve in text form.
Schedules in scikit-learn and Practical ML¶
scikit-learn’s SGDClassifier / SGDRegressor support schedules via the learning_rate parameter:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
X, y = make_regression(n_samples=500, n_features=10, noise=25.0, random_state=42)
scaler = StandardScaler()
X = scaler.fit_transform(X)
schedule_configs = {
'constant (η=0.01)': dict(learning_rate='constant', eta0=0.01),
'optimal (sklearn)': dict(learning_rate='optimal'),
'invscaling': dict(learning_rate='invscaling', eta0=0.1, power_t=0.5),
'adaptive': dict(learning_rate='adaptive', eta0=0.1),
}
fig, ax = plt.subplots(figsize=(8, 4))
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange']
for (label, kwargs), color in zip(schedule_configs.items(), colors):
losses = []
model = SGDRegressor(max_iter=1, tol=None, warm_start=True, random_state=42, **kwargs)
for _ in range(80):
model.fit(X, y)
pred = model.predict(X)
losses.append(np.mean((pred - y) ** 2))
ax.plot(losses, color=color, lw=2, label=f'{label} → MSE={losses[-1]:.1f}')
ax.set_xlabel('Epoch'); ax.set_ylabel('MSE')
ax.set_title('sklearn SGDRegressor — Built-in LR Schedules')
ax.legend(fontsize=9); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()Guided Practice¶
Why do we use a learning rate schedule instead of keeping the step size fixed forever?¶
What is the typical purpose of decreasing the learning rate later in training?¶
Why is a warm-up phase (linearly increasing LR) useful at the start of training with Adam?¶
A cosine annealing schedule runs for $T=100$ epochs with $\alpha_0=0.1$ and $\alpha_{\min}=0$. What is the learning rate at epoch 50?¶
Exercises¶
Exercise 1 — Implement Cosine Annealing with Warm Restarts (SGDR)¶
SGDR (Stochastic Gradient Descent with Warm Restarts) resets the cosine schedule every epochs, doubling the period after each restart:
where doubles each restart and resets to 0. Implement and plot it for 100 epochs with .
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
T_TOTAL = 100
T0 = 10 # initial cycle length
alpha0 = 0.1
alpha_min = 1e-4
def sgdr(t, T0=T0, alpha0=alpha0, alpha_min=alpha_min):
# TODO: compute which cycle we are in and t_cycle (position within cycle)
# cycle length doubles after each restart: T0, 2*T0, 4*T0, ...
# hint: keep a running sum of cycle lengths to find the current cycle boundary
return alpha_min # replace with correct formula
epochs = np.arange(T_TOTAL)
lrs = [sgdr(t) for t in epochs]
plt.figure(figsize=(8, 3))
plt.plot(epochs, lrs, lw=2, color='steelblue')
plt.xlabel('Epoch'); plt.ylabel('Learning Rate')
plt.title('SGDR — Cosine Annealing with Warm Restarts')
plt.grid(alpha=0.3); plt.tight_layout(); plt.show()Exercise 2 — Schedule Impact on Final Loss¶
Run SGD with gradient noise on for 150 steps. Compare five schedule choices and report the final loss for each. Which schedule gets closest to the true minimum? Is there a schedule that consistently beats the others across 10 independent random seeds?
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def J(t): return (t - 3.0)**2
def dJ(t): return 2.0*(t - 3.0)
# TODO: define at least 5 schedule functions
# TODO: run each across 10 random seeds, collect final J(theta)
# TODO: plot mean ± std of final loss for each schedule
np.random.seed(0)
T = 150
n_seeds = 10
# starter: one schedule implemented
def cosine_anneal(t, alpha0=0.3, alpha_min=1e-4, T=T):
return alpha_min + 0.5*(alpha0 - alpha_min)*(1 + np.cos(np.pi * t / T))
# Your code here ...
print('Exercise 2 — implement and compare schedules')Exercise 3 — LR Range Test¶
The LR range test sweeps from a very small value to a large value over a short number of steps and records the loss at each . The useful LR range is where the loss is still falling. Implement this for and identify the useful range.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def J(t): return (t - 3.0)**2
def dJ(t): return 2.0*(t - 3.0)
# LR range test: sweep alpha from 1e-4 to 2.0 over 50 steps
alphas = np.logspace(-4, np.log10(2.0), 50)
theta0 = 0.0
losses = []
for alpha in alphas:
# TODO: take one gradient step from theta0 with this alpha
# record J(theta_after_step)
losses.append(J(theta0)) # replace with J after step
plt.figure(figsize=(7, 4))
plt.semilogx(alphas, losses, lw=2, color='steelblue')
plt.xlabel('Learning Rate (log scale)'); plt.ylabel('Loss after 1 step')
plt.title('LR Range Test — identify useful α range')
plt.grid(alpha=0.3, which='both'); plt.tight_layout(); plt.show()Common Pitfalls¶
Summary¶
Key takeaways
| Schedule | Formula (key) | Best for |
|---|---|---|
| Step Decay | Small datasets, CNNs | |
| Exponential Decay | Smooth continuous decay | |
| Cosine Annealing | Large models, deep networks | |
| Warm-up + Cosine | Ramp → Cosine | Transformers, fine-tuning |
| Cyclic LR | Triangle wave between and | Experimental, escaping minima |
Rule of thumb: cosine annealing is a safe default for most deep learning. Add a warm-up phase (5–10 % of total steps) when using Adam on a transformer or large pre-trained model.
Next Up — Numerical Stability¶

Even a perfectly tuned optimizer with an ideal schedule can fail if floating-point arithmetic explodes or vanishes. The next notebook covers gradient clipping, log-sum-exp tricks, weight initialisation, and how to recognise and fix NaN/Inf values during training.