Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Learning Rate Schedules

Step Decay · Exponential Decay · Cosine Annealing · Warm-up · Cyclic LR

Chapter 6 — Optimization & Training Practicalities

Why the Learning Rate Should Change Over Time

Business context. A retail demand-forecasting model trains for 100 epochs. With a fixed $\alpha=0.01$, training converges to loss 0.042 — decent but not best. Using a cosine annealing schedule (same $\alpha$ at epoch 1, decayed to near zero at epoch 100), the same model reaches 0.031, a 26 % improvement in MSE, with no extra data or architecture change. Schedules are free performance gains.

The intuition: training has two phases with conflicting needs.

PhaseStageNeedOptimal α\alpha
EarlyFar from optimumCover large distances quicklyLarge
LateNear optimumFine-tune without overshootingSmall

A fixed learning rate makes a compromise that is suboptimal in both phases. A schedule gives each phase what it needs.

Schedule Formulas

Let α0\alpha_0 be the initial learning rate, tt the current epoch, and TT the total epochs.


Step Decay

αt=α0γt/s\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t / s \rfloor}

Every ss epochs, multiply by decay factor γ<1\gamma < 1 (e.g., γ=0.5\gamma=0.5, s=10s=10). Piecewise constant — the rate stays flat and then drops suddenly. Easy to debug; common default for CNNs.


Exponential Decay

αt=α0eλt\alpha_t = \alpha_0 \cdot e^{-\lambda t}

Smooth continuous decay. Controlled by rate constant λ\lambda — small λ\lambda decays slowly, large λ\lambda drops fast. Risk: with large λ\lambda the rate becomes negligibly small before training finishes.


Cosine Annealing

αt=αmin+12(α0αmin)(1+cos ⁣(πtT))\alpha_t = \alpha_{\min} + \tfrac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\!\left(\frac{\pi\, t}{T}\right)\right)

Smoothly interpolates from α0\alpha_0 down to αmin\alpha_{\min} following a cosine curve. Avoids the hard drop of step decay and the potential under-annealing of exponential decay. State-of-the-art default for transformer and ResNet training.


Linear Warm-up + Decay

αt={α0ttwarmttwarmfdecay(t)t>twarm\alpha_t = \begin{cases} \alpha_0 \cdot \dfrac{t}{t_{\text{warm}}} & t \leq t_{\text{warm}} \\ f_{\text{decay}}(t) & t > t_{\text{warm}} \end{cases}

Ramp from near-zero up to α0\alpha_0 over twarmt_{\text{warm}} warm-up epochs, then apply any decay schedule. Warm-up prevents early catastrophic gradient updates when moment estimates (in Adam) are still noisy.


Cyclic Learning Rate

αt=αmin+(αmaxαmin)max ⁣(0,1tmod(2c)c1)\alpha_t = \alpha_{\min} + (\alpha_{\max} - \alpha_{\min}) \cdot \max\!\left(0,\, 1 - \left|\frac{t \bmod (2\,c)}{c} - 1\right|\right)

Oscillates between αmin\alpha_{\min} and αmax\alpha_{\max} with cycle length cc. The periodic high-LR phases help escape sharp local minima; the low-LR phases allow fine-tuning. Known to improve generalisation on some tasks.

Schedule Selection Map

Decision tree for schedule selection by problem type. All schedules can be combined with any optimizer.

Visualising All Schedules

The cell below implements each schedule as a pure-NumPy function and plots the learning rate curve over 100 epochs.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

T       = 100          # total epochs
alpha0  = 0.1          # initial LR
alpha_min = 1e-4       # floor for cosine
epochs  = np.arange(T)

def step_decay(t, alpha0=alpha0, gamma=0.5, step_size=20):
    return alpha0 * (gamma ** (t // step_size))

def exp_decay(t, alpha0=alpha0, lam=0.04):
    return alpha0 * np.exp(-lam * t)

def cosine_anneal(t, alpha0=alpha0, alpha_min=alpha_min, T=T):
    return alpha_min + 0.5 * (alpha0 - alpha_min) * (1 + np.cos(np.pi * t / T))

def warmup_cosine(t, alpha0=alpha0, alpha_min=alpha_min, T=T, t_warm=10):
    if t <= t_warm:
        return alpha0 * t / t_warm
    return alpha_min + 0.5 * (alpha0 - alpha_min) * (1 + np.cos(np.pi * (t - t_warm) / (T - t_warm)))

def cyclic_lr(t, alpha_min=1e-3, alpha_max=alpha0, cycle=20):
    x = abs(t % (2 * cycle) / cycle - 1)
    return alpha_min + (alpha_max - alpha_min) * max(0, 1 - x)

schedules = {
    'Step Decay (γ=0.5, s=20)':       [step_decay(t) for t in epochs],
    'Exponential Decay (λ=0.04)':     [exp_decay(t) for t in epochs],
    'Cosine Annealing':                [cosine_anneal(t) for t in epochs],
    'Warm-up + Cosine (t_warm=10)':   [warmup_cosine(t) for t in epochs],
    'Cyclic LR (cycle=20)':           [cyclic_lr(t) for t in epochs],
}
colors = ['steelblue', 'darkorange', 'seagreen', 'tomato', 'purple']
styles = ['-', '--', '-', '-.', ':']

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Left: all on same axes
for (label, lrs), color, ls in zip(schedules.items(), colors, styles):
    axes[0].plot(epochs, lrs, color=color, lw=2, linestyle=ls, label=label)
axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('Learning Rate')
axes[0].set_title('All Schedules — Full Range')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)

# Right: log scale to see small-LR behaviour
for (label, lrs), color, ls in zip(schedules.items(), colors, styles):
    axes[1].semilogy(epochs, lrs, color=color, lw=2, linestyle=ls, label=label)
axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('Learning Rate (log scale)')
axes[1].set_title('All Schedules — Log Scale')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3, which='both')

plt.suptitle('Learning Rate Schedules over 100 Epochs', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Effect on Convergence — Simulation on a Noisy Loss Surface

The cell below runs vanilla SGD (no momentum) on a 1D loss with added gradient noise, comparing fixed LR vs cosine annealing. Noise simulates the stochastic gradient variance of mini-batch training.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

def J(theta):  return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)

T        = 100
noise_sd = 0.3    # simulated SGD gradient noise
alpha0   = 0.3
alpha_min = 1e-4

def cosine_anneal(t, a0=alpha0, a_min=alpha_min, T=T):
    return a_min + 0.5*(a0 - a_min)*(1 + np.cos(np.pi * t / T))

configs = [
    ('Fixed α=0.30',             lambda t: alpha0),
    ('Fixed α=0.05',             lambda t: 0.05),
    ('Cosine Annealing',         cosine_anneal),
    ('Step Decay (γ=0.5, s=25)', lambda t: alpha0 * (0.5 ** (t // 25))),
    ('Warm-up + Cosine',         lambda t: cosine_anneal(t, a0=alpha0*t/10) if t<=10 else cosine_anneal(t)),
]
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange', 'purple']

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

for (label, sched), color in zip(configs, colors):
    theta = 0.0
    loss_hist, alpha_hist = [], []
    for t in range(T):
        alpha_t = sched(t)
        g = dJ(theta) + np.random.randn() * noise_sd
        theta -= alpha_t * g
        loss_hist.append(J(theta))
        alpha_hist.append(alpha_t)
    axes[0].plot(loss_hist, color=color, lw=1.8, label=f'{label} → J={loss_hist[-1]:.4f}')
    axes[1].plot(alpha_hist, color=color, lw=1.8, label=label)

axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('J(θ) = (θ−3)²')
axes[0].set_title('Loss with Noisy SGD — Schedule Comparison')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)

axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('α')
axes[1].set_title('Schedule Curves Used')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3)

plt.suptitle('Schedule Effect on Noisy SGD Convergence', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Try It in the Browser — Design Your Own Schedule

Edit the parameters below to design a custom schedule and see the learning rate curve in text form.

Schedules in scikit-learn and Practical ML

scikit-learn’s SGDClassifier / SGDRegressor support schedules via the learning_rate parameter:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=500, n_features=10, noise=25.0, random_state=42)
scaler = StandardScaler()
X = scaler.fit_transform(X)

schedule_configs = {
    'constant (η=0.01)':   dict(learning_rate='constant',    eta0=0.01),
    'optimal (sklearn)':   dict(learning_rate='optimal'),
    'invscaling':          dict(learning_rate='invscaling',  eta0=0.1, power_t=0.5),
    'adaptive':            dict(learning_rate='adaptive',    eta0=0.1),
}

fig, ax = plt.subplots(figsize=(8, 4))
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange']

for (label, kwargs), color in zip(schedule_configs.items(), colors):
    losses = []
    model = SGDRegressor(max_iter=1, tol=None, warm_start=True, random_state=42, **kwargs)
    for _ in range(80):
        model.fit(X, y)
        pred = model.predict(X)
        losses.append(np.mean((pred - y) ** 2))
    ax.plot(losses, color=color, lw=2, label=f'{label} → MSE={losses[-1]:.1f}')

ax.set_xlabel('Epoch'); ax.set_ylabel('MSE')
ax.set_title('sklearn SGDRegressor — Built-in LR Schedules')
ax.legend(fontsize=9); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Guided Practice

Why do we use a learning rate schedule instead of keeping the step size fixed forever?

Because different stages of training can benefit from different step sizesCorrect. Larger early steps and smaller later steps often improve both convergence speed and final loss.
Because schedules replace the optimizer entirelyA schedule sets the base $\alpha$ fed into the optimizer — it does not replace the optimizer.
Because fixed learning rates are illegal in machine learningFixed rates are perfectly legal; they just leave performance on the table.
Because schedules remove the need for validation dataValidation is still needed to detect overfitting and choose when to stop.

What is the typical purpose of decreasing the learning rate later in training?

To make the model forget the training dataThat is not the purpose of a late-stage LR reduction.
To take finer steps near a good solution and reduce overshootingCorrect. When parameters are already close to the optimum, small steps prevent bouncing around instead of converging.
To convert the loss into accuracyLearning rate does not change the metric definition.
To avoid using gradients altogetherGradient-based methods still use gradients regardless of the schedule.

Why is a warm-up phase (linearly increasing LR) useful at the start of training with Adam?

It prevents the model from learning the first batch of dataThe model still learns from the first batch; warm-up modulates how aggressively.
It sets all initial parameters to zeroParameter initialisation is separate from the LR schedule.
Adam's moment estimates are initialised at zero and unreliable in early steps; a small LR limits the damage from noisy updatesCorrect. Bias correction partially mitigates the cold-start bias, but the estimates are still noisy. Warm-up keeps early steps small until the moments stabilise.
Warm-up increases the batch size automaticallyBatch size and LR schedule are independent settings.

A cosine annealing schedule runs for $T=100$ epochs with $\alpha_0=0.1$ and $\alpha_{\min}=0$. What is the learning rate at epoch 50?

0.05 — exactly half of the initial rate, because $\cos(\pi/2) = 0$Correct. At $t=50$, $\alpha_{50} = 0 + 0.5 \times 0.1 \times (1 + \cos(\pi)) = 0.5 \times 0.1 \times 0 = 0$... wait — $\cos(\pi \times 50/100) = \cos(\pi/2) = 0$, so $\alpha_{50} = 0.5 \times 0.1 \times 1 = 0.05$. ✓
0.10 — unchanged from the initial rateAt epoch 0 the cosine term is $\cos(0)=1$, giving $\alpha_0=0.1$. At epoch 50 it has decayed.
0.00 — the schedule reaches zero at the midpointThe schedule reaches zero only at $t=T$, not at $t=T/2$.
0.07 — a step-decay valueStep decay would give a different value; cosine annealing at the midpoint is exactly $\alpha_0/2$.

Exercises

Exercise 1 — Implement Cosine Annealing with Warm Restarts (SGDR)

SGDR (Stochastic Gradient Descent with Warm Restarts) resets the cosine schedule every T0T_0 epochs, doubling the period after each restart:

αt=αmin+12(α0αmin)(1+cos ⁣(πtcycleTcur))\alpha_t = \alpha_{\min} + \tfrac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\!\left(\frac{\pi\, t_{\text{cycle}}}{T_{\text{cur}}}\right)\right)

where TcurT_{\text{cur}} doubles each restart and tcyclet_{\text{cycle}} resets to 0. Implement and plot it for 100 epochs with T0=10T_0=10.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

T_TOTAL  = 100
T0       = 10      # initial cycle length
alpha0   = 0.1
alpha_min = 1e-4

def sgdr(t, T0=T0, alpha0=alpha0, alpha_min=alpha_min):
    # TODO: compute which cycle we are in and t_cycle (position within cycle)
    # cycle length doubles after each restart: T0, 2*T0, 4*T0, ...
    # hint: keep a running sum of cycle lengths to find the current cycle boundary
    return alpha_min  # replace with correct formula

epochs = np.arange(T_TOTAL)
lrs = [sgdr(t) for t in epochs]

plt.figure(figsize=(8, 3))
plt.plot(epochs, lrs, lw=2, color='steelblue')
plt.xlabel('Epoch'); plt.ylabel('Learning Rate')
plt.title('SGDR — Cosine Annealing with Warm Restarts')
plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Exercise 2 — Schedule Impact on Final Loss

Run SGD with gradient noise σ=0.5\sigma=0.5 on J(θ)=(θ3)2J(\theta)=(\theta-3)^2 for 150 steps. Compare five schedule choices and report the final loss for each. Which schedule gets closest to the true minimum? Is there a schedule that consistently beats the others across 10 independent random seeds?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(t):  return (t - 3.0)**2
def dJ(t): return 2.0*(t - 3.0)

# TODO: define at least 5 schedule functions
# TODO: run each across 10 random seeds, collect final J(theta)
# TODO: plot mean ± std of final loss for each schedule

np.random.seed(0)
T = 150
n_seeds = 10

# starter: one schedule implemented
def cosine_anneal(t, alpha0=0.3, alpha_min=1e-4, T=T):
    return alpha_min + 0.5*(alpha0 - alpha_min)*(1 + np.cos(np.pi * t / T))

# Your code here ...
print('Exercise 2 — implement and compare schedules')

Exercise 3 — LR Range Test

The LR range test sweeps α\alpha from a very small value to a large value over a short number of steps and records the loss at each α\alpha. The useful LR range is where the loss is still falling. Implement this for J(θ)=(θ3)2J(\theta)=(\theta-3)^2 and identify the useful range.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(t):  return (t - 3.0)**2
def dJ(t): return 2.0*(t - 3.0)

# LR range test: sweep alpha from 1e-4 to 2.0 over 50 steps
alphas   = np.logspace(-4, np.log10(2.0), 50)
theta0   = 0.0

losses = []
for alpha in alphas:
    # TODO: take one gradient step from theta0 with this alpha
    # record J(theta_after_step)
    losses.append(J(theta0))  # replace with J after step

plt.figure(figsize=(7, 4))
plt.semilogx(alphas, losses, lw=2, color='steelblue')
plt.xlabel('Learning Rate (log scale)'); plt.ylabel('Loss after 1 step')
plt.title('LR Range Test — identify useful α range')
plt.grid(alpha=0.3, which='both'); plt.tight_layout(); plt.show()

Common Pitfalls

Summary

Key takeaways
ScheduleFormula (key)Best for
Step Decayα0γt/s\alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}Small datasets, CNNs
Exponential Decayα0eλt\alpha_0 e^{-\lambda t}Smooth continuous decay
Cosine Annealingαmin+12(α0αmin)(1+cosπtT)\alpha_{\min} + \frac{1}{2}(\alpha_0-\alpha_{\min})(1+\cos\frac{\pi t}{T})Large models, deep networks
Warm-up + CosineRamp → CosineTransformers, fine-tuning
Cyclic LRTriangle wave between αmin\alpha_{\min} and αmax\alpha_{\max}Experimental, escaping minima

Rule of thumb: cosine annealing is a safe default for most deep learning. Add a warm-up phase (5–10 % of total steps) when using Adam on a transformer or large pre-trained model.

Next Up — Numerical Stability

Numerical Stability
Even a perfectly tuned optimizer with an ideal schedule can fail if floating-point arithmetic explodes or vanishes. The next notebook covers gradient clipping, log-sum-exp tricks, weight initialisation, and how to recognise and fix NaN/Inf values during training.