Learning Rate Schedules¶

Step Decay · Exponential Decay · Cosine Annealing · Warm-up · Cyclic LR

Chapter 6 — Optimization & Training Practicalities

Why the Learning Rate Should Change Over Time¶

Business context. A retail demand-forecasting model trains for 100 epochs. With a fixed $\alpha=0.01$, training converges to loss 0.042 — decent but not best. Using a cosine annealing schedule (same $\alpha$ at epoch 1, decayed to near zero at epoch 100), the same model reaches 0.031, a 26 % improvement in MSE, with no extra data or architecture change. Schedules are free performance gains.

The intuition: training has two phases with conflicting needs.

Phase	Stage	Need	Optimal $\alpha$
Early	Far from optimum	Cover large distances quickly	Large
Late	Near optimum	Fine-tune without overshooting	Small

A fixed learning rate makes a compromise that is suboptimal in both phases. A schedule gives each phase what it needs.

Schedule Formulas¶

Let $\alpha_0$ be the initial learning rate, $t$ the current epoch, and $T$ the total epochs.

Step Decay¶

\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t / s \rfloor}

(1)

Every $s$ epochs, multiply by decay factor $\gamma < 1$ (e.g., $\gamma=0.5$ , $s=10$ ). Piecewise constant — the rate stays flat and then drops suddenly. Easy to debug; common default for CNNs.

Exponential Decay¶

\alpha_t = \alpha_0 \cdot e^{-\lambda t}

(2)

Smooth continuous decay. Controlled by rate constant $\lambda$ — small $\lambda$ decays slowly, large $\lambda$ drops fast. Risk: with large $\lambda$ the rate becomes negligibly small before training finishes.

Cosine Annealing¶

\alpha_t = \alpha_{\min} + \tfrac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\!\left(\frac{\pi\, t}{T}\right)\right)

(3)

Smoothly interpolates from $\alpha_0$ down to $\alpha_{\min}$ following a cosine curve. Avoids the hard drop of step decay and the potential under-annealing of exponential decay. State-of-the-art default for transformer and ResNet training.

Linear Warm-up + Decay¶

\alpha_t = \begin{cases} \alpha_0 \cdot \dfrac{t}{t_{\text{warm}}} & t \leq t_{\text{warm}} \\ f_{\text{decay}}(t) & t > t_{\text{warm}} \end{cases}

(4)

Ramp from near-zero up to $\alpha_0$ over $t_{\text{warm}}$ warm-up epochs, then apply any decay schedule. Warm-up prevents early catastrophic gradient updates when moment estimates (in Adam) are still noisy.

Cyclic Learning Rate¶

\alpha_t = \alpha_{\min} + (\alpha_{\max} - \alpha_{\min}) \cdot \max\!\left(0,\, 1 - \left|\frac{t \bmod (2\,c)}{c} - 1\right|\right)

(5)

Oscillates between $\alpha_{\min}$ and $\alpha_{\max}$ with cycle length $c$ . The periodic high-LR phases help escape sharp local minima; the low-LR phases allow fine-tuning. Known to improve generalisation on some tasks.

Schedule Selection Map¶

Decision tree for schedule selection by problem type. All schedules can be combined with any optimizer.

Visualising All Schedules¶

The cell below implements each schedule as a pure-NumPy function and plots the learning rate curve over 100 epochs.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

T       = 100          # total epochs
alpha0  = 0.1          # initial LR
alpha_min = 1e-4       # floor for cosine
epochs  = np.arange(T)

def step_decay(t, alpha0=alpha0, gamma=0.5, step_size=20):
    return alpha0 * (gamma ** (t // step_size))

def exp_decay(t, alpha0=alpha0, lam=0.04):
    return alpha0 * np.exp(-lam * t)

def cosine_anneal(t, alpha0=alpha0, alpha_min=alpha_min, T=T):
    return alpha_min + 0.5 * (alpha0 - alpha_min) * (1 + np.cos(np.pi * t / T))

def warmup_cosine(t, alpha0=alpha0, alpha_min=alpha_min, T=T, t_warm=10):
    if t <= t_warm:
        return alpha0 * t / t_warm
    return alpha_min + 0.5 * (alpha0 - alpha_min) * (1 + np.cos(np.pi * (t - t_warm) / (T - t_warm)))

def cyclic_lr(t, alpha_min=1e-3, alpha_max=alpha0, cycle=20):
    x = abs(t % (2 * cycle) / cycle - 1)
    return alpha_min + (alpha_max - alpha_min) * max(0, 1 - x)

schedules = {
    'Step Decay (γ=0.5, s=20)':       [step_decay(t) for t in epochs],
    'Exponential Decay (λ=0.04)':     [exp_decay(t) for t in epochs],
    'Cosine Annealing':                [cosine_anneal(t) for t in epochs],
    'Warm-up + Cosine (t_warm=10)':   [warmup_cosine(t) for t in epochs],
    'Cyclic LR (cycle=20)':           [cyclic_lr(t) for t in epochs],
}
colors = ['steelblue', 'darkorange', 'seagreen', 'tomato', 'purple']
styles = ['-', '--', '-', '-.', ':']

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Left: all on same axes
for (label, lrs), color, ls in zip(schedules.items(), colors, styles):
    axes[0].plot(epochs, lrs, color=color, lw=2, linestyle=ls, label=label)
axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('Learning Rate')
axes[0].set_title('All Schedules — Full Range')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)

# Right: log scale to see small-LR behaviour
for (label, lrs), color, ls in zip(schedules.items(), colors, styles):
    axes[1].semilogy(epochs, lrs, color=color, lw=2, linestyle=ls, label=label)
axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('Learning Rate (log scale)')
axes[1].set_title('All Schedules — Log Scale')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3, which='both')

plt.suptitle('Learning Rate Schedules over 100 Epochs', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Effect on Convergence — Simulation on a Noisy Loss Surface¶

The cell below runs vanilla SGD (no momentum) on a 1D loss with added gradient noise, comparing fixed LR vs cosine annealing. Noise simulates the stochastic gradient variance of mini-batch training.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

def J(theta):  return (theta - 3.0) ** 2
def dJ(theta): return 2.0 * (theta - 3.0)

T        = 100
noise_sd = 0.3    # simulated SGD gradient noise
alpha0   = 0.3
alpha_min = 1e-4

def cosine_anneal(t, a0=alpha0, a_min=alpha_min, T=T):
    return a_min + 0.5*(a0 - a_min)*(1 + np.cos(np.pi * t / T))

configs = [
    ('Fixed α=0.30',             lambda t: alpha0),
    ('Fixed α=0.05',             lambda t: 0.05),
    ('Cosine Annealing',         cosine_anneal),
    ('Step Decay (γ=0.5, s=25)', lambda t: alpha0 * (0.5 ** (t // 25))),
    ('Warm-up + Cosine',         lambda t: cosine_anneal(t, a0=alpha0*t/10) if t<=10 else cosine_anneal(t)),
]
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange', 'purple']

fig, axes = plt.subplots(1, 2, figsize=(13, 4))

for (label, sched), color in zip(configs, colors):
    theta = 0.0
    loss_hist, alpha_hist = [], []
    for t in range(T):
        alpha_t = sched(t)
        g = dJ(theta) + np.random.randn() * noise_sd
        theta -= alpha_t * g
        loss_hist.append(J(theta))
        alpha_hist.append(alpha_t)
    axes[0].plot(loss_hist, color=color, lw=1.8, label=f'{label} → J={loss_hist[-1]:.4f}')
    axes[1].plot(alpha_hist, color=color, lw=1.8, label=label)

axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('J(θ) = (θ−3)²')
axes[0].set_title('Loss with Noisy SGD — Schedule Comparison')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)

axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('α')
axes[1].set_title('Schedule Curves Used')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3)

plt.suptitle('Schedule Effect on Noisy SGD Convergence', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Try It in the Browser — Design Your Own Schedule¶

Edit the parameters below to design a custom schedule and see the learning rate curve in text form.

Schedules in scikit-learn and Practical ML¶

scikit-learn’s SGDClassifier / SGDRegressor support schedules via the learning_rate parameter:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler

X, y = make_regression(n_samples=500, n_features=10, noise=25.0, random_state=42)
scaler = StandardScaler()
X = scaler.fit_transform(X)

schedule_configs = {
    'constant (η=0.01)':   dict(learning_rate='constant',    eta0=0.01),
    'optimal (sklearn)':   dict(learning_rate='optimal'),
    'invscaling':          dict(learning_rate='invscaling',  eta0=0.1, power_t=0.5),
    'adaptive':            dict(learning_rate='adaptive',    eta0=0.1),
}

fig, ax = plt.subplots(figsize=(8, 4))
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange']

for (label, kwargs), color in zip(schedule_configs.items(), colors):
    losses = []
    model = SGDRegressor(max_iter=1, tol=None, warm_start=True, random_state=42, **kwargs)
    for _ in range(80):
        model.fit(X, y)
        pred = model.predict(X)
        losses.append(np.mean((pred - y) ** 2))
    ax.plot(losses, color=color, lw=2, label=f'{label} → MSE={losses[-1]:.1f}')

ax.set_xlabel('Epoch'); ax.set_ylabel('MSE')
ax.set_title('sklearn SGDRegressor — Built-in LR Schedules')
ax.legend(fontsize=9); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Guided Practice¶

Why do we use a learning rate schedule instead of keeping the step size fixed forever?¶

Because different stages of training can benefit from different step sizesCorrect. Larger early steps and smaller later steps often improve both convergence speed and final loss.

Because schedules replace the optimizer entirelyA schedule sets the base $\alpha$ fed into the optimizer — it does not replace the optimizer.

Because fixed learning rates are illegal in machine learningFixed rates are perfectly legal; they just leave performance on the table.

Because schedules remove the need for validation dataValidation is still needed to detect overfitting and choose when to stop.

What is the typical purpose of decreasing the learning rate later in training?¶

To make the model forget the training dataThat is not the purpose of a late-stage LR reduction.

To take finer steps near a good solution and reduce overshootingCorrect. When parameters are already close to the optimum, small steps prevent bouncing around instead of converging.

To convert the loss into accuracyLearning rate does not change the metric definition.

To avoid using gradients altogetherGradient-based methods still use gradients regardless of the schedule.

Why is a warm-up phase (linearly increasing LR) useful at the start of training with Adam?¶

It prevents the model from learning the first batch of dataThe model still learns from the first batch; warm-up modulates how aggressively.

It sets all initial parameters to zeroParameter initialisation is separate from the LR schedule.

Adam's moment estimates are initialised at zero and unreliable in early steps; a small LR limits the damage from noisy updatesCorrect. Bias correction partially mitigates the cold-start bias, but the estimates are still noisy. Warm-up keeps early steps small until the moments stabilise.

Warm-up increases the batch size automaticallyBatch size and LR schedule are independent settings.

A cosine annealing schedule runs for $T=100$ epochs with $\alpha_0=0.1$ and $\alpha_{\min}=0$. What is the learning rate at epoch 50?¶

0.05 — exactly half of the initial rate, because $\cos(\pi/2) = 0$Correct. At $t=50$, $\alpha_{50} = 0 + 0.5 \times 0.1 \times (1 + \cos(\pi)) = 0.5 \times 0.1 \times 0 = 0$... wait — $\cos(\pi \times 50/100) = \cos(\pi/2) = 0$, so $\alpha_{50} = 0.5 \times 0.1 \times 1 = 0.05$. ✓

0.10 — unchanged from the initial rateAt epoch 0 the cosine term is $\cos(0)=1$, giving $\alpha_0=0.1$. At epoch 50 it has decayed.

0.00 — the schedule reaches zero at the midpointThe schedule reaches zero only at $t=T$, not at $t=T/2$.

0.07 — a step-decay valueStep decay would give a different value; cosine annealing at the midpoint is exactly $\alpha_0/2$.

Exercises¶

Exercise 1 — Implement Cosine Annealing with Warm Restarts (SGDR)¶

SGDR (Stochastic Gradient Descent with Warm Restarts) resets the cosine schedule every $T_0$ epochs, doubling the period after each restart:

\alpha_t = \alpha_{\min} + \tfrac{1}{2}(\alpha_0 - \alpha_{\min})\left(1 + \cos\!\left(\frac{\pi\, t_{\text{cycle}}}{T_{\text{cur}}}\right)\right)

(6)

where $T_{\text{cur}}$ doubles each restart and $t_{\text{cycle}}$ resets to 0. Implement and plot it for 100 epochs with $T_0=10$ .

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

T_TOTAL  = 100
T0       = 10      # initial cycle length
alpha0   = 0.1
alpha_min = 1e-4

def sgdr(t, T0=T0, alpha0=alpha0, alpha_min=alpha_min):
    # TODO: compute which cycle we are in and t_cycle (position within cycle)
    # cycle length doubles after each restart: T0, 2*T0, 4*T0, ...
    # hint: keep a running sum of cycle lengths to find the current cycle boundary
    return alpha_min  # replace with correct formula

epochs = np.arange(T_TOTAL)
lrs = [sgdr(t) for t in epochs]

plt.figure(figsize=(8, 3))
plt.plot(epochs, lrs, lw=2, color='steelblue')
plt.xlabel('Epoch'); plt.ylabel('Learning Rate')
plt.title('SGDR — Cosine Annealing with Warm Restarts')
plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Exercise 2 — Schedule Impact on Final Loss¶

Run SGD with gradient noise $\sigma=0.5$ on $J(\theta)=(\theta-3)^2$ for 150 steps. Compare five schedule choices and report the final loss for each. Which schedule gets closest to the true minimum? Is there a schedule that consistently beats the others across 10 independent random seeds?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(t):  return (t - 3.0)**2
def dJ(t): return 2.0*(t - 3.0)

# TODO: define at least 5 schedule functions
# TODO: run each across 10 random seeds, collect final J(theta)
# TODO: plot mean ± std of final loss for each schedule

np.random.seed(0)
T = 150
n_seeds = 10

# starter: one schedule implemented
def cosine_anneal(t, alpha0=0.3, alpha_min=1e-4, T=T):
    return alpha_min + 0.5*(alpha0 - alpha_min)*(1 + np.cos(np.pi * t / T))

# Your code here ...
print('Exercise 2 — implement and compare schedules')

Exercise 3 — LR Range Test¶

The LR range test sweeps $\alpha$ from a very small value to a large value over a short number of steps and records the loss at each $\alpha$ . The useful LR range is where the loss is still falling. Implement this for $J(\theta)=(\theta-3)^2$ and identify the useful range.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def J(t):  return (t - 3.0)**2
def dJ(t): return 2.0*(t - 3.0)

# LR range test: sweep alpha from 1e-4 to 2.0 over 50 steps
alphas   = np.logspace(-4, np.log10(2.0), 50)
theta0   = 0.0

losses = []
for alpha in alphas:
    # TODO: take one gradient step from theta0 with this alpha
    # record J(theta_after_step)
    losses.append(J(theta0))  # replace with J after step

plt.figure(figsize=(7, 4))
plt.semilogx(alphas, losses, lw=2, color='steelblue')
plt.xlabel('Learning Rate (log scale)'); plt.ylabel('Loss after 1 step')
plt.title('LR Range Test — identify useful α range')
plt.grid(alpha=0.3, which='both'); plt.tight_layout(); plt.show()

Common Pitfalls¶

Summary¶

Schedule	Formula (key)	Best for
Step Decay	$\alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}$	Small datasets, CNNs
Exponential Decay	$\alpha_0 e^{-\lambda t}$	Smooth continuous decay
Cosine Annealing	$\alpha_{\min} + \frac{1}{2}(\alpha_0-\alpha_{\min})(1+\cos\frac{\pi t}{T})$	Large models, deep networks
Warm-up + Cosine	Ramp → Cosine	Transformers, fine-tuning
Cyclic LR	Triangle wave between $\alpha_{\min}$ and $\alpha_{\max}$	Experimental, escaping minima

Rule of thumb: cosine annealing is a safe default for most deep learning. Add a warm-up phase (5–10 % of total steps) when using Adam on a transformer or large pre-trained model.

Next Up — Numerical Stability¶

Numerical Stability
Even a perfectly tuned optimizer with an ideal schedule can fail if floating-point arithmetic explodes or vanishes. The next notebook covers gradient clipping, log-sum-exp tricks, weight initialisation, and how to recognise and fix NaN/Inf values during training.