Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Numerical Stability

Overflow · Underflow · Log-Sum-Exp · Gradient Clipping · Stable Softmax

Chapter 6 — Optimization & Training Practicalities

Why Numbers Go Wrong in ML

Business context. A credit-scoring model uses raw annual salary (range 20 000–500 000) alongside age (18–85) and a binary flag. The salary feature has gradient magnitudes ~10 000× larger than the flag. With no normalisation, the first SGD step takes a huge step in the salary dimension, overshooting the optimum and sending gradients into the millions — NaN within 3 epochs. Standardising each feature to mean 0, std 1 equalises gradient magnitudes and the model converges in 30 epochs.

Float32 can represent numbers from roughly 10-38 to 1038. Values outside this range cause:

Failure modeWhat happensWhen it occurs
OverflowResult becomes infexp(1000), large unnormalised logits
UnderflowResult rounds to 0.0exp(-1000), very small probabilities
Division by zeroResult becomes nan or inflog(0), x / 0, 0/0
Catastrophic cancellationPrecision lost when subtracting nearly equal numbers(1 + 1e-16) - 10.0 in float64

Demonstrating the Failures

import numpy as np

print("=== Float32 limits ===")
print(f"  np.finfo(np.float32).max  = {np.finfo(np.float32).max:.2e}")
print(f"  np.finfo(np.float32).tiny = {np.finfo(np.float32).tiny:.2e}")

print("\n=== Overflow ===")
x = np.float32(1000.0)
print(f"  np.exp(1000.0) as float32 = {np.exp(x)}")   # inf
print(f"  np.exp(88.0)  as float32 = {np.exp(np.float32(88.0)):.3e}")  # OK

print("\n=== Underflow ===")
x = np.float32(-1000.0)
print(f"  np.exp(-1000.0) as float32 = {np.exp(x)}")  # 0.0

print("\n=== Division by zero ===")
with np.errstate(divide='ignore', invalid='ignore'):
    print(f"  np.log(0.0)  = {np.log(0.0)}")
    print(f"  0.0 / 0.0    = {np.float64(0.0) / np.float64(0.0)}")
    print(f"  np.log(1e-45) = {np.log(np.float32(1e-45))}")  # -inf due to underflow

print("\n=== Catastrophic cancellation ===")
a = np.float64(1.0) + np.float64(1e-16)
print(f"  (1 + 1e-16) - 1 in float64 = {a - 1.0}")
print(f"  np.expm1(1e-16)             = {np.expm1(np.float64(1e-16)):.2e}  ← stable")

The Log-Sum-Exp Trick — Stable Softmax

The naive softmax computes softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}. When any xix_i is large (e.g., 1000), exie^{x_i} overflows.

Fix: subtract max(x)\max(x) before exponentiating. The result is mathematically identical:

exijexj=exicjexjcwhere c=maxjxj\frac{e^{x_i}}{\sum_j e^{x_j}} = \frac{e^{x_i - c}}{\sum_j e^{x_j - c}} \quad \text{where } c = \max_j x_j

After subtracting the max, all exponent arguments are 0\leq 0, so no overflow, and the largest term is exactly 1, preventing underflow of the dominant class.

import numpy as np

def softmax_naive(x):
    e = np.exp(x)
    return e / e.sum()

def softmax_stable(x):
    x_shift = x - np.max(x)          # subtract max for stability
    e = np.exp(x_shift)
    return e / e.sum()

def log_softmax_stable(x):
    """log-softmax avoids computing softmax then taking log (two precision losses)."""
    c = np.max(x)
    return x - c - np.log(np.sum(np.exp(x - c)))

# Test 1: moderate values
x_ok = np.array([2.0, 1.0, 0.1])
print("Moderate logits [2.0, 1.0, 0.1]:")
print(f"  Naive:  {softmax_naive(x_ok)}")
print(f"  Stable: {softmax_stable(x_ok)}")

# Test 2: large values (naive overflows)
x_big = np.array([1000.0, 1001.0, 1002.0])
with np.errstate(over='ignore', invalid='ignore'):
    naive_big = softmax_naive(x_big)
print("\nLarge logits [1000, 1001, 1002]:")
print(f"  Naive:  {naive_big}    ← nan from inf/inf")
print(f"  Stable: {softmax_stable(x_big)}  ← correct")

# Test 3: very negative values (naive underflows)
x_neg = np.array([-1000.0, -999.0, -998.0])
with np.errstate(divide='ignore', invalid='ignore'):
    naive_neg = softmax_naive(x_neg)
print("\nVery negative logits [-1000, -999, -998]:")
print(f"  Naive:  {naive_neg}   ← nan from 0/0")
print(f"  Stable: {softmax_stable(x_neg)}  ← correct")

# Cross-entropy loss: use log_softmax directly instead of log(softmax(x))
true_class = 1
ce_naive  = -np.log(softmax_stable(x_ok)[true_class])
ce_stable = -log_softmax_stable(x_ok)[true_class]
print(f"\nCross-entropy on x_ok, true_class=1:")
print(f"  Via log(softmax): {ce_naive:.8f}")
print(f"  Via log_softmax:  {ce_stable:.8f}  ← more precise")

Gradient Clipping

When gradients become very large (e.g., in deep networks or RNNs), the update step overshoots the minimum. Two clipping strategies:

Clip by value: clamp each gradient component independently to [c,c][-c, c]:

giclip(gi,c,c)g_i \leftarrow \text{clip}(g_i, -c, c)

Clip by norm: scale the entire gradient vector so its L2 norm does not exceed threshold τ\tau:

ggmin(g,τ)g\mathbf{g} \leftarrow \mathbf{g} \cdot \frac{\min(\|\mathbf{g}\|, \tau)}{\|\mathbf{g}\|}

Clip-by-norm is preferred for deep networks because it preserves the gradient direction while bounding the step magnitude.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def clip_by_value(g, c=1.0):
    return np.clip(g, -c, c)

def clip_by_norm(g, tau=1.0):
    norm = np.linalg.norm(g)
    if norm > tau:
        return g * tau / norm
    return g

# Simulate gradient explosion: large random spikes
np.random.seed(42)
n_steps = 100
base_grads = np.random.randn(n_steps)
# Inject spikes at steps 20, 50, 75
for s in [20, 50, 75]:
    base_grads[s] = np.random.choice([-1, 1]) * (15 + np.random.rand() * 5)

clipped_val  = np.array([clip_by_value(g, c=2.0) for g in base_grads])
clipped_norm = np.array([clip_by_norm(np.array([g]), tau=2.0)[0] for g in base_grads])

fig, axes = plt.subplots(3, 1, figsize=(11, 7), sharex=True)
titles  = ['Raw Gradients (with explosion spikes)', 'Clip by Value (c=2)', 'Clip by Norm (τ=2)']
series  = [base_grads, clipped_val, clipped_norm]
colors  = ['tomato', 'steelblue', 'seagreen']

for ax, title, g, color in zip(axes, titles, series, colors):
    ax.plot(g, color=color, lw=1.2, alpha=0.85)
    ax.axhline(0, color='gray', lw=0.8)
    ax.set_title(title, fontsize=10)
    ax.set_ylabel('Gradient')
    ax.grid(alpha=0.3)
    ax.set_ylim(-20, 20)

axes[-1].set_xlabel('Step')
plt.suptitle('Gradient Clipping — Taming Exploding Gradients', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Effect on parameter update path
def sgd_path(grads, alpha=0.1, theta0=0.0):
    theta = theta0
    path = [theta]
    for g in grads:
        theta -= alpha * g
        path.append(theta)
    return path

path_raw   = sgd_path(base_grads)
path_val   = sgd_path(clipped_val)
path_norm  = sgd_path(clipped_norm)

plt.figure(figsize=(9, 3))
plt.plot(path_raw,  color='tomato',   lw=1.5, label=f'No clipping  (final={path_raw[-1]:.2f})')
plt.plot(path_val,  color='steelblue',lw=1.5, label=f'Clip value   (final={path_val[-1]:.2f})')
plt.plot(path_norm, color='seagreen', lw=1.5, label=f'Clip norm    (final={path_norm[-1]:.2f})')
plt.xlabel('Step'); plt.ylabel('θ')
plt.title('Parameter Trajectory with and without Gradient Clipping')
plt.legend(fontsize=9); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()

Feature Standardisation and Gradient Conditioning

Without standardisation, features on different scales produce gradients of very different magnitudes. The loss surface becomes an elongated ellipse — gradient descent zig-zags along the narrow direction and makes slow progress along the wide one.

Standard scaler: x=xμσx' = \dfrac{x - \mu}{\sigma} transforms each feature to mean 0, variance 1.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(7)

# Two-feature dataset: salary (20k–500k) and age (18–85)
n = 200
salary = np.random.uniform(20_000, 500_000, n)
age    = np.random.uniform(18, 85, n)
# True weights for a linear model (credit score)
y = 0.0001 * salary + 1.5 * age + np.random.randn(n) * 20

X_raw = np.column_stack([salary, age, np.ones(n)])

# Standardise
mu     = np.array([salary.mean(), age.mean(), 0.0])
sigma  = np.array([salary.std(),  age.std(),  1.0])
X_std  = (X_raw - mu) / sigma
X_std[:, 2] = 1.0  # keep bias column as ones

def mse_grad(X, y, theta):
    resid = X @ theta - y
    return 2 / len(y) * X.T @ resid

def run_gd(X, y, alpha, n_steps=200):
    theta = np.zeros(3)
    losses = []
    for _ in range(n_steps):
        g = mse_grad(X, y, theta)
        if not np.isfinite(g).all():
            losses.append(np.nan)
            break
        theta -= alpha * g
        losses.append(np.mean((X @ theta - y) ** 2))
    return theta, losses

_, losses_raw = run_gd(X_raw, y, alpha=1e-10, n_steps=200)   # tiny LR needed for raw
_, losses_std = run_gd(X_std, y, alpha=0.05,  n_steps=200)   # normal LR for standardised

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].semilogy(losses_raw, color='tomato',   lw=2, label=f'Raw features (α=1e-10)')
axes[0].semilogy(losses_std, color='seagreen', lw=2, label=f'Standardised (α=0.05)')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('MSE (log scale)')
axes[0].set_title('Convergence: Raw vs Standardised')
axes[0].legend(); axes[0].grid(alpha=0.3, which='both')

# Gradient magnitude comparison at step 0
grad_raw = np.abs(mse_grad(X_raw, y, np.zeros(3)))
grad_std = np.abs(mse_grad(X_std, y, np.zeros(3)))
bar_x = np.arange(3)
w = 0.35
axes[1].bar(bar_x - w/2, grad_raw, w, color='tomato',   label='Raw')
axes[1].bar(bar_x + w/2, grad_std, w, color='seagreen', label='Standardised')
axes[1].set_xticks(bar_x); axes[1].set_xticklabels(['salary', 'age', 'bias'])
axes[1].set_ylabel('|gradient|'); axes[1].set_title('Gradient Magnitudes at Step 0')
axes[1].legend(); axes[1].grid(alpha=0.3)
axes[1].set_yscale('log')

plt.suptitle('Feature Standardisation Effect on Gradient Conditioning', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"Raw gradient magnitudes:     salary={grad_raw[0]:.2e}  age={grad_raw[1]:.2e}  ratio={grad_raw[0]/grad_raw[1]:.0f}×")
print(f"Standardised magnitudes:     salary={grad_std[0]:.2e}  age={grad_std[1]:.2e}  ratio={grad_std[0]/grad_std[1]:.1f}×")

Try It in the Browser — Safe vs Unsafe Computations

Edit the logit values and see how naive vs stable softmax differ.

Practical Epsilon Guards and NumPy Helpers

A collection of patterns used throughout ML pipelines:

import numpy as np

EPS = 1e-8

# 1. Safe log (prevent log(0))
probs = np.array([0.0, 0.2, 0.5, 1.0])
safe_log_probs = np.log(np.clip(probs, EPS, 1.0))
print("Safe log of [0, 0.2, 0.5, 1.0]:")
print(f"  {safe_log_probs}")

# 2. np.log1p for small x: log(1 + x) avoids cancellation when x ≈ 0
x_small = 1e-12
print(f"\nlog(1 + 1e-12) naive   : {np.log(1 + x_small):.20f}")
print(f"np.log1p(1e-12) stable : {np.log1p(x_small):.20f}")

# 3. np.expm1 for small x: exp(x) - 1 avoids cancellation
print(f"\nexp(1e-12) - 1 naive   : {np.exp(x_small) - 1:.20f}")
print(f"np.expm1(1e-12) stable : {np.expm1(x_small):.20f}")

# 4. Safe division (prevent 0/0)
numerator   = np.array([1.0, 0.0, 3.0])
denominator = np.array([2.0, 0.0, 4.0])
safe_div    = np.divide(numerator, denominator + EPS)
print(f"\nSafe division [1/2, 0/0, 3/4]: {safe_div}")

# 5. Binary cross-entropy with clipping
def binary_cross_entropy(y_true, y_pred, eps=1e-7):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1, 1])
y_pred_bad  = np.array([0.0, 1.0, 0.9, 0.8])  # 0 and 1 → log(0) disaster without clip
y_pred_good = np.array([0.9, 0.1, 0.9, 0.8])
print(f"\nBCE (clipped, bad probs):  {binary_cross_entropy(y_true, y_pred_bad):.4f}")
print(f"BCE (clipped, good probs): {binary_cross_entropy(y_true, y_pred_good):.4f}")

Guided Practice

Why is numerical stability important in machine learning code?

Unstable computations can overflow, underflow, or produce NaN — silently corrupting trainingCorrect. A single NaN gradient propagates through the entire network, making the model untrainable and the loss unreliable.
It eliminates the need for preprocessingPreprocessing (e.g., standardisation) is often the fix for numerical issues, not something stability removes the need for.
It guarantees perfect model fairnessNumerical stability and algorithmic fairness are orthogonal concerns.
It changes a classifier into a regressorStability techniques do not redefine the task or model family.

Why is vectorization often preferred over explicit Python loops in numerical code?

Because loops cannot compute arithmeticLoops can compute arithmetic — they are just slower and less precise.
NumPy vectorised operations call optimised C/BLAS routines that are faster and use higher-precision intermediate accumulationCorrect. Operations like `np.dot` use BLAS routines with extended precision accumulators, reducing rounding error compared to a naive Python loop.
Vectorization avoids all memory useVectorised arrays still consume memory; the benefit is speed and precision, not memory elimination.
Vectorization removes the need for evaluation metricsMetrics remain necessary regardless of implementation style.

The log-sum-exp trick subtracts $\max(x)$ before exponentiation. Which claim is true?

It changes the softmax output values to prevent the model from being overconfidentThe output is mathematically identical to naive softmax — only numerical behaviour changes.
The softmax values are unchanged but all exponent arguments are now ≤ 0, preventing overflowCorrect. Multiplying numerator and denominator by $e^{-c}$ is algebraically neutral but keeps all values in the safe range $[e^{-\infty}, e^0] = [0, 1]$.
It only works when all logits are positiveThe trick works for any logits — the subtraction is always safe regardless of sign.
It replaces the softmax with a sigmoid functionThe operation remains softmax; only the computation path is stabilised.

You are training a model and see the loss suddenly jump from 0.4 to `nan` at epoch 12. What is the most likely cause and first fix?

The dataset is too small — add more training examplesNaN from a sudden jump is almost always a numerical issue, not a data-size issue.
The model has too many layers — reduce depthDepth can contribute to gradient explosion, but the first fix should be targeted at the actual failure mode.
A gradient spike caused an overflow; first fix: add gradient clipping and check for unnormalised input featuresCorrect. A sudden NaN almost always traces back to an extreme gradient update. Clip gradients and standardise inputs first — these are low-effort, high-impact fixes.
The learning rate is too small — increase it by 10×A too-small LR causes slow convergence, not NaN. NaN suggests the LR or gradients are too large.

Exercises

Exercise 1 — Stable Log-Sum-Exp for Arbitrary Vectors

Implement a general logsumexp(x) function that computes logiexi\log\sum_i e^{x_i} in a numerically stable way. Verify it matches scipy.special.logsumexp (or a reference value) on the test cases below.

import numpy as np

def logsumexp(x):
    # TODO: implement log-sum-exp trick
    # hint: c = max(x); return c + log(sum(exp(x - c)))
    return 0.0  # replace

# Reference values
test_cases = [
    (np.array([0.0, 0.0, 0.0]),     np.log(3)),           # log(1+1+1)
    (np.array([1.0, 2.0, 3.0]),     3.4076059),
    (np.array([1000.0, 1001.0]),    1000 + np.log(1 + np.e)),  # stable reference
    (np.array([-1000.0, -999.0]),   -999 + np.log(1 + 1/np.e)),
]

print(f"{'Input':>30}  {'Your result':>14}  {'Reference':>14}  {'Match?':>8}")
for x, ref in test_cases:
    result = logsumexp(x)
    match  = abs(result - ref) < 1e-5
    print(f"{str(x):>30}  {result:>14.7f}  {ref:>14.7f}  {'✓' if match else '✗':>8}")

Exercise 2 — Gradient Norm Monitoring

Add gradient norm tracking to an SGD loop on J(θ)=(θ3)2J(\theta) = (\theta-3)^2 with injected noise spikes. Plot the gradient norm at each step and mark where clipping activates. Does clipping change the final converged value?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

def J(t):  return (t - 3.0) ** 2
def dJ(t): return 2.0 * (t - 3.0)

n_steps = 80
alpha   = 0.1
clip_tau = 1.5

# Base gradients with spikes at steps 15, 40, 60
np.random.seed(5)
noise = np.random.randn(n_steps) * 0.3
for s in [15, 40, 60]:
    noise[s] += np.random.choice([-1,1]) * 12

# TODO: implement two SGD runs: without clipping and with clip_by_norm
# Record: theta path, gradient norms, clipping events
print('Exercise 2 — implement gradient norm monitoring')

Exercise 3 — Diagnosing NaN in a Linear Model

The code below has a numerical bug that causes NaN in the loss. Find and fix it without changing the model architecture.

import numpy as np

np.random.seed(0)
n, p = 100, 3

# Features with very different scales (intentional — this is the bug setup)
X = np.column_stack([
    np.random.uniform(0, 1_000_000, n),   # feature 1: revenue
    np.random.uniform(0, 1, n),           # feature 2: rate
    np.ones(n),                           # bias
])
y = 2 * X[:, 0] + 500 * X[:, 1] + np.random.randn(n) * 100

theta = np.zeros(p)
alpha = 0.01  # too large for raw features

for step in range(20):
    grad = 2 / n * X.T @ (X @ theta - y)
    theta -= alpha * grad
    loss = np.mean((X @ theta - y) ** 2)
    print(f"Step {step:2d}: loss = {loss:.4e}")
    if not np.isfinite(loss):
        print(">>> NaN detected — apply your fix here <<<")
        break

# TODO: fix the NaN issue (hint: standardise X and/or reduce alpha)
# Then rerun and confirm the model converges

Common Pitfalls

Summary

Key takeaways
ProblemSymptomFix
Overflow in softmaxnan class probabilitiesLog-sum-exp: subtract max(x)\max(x)
log(0)−inf lossnp.clip(p, ε, 1.0) before log
Exploding gradientsLoss jumps to nanGradient clipping (by norm)
Ill-conditioned featuresZig-zag, slow convergenceStandardScaler (fit on train only)
Catastrophic cancellationPrecision loss near zeronp.log1p, np.expm1
Slow loopsRuntime bottleneckNumPy vectorisation / broadcasting

Checklist when you see NaN: (1) Check feature scales. (2) Reduce LR. (3) Add gradient clipping. (4) Check for log(0) or division by zero in loss computation.

Next Up — Optimization Lab

Optimization Lab
You have now built the complete optimizer stack — gradient descent, adaptive optimizers, learning rate schedules, and numerical stability guards. The lab notebook brings everything together: you will run controlled experiments comparing all variants on a real ML problem, measuring convergence speed, final accuracy, and training stability.