
Numerical Stability¶
Overflow · Underflow · Log-Sum-Exp · Gradient Clipping · Stable Softmax
Chapter 6 — Optimization & Training Practicalities
Why Numbers Go Wrong in ML¶

Float32 can represent numbers from roughly 10-38 to 1038. Values outside this range cause:
| Failure mode | What happens | When it occurs |
|---|---|---|
| Overflow | Result becomes inf | exp(1000), large unnormalised logits |
| Underflow | Result rounds to 0.0 | exp(-1000), very small probabilities |
| Division by zero | Result becomes nan or inf | log(0), x / 0, 0/0 |
| Catastrophic cancellation | Precision lost when subtracting nearly equal numbers | (1 + 1e-16) - 1 → 0.0 in float64 |
Demonstrating the Failures¶
import numpy as np
print("=== Float32 limits ===")
print(f" np.finfo(np.float32).max = {np.finfo(np.float32).max:.2e}")
print(f" np.finfo(np.float32).tiny = {np.finfo(np.float32).tiny:.2e}")
print("\n=== Overflow ===")
x = np.float32(1000.0)
print(f" np.exp(1000.0) as float32 = {np.exp(x)}") # inf
print(f" np.exp(88.0) as float32 = {np.exp(np.float32(88.0)):.3e}") # OK
print("\n=== Underflow ===")
x = np.float32(-1000.0)
print(f" np.exp(-1000.0) as float32 = {np.exp(x)}") # 0.0
print("\n=== Division by zero ===")
with np.errstate(divide='ignore', invalid='ignore'):
print(f" np.log(0.0) = {np.log(0.0)}")
print(f" 0.0 / 0.0 = {np.float64(0.0) / np.float64(0.0)}")
print(f" np.log(1e-45) = {np.log(np.float32(1e-45))}") # -inf due to underflow
print("\n=== Catastrophic cancellation ===")
a = np.float64(1.0) + np.float64(1e-16)
print(f" (1 + 1e-16) - 1 in float64 = {a - 1.0}")
print(f" np.expm1(1e-16) = {np.expm1(np.float64(1e-16)):.2e} ← stable")The Log-Sum-Exp Trick — Stable Softmax¶
The naive softmax computes . When any is large (e.g., 1000), overflows.
Fix: subtract before exponentiating. The result is mathematically identical:
After subtracting the max, all exponent arguments are , so no overflow, and the largest term is exactly 1, preventing underflow of the dominant class.
import numpy as np
def softmax_naive(x):
e = np.exp(x)
return e / e.sum()
def softmax_stable(x):
x_shift = x - np.max(x) # subtract max for stability
e = np.exp(x_shift)
return e / e.sum()
def log_softmax_stable(x):
"""log-softmax avoids computing softmax then taking log (two precision losses)."""
c = np.max(x)
return x - c - np.log(np.sum(np.exp(x - c)))
# Test 1: moderate values
x_ok = np.array([2.0, 1.0, 0.1])
print("Moderate logits [2.0, 1.0, 0.1]:")
print(f" Naive: {softmax_naive(x_ok)}")
print(f" Stable: {softmax_stable(x_ok)}")
# Test 2: large values (naive overflows)
x_big = np.array([1000.0, 1001.0, 1002.0])
with np.errstate(over='ignore', invalid='ignore'):
naive_big = softmax_naive(x_big)
print("\nLarge logits [1000, 1001, 1002]:")
print(f" Naive: {naive_big} ← nan from inf/inf")
print(f" Stable: {softmax_stable(x_big)} ← correct")
# Test 3: very negative values (naive underflows)
x_neg = np.array([-1000.0, -999.0, -998.0])
with np.errstate(divide='ignore', invalid='ignore'):
naive_neg = softmax_naive(x_neg)
print("\nVery negative logits [-1000, -999, -998]:")
print(f" Naive: {naive_neg} ← nan from 0/0")
print(f" Stable: {softmax_stable(x_neg)} ← correct")
# Cross-entropy loss: use log_softmax directly instead of log(softmax(x))
true_class = 1
ce_naive = -np.log(softmax_stable(x_ok)[true_class])
ce_stable = -log_softmax_stable(x_ok)[true_class]
print(f"\nCross-entropy on x_ok, true_class=1:")
print(f" Via log(softmax): {ce_naive:.8f}")
print(f" Via log_softmax: {ce_stable:.8f} ← more precise")Gradient Clipping¶
When gradients become very large (e.g., in deep networks or RNNs), the update step overshoots the minimum. Two clipping strategies:
Clip by value: clamp each gradient component independently to :
Clip by norm: scale the entire gradient vector so its L2 norm does not exceed threshold :
Clip-by-norm is preferred for deep networks because it preserves the gradient direction while bounding the step magnitude.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
def clip_by_value(g, c=1.0):
return np.clip(g, -c, c)
def clip_by_norm(g, tau=1.0):
norm = np.linalg.norm(g)
if norm > tau:
return g * tau / norm
return g
# Simulate gradient explosion: large random spikes
np.random.seed(42)
n_steps = 100
base_grads = np.random.randn(n_steps)
# Inject spikes at steps 20, 50, 75
for s in [20, 50, 75]:
base_grads[s] = np.random.choice([-1, 1]) * (15 + np.random.rand() * 5)
clipped_val = np.array([clip_by_value(g, c=2.0) for g in base_grads])
clipped_norm = np.array([clip_by_norm(np.array([g]), tau=2.0)[0] for g in base_grads])
fig, axes = plt.subplots(3, 1, figsize=(11, 7), sharex=True)
titles = ['Raw Gradients (with explosion spikes)', 'Clip by Value (c=2)', 'Clip by Norm (τ=2)']
series = [base_grads, clipped_val, clipped_norm]
colors = ['tomato', 'steelblue', 'seagreen']
for ax, title, g, color in zip(axes, titles, series, colors):
ax.plot(g, color=color, lw=1.2, alpha=0.85)
ax.axhline(0, color='gray', lw=0.8)
ax.set_title(title, fontsize=10)
ax.set_ylabel('Gradient')
ax.grid(alpha=0.3)
ax.set_ylim(-20, 20)
axes[-1].set_xlabel('Step')
plt.suptitle('Gradient Clipping — Taming Exploding Gradients', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
# Effect on parameter update path
def sgd_path(grads, alpha=0.1, theta0=0.0):
theta = theta0
path = [theta]
for g in grads:
theta -= alpha * g
path.append(theta)
return path
path_raw = sgd_path(base_grads)
path_val = sgd_path(clipped_val)
path_norm = sgd_path(clipped_norm)
plt.figure(figsize=(9, 3))
plt.plot(path_raw, color='tomato', lw=1.5, label=f'No clipping (final={path_raw[-1]:.2f})')
plt.plot(path_val, color='steelblue',lw=1.5, label=f'Clip value (final={path_val[-1]:.2f})')
plt.plot(path_norm, color='seagreen', lw=1.5, label=f'Clip norm (final={path_norm[-1]:.2f})')
plt.xlabel('Step'); plt.ylabel('θ')
plt.title('Parameter Trajectory with and without Gradient Clipping')
plt.legend(fontsize=9); plt.grid(alpha=0.3); plt.tight_layout(); plt.show()Feature Standardisation and Gradient Conditioning¶
Without standardisation, features on different scales produce gradients of very different magnitudes. The loss surface becomes an elongated ellipse — gradient descent zig-zags along the narrow direction and makes slow progress along the wide one.
Standard scaler: transforms each feature to mean 0, variance 1.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(7)
# Two-feature dataset: salary (20k–500k) and age (18–85)
n = 200
salary = np.random.uniform(20_000, 500_000, n)
age = np.random.uniform(18, 85, n)
# True weights for a linear model (credit score)
y = 0.0001 * salary + 1.5 * age + np.random.randn(n) * 20
X_raw = np.column_stack([salary, age, np.ones(n)])
# Standardise
mu = np.array([salary.mean(), age.mean(), 0.0])
sigma = np.array([salary.std(), age.std(), 1.0])
X_std = (X_raw - mu) / sigma
X_std[:, 2] = 1.0 # keep bias column as ones
def mse_grad(X, y, theta):
resid = X @ theta - y
return 2 / len(y) * X.T @ resid
def run_gd(X, y, alpha, n_steps=200):
theta = np.zeros(3)
losses = []
for _ in range(n_steps):
g = mse_grad(X, y, theta)
if not np.isfinite(g).all():
losses.append(np.nan)
break
theta -= alpha * g
losses.append(np.mean((X @ theta - y) ** 2))
return theta, losses
_, losses_raw = run_gd(X_raw, y, alpha=1e-10, n_steps=200) # tiny LR needed for raw
_, losses_std = run_gd(X_std, y, alpha=0.05, n_steps=200) # normal LR for standardised
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].semilogy(losses_raw, color='tomato', lw=2, label=f'Raw features (α=1e-10)')
axes[0].semilogy(losses_std, color='seagreen', lw=2, label=f'Standardised (α=0.05)')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('MSE (log scale)')
axes[0].set_title('Convergence: Raw vs Standardised')
axes[0].legend(); axes[0].grid(alpha=0.3, which='both')
# Gradient magnitude comparison at step 0
grad_raw = np.abs(mse_grad(X_raw, y, np.zeros(3)))
grad_std = np.abs(mse_grad(X_std, y, np.zeros(3)))
bar_x = np.arange(3)
w = 0.35
axes[1].bar(bar_x - w/2, grad_raw, w, color='tomato', label='Raw')
axes[1].bar(bar_x + w/2, grad_std, w, color='seagreen', label='Standardised')
axes[1].set_xticks(bar_x); axes[1].set_xticklabels(['salary', 'age', 'bias'])
axes[1].set_ylabel('|gradient|'); axes[1].set_title('Gradient Magnitudes at Step 0')
axes[1].legend(); axes[1].grid(alpha=0.3)
axes[1].set_yscale('log')
plt.suptitle('Feature Standardisation Effect on Gradient Conditioning', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
print(f"Raw gradient magnitudes: salary={grad_raw[0]:.2e} age={grad_raw[1]:.2e} ratio={grad_raw[0]/grad_raw[1]:.0f}×")
print(f"Standardised magnitudes: salary={grad_std[0]:.2e} age={grad_std[1]:.2e} ratio={grad_std[0]/grad_std[1]:.1f}×")Try It in the Browser — Safe vs Unsafe Computations¶
Edit the logit values and see how naive vs stable softmax differ.
Practical Epsilon Guards and NumPy Helpers¶
A collection of patterns used throughout ML pipelines:
import numpy as np
EPS = 1e-8
# 1. Safe log (prevent log(0))
probs = np.array([0.0, 0.2, 0.5, 1.0])
safe_log_probs = np.log(np.clip(probs, EPS, 1.0))
print("Safe log of [0, 0.2, 0.5, 1.0]:")
print(f" {safe_log_probs}")
# 2. np.log1p for small x: log(1 + x) avoids cancellation when x ≈ 0
x_small = 1e-12
print(f"\nlog(1 + 1e-12) naive : {np.log(1 + x_small):.20f}")
print(f"np.log1p(1e-12) stable : {np.log1p(x_small):.20f}")
# 3. np.expm1 for small x: exp(x) - 1 avoids cancellation
print(f"\nexp(1e-12) - 1 naive : {np.exp(x_small) - 1:.20f}")
print(f"np.expm1(1e-12) stable : {np.expm1(x_small):.20f}")
# 4. Safe division (prevent 0/0)
numerator = np.array([1.0, 0.0, 3.0])
denominator = np.array([2.0, 0.0, 4.0])
safe_div = np.divide(numerator, denominator + EPS)
print(f"\nSafe division [1/2, 0/0, 3/4]: {safe_div}")
# 5. Binary cross-entropy with clipping
def binary_cross_entropy(y_true, y_pred, eps=1e-7):
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
y_true = np.array([1, 0, 1, 1])
y_pred_bad = np.array([0.0, 1.0, 0.9, 0.8]) # 0 and 1 → log(0) disaster without clip
y_pred_good = np.array([0.9, 0.1, 0.9, 0.8])
print(f"\nBCE (clipped, bad probs): {binary_cross_entropy(y_true, y_pred_bad):.4f}")
print(f"BCE (clipped, good probs): {binary_cross_entropy(y_true, y_pred_good):.4f}")Guided Practice¶
Why is numerical stability important in machine learning code?¶
Why is vectorization often preferred over explicit Python loops in numerical code?¶
The log-sum-exp trick subtracts $\max(x)$ before exponentiation. Which claim is true?¶
You are training a model and see the loss suddenly jump from 0.4 to `nan` at epoch 12. What is the most likely cause and first fix?¶
Exercises¶
Exercise 1 — Stable Log-Sum-Exp for Arbitrary Vectors¶
Implement a general logsumexp(x) function that computes in a numerically stable way. Verify it matches scipy.special.logsumexp (or a reference value) on the test cases below.
import numpy as np
def logsumexp(x):
# TODO: implement log-sum-exp trick
# hint: c = max(x); return c + log(sum(exp(x - c)))
return 0.0 # replace
# Reference values
test_cases = [
(np.array([0.0, 0.0, 0.0]), np.log(3)), # log(1+1+1)
(np.array([1.0, 2.0, 3.0]), 3.4076059),
(np.array([1000.0, 1001.0]), 1000 + np.log(1 + np.e)), # stable reference
(np.array([-1000.0, -999.0]), -999 + np.log(1 + 1/np.e)),
]
print(f"{'Input':>30} {'Your result':>14} {'Reference':>14} {'Match?':>8}")
for x, ref in test_cases:
result = logsumexp(x)
match = abs(result - ref) < 1e-5
print(f"{str(x):>30} {result:>14.7f} {ref:>14.7f} {'✓' if match else '✗':>8}")Exercise 2 — Gradient Norm Monitoring¶
Add gradient norm tracking to an SGD loop on with injected noise spikes. Plot the gradient norm at each step and mark where clipping activates. Does clipping change the final converged value?
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
def J(t): return (t - 3.0) ** 2
def dJ(t): return 2.0 * (t - 3.0)
n_steps = 80
alpha = 0.1
clip_tau = 1.5
# Base gradients with spikes at steps 15, 40, 60
np.random.seed(5)
noise = np.random.randn(n_steps) * 0.3
for s in [15, 40, 60]:
noise[s] += np.random.choice([-1,1]) * 12
# TODO: implement two SGD runs: without clipping and with clip_by_norm
# Record: theta path, gradient norms, clipping events
print('Exercise 2 — implement gradient norm monitoring')Exercise 3 — Diagnosing NaN in a Linear Model¶
The code below has a numerical bug that causes NaN in the loss. Find and fix it without changing the model architecture.
import numpy as np
np.random.seed(0)
n, p = 100, 3
# Features with very different scales (intentional — this is the bug setup)
X = np.column_stack([
np.random.uniform(0, 1_000_000, n), # feature 1: revenue
np.random.uniform(0, 1, n), # feature 2: rate
np.ones(n), # bias
])
y = 2 * X[:, 0] + 500 * X[:, 1] + np.random.randn(n) * 100
theta = np.zeros(p)
alpha = 0.01 # too large for raw features
for step in range(20):
grad = 2 / n * X.T @ (X @ theta - y)
theta -= alpha * grad
loss = np.mean((X @ theta - y) ** 2)
print(f"Step {step:2d}: loss = {loss:.4e}")
if not np.isfinite(loss):
print(">>> NaN detected — apply your fix here <<<")
break
# TODO: fix the NaN issue (hint: standardise X and/or reduce alpha)
# Then rerun and confirm the model convergesCommon Pitfalls¶
Summary¶
Key takeaways
| Problem | Symptom | Fix |
|---|---|---|
| Overflow in softmax | nan class probabilities | Log-sum-exp: subtract |
log(0) | −inf loss | np.clip(p, ε, 1.0) before log |
| Exploding gradients | Loss jumps to nan | Gradient clipping (by norm) |
| Ill-conditioned features | Zig-zag, slow convergence | StandardScaler (fit on train only) |
| Catastrophic cancellation | Precision loss near zero | np.log1p, np.expm1 |
| Slow loops | Runtime bottleneck | NumPy vectorisation / broadcasting |
Checklist when you see NaN: (1) Check feature scales. (2) Reduce LR. (3) Add gradient clipping. (4) Check for log(0) or division by zero in loss computation.
Next Up — Optimization Lab¶

You have now built the complete optimizer stack — gradient descent, adaptive optimizers, learning rate schedules, and numerical stability guards. The lab notebook brings everything together: you will run controlled experiments comparing all variants on a real ML problem, measuring convergence speed, final accuracy, and training stability.