
Optimization Lab¶
Comparing GD Variants · Optimizers · Schedules · Stability · Real Data
Chapter 6 — Optimization & Training Practicalities · Capstone
Setup — Shared Utilities¶
Run this cell first. It defines the shared loss, gradient, and all optimizer implementations used throughout the lab.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
np.random.seed(42)
# ── 1D lab problem ────────────────────────────────────────────────────────────
def J(w): return (w - 3.0) ** 2 + 2.0
def dJ(w): return 2.0 * (w - 3.0)
# ── Optimizer implementations (all return loss history + final theta) ─────────
def batch_gd(w0=10.0, alpha=0.1, n=60, noise_sd=0.0):
w, hist = w0, [J(w0)]
for _ in range(n):
g = dJ(w) + np.random.randn() * noise_sd
w -= alpha * g
hist.append(J(w))
return w, hist
def sgd(w0=10.0, alpha=0.1, n=60, noise_sd=1.0):
"""SGD: one sample per step, simulated by noise."""
return batch_gd(w0, alpha, n, noise_sd=noise_sd)
def mini_batch_gd(w0=10.0, alpha=0.1, n=60, noise_sd=0.3):
"""Mini-batch: intermediate noise level."""
return batch_gd(w0, alpha, n, noise_sd=noise_sd)
def momentum_gd(w0=10.0, alpha=0.1, beta=0.9, n=60, noise_sd=0.0):
w, v, hist = w0, 0.0, [J(w0)]
for _ in range(n):
g = dJ(w) + np.random.randn() * noise_sd
v = beta * v + (1 - beta) * g
w -= alpha * v
hist.append(J(w))
return w, hist
def adagrad(w0=10.0, alpha=0.5, eps=1e-8, n=60, noise_sd=0.0):
w, G, hist = w0, 0.0, [J(w0)]
for _ in range(n):
g = dJ(w) + np.random.randn() * noise_sd
G += g ** 2
w -= alpha / (np.sqrt(G) + eps) * g
hist.append(J(w))
return w, hist
def rmsprop(w0=10.0, alpha=0.1, rho=0.9, eps=1e-8, n=60, noise_sd=0.0):
w, s, hist = w0, 0.0, [J(w0)]
for _ in range(n):
g = dJ(w) + np.random.randn() * noise_sd
s = rho * s + (1 - rho) * g ** 2
w -= alpha / (np.sqrt(s) + eps) * g
hist.append(J(w))
return w, hist
def adam(w0=10.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60, noise_sd=0.0):
w, m, v, hist = w0, 0.0, 0.0, [J(w0)]
for t in range(1, n + 1):
g = dJ(w) + np.random.randn() * noise_sd
m = b1 * m + (1 - b1) * g
v = b2 * v + (1 - b2) * g ** 2
m_hat = m / (1 - b1 ** t)
v_hat = v / (1 - b2 ** t)
w -= alpha * m_hat / (np.sqrt(v_hat) + eps)
hist.append(J(w))
return w, hist
print("Setup complete — all optimizer functions loaded.")Experiment 1 — GD Variants: Noise Effect¶
Compare Batch GD (no noise), Mini-batch GD (moderate noise), and SGD (high noise) on the 1D quadratic. All use the same initial point and learning rate .
Hypothesis: batch GD converges smoothly but all reach the same neighbourhood of the minimum.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
N = 60
alpha = 0.1
runs = [
('Batch GD (σ=0)', batch_gd(alpha=alpha, n=N, noise_sd=0.0), 'steelblue', '-'),
('Mini-batch (σ=0.5)', mini_batch_gd(alpha=alpha, n=N, noise_sd=0.5), 'seagreen', '--'),
('SGD (σ=2.0)', sgd(alpha=alpha, n=N, noise_sd=2.0), 'tomato', ':'),
]
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(N + 1)
for label, (w_final, hist), color, ls in runs:
axes[0].plot(steps, hist, color=color, lw=2, linestyle=ls,
label=f'{label} → J={hist[-1]:.3f}')
axes[1].plot(steps, hist, color=color, lw=2, linestyle=ls, label=label)
axes[0].set_xlabel('Step'); axes[0].set_ylabel('J(w)')
axes[0].set_title('Loss — Full Scale')
axes[0].axhline(2.0, color='gray', lw=1, linestyle=':', alpha=0.6, label='True min J=2')
axes[0].legend(fontsize=9); axes[0].grid(alpha=0.3)
axes[1].set_yscale('log'); axes[1].set_xlabel('Step')
axes[1].set_ylabel('J(w) − 2 (log scale)')
axes[1].set_title('Excess Loss (log scale)')
# Plot excess loss (J - true_min)
for label, (w_final, hist), color, ls in runs:
excess = np.array(hist) - 2.0
excess = np.where(excess < 1e-9, 1e-9, excess)
axes[1].semilogy(steps, excess, color=color, lw=2, linestyle=ls)
axes[1].legend(fontsize=9); axes[1].grid(alpha=0.3, which='both')
plt.suptitle('Experiment 1 — GD Variant Comparison (noise = gradient stochasticity)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
print("Final positions:")
for label, (w_final, hist), *_ in runs:
print(f" {label:<28}: w={w_final:.4f} J={hist[-1]:.4f}")Experiment 2 — Optimizer Race on 1D Quadratic¶
Compare all seven optimizers from chapter 6 starting at , noise = 0 (clean gradients). Which converges fastest? Which reaches closest to the true minimum , ?
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
N = 60
race = [
('Batch GD (α=0.10)', batch_gd(alpha=0.10, n=N), 'navy'),
('SGD (α=0.10)', sgd(alpha=0.10, n=N, noise_sd=0.5), 'cornflowerblue'),
('Mini-batch (α=0.10)', mini_batch_gd(alpha=0.10, n=N, noise_sd=0.2), 'deepskyblue'),
('Momentum β=.9 (α=0.10)', momentum_gd(alpha=0.10, beta=0.9, n=N), 'seagreen'),
('Adagrad (α=0.50)', adagrad(alpha=0.50, n=N), 'darkorange'),
('RMSProp (α=0.10)', rmsprop(alpha=0.10, n=N), 'red'),
('Adam (α=0.30)', adam(alpha=0.30, n=N), 'gold'),
]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
steps = np.arange(N + 1)
for label, (w_final, hist), color in race:
axes[0].plot(steps, hist, color=color, lw=1.8, label=f'{label} J={hist[-1]:.4f}')
excess = np.maximum(np.array(hist) - 2.0, 1e-10)
axes[1].semilogy(steps, excess, color=color, lw=1.8, label=label)
axes[0].axhline(2.0, color='black', lw=1.2, linestyle='--', alpha=0.4, label='True min J=2')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('J(w)')
axes[0].set_title('Optimizer Race — Loss')
axes[0].legend(fontsize=7.5, loc='upper right'); axes[0].grid(alpha=0.3)
axes[1].set_xlabel('Step'); axes[1].set_ylabel('Excess Loss J(w)−2 (log)')
axes[1].set_title('Optimizer Race — Convergence Rate')
axes[1].legend(fontsize=7.5); axes[1].grid(alpha=0.3, which='both')
plt.suptitle('Experiment 2 — All Optimizers on 1D Quadratic', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()
# Summary table
print(f"{'Optimizer':<34} {'Final w':>10} {'Final J':>10} {'Steps to J<2.01':>16}")
for label, (w_final, hist), _ in race:
steps_to = next((i for i, v in enumerate(hist) if v < 2.01), -1)
print(f"{label:<34} {w_final:>10.4f} {hist[-1]:>10.6f} {steps_to:>16}")Experiment 3 — Schedule vs No Schedule Under Noise¶
SGD with gradient noise (simulating mini-batch variance) — compare fixed LR vs cosine annealing vs step decay over 100 steps. The schedule should reduce the noise impact in later epochs.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
N = 100
NOISE = 0.8
ALPHA0 = 0.3
def cosine_anneal(t, T=N, a0=ALPHA0, amin=1e-4):
return amin + 0.5*(a0-amin)*(1 + np.cos(np.pi*t/T))
def step_decay(t, a0=ALPHA0, gamma=0.5, step_size=25):
return a0 * (gamma ** (t // step_size))
schedules = [
('Fixed α=0.30', lambda t: ALPHA0),
('Fixed α=0.05', lambda t: 0.05),
('Cosine Annealing', cosine_anneal),
('Step Decay (γ=0.5)', step_decay),
]
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange']
# Run each schedule across 5 seeds, compute mean ± std trajectory
n_seeds = 5
all_results = {}
for label, sched in schedules:
seed_hists = []
for seed in range(n_seeds):
np.random.seed(seed)
w = 10.0
hist = [J(w)]
for t in range(N):
g = dJ(w) + np.random.randn() * NOISE
w -= sched(t) * g
hist.append(J(w))
seed_hists.append(hist)
all_results[label] = np.array(seed_hists) # shape (n_seeds, N+1)
fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(N + 1)
for (label, _), color in zip(schedules, colors):
data = all_results[label]
mu = data.mean(axis=0)
sd = data.std(axis=0)
axes[0].plot(steps, mu, color=color, lw=2, label=f'{label} J={mu[-1]:.3f}')
axes[0].fill_between(steps, mu-sd, mu+sd, color=color, alpha=0.15)
axes[1].semilogy(steps, np.maximum(mu-2.0, 1e-10), color=color, lw=2, label=label)
axes[0].axhline(2.0, color='gray', lw=1, linestyle=':', alpha=0.6)
axes[0].set_xlabel('Step'); axes[0].set_ylabel('J(w) — mean ± 1 std')
axes[0].set_title('Schedule Comparison on Noisy SGD (mean ± std over 5 seeds)')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)
axes[1].set_xlabel('Step'); axes[1].set_ylabel('Excess Loss (log)')
axes[1].set_title('Convergence to True Minimum')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3, which='both')
plt.suptitle('Experiment 3 — LR Schedules on Noisy SGD (5 seeds)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()Experiment 4 — Numerical Stability Diagnosis¶
The cell below has a training loop with two embedded bugs. Run it, identify the bugs from the output, and fix them in the follow-up cell.
import numpy as np
np.random.seed(0)
n, p = 100, 2
# Bug setup: unscaled features
X = np.column_stack([
np.random.uniform(0, 1e6, n), # revenue (very large scale)
np.random.uniform(0, 1, n), # ratio (normal scale)
np.ones(n), # bias
])
y = 1e-4 * X[:,0] + 2 * X[:,1] + np.random.randn(n) * 10
theta = np.zeros(3)
alpha = 0.1 # Bug: too large for unscaled features
for step in range(10):
g = 2/n * X.T @ (X @ theta - y)
theta -= alpha * g
loss = np.mean((X @ theta - y)**2)
print(f"Step {step:2d}: loss = {loss:.4e}")
if not np.isfinite(loss):
print("\n>>> NaN detected — investigate and fix <<<")
breakimport numpy as np
np.random.seed(0)
n, p = 100, 2
X_raw = np.column_stack([
np.random.uniform(0, 1e6, n),
np.random.uniform(0, 1, n),
np.ones(n),
])
y = 1e-4 * X_raw[:,0] + 2 * X_raw[:,1] + np.random.randn(n) * 10
# FIX 1: standardise features (but NOT bias column)
mu = np.array([X_raw[:,0].mean(), X_raw[:,1].mean(), 0.0])
sigma = np.array([X_raw[:,0].std(), X_raw[:,1].std(), 1.0])
X = (X_raw - mu) / sigma
X[:,2] = 1.0 # restore bias
# FIX 2: use a reasonable learning rate for standardised features
theta = np.zeros(3)
alpha = 0.05
print("Fixed version:")
for step in range(40):
g = 2/n * X.T @ (X @ theta - y)
theta -= alpha * g
loss = np.mean((X @ theta - y)**2)
if step % 10 == 0 or step == 39:
print(f" Step {step:2d}: loss = {loss:.4e}")
print(f"Converged: {'yes' if np.isfinite(loss) else 'no'}")Experiment 5 — Full Pipeline on Real Data (California Housing)¶
Apply the chapter-6 toolkit end-to-end on the California Housing dataset. Compare plain SGD vs Adam with cosine annealing and gradient clipping.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# ── Load and preprocess ───────────────────────────────────────────────────────
data = fetch_california_housing()
X_all, y_all = data.data, data.target
X_tr, X_te, y_tr, y_te = train_test_split(X_all, y_all, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr) # fit on train only
X_te = scaler.transform(X_te)
# Add bias column
X_tr = np.column_stack([X_tr, np.ones(len(X_tr))])
X_te = np.column_stack([X_te, np.ones(len(X_te))])
p = X_tr.shape[1] # 9 features + 1 bias
def mse(X, y, theta): return np.mean((X @ theta - y) ** 2)
def grad_full(X, y, theta): return 2/len(y) * X.T @ (X @ theta - y)
def clip_by_norm(g, tau=5.0):
norm = np.linalg.norm(g)
return g * tau / norm if norm > tau else g
def cosine_anneal(t, T, a0, amin=1e-5):
return amin + 0.5*(a0-amin)*(1 + np.cos(np.pi*t/T))
N_EPOCHS = 150
BATCH = 64
N_BATCHES = len(X_tr) // BATCH
np.random.seed(42)
def run_mini_batch_sgd(alpha0=0.05, schedule=None, clip_tau=None):
theta = np.zeros(p)
train_loss, test_loss = [], []
idx = np.arange(len(X_tr))
total_steps = 0
for epoch in range(N_EPOCHS):
np.random.shuffle(idx)
for b in range(N_BATCHES):
batch_idx = idx[b*BATCH:(b+1)*BATCH]
Xb, yb = X_tr[batch_idx], y_tr[batch_idx]
g = grad_full(Xb, yb, theta)
if clip_tau:
g = clip_by_norm(g, tau=clip_tau)
alpha = schedule(total_steps, N_EPOCHS*N_BATCHES, alpha0) if schedule else alpha0
theta -= alpha * g
total_steps += 1
train_loss.append(mse(X_tr, y_tr, theta))
test_loss.append(mse(X_te, y_te, theta))
return theta, train_loss, test_loss
def run_adam_mini_batch(alpha0=0.01, schedule=None, clip_tau=None):
theta = np.zeros(p)
m, v = np.zeros(p), np.zeros(p)
b1, b2, eps = 0.9, 0.999, 1e-8
train_loss, test_loss = [], []
idx = np.arange(len(X_tr))
t = 0
for epoch in range(N_EPOCHS):
np.random.shuffle(idx)
for b_i in range(N_BATCHES):
t += 1
batch_idx = idx[b_i*BATCH:(b_i+1)*BATCH]
Xb, yb = X_tr[batch_idx], y_tr[batch_idx]
g = grad_full(Xb, yb, theta)
if clip_tau:
g = clip_by_norm(g, tau=clip_tau)
m = b1*m + (1-b1)*g; v = b2*v + (1-b2)*g**2
mh = m/(1-b1**t); vh = v/(1-b2**t)
alpha = schedule(t, N_EPOCHS*N_BATCHES, alpha0) if schedule else alpha0
theta -= alpha * mh / (np.sqrt(vh) + eps)
train_loss.append(mse(X_tr, y_tr, theta))
test_loss.append(mse(X_te, y_te, theta))
return theta, train_loss, test_loss
configs = [
('SGD fixed α', lambda: run_mini_batch_sgd(alpha0=0.05)),
('SGD + cosine', lambda: run_mini_batch_sgd(alpha0=0.1, schedule=cosine_anneal)),
('SGD + cosine + clip', lambda: run_mini_batch_sgd(alpha0=0.1, schedule=cosine_anneal, clip_tau=5.0)),
('Adam fixed α', lambda: run_adam_mini_batch(alpha0=0.01)),
('Adam + cosine + clip', lambda: run_adam_mini_batch(alpha0=0.01, schedule=cosine_anneal, clip_tau=5.0)),
]
colors = ['navy', 'steelblue', 'deepskyblue', 'tomato', 'darkred']
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
epochs_ax = np.arange(1, N_EPOCHS + 1)
print(f"{'Config':<30} {'Train MSE':>10} {'Test MSE':>10}")
for (label, fn), color in zip(configs, colors):
_, tr_loss, te_loss = fn()
axes[0].plot(epochs_ax, tr_loss, color=color, lw=1.8, label=label)
axes[1].plot(epochs_ax, te_loss, color=color, lw=1.8,
label=f'{label} ({te_loss[-1]:.3f})')
print(f" {label:<30} {tr_loss[-1]:>10.4f} {te_loss[-1]:>10.4f}")
axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('MSE')
axes[0].set_title('Training MSE')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)
axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('MSE')
axes[1].set_title('Test MSE')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3)
plt.suptitle('Experiment 5 — Full Pipeline on California Housing (standardised, mini-batch, 150 epochs)',
fontsize=11, fontweight='bold')
plt.tight_layout()
plt.show()Guided Practice¶
What is the main purpose of comparing different gradient descent variants in a lab?¶
If one optimizer converges quickly but shows high variance across random seeds, what should you conclude?¶
Open-Ended Challenges¶
Challenge A — Learning Rate Grid Search¶
For each optimizer (SGD, Momentum, Adam), find the best learning rate on the 1D quadratic. Which optimizer is most sensitive to choice?
Challenge B — Non-Convex Extension¶
Modify the 1D problem to use a double-well loss: . This has two local minima at and . Run all optimizers from (on the saddle between wells). Which reach a minimum, and which?
Challenge C — Business Scenario¶
You are optimising a marketing mix model on 5 years of weekly spend data (260 samples, 8 features). Budget is constrained: you can afford at most 30 seconds of compute. Based on Experiment 5, which combination of (optimizer, schedule, batch size) would you recommend and why?
# Your experiments here
# Challenge A: LR grid search
# Challenge B: double-well
# Challenge C: justify your recommendation in a comment
print('Open-ended challenge workspace')Lab Summary¶
Key findings from the experiments
| Experiment | Key finding |
|---|---|
| 1 — GD variants | Batch GD is smooth; mini-batch and SGD add oscillation proportional to noise level |
| 2 — Optimizer race | Adam and RMSProp typically converge in fewest steps; SGD converges smoothly but slower |
| 3 — Schedules | Cosine annealing reduces late-stage oscillation more than fixed LR; step decay reduces it in steps |
| 4 — Stability | Unscaled features + large LR → NaN in <10 steps; standardise first, then tune LR |
| 5 — Real data | Adam + schedule + clipping achieves lower test MSE than vanilla SGD on California Housing |
Chapter 6 full toolkit: GD algorithm → optimizer choice → LR schedule → numerical stability guards → mini-batch pipeline.
Next Chapter — Logistic Regression & Classification¶

Chapter 6 is complete. You can now build and debug any gradient-based training loop from scratch. Chapter 7 applies everything to classification — logistic regression uses the same gradient descent engine, but with a sigmoid output and cross-entropy loss instead of MSE.