Optimization Lab¶

Comparing GD Variants · Optimizers · Schedules · Stability · Real Data

Chapter 6 — Optimization & Training Practicalities · Capstone

Setup — Shared Utilities¶

Run this cell first. It defines the shared loss, gradient, and all optimizer implementations used throughout the lab.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

np.random.seed(42)

# ── 1D lab problem ────────────────────────────────────────────────────────────
def J(w):  return (w - 3.0) ** 2 + 2.0
def dJ(w): return 2.0 * (w - 3.0)

# ── Optimizer implementations (all return loss history + final theta) ─────────
def batch_gd(w0=10.0, alpha=0.1, n=60, noise_sd=0.0):
    w, hist = w0, [J(w0)]
    for _ in range(n):
        g = dJ(w) + np.random.randn() * noise_sd
        w -= alpha * g
        hist.append(J(w))
    return w, hist

def sgd(w0=10.0, alpha=0.1, n=60, noise_sd=1.0):
    """SGD: one sample per step, simulated by noise."""
    return batch_gd(w0, alpha, n, noise_sd=noise_sd)

def mini_batch_gd(w0=10.0, alpha=0.1, n=60, noise_sd=0.3):
    """Mini-batch: intermediate noise level."""
    return batch_gd(w0, alpha, n, noise_sd=noise_sd)

def momentum_gd(w0=10.0, alpha=0.1, beta=0.9, n=60, noise_sd=0.0):
    w, v, hist = w0, 0.0, [J(w0)]
    for _ in range(n):
        g = dJ(w) + np.random.randn() * noise_sd
        v = beta * v + (1 - beta) * g
        w -= alpha * v
        hist.append(J(w))
    return w, hist

def adagrad(w0=10.0, alpha=0.5, eps=1e-8, n=60, noise_sd=0.0):
    w, G, hist = w0, 0.0, [J(w0)]
    for _ in range(n):
        g = dJ(w) + np.random.randn() * noise_sd
        G += g ** 2
        w -= alpha / (np.sqrt(G) + eps) * g
        hist.append(J(w))
    return w, hist

def rmsprop(w0=10.0, alpha=0.1, rho=0.9, eps=1e-8, n=60, noise_sd=0.0):
    w, s, hist = w0, 0.0, [J(w0)]
    for _ in range(n):
        g = dJ(w) + np.random.randn() * noise_sd
        s = rho * s + (1 - rho) * g ** 2
        w -= alpha / (np.sqrt(s) + eps) * g
        hist.append(J(w))
    return w, hist

def adam(w0=10.0, alpha=0.3, b1=0.9, b2=0.999, eps=1e-8, n=60, noise_sd=0.0):
    w, m, v, hist = w0, 0.0, 0.0, [J(w0)]
    for t in range(1, n + 1):
        g = dJ(w) + np.random.randn() * noise_sd
        m = b1 * m + (1 - b1) * g
        v = b2 * v + (1 - b2) * g ** 2
        m_hat = m / (1 - b1 ** t)
        v_hat = v / (1 - b2 ** t)
        w -= alpha * m_hat / (np.sqrt(v_hat) + eps)
        hist.append(J(w))
    return w, hist

print("Setup complete — all optimizer functions loaded.")

Experiment 1 — GD Variants: Noise Effect¶

Compare Batch GD (no noise), Mini-batch GD (moderate noise), and SGD (high noise) on the 1D quadratic. All use the same initial point $w_0 = 10$ and learning rate $\alpha = 0.1$ .

Hypothesis: batch GD converges smoothly but all reach the same neighbourhood of the minimum.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
N = 60
alpha = 0.1

runs = [
    ('Batch GD (σ=0)',     batch_gd(alpha=alpha,     n=N, noise_sd=0.0),  'steelblue',  '-'),
    ('Mini-batch (σ=0.5)', mini_batch_gd(alpha=alpha, n=N, noise_sd=0.5), 'seagreen',   '--'),
    ('SGD (σ=2.0)',        sgd(alpha=alpha,           n=N, noise_sd=2.0), 'tomato',     ':'),
]

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(N + 1)

for label, (w_final, hist), color, ls in runs:
    axes[0].plot(steps, hist, color=color, lw=2, linestyle=ls,
                 label=f'{label}  →  J={hist[-1]:.3f}')
    axes[1].plot(steps, hist, color=color, lw=2, linestyle=ls, label=label)

axes[0].set_xlabel('Step'); axes[0].set_ylabel('J(w)')
axes[0].set_title('Loss — Full Scale')
axes[0].axhline(2.0, color='gray', lw=1, linestyle=':', alpha=0.6, label='True min J=2')
axes[0].legend(fontsize=9); axes[0].grid(alpha=0.3)

axes[1].set_yscale('log'); axes[1].set_xlabel('Step')
axes[1].set_ylabel('J(w) − 2  (log scale)')
axes[1].set_title('Excess Loss (log scale)')
# Plot excess loss (J - true_min)
for label, (w_final, hist), color, ls in runs:
    excess = np.array(hist) - 2.0
    excess = np.where(excess < 1e-9, 1e-9, excess)
    axes[1].semilogy(steps, excess, color=color, lw=2, linestyle=ls)
axes[1].legend(fontsize=9); axes[1].grid(alpha=0.3, which='both')

plt.suptitle('Experiment 1 — GD Variant Comparison (noise = gradient stochasticity)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

print("Final positions:")
for label, (w_final, hist), *_ in runs:
    print(f"  {label:<28}: w={w_final:.4f}  J={hist[-1]:.4f}")

Experiment 2 — Optimizer Race on 1D Quadratic¶

Compare all seven optimizers from chapter 6 starting at $w_0 = 10$ , noise = 0 (clean gradients). Which converges fastest? Which reaches closest to the true minimum $w^* = 3$ , $J^* = 2$ ?

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
N = 60

race = [
    ('Batch GD      (α=0.10)', batch_gd(alpha=0.10, n=N),                        'navy'),
    ('SGD           (α=0.10)', sgd(alpha=0.10, n=N, noise_sd=0.5),               'cornflowerblue'),
    ('Mini-batch    (α=0.10)', mini_batch_gd(alpha=0.10, n=N, noise_sd=0.2),     'deepskyblue'),
    ('Momentum β=.9 (α=0.10)', momentum_gd(alpha=0.10, beta=0.9, n=N),           'seagreen'),
    ('Adagrad       (α=0.50)', adagrad(alpha=0.50, n=N),                          'darkorange'),
    ('RMSProp       (α=0.10)', rmsprop(alpha=0.10, n=N),                          'red'),
    ('Adam          (α=0.30)', adam(alpha=0.30, n=N),                             'gold'),
]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
steps = np.arange(N + 1)

for label, (w_final, hist), color in race:
    axes[0].plot(steps, hist, color=color, lw=1.8, label=f'{label}  J={hist[-1]:.4f}')
    excess = np.maximum(np.array(hist) - 2.0, 1e-10)
    axes[1].semilogy(steps, excess, color=color, lw=1.8, label=label)

axes[0].axhline(2.0, color='black', lw=1.2, linestyle='--', alpha=0.4, label='True min J=2')
axes[0].set_xlabel('Step'); axes[0].set_ylabel('J(w)')
axes[0].set_title('Optimizer Race — Loss')
axes[0].legend(fontsize=7.5, loc='upper right'); axes[0].grid(alpha=0.3)

axes[1].set_xlabel('Step'); axes[1].set_ylabel('Excess Loss J(w)−2 (log)')
axes[1].set_title('Optimizer Race — Convergence Rate')
axes[1].legend(fontsize=7.5); axes[1].grid(alpha=0.3, which='both')

plt.suptitle('Experiment 2 — All Optimizers on 1D Quadratic', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

# Summary table
print(f"{'Optimizer':<34}  {'Final w':>10}  {'Final J':>10}  {'Steps to J<2.01':>16}")
for label, (w_final, hist), _ in race:
    steps_to = next((i for i, v in enumerate(hist) if v < 2.01), -1)
    print(f"{label:<34}  {w_final:>10.4f}  {hist[-1]:>10.6f}  {steps_to:>16}")

Experiment 3 — Schedule vs No Schedule Under Noise¶

SGD with gradient noise (simulating mini-batch variance) — compare fixed LR vs cosine annealing vs step decay over 100 steps. The schedule should reduce the noise impact in later epochs.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
N      = 100
NOISE  = 0.8
ALPHA0 = 0.3

def cosine_anneal(t, T=N, a0=ALPHA0, amin=1e-4):
    return amin + 0.5*(a0-amin)*(1 + np.cos(np.pi*t/T))

def step_decay(t, a0=ALPHA0, gamma=0.5, step_size=25):
    return a0 * (gamma ** (t // step_size))

schedules = [
    ('Fixed α=0.30',        lambda t: ALPHA0),
    ('Fixed α=0.05',        lambda t: 0.05),
    ('Cosine Annealing',    cosine_anneal),
    ('Step Decay (γ=0.5)',  step_decay),
]
colors = ['tomato', 'steelblue', 'seagreen', 'darkorange']

# Run each schedule across 5 seeds, compute mean ± std trajectory
n_seeds = 5
all_results = {}

for label, sched in schedules:
    seed_hists = []
    for seed in range(n_seeds):
        np.random.seed(seed)
        w = 10.0
        hist = [J(w)]
        for t in range(N):
            g = dJ(w) + np.random.randn() * NOISE
            w -= sched(t) * g
            hist.append(J(w))
        seed_hists.append(hist)
    all_results[label] = np.array(seed_hists)  # shape (n_seeds, N+1)

fig, axes = plt.subplots(1, 2, figsize=(13, 4))
steps = np.arange(N + 1)

for (label, _), color in zip(schedules, colors):
    data = all_results[label]
    mu   = data.mean(axis=0)
    sd   = data.std(axis=0)
    axes[0].plot(steps, mu, color=color, lw=2, label=f'{label}  J={mu[-1]:.3f}')
    axes[0].fill_between(steps, mu-sd, mu+sd, color=color, alpha=0.15)

    axes[1].semilogy(steps, np.maximum(mu-2.0, 1e-10), color=color, lw=2, label=label)

axes[0].axhline(2.0, color='gray', lw=1, linestyle=':', alpha=0.6)
axes[0].set_xlabel('Step'); axes[0].set_ylabel('J(w) — mean ± 1 std')
axes[0].set_title('Schedule Comparison on Noisy SGD (mean ± std over 5 seeds)')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)

axes[1].set_xlabel('Step'); axes[1].set_ylabel('Excess Loss (log)')
axes[1].set_title('Convergence to True Minimum')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3, which='both')

plt.suptitle('Experiment 3 — LR Schedules on Noisy SGD (5 seeds)', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

Experiment 4 — Numerical Stability Diagnosis¶

The cell below has a training loop with two embedded bugs. Run it, identify the bugs from the output, and fix them in the follow-up cell.

import numpy as np

np.random.seed(0)
n, p = 100, 2

# Bug setup: unscaled features
X = np.column_stack([
    np.random.uniform(0, 1e6, n),   # revenue (very large scale)
    np.random.uniform(0, 1,   n),   # ratio   (normal scale)
    np.ones(n),                     # bias
])
y = 1e-4 * X[:,0] + 2 * X[:,1] + np.random.randn(n) * 10

theta = np.zeros(3)
alpha = 0.1   # Bug: too large for unscaled features

for step in range(10):
    g = 2/n * X.T @ (X @ theta - y)
    theta -= alpha * g
    loss = np.mean((X @ theta - y)**2)
    print(f"Step {step:2d}: loss = {loss:.4e}")
    if not np.isfinite(loss):
        print("\n>>> NaN detected — investigate and fix <<<")
        break

import numpy as np

np.random.seed(0)
n, p = 100, 2

X_raw = np.column_stack([
    np.random.uniform(0, 1e6, n),
    np.random.uniform(0, 1,   n),
    np.ones(n),
])
y = 1e-4 * X_raw[:,0] + 2 * X_raw[:,1] + np.random.randn(n) * 10

# FIX 1: standardise features (but NOT bias column)
mu     = np.array([X_raw[:,0].mean(), X_raw[:,1].mean(), 0.0])
sigma  = np.array([X_raw[:,0].std(),  X_raw[:,1].std(),  1.0])
X = (X_raw - mu) / sigma
X[:,2] = 1.0  # restore bias

# FIX 2: use a reasonable learning rate for standardised features
theta = np.zeros(3)
alpha = 0.05

print("Fixed version:")
for step in range(40):
    g = 2/n * X.T @ (X @ theta - y)
    theta -= alpha * g
    loss = np.mean((X @ theta - y)**2)
    if step % 10 == 0 or step == 39:
        print(f"  Step {step:2d}: loss = {loss:.4e}")
print(f"Converged: {'yes' if np.isfinite(loss) else 'no'}")

Experiment 5 — Full Pipeline on Real Data (California Housing)¶

Apply the chapter-6 toolkit end-to-end on the California Housing dataset. Compare plain SGD vs Adam with cosine annealing and gradient clipping.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# ── Load and preprocess ───────────────────────────────────────────────────────
data = fetch_california_housing()
X_all, y_all = data.data, data.target

X_tr, X_te, y_tr, y_te = train_test_split(X_all, y_all, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr)   # fit on train only
X_te = scaler.transform(X_te)

# Add bias column
X_tr = np.column_stack([X_tr, np.ones(len(X_tr))])
X_te = np.column_stack([X_te, np.ones(len(X_te))])
p = X_tr.shape[1]  # 9 features + 1 bias

def mse(X, y, theta): return np.mean((X @ theta - y) ** 2)
def grad_full(X, y, theta): return 2/len(y) * X.T @ (X @ theta - y)

def clip_by_norm(g, tau=5.0):
    norm = np.linalg.norm(g)
    return g * tau / norm if norm > tau else g

def cosine_anneal(t, T, a0, amin=1e-5):
    return amin + 0.5*(a0-amin)*(1 + np.cos(np.pi*t/T))

N_EPOCHS = 150
BATCH    = 64
N_BATCHES = len(X_tr) // BATCH
np.random.seed(42)

def run_mini_batch_sgd(alpha0=0.05, schedule=None, clip_tau=None):
    theta = np.zeros(p)
    train_loss, test_loss = [], []
    idx = np.arange(len(X_tr))
    total_steps = 0
    for epoch in range(N_EPOCHS):
        np.random.shuffle(idx)
        for b in range(N_BATCHES):
            batch_idx = idx[b*BATCH:(b+1)*BATCH]
            Xb, yb = X_tr[batch_idx], y_tr[batch_idx]
            g = grad_full(Xb, yb, theta)
            if clip_tau:
                g = clip_by_norm(g, tau=clip_tau)
            alpha = schedule(total_steps, N_EPOCHS*N_BATCHES, alpha0) if schedule else alpha0
            theta -= alpha * g
            total_steps += 1
        train_loss.append(mse(X_tr, y_tr, theta))
        test_loss.append(mse(X_te, y_te, theta))
    return theta, train_loss, test_loss

def run_adam_mini_batch(alpha0=0.01, schedule=None, clip_tau=None):
    theta = np.zeros(p)
    m, v  = np.zeros(p), np.zeros(p)
    b1, b2, eps = 0.9, 0.999, 1e-8
    train_loss, test_loss = [], []
    idx = np.arange(len(X_tr))
    t = 0
    for epoch in range(N_EPOCHS):
        np.random.shuffle(idx)
        for b_i in range(N_BATCHES):
            t += 1
            batch_idx = idx[b_i*BATCH:(b_i+1)*BATCH]
            Xb, yb = X_tr[batch_idx], y_tr[batch_idx]
            g = grad_full(Xb, yb, theta)
            if clip_tau:
                g = clip_by_norm(g, tau=clip_tau)
            m = b1*m + (1-b1)*g; v = b2*v + (1-b2)*g**2
            mh = m/(1-b1**t); vh = v/(1-b2**t)
            alpha = schedule(t, N_EPOCHS*N_BATCHES, alpha0) if schedule else alpha0
            theta -= alpha * mh / (np.sqrt(vh) + eps)
        train_loss.append(mse(X_tr, y_tr, theta))
        test_loss.append(mse(X_te, y_te, theta))
    return theta, train_loss, test_loss

configs = [
    ('SGD fixed α',             lambda: run_mini_batch_sgd(alpha0=0.05)),
    ('SGD + cosine',            lambda: run_mini_batch_sgd(alpha0=0.1, schedule=cosine_anneal)),
    ('SGD + cosine + clip',     lambda: run_mini_batch_sgd(alpha0=0.1, schedule=cosine_anneal, clip_tau=5.0)),
    ('Adam fixed α',            lambda: run_adam_mini_batch(alpha0=0.01)),
    ('Adam + cosine + clip',    lambda: run_adam_mini_batch(alpha0=0.01, schedule=cosine_anneal, clip_tau=5.0)),
]
colors = ['navy', 'steelblue', 'deepskyblue', 'tomato', 'darkred']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
epochs_ax = np.arange(1, N_EPOCHS + 1)

print(f"{'Config':<30}  {'Train MSE':>10}  {'Test MSE':>10}")
for (label, fn), color in zip(configs, colors):
    _, tr_loss, te_loss = fn()
    axes[0].plot(epochs_ax, tr_loss, color=color, lw=1.8, label=label)
    axes[1].plot(epochs_ax, te_loss, color=color, lw=1.8,
                 label=f'{label}  ({te_loss[-1]:.3f})')
    print(f"  {label:<30}  {tr_loss[-1]:>10.4f}  {te_loss[-1]:>10.4f}")

axes[0].set_xlabel('Epoch'); axes[0].set_ylabel('MSE')
axes[0].set_title('Training MSE')
axes[0].legend(fontsize=8); axes[0].grid(alpha=0.3)

axes[1].set_xlabel('Epoch'); axes[1].set_ylabel('MSE')
axes[1].set_title('Test MSE')
axes[1].legend(fontsize=8); axes[1].grid(alpha=0.3)

plt.suptitle('Experiment 5 — Full Pipeline on California Housing (standardised, mini-batch, 150 epochs)',
             fontsize=11, fontweight='bold')
plt.tight_layout()
plt.show()

Guided Practice¶

What is the main purpose of comparing different gradient descent variants in a lab?¶

To observe practical tradeoffs in speed, stability, and convergence behaviour under controlled conditionsCorrect. Controlled experiments make optimizer tradeoffs concrete: one variant may converge in fewer steps but with higher variance; another converges smoothly but slowly.

To prove that every optimizer produces the exact same pathDifferent optimizers take different paths — that is precisely what makes the comparison informative.

To avoid plotting any resultsVisualisation is a key part of the lab — loss curves and trajectories reveal behaviour that tables alone cannot.

To remove all randomness from machine learningStochasticity is inherent to mini-batch training; labs help us understand it, not eliminate it.

If one optimizer converges quickly but shows high variance across random seeds, what should you conclude?¶

It is automatically the best choice for productionHigh variance means its performance is sensitive to initialisation or data ordering — a risk in production.

Optimizer selection involves tradeoffs between convergence speed and stability across runsCorrect. A fast but unstable optimizer may require multiple restarts. Pair speed with stability metrics (mean ± std across seeds) for a fair comparison.

The loss function must be wrongHigh variance is a property of the optimizer and its interaction with stochastic gradients, not the loss function.

Training data should be discardedDiscarding data does not address optimizer sensitivity.

Open-Ended Challenges¶

Challenge A — Learning Rate Grid Search¶

For each optimizer (SGD, Momentum, Adam), find the best learning rate $\alpha \in \{0.001, 0.01, 0.05, 0.1, 0.3, 0.5\}$ on the 1D quadratic. Which optimizer is most sensitive to $\alpha$ choice?

Challenge B — Non-Convex Extension¶

Modify the 1D problem to use a double-well loss: $J(w) = (w^2 - 4)^2$ . This has two local minima at $w=-2$ and $w=2$ . Run all optimizers from $w_0=0.1$ (on the saddle between wells). Which reach a minimum, and which?

Challenge C — Business Scenario¶

You are optimising a marketing mix model on 5 years of weekly spend data (260 samples, 8 features). Budget is constrained: you can afford at most 30 seconds of compute. Based on Experiment 5, which combination of (optimizer, schedule, batch size) would you recommend and why?

# Your experiments here
# Challenge A: LR grid search
# Challenge B: double-well
# Challenge C: justify your recommendation in a comment
print('Open-ended challenge workspace')

Lab Summary¶

Experiment	Key finding
1 — GD variants	Batch GD is smooth; mini-batch and SGD add oscillation proportional to noise level
2 — Optimizer race	Adam and RMSProp typically converge in fewest steps; SGD converges smoothly but slower
3 — Schedules	Cosine annealing reduces late-stage oscillation more than fixed LR; step decay reduces it in steps
4 — Stability	Unscaled features + large LR → NaN in <10 steps; standardise first, then tune LR
5 — Real data	Adam + schedule + clipping achieves lower test MSE than vanilla SGD on California Housing

Chapter 6 full toolkit: GD algorithm → optimizer choice → LR schedule → numerical stability guards → mini-batch pipeline.

Next Chapter — Logistic Regression & Classification¶

Logistic Regression & Classification
Chapter 6 is complete. You can now build and debug any gradient-based training loop from scratch. Chapter 7 applies everything to classification — logistic regression uses the same gradient descent engine, but with a sigmoid output and cross-entropy loss instead of MSE.