Lab – Comparing GD Variants#
Welcome to the Optimization Olympics 🏅! Today, we’ll watch different Gradient Descent variants race to find the global minimum — and see which one deserves the gold medal 🥇.
🏁 Objective#
Compare how Batch GD, Stochastic GD, and Mini-Batch GD (and their cooler cousins like Adam) behave in real training scenarios.
You’ll:
Visualize their learning paths 🎢
Compare convergence speeds ⏱️
Observe how hyperparameters change the story 📊
📦 Setup#
Let’s load some necessary libraries (no doping allowed 🚫💉):
import numpy as np
import matplotlib.pyplot as plt
# For reproducibility (no cheating)
np.random.seed(42)
🎯 The Loss Landscape#
We’ll simulate a simple quadratic loss function: [ L(w) = (w - 3)^2 + 2 ] Its minimum is at w = 3 (the “finish line”).
def loss(w):
return (w - 3)**2 + 2
def grad(w):
return 2 * (w - 3)
🏃♀️ Gradient Descent Variants in Action#
1️⃣ Batch Gradient Descent#
Calculates the gradient on the entire dataset every time.
w, lr = 10, 0.1
trajectory = []
for i in range(20):
g = grad(w)
w -= lr * g
trajectory.append(w)
2️⃣ Stochastic Gradient Descent#
Updates weights after each sample — noisy but fast ⚡.
w, lr = 10, 0.1
trajectory_sgd = []
for i in range(20):
noise = np.random.randn() * 0.1
g = grad(w) + noise
w -= lr * g
trajectory_sgd.append(w)
3️⃣ Momentum (GD with Momentum)#
Adds a bit of physics — keeps rolling through small bumps 🏎️.
w, lr, momentum = 10, 0.1, 0.9
v = 0
trajectory_m = []
for i in range(20):
g = grad(w)
v = momentum * v - lr * g
w += v
trajectory_m.append(w)
📉 Visualizing the Race#
plt.figure(figsize=(8,5))
plt.plot(trajectory, label="Batch GD 🐢")
plt.plot(trajectory_sgd, label="SGD 🎲")
plt.plot(trajectory_m, label="Momentum 🚀")
plt.axhline(3, color='red', linestyle='--', label='True Minimum (w=3)')
plt.xlabel("Iteration")
plt.ylabel("Weight")
plt.title("Optimization Race: Who Reaches the Minimum First?")
plt.legend()
plt.show()
🧠 Bonus Round: Enter Adam!#
w, m, v = 10, 0, 0
lr, beta1, beta2, eps = 0.1, 0.9, 0.999, 1e-8
trajectory_adam = []
for t in range(1, 21):
g = grad(w)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * (g ** 2)
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
w -= lr * m_hat / (np.sqrt(v_hat) + eps)
trajectory_adam.append(w)
Add to the plot:
plt.plot(trajectory_adam, label="Adam 🧠")
plt.legend()
plt.show()
You’ll notice Adam zooms to the minimum while others are still stretching. 🏃♂️💨
📊 Your Turn#
🧩 Try this:#
Change the learning rate (
lr = 0.01,0.5, etc.)Add momentum to SGD.
Visualize how stability and convergence differ.
See how your model behaves when it’s “too excited” vs “too sleepy” 😴.
🎯 Key Takeaways#
Optimizer |
Strength |
Weakness |
|---|---|---|
Batch GD |
Smooth & stable |
Slow for large data |
SGD |
Fast & scalable |
Noisy updates |
Momentum |
Smooths oscillations |
Needs tuning |
Adam |
Adaptive & fast |
Sometimes overfits |
🧩 Business Analogy#
Optimizer |
Business Personality |
|---|---|
SGD |
The hustler – moves fast, breaks things. |
Batch GD |
The analyst – waits for all data, then acts. |
Momentum |
The marathon runner – steady and strong. |
Adam |
The consultant – adapts to everything and charges more. 💼 |
💬 “Optimization is like coffee brewing: get the temperature (learning rate) right, stir well (momentum), and don’t overdo it (overfitting).” ☕
🧰 Continue Exploring#
Run this notebook on Colab or JupyterLite using the buttons above.
Modify the loss function to something non-convex and see how optimizers behave in the wild. 🏞️
🔗 Next Chapter: Supervised Classification – Trees & Friends 🌳 Because predicting “who buys what” is the true business magic. ✨
# Your code here