Lab – Comparing GD Variants#

Welcome to the Optimization Olympics 🏅! Today, we’ll watch different Gradient Descent variants race to find the global minimum — and see which one deserves the gold medal 🥇.


🏁 Objective#

Compare how Batch GD, Stochastic GD, and Mini-Batch GD (and their cooler cousins like Adam) behave in real training scenarios.

You’ll:

  • Visualize their learning paths 🎢

  • Compare convergence speeds ⏱️

  • Observe how hyperparameters change the story 📊


📦 Setup#

Let’s load some necessary libraries (no doping allowed 🚫💉):

import numpy as np
import matplotlib.pyplot as plt

# For reproducibility (no cheating)
np.random.seed(42)

🎯 The Loss Landscape#

We’ll simulate a simple quadratic loss function: [ L(w) = (w - 3)^2 + 2 ] Its minimum is at w = 3 (the “finish line”).

def loss(w):
    return (w - 3)**2 + 2

def grad(w):
    return 2 * (w - 3)

🏃‍♀️ Gradient Descent Variants in Action#

1️⃣ Batch Gradient Descent#

Calculates the gradient on the entire dataset every time.

w, lr = 10, 0.1
trajectory = []

for i in range(20):
    g = grad(w)
    w -= lr * g
    trajectory.append(w)

2️⃣ Stochastic Gradient Descent#

Updates weights after each sample — noisy but fast ⚡.

w, lr = 10, 0.1
trajectory_sgd = []

for i in range(20):
    noise = np.random.randn() * 0.1
    g = grad(w) + noise
    w -= lr * g
    trajectory_sgd.append(w)

3️⃣ Momentum (GD with Momentum)#

Adds a bit of physics — keeps rolling through small bumps 🏎️.

w, lr, momentum = 10, 0.1, 0.9
v = 0
trajectory_m = []

for i in range(20):
    g = grad(w)
    v = momentum * v - lr * g
    w += v
    trajectory_m.append(w)

📉 Visualizing the Race#

plt.figure(figsize=(8,5))
plt.plot(trajectory, label="Batch GD 🐢")
plt.plot(trajectory_sgd, label="SGD 🎲")
plt.plot(trajectory_m, label="Momentum 🚀")
plt.axhline(3, color='red', linestyle='--', label='True Minimum (w=3)')
plt.xlabel("Iteration")
plt.ylabel("Weight")
plt.title("Optimization Race: Who Reaches the Minimum First?")
plt.legend()
plt.show()

🧠 Bonus Round: Enter Adam!#

w, m, v = 10, 0, 0
lr, beta1, beta2, eps = 0.1, 0.9, 0.999, 1e-8
trajectory_adam = []

for t in range(1, 21):
    g = grad(w)
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * (g ** 2)
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    w -= lr * m_hat / (np.sqrt(v_hat) + eps)
    trajectory_adam.append(w)

Add to the plot:

plt.plot(trajectory_adam, label="Adam 🧠")
plt.legend()
plt.show()

You’ll notice Adam zooms to the minimum while others are still stretching. 🏃‍♂️💨


📊 Your Turn#

🧩 Try this:#

  1. Change the learning rate (lr = 0.01, 0.5, etc.)

  2. Add momentum to SGD.

  3. Visualize how stability and convergence differ.

See how your model behaves when it’s “too excited” vs “too sleepy” 😴.


🎯 Key Takeaways#

Optimizer

Strength

Weakness

Batch GD

Smooth & stable

Slow for large data

SGD

Fast & scalable

Noisy updates

Momentum

Smooths oscillations

Needs tuning

Adam

Adaptive & fast

Sometimes overfits


🧩 Business Analogy#

Optimizer

Business Personality

SGD

The hustler – moves fast, breaks things.

Batch GD

The analyst – waits for all data, then acts.

Momentum

The marathon runner – steady and strong.

Adam

The consultant – adapts to everything and charges more. 💼


💬 “Optimization is like coffee brewing: get the temperature (learning rate) right, stir well (momentum), and don’t overdo it (overfitting).”


🧰 Continue Exploring#

  • Run this notebook on Colab or JupyterLite using the buttons above.

  • Modify the loss function to something non-convex and see how optimizers behave in the wild. 🏞️


🔗 Next Chapter: Supervised Classification – Trees & Friends 🌳 Because predicting “who buys what” is the true business magic. ✨

# Your code here