Advanced Optimizers (Adam, etc.)#

If Gradient Descent is the bicycle of optimization — these optimizers are the electric scooters. They go faster, handle bumps (a.k.a. noisy gradients), and make you look cool while doing it 😎.


🚀 Why We Need Fancier Optimizers#

Plain Gradient Descent works fine… until:

  • Your loss curve looks like a roller coaster 🎢

  • Your gradients vanish faster than your weekend plans 💨

  • Or you get stuck at “almost good enough” local minima 😭

That’s where adaptive optimizers step in — they adjust learning rates, momentum, and direction automatically.


🧩 The Optimizer Lineup#

Optimizer

Personality

Superpower

SGD (with Momentum)

The gym bro of optimizers 🏋️

Builds momentum to escape small dips in loss.

RMSProp

The caffeine-fueled statistician ☕📊

Keeps an exponentially decaying average of past gradients.

Adam

The overachiever who read all the research papers 📚

Combines momentum + adaptive learning rates. Most popular choice.

Adagrad

The generous one 💸

Gives big learning rates to rarely updated parameters.

AdamW

Adam’s tidy cousin 🧹

Adds weight decay to keep models from overfitting.


💡 How Adam Actually Works#

Adam = Momentum (β₁) + Adaptive Learning Rate (β₂)

It maintains moving averages of:

  • mₜ (momentum): average of gradients

  • vₜ (velocity): average of squared gradients

Then it corrects bias and updates parameters like:

[ \theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} ]

(Don’t worry — it looks scarier than it is. Like your first tax form.)


🧪 Quick Practice#

Run this small experiment:

import torch
import torch.nn as nn
import torch.optim as optim

# Dummy data
x = torch.randn(100, 1)
y = 3 * x + 2 + 0.1 * torch.randn(100, 1)

# Simple linear model
model = nn.Linear(1, 1)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

for epoch in range(100):
    optimizer.zero_grad()
    loss = criterion(model(x), y)
    loss.backward()
    optimizer.step()

print("Final loss:", loss.item())

Try swapping:

optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

and watch how the training speed and stability change.


🎯 Key Takeaways#

  • Adam = RMSProp + Momentum → Best all-rounder.

  • Momentum helps escape valleys.

  • Adaptive learning rates save you from manual tuning.

  • Weight decay keeps your model humble.


💬 “Remember: choosing an optimizer is like choosing a pizza topping — no universal best, but some combinations just work better.” 🍕


🔗 Next Up: Learning Rate Schedules – where we teach your optimizer when to chill and when to sprint. 🏃‍♂️

import numpy as np
import plotly.graph_objects as go
from dash import Dash, dcc, html, Input, Output

# Ackley function: complex, non-convex objective function
def objective_function(x, y):
    term1 = -20 * np.exp(-0.2 * np.sqrt(0.5 * (x**2 + y**2)))
    term2 = -np.exp(0.5 * (np.cos(2 * np.pi * x) + np.cos(2 * np.pi * y)))
    return term1 + term2 + 20 + np.e

# Gradient of the objective function
def gradient(x, y):
    dx_term1 = 2 * x
    dx_term2 = 3 * np.sin(1.5 * x) * np.cos(1.5 * x)
    dx = dx_term1 + dx_term2

    dy_term1 = 2 * y
    dy_term2 = 3 * np.sin(1.5 * y) * np.cos(1.5 * y)
    dy = dy_term1 + dy_term2

    return np.array([dx, dy])

# Clip points to stay within bounds
def clip_point(point, bounds=(-3, 3)):
    return np.clip(point, bounds[0], bounds[1])

# Compute average step length
def average_step_length(path):
    if len(path) < 2:
        return 0.0
    steps = np.sqrt(np.sum(np.diff(path, axis=0)**2, axis=1))
    return np.mean(steps) if steps.size > 0 else 0.0

# SGD with Momentum
def sgd_momentum(start, learning_rate, num_iterations, momentum=0.9, tolerance=0.01):
    point = np.array(start, dtype=float)
    velocity = np.zeros_like(point)
    history = [point.copy()]
    steps = 0
    for i in range(num_iterations):
        grad = gradient(point[0], point[1])
        velocity = momentum * velocity - learning_rate * grad
        point = clip_point(point + velocity)
        history.append(point.copy())
        steps += 1
        if objective_function(point[0], point[1]) < tolerance:
            break
    return np.array(history), steps

# AdaGrad
def adagrad(start, learning_rate, num_iterations, epsilon=1e-8, tolerance=0.01):
    point = np.array(start, dtype=float)
    history = [point.copy()]
    squared_grad_sum = np.zeros_like(point)
    steps = 0
    for i in range(num_iterations):
        grad = gradient(point[0], point[1])
        squared_grad_sum += grad**2
        adjusted_lr = learning_rate / (np.sqrt(squared_grad_sum) + epsilon)
        point = clip_point(point - adjusted_lr * grad)
        history.append(point.copy())
        steps += 1
        if objective_function(point[0], point[1]) < tolerance:
            break
    return np.array(history), steps

# RMSProp
def rmsprop(start, learning_rate, num_iterations, decay_rate=0.9, epsilon=1e-8, tolerance=0.01):
    point = np.array(start, dtype=float)
    history = [point.copy()]
    squared_grad_avg = np.zeros_like(point)
    steps = 0
    for i in range(num_iterations):
        grad = gradient(point[0], point[1])
        squared_grad_avg = decay_rate * squared_grad_avg + (1 - decay_rate) * grad**2
        point = clip_point(point - (learning_rate / (np.sqrt(squared_grad_avg) + epsilon)) * grad)
        history.append(point.copy())
        steps += 1
        if objective_function(point[0], point[1]) < tolerance:
            break
    return np.array(history), steps

# Adam
def adam(start, learning_rate, num_iterations, beta1=0.9, beta2=0.999, epsilon=1e-8, tolerance=0.01):
    point = np.array(start, dtype=float)
    history = [point.copy()]
    m = np.zeros_like(point)
    v = np.zeros_like(point)
    t = 0
    steps = 0
    for i in range(num_iterations):
        t += 1
        grad = gradient(point[0], point[1])
        m = beta1 * m + (1 - beta1) * grad
        v = beta2 * v + (1 - beta2) * grad**2
        m_hat = m / (1 - beta1**t)
        v_hat = v / (1 - beta2**t)
        point = clip_point(point - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon))
        history.append(point.copy())
        steps += 1
        if objective_function(point[0], point[1]) < tolerance:
            break
    return np.array(history), steps

# Create Plotly figure
def create_plotly_figure(sgd_path, adagrad_path, rmsprop_path, adam_path):
    x = np.linspace(-3, 3, 100)
    y = np.linspace(-3, 3, 100)
    X, Y = np.meshgrid(x, y)
    Z = objective_function(X, Y)

    sgd_z = np.array([objective_function(x, y) for x, y in sgd_path])
    adagrad_z = np.array([objective_function(x, y) for x, y in adagrad_path])
    rmsprop_z = np.array([objective_function(x, y) for x, y in rmsprop_path])
    adam_z = np.array([objective_function(x, y) for x, y in adam_path])

    surface = go.Surface(
        x=X, y=Y, z=Z,
        colorscale='Viridis',
        opacity=0.6,
        colorbar=dict(
            title='Function Value',
            x=1.05,  # Position colorbar to the right
            xanchor='left',
            len=0.6
        )
    )

    sgd_trace = go.Scatter3d(
        x=sgd_path[:, 0], y=sgd_path[:, 1], z=sgd_z,
        mode='lines+markers',
        name='SGD (Momentum)',
        line=dict(color='red', width=4),
        marker=dict(size=3)
    )
    adagrad_trace = go.Scatter3d(
        x=adagrad_path[:, 0], y=adagrad_path[:, 1], z=adagrad_z,
        mode='lines+markers',
        name='AdaGrad',
        line=dict(color='blue', width=4),
        marker=dict(size=3)
    )
    rmsprop_trace = go.Scatter3d(
        x=rmsprop_path[:, 0], y=rmsprop_path[:, 1], z=rmsprop_z,
        mode='lines+markers',
        name='RMSProp',
        line=dict(color='green', width=4),
        marker=dict(size=3)
    )
    adam_trace = go.Scatter3d(
        x=adam_path[:, 0], y=adam_path[:, 1], z=adam_z,
        mode='lines+markers',
        name='Adam',
        line=dict(color='yellow', width=4),
        marker=dict(size=3)
    )

    start_point = go.Scatter3d(
        x=[sgd_path[0, 0]], y=[sgd_path[0, 1]], z=[sgd_z[0]],
        mode='markers',
        name='Start',
        marker=dict(size=10, color='black', symbol='circle')
    )
    global_minimum = go.Scatter3d(
        x=[0], y=[0], z=[0],
        mode='markers',
        name='Global Minimum',
        marker=dict(size=10, color='red', symbol='x')
    )

    updatemenus = [
        dict(
            buttons=[
                dict(args=[{'visible': [True, True, True, True, True, True, True]}], label='All', method='update'),
                dict(args=[{'visible': [True, True, False, False, False, True, True]}], label='SGD (Momentum)', method='update'),
                dict(args=[{'visible': [True, False, True, False, False, True, True]}], label='AdaGrad', method='update'),
                dict(args=[{'visible': [True, False, False, True, False, True, True]}], label='RMSProp', method='update'),
                dict(args=[{'visible': [True, False, False, False, True, True, True]}], label='Adam', method='update'),
            ],
            direction='down',
            showactive=True,
            x=0.1,
            xanchor='left',
            y=1.2,
            yanchor='top'
        )
    ]

    layout = go.Layout(
        title='',
        scene=dict(
            xaxis_title='x',
            yaxis_title='y',
            zaxis_title='f(x, y)',
            aspectmode='manual',
            aspectratio=dict(x=1, y=1, z=0.5)
        ),
        showlegend=True,
        legend=dict(
            x=0.8,  # Move legend to avoid colorbar
            y=0.9,
            xanchor='left',
            yanchor='top'
        ),
        updatemenus=updatemenus,
        margin=dict(r=150)  # Add right margin to accommodate colorbar
    )

    fig = go.Figure(data=[surface, sgd_trace, adagrad_trace, rmsprop_trace, adam_trace, start_point, global_minimum], layout=layout)
    return fig

# Initialize Dash app
app = Dash(__name__)

# Layout of the Dash app
app.layout = html.Div([
    html.H1("Gradient Descent Optimization on Ackley Function"),
    html.Div([
        html.Label("Learning Rate:"),
        dcc.Input(id='learning-rate', type='number', value=0.05, min=0.001, max=0.1, step=0.001),
    ]),
    html.Div([
        html.Label("Number of Iterations:"),
        dcc.Input(id='num-iterations', type='number', value=1000, min=10, max=5000, step=10),
    ]),
    html.Div([
        html.Label("Momentum (SGD):"),
        dcc.Input(id='momentum', type='number', value=0.9, min=0.0, max=1.0, step=0.01),
    ]),
    html.Div([
        html.Label("Decay Rate (RMSProp):"),
        dcc.Input(id='decay-rate', type='number', value=0.9, min=0.0, max=1.0, step=0.01),
    ]),
    html.Button('Run Optimization', id='run-button', n_clicks=0),
    dcc.Graph(id='optimization-plot'),
    html.Div(id='steps-output')
])

# Callback to update plot and steps
@app.callback(
    [Output('optimization-plot', 'figure'),
     Output('steps-output', 'children')],
    [Input('run-button', 'n_clicks')],
    [Input('learning-rate', 'value'),
     Input('num-iterations', 'value'),
     Input('momentum', 'value'),
     Input('decay-rate', 'value')]
)
def update_plot(n_clicks, learning_rate, num_iterations, momentum, decay_rate):
    start_point = [2.0, 2.0]
    tolerance = 0.01

    if learning_rate is None or num_iterations is None or momentum is None or decay_rate is None:
        sgd_path = np.array([start_point])
        adagrad_path = np.array([start_point])
        rmsprop_path = np.array([start_point])
        adam_path = np.array([start_point])
        steps_text = "Please provide valid inputs."
    else:
        sgd_path, sgd_steps = sgd_momentum(start_point, learning_rate, int(num_iterations), momentum, tolerance)
        adagrad_path, adagrad_steps = adagrad(start_point, learning_rate, int(num_iterations), tolerance=tolerance)
        rmsprop_path, rmsprop_steps = rmsprop(start_point, learning_rate, int(num_iterations), decay_rate, tolerance=tolerance)
        adam_path, adam_steps = adam(start_point, learning_rate, int(num_iterations), tolerance=tolerance)

        # Compute final function values and average step lengths
        sgd_final = objective_function(sgd_path[-1, 0], sgd_path[-1, 1])
        adagrad_final = objective_function(adagrad_path[-1, 0], adagrad_path[-1, 1])
        rmsprop_final = objective_function(rmsprop_path[-1, 0], rmsprop_path[-1, 1])
        adam_final = objective_function(adam_path[-1, 0], adam_path[-1, 1])

        sgd_step_length = average_step_length(sgd_path)
        adagrad_step_length = average_step_length(adagrad_path)
        rmsprop_step_length = average_step_length(rmsprop_path)
        adam_step_length = average_step_length(adam_path)

        steps_text = [
            html.P(f"SGD with Momentum: {sgd_steps} steps, Final f(x, y) = {sgd_final:.4f}, Avg Step Length = {sgd_step_length:.4f}"),
            html.P(f"AdaGrad: {adagrad_steps} steps, Final f(x, y) = {adagrad_final:.4f}, Avg Step Length = {adagrad_step_length:.4f}"),
            html.P(f"RMSProp: {rmsprop_steps} steps, Final f(x, y) = {rmsprop_final:.4f}, Avg Step Length = {rmsprop_step_length:.4f}"),
            html.P(f"Adam: {adam_steps} steps, Final f(x, y) = {adam_final:.4f}, Avg Step Length = {adam_step_length:.4f}")
        ]

    fig = create_plotly_figure(sgd_path, adagrad_path, rmsprop_path, adam_path)
    return fig, steps_text

# Run the app
if __name__ == '__main__':
    app.run(debug=True)

jiupinjia/Visualize-Optimization-Algorithms

# Your code here