Optimization teaches a model how to improve its parameters step by step¶
This chapter explains how training actually happens, why loss decreases over time, and how choices like learning rate, batch size, and optimizer affect the path to a useful model.
Why Optimization Matters¶
In theory, machine learning sounds simple: find the weights that minimize the loss. In practice, a model can overshoot the minimum, bounce across the loss curve, or converge very slowly. Optimization is the process that turns repeated feedback into better parameter values.
What Happens During Training¶
Every training step asks a simple question: did the last parameter update improve the model? If not, the optimizer changes direction or step size. This is why gradients matter: they tell the model which way leads toward lower error.
Core Equations¶
Loss Function¶
Parameter Update Rule¶
Variable Guide¶
: loss value
: actual value
: prediction
: learning rate
: gradient direction
: number of observations
These equations formalize the same intuition: measure error, then move the parameters in a direction that reduces that error.
Why We Cannot Jump Straight to the Minimum¶
Real loss surfaces are rarely simple. They can contain flat regions, steep regions, noisy directions, and points where different parameters interact in difficult ways. That is why practical optimization relies on repeated updates instead of one perfect jump.
Common Optimization Problems¶
| Problem | Description | Business effect |
|---|---|---|
| Vanishing gradients | Updates become too small | Training slows down and stalls |
| Exploding gradients | Updates become too large | Loss becomes unstable or diverges |
| Poor initialization | Parameters start in an unhelpful region | More iterations are wasted |
| Bad learning rate | Step size is too high or too low | Training is chaotic or inefficient |
Business Analogy¶
Optimization is similar to coaching a team through repeated feedback cycles. You evaluate performance, give corrective guidance, decide how aggressively to change behavior, and monitor whether the process is improving results or creating instability.
Main Optimization Families¶
| Type | Description | Typical tradeoff |
|---|---|---|
| Batch Gradient Descent | Uses the full dataset for each update | Stable but slower |
| Stochastic Gradient Descent | Uses one observation at a time | Faster but noisier |
| Mini-Batch Gradient Descent | Uses small batches | Good balance of speed and stability |
| Momentum and Adam | Add memory and adaptive updates | Often faster convergence in practice |
Training Curve Example¶
Code Variable Guide¶
import numpy as np
import matplotlib.pyplot as plt
epochs = np.arange(1, 101)
loss = np.exp(-epochs / 20) + np.random.normal(0, 0.02, 100)
plt.figure(figsize=(8, 5))
plt.plot(epochs, loss, label='Loss Curve', color='royalblue')
plt.title('Training Loss Over Time')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()Interpretation: the curve should generally trend downward, but small fluctuations are common because optimization is iterative rather than perfectly smooth.
Chapter Roadmap¶
| Section | What you will study |
|---|---|
| Gradient Descent Variants | Different ways to compute update steps |
| Advanced Optimizers | Adam, momentum, and adaptive methods |
| Learning Rate Schedules | How step size changes during training |
| Numerical Stability & Vectorization | Safer and faster implementation patterns |
| Lab – Comparing GD Variants | Visual and practical comparison |
Practice Warm-Up¶
Why can a high learning rate cause the model to overshoot a good solution?
Why is stochastic gradient descent noisier than batch gradient descent?
In a business setting, what would loss reduction represent: lower error, lower cost, or better decisions?
Continue¶
Next, move to Gradient Descent Variants to compare full-batch, stochastic, and mini-batch updates in detail.
import numpy as np
import matplotlib.pyplot as plt
epochs = np.arange(1, 101)
loss = np.exp(-epochs / 20) + np.random.normal(0, 0.02, 100)
plt.figure(figsize=(8, 5))
plt.plot(epochs, loss, color='royalblue', label='Loss Curve')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Optimization Progress Across Epochs')
plt.legend()
plt.show()