Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Optimization teaches a model how to improve its parameters step by step

This chapter explains how training actually happens, why loss decreases over time, and how choices like learning rate, batch size, and optimizer affect the path to a useful model.


Why Optimization Matters

In theory, machine learning sounds simple: find the weights that minimize the loss. In practice, a model can overshoot the minimum, bounce across the loss curve, or converge very slowly. Optimization is the process that turns repeated feedback into better parameter values.



What Happens During Training

Every training step asks a simple question: did the last parameter update improve the model? If not, the optimizer changes direction or step size. This is why gradients matter: they tell the model which way leads toward lower error.


Core Equations

Loss Function

L=1ni=1n(yiy^i)2L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Parameter Update Rule

θnew=θoldηL(θ)\theta_{new} = \theta_{old} - \eta \cdot \nabla L(\theta)

Variable Guide

  • LL: loss value

  • yiy_i: actual value

  • y^i\hat{y}_i: prediction

  • η\eta: learning rate

  • L(θ)\nabla L(\theta): gradient direction

  • nn: number of observations

These equations formalize the same intuition: measure error, then move the parameters in a direction that reduces that error.


Why We Cannot Jump Straight to the Minimum

Real loss surfaces are rarely simple. They can contain flat regions, steep regions, noisy directions, and points where different parameters interact in difficult ways. That is why practical optimization relies on repeated updates instead of one perfect jump.


Common Optimization Problems

ProblemDescriptionBusiness effect
Vanishing gradientsUpdates become too smallTraining slows down and stalls
Exploding gradientsUpdates become too largeLoss becomes unstable or diverges
Poor initializationParameters start in an unhelpful regionMore iterations are wasted
Bad learning rateStep size is too high or too lowTraining is chaotic or inefficient

Business Analogy

Optimization is similar to coaching a team through repeated feedback cycles. You evaluate performance, give corrective guidance, decide how aggressively to change behavior, and monitor whether the process is improving results or creating instability.


Main Optimization Families

TypeDescriptionTypical tradeoff
Batch Gradient DescentUses the full dataset for each updateStable but slower
Stochastic Gradient DescentUses one observation at a timeFaster but noisier
Mini-Batch Gradient DescentUses small batchesGood balance of speed and stability
Momentum and AdamAdd memory and adaptive updatesOften faster convergence in practice

Training Curve Example

Code Variable Guide

epochs: training rounds loss: model error over time
import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 101)
loss = np.exp(-epochs / 20) + np.random.normal(0, 0.02, 100)

plt.figure(figsize=(8, 5))
plt.plot(epochs, loss, label='Loss Curve', color='royalblue')
plt.title('Training Loss Over Time')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Interpretation: the curve should generally trend downward, but small fluctuations are common because optimization is iterative rather than perfectly smooth.


Chapter Roadmap

SectionWhat you will study
Gradient Descent VariantsDifferent ways to compute update steps
Advanced OptimizersAdam, momentum, and adaptive methods
Learning Rate SchedulesHow step size changes during training
Numerical Stability & VectorizationSafer and faster implementation patterns
Lab – Comparing GD VariantsVisual and practical comparison

Practice Warm-Up

  1. Why can a high learning rate cause the model to overshoot a good solution?

  2. Why is stochastic gradient descent noisier than batch gradient descent?

  3. In a business setting, what would loss reduction represent: lower error, lower cost, or better decisions?


Continue

Next, move to Gradient Descent Variants to compare full-batch, stochastic, and mini-batch updates in detail.

import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 101)
loss = np.exp(-epochs / 20) + np.random.normal(0, 0.02, 100)

plt.figure(figsize=(8, 5))
plt.plot(epochs, loss, color='royalblue', label='Loss Curve')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Optimization Progress Across Epochs')
plt.legend()
plt.show()

Knowledge Check

What is the role of the learning rate in optimization?

It controls how large each parameter update step isCorrect. The learning rate determines how aggressively the optimizer moves.
It decides how many target classes existTarget classes are part of the data, not the learning rate.
It replaces the loss functionOptimization still requires a loss function.
It removes the need for gradientsGradient-based optimization still relies on gradients.

Why can a learning rate that is too high be harmful?

Because the model stops having parametersParameter existence is not affected by the learning rate.
Because updates can overshoot good solutions and make training unstableCorrect. Large steps can bounce around or diverge instead of converging smoothly.
Because the dataset becomes smallerDataset size does not change.
Because the loss becomes a classification metricLoss type is not redefined by the learning rate.