Optimization & Training Practicalities - Machine Learning for Business

Optimization teaches a model how to improve its parameters step by step¶

This chapter explains how training actually happens, why loss decreases over time, and how choices like learning rate, batch size, and optimizer affect the path to a useful model.

Why Optimization Matters¶

In theory, machine learning sounds simple: find the weights that minimize the loss. In practice, a model can overshoot the minimum, bounce across the loss curve, or converge very slowly. Optimization is the process that turns repeated feedback into better parameter values.

What Happens During Training¶

Every training step asks a simple question: did the last parameter update improve the model? If not, the optimizer changes direction or step size. This is why gradients matter: they tell the model which way leads toward lower error.

Core Equations¶

Loss Function¶

L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

(1)

Parameter Update Rule¶

\theta_{new} = \theta_{old} - \eta \cdot \nabla L(\theta)

(2)

Variable Guide¶

$L$ : loss value
$y_i$ : actual value
$\hat{y}_i$ : prediction
$\eta$ : learning rate
$\nabla L(\theta)$ : gradient direction
$n$ : number of observations

These equations formalize the same intuition: measure error, then move the parameters in a direction that reduces that error.

Why We Cannot Jump Straight to the Minimum¶

Real loss surfaces are rarely simple. They can contain flat regions, steep regions, noisy directions, and points where different parameters interact in difficult ways. That is why practical optimization relies on repeated updates instead of one perfect jump.

Common Optimization Problems¶

Problem	Description	Business effect
Vanishing gradients	Updates become too small	Training slows down and stalls
Exploding gradients	Updates become too large	Loss becomes unstable or diverges
Poor initialization	Parameters start in an unhelpful region	More iterations are wasted
Bad learning rate	Step size is too high or too low	Training is chaotic or inefficient

Business Analogy¶

Optimization is similar to coaching a team through repeated feedback cycles. You evaluate performance, give corrective guidance, decide how aggressively to change behavior, and monitor whether the process is improving results or creating instability.

Main Optimization Families¶

Type	Description	Typical tradeoff
Batch Gradient Descent	Uses the full dataset for each update	Stable but slower
Stochastic Gradient Descent	Uses one observation at a time	Faster but noisier
Mini-Batch Gradient Descent	Uses small batches	Good balance of speed and stability
Momentum and Adam	Add memory and adaptive updates	Often faster convergence in practice

Training Curve Example¶

Code Variable Guide¶

epochs: training rounds loss: model error over time

import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 101)
loss = np.exp(-epochs / 20) + np.random.normal(0, 0.02, 100)

plt.figure(figsize=(8, 5))
plt.plot(epochs, loss, label='Loss Curve', color='royalblue')
plt.title('Training Loss Over Time')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Interpretation: the curve should generally trend downward, but small fluctuations are common because optimization is iterative rather than perfectly smooth.

Chapter Roadmap¶

Section	What you will study
Gradient Descent Variants	Different ways to compute update steps
Advanced Optimizers	Adam, momentum, and adaptive methods
Learning Rate Schedules	How step size changes during training
Numerical Stability & Vectorization	Safer and faster implementation patterns
Lab – Comparing GD Variants	Visual and practical comparison

Practice Warm-Up¶

Why can a high learning rate cause the model to overshoot a good solution?
Why is stochastic gradient descent noisier than batch gradient descent?
In a business setting, what would loss reduction represent: lower error, lower cost, or better decisions?

Continue¶

Next, move to Gradient Descent Variants to compare full-batch, stochastic, and mini-batch updates in detail.

import numpy as np
import matplotlib.pyplot as plt

epochs = np.arange(1, 101)
loss = np.exp(-epochs / 20) + np.random.normal(0, 0.02, 100)

plt.figure(figsize=(8, 5))
plt.plot(epochs, loss, color='royalblue', label='Loss Curve')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Optimization Progress Across Epochs')
plt.legend()
plt.show()

Knowledge Check¶

What is the role of the learning rate in optimization?¶

It controls how large each parameter update step isCorrect. The learning rate determines how aggressively the optimizer moves.

It decides how many target classes existTarget classes are part of the data, not the learning rate.

It replaces the loss functionOptimization still requires a loss function.

It removes the need for gradientsGradient-based optimization still relies on gradients.

Why can a learning rate that is too high be harmful?¶

Because the model stops having parametersParameter existence is not affected by the learning rate.

Because updates can overshoot good solutions and make training unstableCorrect. Large steps can bounce around or diverge instead of converging smoothly.

Because the dataset becomes smallerDataset size does not change.

Because the loss becomes a classification metricLoss type is not redefined by the learning rate.