Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Teaching Machines the Art of Regret

“To err is human. To minimize that error — that’s machine learning.” 🤖


🎯 What Is MSE?

Imagine your ML model as a junior analyst making predictions about sales. Every time it’s wrong, you slap it with a gentle mathematical frown 😠 — that’s the error.

Now, instead of yelling at it for each mistake, you:

  1. Measure how far off it was (difference between prediction & truth),

  2. Square it (because negative errors also hurt),

  3. Average them all.

That’s Mean Squared Error (MSE) — the average squared disappointment of your model.

[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 ]

Where:

  • ( y_i ) = actual value

  • ( \hat{y_i} ) = predicted value

  • ( n ) = number of observations


🧮 Example: Predicting Sales

ObservationActual (y)Predicted (ŷ)Error (y−ŷ)Squared Error
1100110-10100
220019010100
330027030900

[ \text{MSE} = \frac{100 + 100 + 900}{3} = 366.67 ]

💬 “On average, the model is about 19 units off — but because errors are squared, big mistakes really sting.” 🥲


📊 Why Square the Errors?

Because math loves drama. Squaring does two things:

  1. Eliminates negatives (so overestimates and underestimates don’t cancel out),

  2. Punishes big errors — the way your boss remembers that one bad forecast from last year.


💼 Business Analogy

Let’s say you’re a coffee chain manager predicting daily sales.

  • Predicting 480 cups when you sell 500 → okay, close enough. ☕

  • Predicting 50 cups when you sell 500 → catastrophic caffeine shortage! 😱

Squaring ensures the big misses matter a lot more.

That’s why MSE = “Measure of how badly your model can mess up, weighted by embarrassment.”


⚙️ In Python

from sklearn.metrics import mean_squared_error

y_true = [100, 200, 300]
y_pred = [110, 190, 270]

mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.2f}")
print(f"RMSE: {mse**0.5:.2f}")  # Root Mean Squared Error

Output:

MSE: 366.67
RMSE: 19.15

💡 RMSE (Root Mean Squared Error) brings the units back — so it’s easier to explain in meetings. “Our model’s off by about 19 sales per day on average.” 📉


📈 Visual Intuition

import matplotlib.pyplot as plt
import numpy as np

y_actual = np.array([100, 200, 300])
y_pred = np.array([110, 190, 270])

plt.scatter(y_actual, y_pred, color='blue', label='Predictions')
plt.plot([0, 300], [0, 300], color='red', linestyle='--', label='Perfect Prediction Line')
for i in range(len(y_actual)):
    plt.plot([y_actual[i], y_actual[i]], [y_actual[i], y_pred[i]], 'gray', linestyle=':')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Error Distances (MSE Visualization)')
plt.legend()
plt.show()

Each gray line = a “mistake.” The longer the line, the bigger the pain (and the bigger the MSE). 💔


🧠 When to Use MSE

MetricBest ForNotes
MSEContinuous outcomesPenalizes large errors heavily
RMSEInterpretabilityEasier to explain (in original units)
MAE (Mean Absolute Error)When you want fairnessDoesn’t punish big errors as much

💬 Pro tip: If your business hates big mistakes (like loan defaults or stock forecasts), go with MSE/RMSE. If you want to treat all errors equally, use MAE.


🤹 Practice Corner: “Forecast Like a Boss”

Try this small challenge 👇

You’re predicting next month’s sales revenue for 4 stores:

StoreActual ($)Predicted ($)
A50,00048,000
B75,00080,000
C60,00065,000
D90,000100,000
  1. Compute MSE and RMSE.

  2. Which store has the largest squared error?

  3. In business terms — what does this tell you?

💡 Hint: One bad forecast can ruin your bonus report 😅


🧩 Common Pitfalls

MistakeExplanation
Ignoring scaleMSE grows with the magnitude of your data
Comparing MSE across models with different unitsDon’t — use RMSE or relative metrics
Forgetting to check residual plotsMSE alone can’t tell if errors are biased

🧭 Recap

ConceptMeaning
MSEAverage squared prediction error
RMSESquare root of MSE (in same units as data)
GoalMake it as small as possible
PunishesLarge errors more heavily
Used inRegression, forecasting, optimization

💬 Final Thought

“MSE is like guilt — it teaches your model to reflect on its mistakes, but too much of it and nobody learns anything.” 😅


🔜 Next Up

👉 Head to Gradients & Partial Derivatives — where your model learns how to apologize properly by adjusting itself step by step. 🧘‍♂️

“Because knowing you’re wrong is one thing — learning from it is where the magic happens.” 🌟


import matplotlib.pyplot as plt
import numpy as np
from matplotlib.animation import FuncAnimation, PillowWriter

# Define the true function
def true_function(x):
    return 2 * x + 1

# Generate some data points
x_data = np.linspace(0, 10, 20)
y_data = true_function(x_data) + np.random.normal(0, 2, 20)

# Define the figure and axes
fig, ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 25)

# Plot the data points
scatter = ax.scatter(x_data, y_data)

# Create an empty line object for the predicted line
line, = ax.plot([], [], 'r-', lw=2)

# Create empty line objects for the error lines
error_lines = []
for _ in range(len(x_data)):
    err_line, = ax.plot([], [], 'g--', lw=1)
    error_lines.append(err_line)

# Function to initialize the animation
def init():
    line.set_data([], [])
    for err_line in error_lines:
        err_line.set_data([], [])
    return (line, *error_lines)

# Function to update the animation frame
def animate(i):
    # Define the predicted function with varying slope and intercept
    def predicted_function(x, slope, intercept):
        return slope * x + intercept

    # Define slope and intercept for the current frame
    slope = 0.5 + i * 0.1
    intercept = 0 + i * 0.2

    # Calculate predicted y values
    y_predicted = predicted_function(x_data, slope, intercept)

    # Update the line data
    line.set_data(x_data, y_predicted)

    # Update error lines
    for j, err_line in enumerate(error_lines):
        err_line.set_data([x_data[j], x_data[j]], [y_data[j], y_predicted[j]])

    return (line, *error_lines)

# Create the animation
ani = FuncAnimation(fig, animate, frames=np.arange(25), init_func=init, interval=1000, blit=True)

# 10. Save as GIF
ani.save('squared_error.gif', writer=PillowWriter(fps=10))

# 11. Display saved GIF inside notebook
from IPython.display import Image
Image(filename="squared_error.gif")
<IPython.core.display.Image object>
<Figure size 640x480 with 1 Axes>
ml

An Objective: Mean Squared Error

We pick θ\theta to minimize the mean squared error (MSE). Slight variants of this objective are also known as the residual sum of squares (RSS) or the sum of squared residuals (SSR).

J(θ)=12ni=1n(y(i)θx(i))2J(\theta)= \frac{1}{2n} \sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2

In other words, we are looking for the best compromise in θ\theta over all the data points.

Mean Squared Error: Partial Derivatives

Let’s work out the derivatives for 12(fθ(x(i))y(i))2,\frac{1}{2} \left( f_\theta(x^{(i)}) - y^{(i)} \right)^2, the MSE of a linear model fθf_\theta for one training example (x(i),y(i))(x^{(i)}, y^{(i)}), which we denote J(i)(θ)J^{(i)}(\theta).

θjJ(i)(θ)=θj(12(fθ(x(i))y(i))2)=(fθ(x(i))y(i))θj(fθ(x(i))y(i))=(fθ(x(i))y(i))θj(k=0dθkxk(i)y(i))=(fθ(x(i))y(i))xj(i)\begin{align*} \frac{\partial}{\partial \theta_j} J^{(i)}(\theta) & = \frac{\partial}{\partial \theta_j} \left(\frac{1}{2} \left( f_\theta(x^{(i)}) - y^{(i)} \right)^2\right) \\ & = \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot \frac{\partial}{\partial \theta_j} \left( f_\theta(x^{(i)}) - y^{(i)} \right) \\ & = \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot \frac{\partial}{\partial \theta_j} \left( \sum_{k=0}^d \theta_k \cdot x^{(i)}_k - y^{(i)} \right) \\ & = \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot x^{(i)}_j \end{align*}

Mean Squared Error: The Gradient

We can use this derivation to obtain an expression for the gradient of the MSE for a linear model

θJ(i)(θ)=[J(i)(θ)θ0J(i)(θ)θ1J(i)(θ)θd]=[(fθ(x(i))y(i))x0(i)(fθ(x(i))y(i))x1(i)(fθ(x(i))y(i))xd(i)]=(fθ(x(i))y(i))x(i)\begin{align*} \small {\tiny \nabla_\theta J^{(i)} (\theta)} = \begin{bmatrix} \frac{\partial J^{(i)}(\theta)}{\partial \theta_0} \\ \frac{\partial J^{(i)}(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial J^{(i)}(\theta)}{\partial \theta_d} \end{bmatrix} = \begin{bmatrix} \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot x^{(i)}_0 \\ \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot x^{(i)}_1 \\ \vdots \\ \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot x^{(i)}_d \end{bmatrix} = \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot x^{(i)} \end{align*}

Note that the MSE over the entire dataset is J(θ)=1ni=1nJ(i)(θ)J(\theta) = \frac{1}{n}\sum_{i=1}^n J^{(i)}(\theta). Therefore:

θJ(θ)=[J(θ)θ0J(θ)θ1J(θ)θd]=1ni=1n[J(i)(θ)θ0J(i)(θ)θ1J(i)(θ)θd]=1ni=1n(fθ(x(i))y(i))x(i)\begin{align*} \nabla_\theta J (\theta) = \begin{bmatrix} \frac{\partial J(\theta)}{\partial \theta_0} \\ \frac{\partial J(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial J(\theta)}{\partial \theta_d} \end{bmatrix} = \frac{1}{n}\sum_{i=1}^n \begin{bmatrix} \frac{\partial J^{(i)}(\theta)}{\partial \theta_0} \\ \frac{\partial J^{(i)}(\theta)}{\partial \theta_1} \\ \vdots \\ \frac{\partial J^{(i)}(\theta)}{\partial \theta_d} \end{bmatrix} = \frac{1}{n} \sum_{i=1}^n \left( f_\theta(x^{(i)}) - y^{(i)} \right) \cdot x^{(i)} \end{align*}

Ordinary Least Squares

The gradient θf\nabla_\theta f further extends the derivative to multivariate functions f:RdRf : \mathbb{R}^d \to \mathbb{R}, and is defined at a point θ0\theta_0 as

θf(θ0)=[f(θ0)θ1f(θ0)θ2f(θ0)θd].\nabla_\theta f (\theta_0) = \begin{bmatrix} \frac{\partial f(\theta_0)}{\partial \theta_1} \\ \frac{\partial f(\theta_0)}{\partial \theta_2} \\ \vdots \\ \frac{\partial f(\theta_0)}{\partial \theta_d} \end{bmatrix}.

In other words, the jj-th entry of the vector θf(θ0)\nabla_\theta f (\theta_0) is the partial derivative f(θ0)θj\frac{\partial f(\theta_0)}{\partial \theta_j} of ff with respect to the jj-th component of θ\theta.

Notation: Design Matrix

Machine learning algorithms are most easily defined in the language of linear algebra. Therefore, it will be useful to represent the entire dataset as one matrix XRn×dX \in \mathbb{R}^{n \times d}, of the form:

X=[x1(1)x2(1)xd(1)x1(2)x2(2)xd(2)x1(n)x2(n)xd(n)]=[(x(1))(x(2))(x(n))].X = \begin{bmatrix} x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_d \\ x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_d \\ \vdots \\ x^{(n)}_1 & x^{(n)}_2 & \ldots & x^{(n)}_d \end{bmatrix} = \begin{bmatrix} - & (x^{(1)})^\top & - \\ - & (x^{(2)})^\top & - \\ & \vdots & \\ - & (x^{(n)})^\top & - \\ \end{bmatrix} .

Notation: Target Vector

Similarly, we can vectorize the target variables into a vector yRny \in \mathbb{R}^n of the form

y=[y(1)y(2)y(n)].y = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)} \end{bmatrix}.

Squared Error in Matrix Form

Recall that we may fit a linear model by choosing θ\theta that minimizes the squared error:

J(θ)=12i=1n(y(i)θx(i))2J(\theta)=\frac{1}{2}\sum_{i=1}^n(y^{(i)}-\theta^\top x^{(i)})^2

We can write this sum in matrix-vector form as:

J(θ)=12(yXθ)(yXθ)=12yXθ2,J(\theta) = \frac{1}{2} (y-X\theta)^\top(y-X\theta) = \frac{1}{2} \|y-X\theta\|^2,

where XX is the design matrix and \|\cdot\| denotes the Euclidean norm.

The Gradient of the Squared Error

We can compute the gradient of the mean squared error as follows.

θJ(θ)=θ12(Xθy)(Xθy)=12θ((Xθ)(Xθ)(Xθ)yy(Xθ)+yy)=12θ(θ(XX)θ2(Xθ)y)=12(2(XX)θ2Xy)=(XX)θXy\begin{align*} \nabla_\theta J(\theta) & = \nabla_\theta \frac{1}{2} (X \theta - y)^\top (X \theta - y) \\ & = \frac{1}{2} \nabla_\theta \left( (X \theta)^\top (X \theta) - (X \theta)^\top y - y^\top (X \theta) + y^\top y \right) \\ & = \frac{1}{2} \nabla_\theta \left( \theta^\top (X^\top X) \theta - 2(X \theta)^\top y \right) \\ & = \frac{1}{2} \left( 2(X^\top X) \theta - 2X^\top y \right) \\ & = (X^\top X) \theta - X^\top y \end{align*}

We used the facts that ab=baa^\top b = b^\top a (line 3), that xbx=b\nabla_x b^\top x = b (line 4), and that xxAx=2Ax\nabla_x x^\top A x = 2 A x for a symmetric matrix AA (line 4).

Normal Equations

Setting the above derivative to zero, we obtain the normal equations:

(XX)θ=Xy.(X^\top X) \theta = X^\top y.

Hence, the value θ\theta^* that minimizes this objective is given by:

θ=(XX)1Xy.\theta^* = (X^\top X)^{-1} X^\top y.

Note that we assumed that the matrix (XX)(X^\top X) is invertible; we will soon see a simple way of dealing with non-invertible matrices.

Why square the difference?
  1. The error will be positive

  2. If you take the absolute function (to cover point 1), the absolute function isn’t differentiable at the origin. Hence, we square the error.

 Why 12m\frac{1}{2m} instead of 1m\frac{1}{m} ?

As we will see later, when we differentiate the squared error, the 12\frac{1}{2} will cancel out. If we don’t do that we’ll be stuck with a 2 in the equation which is useless.

Minimising the cost function

Our objective function is

argminθJ(θ)\displaystyle \operatorname*{argmin}_\theta J(\theta)

Which simply means, find the value of θ\theta that minimises the error function J(θ)J(\theta). In order to do that, we will differentiate our cost function. When we differentiate it, it will give us gradient, which is the direction in which the error will be reduced. Upon having the gradient, we will simply update our θ\theta values to reflect that step (a step in the direction of lower error)

So, the update rule is the following equation

θ=θαθJ(θ)\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)

Where,

α\alpha = learning rate. Which is the rate at which we will travel to the direction of the lower error.

This process is nothing but Gradient Descent. There are few version of gradient descent, few of them are:

  1. Batch Gradient Descent: Go through all your input samples, compute the gradient once, and then update θ\thetas.

  2. Stochastic Gradient Descent: Go through a single sample, compute gradient, update θ\thetas, repeat mm times

  3. Mini Batch Gradient Descent: Go through a batch of kk samples, compute gradient, update θ\thetas, repeat mk\frac{m}{k} times.

Differentiating the loss function:

In the update rule:

θ=θαθJ(θ)\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)

The important part is calculating the derivative. Since we have two variables, we will have two derivatives, one for θ0\theta_0 and another for θ1\theta_1.

So the first equation is:

$

θ0=θ0θ0J(θ)=θ0θ0(12mi=0m(hθ(xi)yi)2)=θ022mi=0m(hθ(xi)yi)θ0(hθ(xi)yi)=θ01mi=0m(hθ(xi)yi)θ0(hθ(xi)yi)=θ01mi=0m(hθ(xi)yi)θ0(θ0+θ1xyi)=θ01mi=0m(hθ(xi)yi)(1+00)\begin{align} \notag \theta_0 &= \theta_0 - \frac{\partial}{\partial \theta_0} J(\theta) \\ \notag &= \theta_0 - \frac{\partial}{\partial \theta_0}(\frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}) \\ \notag &= \theta_0 - \frac{2}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(\theta_0 + \theta_1x- y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(1 + 0 - 0) \end{align}

$

θ0=θ01mi=0m(hθ(xi)yi)\therefore \theta_0= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}
Similarly, for θ1\theta_1

$

θ1=θ1θ1J(θ)=θ1θ1(12mi=0m(hθ(xi)yi)2)=θ122mi=0m(hθ(xi)yi)θ0(hθ(xi)yi)=θ11mi=0m(hθ(xi)yi)θ0(hθ(xi)yi)=θ11mi=0m(hθ(xi)yi)θ0(θ0+θ1xyi)=θ11mi=0m(hθ(xi)yi)(x+00)\begin{align} \notag \theta_1 &= \theta_1 - \frac{\partial}{\partial \theta_1} J(\theta) \\ \notag &= \theta_1 - \frac{\partial}{\partial \theta_1}(\frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}) \\ \notag &= \theta_1 - \frac{2}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(\theta_0 + \theta_1x- y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x + 0 - 0) \end{align}

$

θ1=θ11mi=0m(hθ(xi)yi)(x)\therefore \theta_1= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x)

We will implement Batch Gradient Descent i.e. we’ll update the gradients after 1 pass through the entire dataset. Our Algorithm hence becomes:

Repeat till convergence:

1.       θ0=θ01mi=0m(hθ(xi)yi)\theta_0= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}
2.       θ1=θ11mi=0m(hθ(xi)yi)(x)\theta_1= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x)
theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
    theta_prev = theta
    theta = theta_prev - step_size * gradient(theta_prev)

In the above algorithm, we stop when θiθi1||\theta_i - \theta_{i-1}|| is small.

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation, PillowWriter

# 1. Generate synthetic data
np.random.seed(0)
X = np.linspace(0, 10, 100)
true_w = 2.5
y = true_w * X + np.random.normal(0, 2, size=X.shape)

# 2. Define loss function
def compute_loss(w):
    y_pred = w * X
    return np.mean((y - y_pred)**2)

# 3. Precompute loss curve
w_values = np.linspace(0, 5, 200)
loss_values = [compute_loss(w) for w in w_values]

# 4. Gradient of the loss w.r.t w
def compute_gradient(w):
    y_pred = w * X
    grad = -2 * np.mean(X * (y - y_pred))
    return grad

# 5. Initialize
w_current = 0.0
learning_rate = 0.001
w_history = [w_current]
loss_history = [compute_loss(w_current)]

# 6. Simulate Gradient Descent Steps
for _ in range(100):
    grad = compute_gradient(w_current)
    w_current -= learning_rate * grad
    w_history.append(w_current)
    loss_history.append(compute_loss(w_current))

# 7. Plot setup
fig, ax = plt.subplots()
ax.plot(w_values, loss_values, label='Loss Curve')
point, = ax.plot([], [], 'ro', label='Current Weight')
ax.set_xlabel('Weight (w)')
ax.set_ylabel('Loss')
ax.set_title('Gradient Descent on Loss Curve')
ax.legend()

# 8. Animation function
def update(frame):
    point.set_data([w_history[frame]], [loss_history[frame]])
    return point,

# 9. Create animation
ani = FuncAnimation(fig, update, frames=len(w_history), interval=1000, blit=True)

# 10. Save as GIF
ani.save('gradient_descent.gif', writer=PillowWriter(fps=10))

# 11. Display saved GIF inside notebook
from IPython.display import Image
Image(filename="gradient_descent.gif")
<IPython.core.display.Image object>
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.animation import FuncAnimation, PillowWriter

# Define the true function
def true_function(x):
    return 2 * x + 1

# Generate some data points
x_data = np.linspace(0, 10, 20)
y_data = true_function(x_data) + np.random.normal(0, 2, 20)

# Define the figure and axes
fig, ax = plt.subplots()
ax.set_xlim(0, 10)
ax.set_ylim(0, 25)

# Plot the data points
scatter = ax.scatter(x_data, y_data)

# Create an empty line object for the predicted line
line, = ax.plot([], [], 'r-', lw=2)

# Create empty line objects for the error lines
error_lines = []
for _ in range(len(x_data)):
    err_line, = ax.plot([], [], 'g--', lw=1)
    error_lines.append(err_line)

# Create a vertical line for minimum loss
min_loss_line, = ax.plot([], [], 'b--', lw=2, label='Min Loss')

# Create a text annotation for loss value
loss_text = ax.text(0.02, 0.95, '', transform=ax.transAxes, verticalalignment='top')

# Store loss values and parameters
losses = []
slopes = []
intercepts = []

# Function to calculate squared loss
def calculate_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

# Function to initialize the animation
def init():
    line.set_data([], [])
    min_loss_line.set_data([], [])
    for err_line in error_lines:
        err_line.set_data([], [])
    loss_text.set_text('')
    return (line, min_loss_line, *error_lines, loss_text)

# Function to update the animation frame
def animate(i):
    # Define the predicted function with varying slope and intercept
    def predicted_function(x, slope, intercept):
        return slope * x + intercept

    # Define slope and intercept for the current frame
    slope = 0.5 + i * 0.1
    intercept = 0 + i * 0.2

    # Calculate predicted y values
    y_predicted = predicted_function(x_data, slope, intercept)

    # Calculate and store loss
    loss = calculate_loss(y_data, y_predicted)
    losses.append(loss)
    slopes.append(slope)
    intercepts.append(intercept)

    # Update the line data
    line.set_data(x_data, y_predicted)

    # Update error lines
    for j, err_line in enumerate(error_lines):
        err_line.set_data([x_data[j], x_data[j]], [y_data[j], y_predicted[j]])

    # Update minimum loss line
    if len(losses) > 0 and i == len(np.arange(25)) - 1:
        min_loss_idx = np.argmin(losses)
        min_slope = slopes[min_loss_idx]
        min_intercept = intercepts[min_loss_idx]
        y_min = predicted_function(x_data, min_slope, min_intercept)
        min_loss_line.set_data(x_data, y_min)

    # Update loss text
    loss_text.set_text(f'Squared Loss: {loss:.2f}')

    return (line, min_loss_line, *error_lines, loss_text)

# Create the animation
ani = FuncAnimation(fig, animate, frames=np.arange(25), init_func=init, interval=1000, blit=True)

# Add legend
ax.legend()

# Save as GIF
ani.save('squared_error_with_min_loss.gif', writer=PillowWriter(fps=10))

# Display saved GIF inside notebook
from IPython.display import Image
Image(filename="squared_error_with_min_loss.gif")
<IPython.core.display.Image object>
<Figure size 640x480 with 1 Axes>
# Your code here