Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Welcome to Logistic Regression, the model that takes the seriousness of math and says:

“Relax. I just want to predict whether they’ll buy the product or ghost you.” 👻

It’s simple, interpretable, and — despite the name — it’s not regression in the traditional sense. It’s classification in disguise!


🎯 Intuition

Imagine you’re a business analyst trying to predict:

Will a customer churn (1) or stay (0)?

A linear regression might predict something like 1.4 or -0.2. But you can’t tell your boss, “There’s a -0.2 chance the customer will leave.” 😬

So we need something that outputs probabilities between 0 and 1 — Enter Logistic Regression, our smooth-talking probability machine 🧠.


⚙️ The Sigmoid Function — Our Secret Sauce

The logistic model converts linear outputs into probabilities using the sigmoid function:

[ P(y=1|x) = \frac{1}{1 + e^{-(w^T x + b)}} ]

This squashes any real number into the 0–1 range:

  • Large positive → close to 1 (very likely to churn)

  • Large negative → close to 0 (probably staying)

Let’s plot it:

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-6, 6, 200)
sigmoid = 1 / (1 + np.exp(-x))

plt.plot(x, sigmoid)
plt.title("Sigmoid Function – Turning Scores into Probabilities")
plt.xlabel("Score (w·x + b)")
plt.ylabel("P(y=1)")
plt.grid(True)
plt.show()

That “S” shape? That’s the curve of business prediction love ❤️.


🧮 Training the Model (The Logistic Way)

We don’t use least squares here — we use Maximum Likelihood Estimation (MLE).

The goal is to find parameters ( w ) and ( b ) that maximize the probability of correctly predicting all outcomes.

In other words:

“Adjust weights until the model lies with confidence — but statistically.” 😅

The loss function is called Binary Cross-Entropy:

[ L = -\frac{1}{N} \sum_i [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)] ]


💼 Business Example: Churn Probability

Let’s say you have data on customer engagement:

  • calls_per_month

  • monthly_spend

  • support_tickets

and you want to predict if a customer will churn.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Simple dataset
data = pd.DataFrame({
    'calls_per_month': [3, 5, 8, 10, 12],
    'monthly_spend': [50, 70, 100, 120, 150],
    'churn': [0, 0, 0, 1, 1]
})

X = data[['calls_per_month', 'monthly_spend']]
y = data['churn']

model = LogisticRegression()
model.fit(X, y)

print("Predicted probabilities:")
print(model.predict_proba([[7, 90]]))

Your model now says something like:

“Customer has a 72% chance of leaving.”

That’s your cue to send a “We miss you 💌” email.


🧠 Model Coefficients = Business Insights

You can interpret the coefficients just like linear regression:

FeatureCoefficientInterpretation
calls_per_month-0.5More calls → lower churn chance 📞
monthly_spend+0.8Higher spend → slightly higher churn risk 💸

The signs and magnitudes tell you what drives customer behavior — it’s like X-ray vision for business data 👓.


🧩 Practice Challenge

  1. Create a new feature: discount_received.

  2. Re-train the model and see if discounts reduce churn probability.

  3. Plot your new sigmoid curve.

  4. Interpret the coefficients like a data-driven detective 🕵️.


🎯 TL;DR

ConceptMeaning
SigmoidConverts scores → probabilities
WeightsShow direction & strength of influence
Cross-Entropy LossPenalizes wrong confidence
OutputProbability between 0 and 1

💬 “Logistic Regression is like a polite fortune teller — it won’t say you’ll churn, just that you probably will (with 87% confidence).” 🔮


🔗 Next Up: Naive Bayes – Meet the ancient probabilistic wizard who assumes all your features get along… even when they clearly don’t. 😅

  • Binary Classification: Predicts P(y=1x)P(y=1 | \mathbf{x}), the probability that y=1y=1 given input features x\mathbf{x}.

  • Sigmoid Function: Maps linear combinations of inputs to a probability between 0 and 1.


2. The Sigmoid Function

The core of logistic regression is the sigmoid (logistic) function:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
  • z=wTx+bz = \mathbf{w}^T \mathbf{x} + b (linear combination of weights w\mathbf{w} and features x\mathbf{x}, with bias bb).

  • Output σ(z)(0,1)\sigma(z) \in (0,1).


3. Hypothesis Representation

The predicted probability y^=P(y=1x)\hat{y} = P(y=1 | \mathbf{x}) is:

y^=σ(wTx+b)=11+e(wTx+b)\hat{y} = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}

4. Cost Function (Log Loss)

Logistic regression uses cross-entropy loss:

J(w,b)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(\mathbf{w}, b) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]
  • mm = number of training examples.

  • y(i)y^{(i)} = true label (0 or 1).

  • y^(i)\hat{y}^{(i)} = predicted probability.

Intuition:

  • If y=1y=1, cost \rightarrow \infty when y^0\hat{y} \rightarrow 0.

  • If y=0y=0, cost \rightarrow \infty when y^1\hat{y} \rightarrow 1.


5. Gradient Descent Optimization

Update weights w\mathbf{w} and bias bb to minimize J(w,b)J(\mathbf{w}, b):

wj:=wjαJwjw_j := w_j - \alpha \frac{\partial J}{\partial w_j}
b:=bαJbb := b - \alpha \frac{\partial J}{\partial b}

Where:

  • α\alpha = learning rate.

  • The gradients are:

Jwj=1mi=1m(y^(i)y(i))xj(i)\frac{\partial J}{\partial w_j} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)}) x_j^{(i)}
Jb=1mi=1m(y^(i)y(i))\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)})

6. Decision Boundary

  • If y^0.5\hat{y} \geq 0.5, predict y=1y=1.

  • If y^<0.5\hat{y} < 0.5, predict y=0y=0.

  • The boundary is where wTx+b=0\mathbf{w}^T \mathbf{x} + b = 0.


7. Multiclass Logistic Regression (Softmax)

For K>2K > 2 classes, use softmax regression:

P(y=kx)=ewkTxj=1KewjTxP(y=k | \mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}

8. Assumptions & Limitations

  • Assumes a linear decision boundary (use feature engineering for non-linear cases).

  • Sensitive to outliers (but less than linear regression).

  • Requires large datasets for stable weight estimates.


Example in Python (Scikit-learn)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Example: Coin Flipping with Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)

Imagine you flip a coin 10 times and get 7 heads. You want to estimate the probability of getting heads, denoted as θ\theta. Is the coin fair (θ=0.5\theta = 0.5) or biased? We’ll use MLE and MAP to find θ\theta.


MLE: Trust Only the Data

MLE picks the θ\theta that makes the observed data (7 heads in 10 flips) most likely.

  • You got 7 heads and 3 tails. The probability of this happening is modeled as:

    P(data)=θ7(1θ)3P(\text{data}) = \theta^7 (1 - \theta)^3
  • MLE finds the θ\theta that maximizes this. Intuitively, it’s like asking: “What θ\theta best explains 7 heads out of 10 flips?”

  • Without diving into calculus, the answer is the proportion of heads:

    θMLE=number of headstotal flips=710=0.7\theta_{\text{MLE}} = \frac{\text{number of heads}}{\text{total flips}} = \frac{7}{10} = 0.7
  • Interpretation: MLE says the probability of heads is 0.7, based purely on the data. If you got no prior beliefs, this is your best guess.


MAP: Combine Data with a Prior Belief

MAP also uses the data but adds a “prior belief” about θ\theta. Let’s assume you think the coin is probably fair (θ0.5\theta \approx 0.5) but you’re open to it being slightly biased. We model this with a simple prior called a Beta(2, 2) prior, which favors θ\theta near 0.5 but allows flexibility.

  • The Beta(2, 2) prior is like saying, “Before flipping, I believe the coin is fair, as if I’ve seen 2 heads and 2 tails in the past.”

  • MAP combines:

    • Data: 7 heads, 3 tails from 10 flips.

    • Prior: 2 “imaginary” heads, 2 “imaginary” tails.

  • Add them together:

    • Total heads = 7 (data) + 2 (prior) = 9

    • Total flips = 10 (data) + 4 (prior) = 14

  • MAP estimate:

    θMAP=total headstotal flips=9140.643\theta_{\text{MAP}} = \frac{\text{total heads}}{\text{total flips}} = \frac{9}{14} \approx 0.643
  • Interpretation: MAP says the probability of heads is about 0.643. It’s closer to 0.5 than MLE (0.7) because the prior belief in a fair coin “pulls” the estimate toward 0.5.


Key Difference

  • MLE: Only uses the data (7 heads, 10 flips) → θ=0.7\theta = 0.7.

  • MAP: Combines data with a prior belief (like 2 heads, 2 tails) → θ0.643\theta \approx 0.643, which is smoothed toward a fair coin.

This smoothing is helpful when you have little data. For example, if you flipped the coin once and got 1 head, MLE would say θ=1.0\theta = 1.0 (100% heads), which is extreme. MAP, with the Beta(2, 2) prior, would give θ=1+21+4=35=0.6\theta = \frac{1 + 2}{1 + 4} = \frac{3}{5} = 0.6, a more cautious estimate.


Linear Regression Recap

Linear regression models a continuous target value:

y=θx+by = \vec{\theta}^\intercal \vec{x} + b
  • The cost function is usually Mean Squared Error (MSE).

  • We use regularization (like L2/Ridge or L1/Lasso) to reduce overfitting.


Motivation for Classification

What if our target yy is not a continuous number, but a class label — like “spam” or “not spam”?

  • Linear regression doesn’t work well for classification:

    • It can predict values outside the [0,1][0, 1] range.

    • It doesn’t model probabilities naturally.



Logistic Regression

Logistic regression is a binary classification algorithm used to predict the probability that an instance belongs to one of two classes (e.g., 0 or 1). It models the relationship between input features and the probability of a specific outcome using the sigmoid function, which maps any real-valued number to the range (0,1)(0, 1). The model is trained by maximizing the likelihood of the observed data, typically using gradient-based optimization.

Key Components

  1. Sigmoid Function: Converts a linear combination of features into a probability.

  2. Decision Boundary: Separates the two classes based on a probability threshold (usually 0.5).

  3. Maximum Likelihood Estimation (MLE): Optimizes the model parameters to best fit the training data.


1. Sigmoid Function

The sigmoid function is defined as:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

where z=θx+bz = \vec{\theta}^\intercal \vec{x} + b is a linear combination of:

  • x\vec{x}: Input feature vector.

  • θ\vec{\theta}: Weight vector (model parameters).

  • bb: Bias term.

Properties

  • Range: σ(z)(0,1)\sigma(z) \in (0, 1), making it ideal for modeling probabilities.

  • Interpretation: σ(z)=P(y=1x)\sigma(z) = P(y=1|\vec{x}), the probability of class 1 given input x\vec{x}.

  • Complement: P(y=0x)=1σ(z)P(y=0|\vec{x}) = 1 - \sigma(z).

Why Use exe^x in the Sigmoid?

The exponential function exe^x is chosen because:

  1. Non-linearity: Introduces non-linear behavior, allowing the model to capture complex patterns.

  2. Differentiability: exe^x has a simple derivative, facilitating gradient-based optimization.

  3. Convexity: The resulting log-likelihood is convex, ensuring a single global minimum during optimization.

Other bases (e.g., 2x2^x) could work but lack the mathematical convenience of exe^x.


%matplotlib inline
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def grad_sigmoid(x):
    return sigmoid(x) * (1 - sigmoid(x))

X = np.arange(-10, 10, 0.1)
fig = plt.figure(figsize=(8, 6))
plt.plot(X, sigmoid(X), label='sigmoid', c='blue', linewidth=3)
plt.plot(X, grad_sigmoid(X), label='derivative of sigmoid', c='green', linewidth=3)
plt.legend(loc='upper left')
plt.xlabel('X')
plt.grid(True)
plt.ylim([-0.5, 1.5])
plt.ylabel('output')
plt.title('Sigmoid')
plt.axhline(y=0, color='k')
plt.axvline(x=0, color='k')
plt.show()
<Figure size 800x600 with 1 Axes>

2. Decision Boundary

The decision boundary is the set of points where the predicted probability is 0.5:

σ(z)=0.5    z=0    θx+b=0\sigma(z) = 0.5 \implies z = 0 \implies \vec{\theta}^\intercal \vec{x} + b = 0

This equation defines a hyperplane in the feature space:

  • If z>0z > 0, then σ(z)>0.5\sigma(z) > 0.5, and the instance is classified as class 1.

  • If z<0z < 0, then σ(z)<0.5\sigma(z) < 0.5, and the instance is classified as class 0.


3. Model Formulation

Logistic regression assumes the class posterior probabilities follow the sigmoid form:

P(C1x)=σ(θx+b)P(C_1|\vec{x}) = \sigma(\vec{\theta}^\intercal \vec{x} + b)
P(C2x)=1P(C1x)=1σ(θx+b)P(C_2|\vec{x}) = 1 - P(C_1|\vec{x}) = 1 - \sigma(\vec{\theta}^\intercal \vec{x} + b)

Here, x\vec{x} may be raw features or transformed features ϕ(x)\phi(\vec{x}) (e.g., polynomial features).


4. Training with Maximum Likelihood Estimation (MLE)

The goal is to find the parameters θ\vec{\theta} and bb that maximize the likelihood of the observed data.

Training Data

The training set is:

Ψ={(xi,ti)}i=1m,ti{0,1}\Psi = \{(\vec{x}_i, t_i)\}_{i=1}^m, \quad t_i \in \{0, 1\}

where:

  • xi\vec{x}_i: Feature vector for the ii-th instance.

  • tit_i: Target label (0 or 1).

Likelihood Function

The likelihood of the data given the parameters is:

P(Ψθ,b)=i=1mP(tixi)P(\Psi|\vec{\theta}, b) = \prod_{i=1}^m P(t_i|\vec{x}_i)

Since ti{0,1}t_i \in \{0, 1\}, we can write:

P(tixi)=σ(θxi+b)ti(1σ(θxi+b))1tiP(t_i|\vec{x}_i) = \sigma(\vec{\theta}^\intercal \vec{x}_i + b)^{t_i} \cdot (1 - \sigma(\vec{\theta}^\intercal \vec{x}_i + b))^{1 - t_i}

Thus, the likelihood is:

P(Ψθ,b)=i=1mσ(θxi+b)ti(1σ(θxi+b))1tiP(\Psi|\vec{\theta}, b) = \prod_{i=1}^m \sigma(\vec{\theta}^\intercal \vec{x}_i + b)^{t_i} \cdot (1 - \sigma(\vec{\theta}^\intercal \vec{x}_i + b))^{1 - t_i}

Basic Logarithm Rules

Let bb be the base (b>0b > 0 and b1b \neq 1), and let xx and yy be positive numbers.

  1. Definition:

    logb(x)=y    by=x\log_b(x) = y \iff b^y = x
  2. Product Rule:

    logb(xy)=logb(x)+logb(y)\log_b(xy) = \log_b(x) + \log_b(y)
  3. Quotient Rule:

    logb(xy)=logb(x)logb(y)\log_b\left(\frac{x}{y}\right) = \log_b(x) - \log_b(y)
  4. Power Rule:

    logb(xp)=plogb(x)\log_b(x^p) = p \cdot \log_b(x)
  5. Change of Base Rule:

    logb(x)=logc(x)logc(b)\log_b(x) = \frac{\log_c(x)}{\log_c(b)}

    where cc is any valid base (c>0c > 0 and c1c \neq 1). A common form uses the natural logarithm (ln\ln) or the common logarithm (log10\log_{10}):

    logb(x)=ln(x)ln(b)=log10(x)log10(b)\log_b(x) = \frac{\ln(x)}{\ln(b)} = \frac{\log_{10}(x)}{\log_{10}(b)}
  6. Logarithm of the Base:

    logb(b)=1\log_b(b) = 1
  7. Logarithm of One:

    logb(1)=0\log_b(1) = 0

Rules Involving exe^x and ln(x)\ln(x) (Natural Logarithm)

The natural logarithm (ln\ln) has base e2.71828e \approx 2.71828.

  1. Definition:

    ln(x)=y    ey=x\ln(x) = y \iff e^y = x
  2. Inverse Relationships:

    eln(x)=xe^{\ln(x)} = x

    ln(ex)=x\ln(e^x) = x

These rules are fundamental for working with logarithmic and exponential functions.


Examples of Calculating e(z)e(z) or eze^z

Example 1: e(0)e(0)

  • Calculation: e(0)=e0=1e(0) = e^0 = 1

  • Result: e(0)=1e(0) = 1

  • Explanation: Any number raised to the power of 0 is 1.

Example 2: e(1)e(1)

  • Calculation: e(1)=e1=e2.71828e(1) = e^1 = e \approx 2.71828

  • Result: e(1)2.718e(1) \approx 2.718

  • Explanation: This is the value of ee itself, approximated to three decimal places.

Example 3: e(2)e(2)

  • Calculation: e(2)=e22.7182827.38906e(2) = e^2 \approx 2.71828^2 \approx 7.38906

  • Result: e(2)7.389e(2) \approx 7.389

  • Explanation: Square the value of ee. Using a calculator, 2.7182827.3892.71828^2 \approx 7.389.

Example 4: e(1)e(-1)

  • Calculation: e(1)=e1=1e12.718280.36788e(-1) = e^{-1} = \frac{1}{e} \approx \frac{1}{2.71828} \approx 0.36788

  • Result: e(1)0.368e(-1) \approx 0.368

  • Explanation: A negative exponent means the reciprocal of e1e^1. So, e(1)=1/ee(-1) = 1/e.

Example 5: e(0.5)e(0.5)

  • Calculation: e(0.5)=e0.52.718280.52.718281.64872e(0.5) = e^{0.5} \approx 2.71828^{0.5} \approx \sqrt{2.71828} \approx 1.64872

  • Result: e(0.5)1.649e(0.5) \approx 1.649

  • Explanation: The exponent 0.5 means the square root of ee. Using a calculator, 2.718281.649\sqrt{2.71828} \approx 1.649.


Notes

  • Approximations: The results are rounded to three decimal places for simplicity. In practice, you’d use a calculator or software (e.g., Python’s math.exp() or NumPy’s np.exp()) for precise values.

  • Context: These calculations are common in machine learning (e.g., softmax function) and other fields like statistics and physics.

  • How to Compute: If you don’t have a calculator, you can use software like Python:

    import math
    print(math.exp(2))  # Outputs ~7.389

Let me know if you want more examples or help with a specific value!

Log-Likelihood

To simplify optimization, we maximize the log-likelihood (since the log is monotonic):

(θ,b)=logP(Ψθ,b)=i=1m[tilogσ(θxi+b)+(1ti)log(1σ(θxi+b))]\ell(\vec{\theta}, b) = \log P(\Psi|\vec{\theta}, b) = \sum_{i=1}^m \left[ t_i \log \sigma(\vec{\theta}^\intercal \vec{x}_i + b) + (1 - t_i) \log (1 - \sigma(\vec{\theta}^\intercal \vec{x}_i + b)) \right]

Optimization

We maximize (θ,b)\ell(\vec{\theta}, b) (or equivalently, minimize (θ,b)-\ell(\vec{\theta}, b)) using gradient-based methods (e.g., gradient ascent). The gradients are computed with respect to θ\vec{\theta} and bb.


5. Derivation of the Sigmoid Function

The sigmoid function arises naturally from modeling the log-odds (logit) as a linear function.

Step-by-Step Derivation

  1. Odds: The odds of class 1 are:

    Odds=P(y=1x)P(y=0x)=p1p\text{Odds} = \frac{P(y=1|\vec{x})}{P(y=0|\vec{x})} = \frac{p}{1 - p}
  2. Logit: The log-odds (logit) is assumed to be a linear function of the features:

    ln(p1p)=θx+b\ln \left( \frac{p}{1 - p} \right) = \vec{\theta}^\intercal \vec{x} + b
  3. Exponentiate Both Sides:

    p1p=eθx+b\frac{p}{1 - p} = e^{\vec{\theta}^\intercal \vec{x} + b}
  4. Solve for pp:

    Let z=θx+bz = \vec{\theta}^\intercal \vec{x} + b. Then:

    p=ez(1p)p = e^z (1 - p)
    p=ezpezp = e^z - p e^z
    p+pez=ezp + p e^z = e^z
    p(1+ez)=ezp (1 + e^z) = e^z
    p=ez1+ez=11+ezp = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}

This is the sigmoid function: σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}.


6. Gradient of the Sigmoid Function

The derivative of the sigmoid function is used in gradient-based optimization.

Derivation

Let σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}. Compute the derivative:

ddxσ(x)=ddx(1+ex)1\frac{d}{dx} \sigma(x) = \frac{d}{dx} \left( 1 + e^{-x} \right)^{-1}

Using the chain rule:

ddx(1+ex)1=(1+ex)2ddx(1+ex)\frac{d}{dx} \left( 1 + e^{-x} \right)^{-1} = - \left( 1 + e^{-x} \right)^{-2} \cdot \frac{d}{dx} \left( 1 + e^{-x} \right)

The derivative of the inner function is:

ddx(1+ex)=0+(1)ex=ex\frac{d}{dx} \left( 1 + e^{-x} \right) = 0 + (-1) \cdot e^{-x} = -e^{-x}

So:

ddxσ(x)=(1+ex)2(ex)=ex(1+ex)2\frac{d}{dx} \sigma(x) = - \left( 1 + e^{-x} \right)^{-2} \cdot (-e^{-x}) = \frac{e^{-x}}{\left( 1 + e^{-x} \right)^2}

Simplify:

ex(1+ex)2=11+exex1+ex\frac{e^{-x}}{\left( 1 + e^{-x} \right)^2} = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}}

Notice that:

11+ex=σ(x)\frac{1}{1 + e^{-x}} = \sigma(x)

And:

ex1+ex=(1+ex)11+ex=111+ex=1σ(x)\frac{e^{-x}}{1 + e^{-x}} = \frac{(1 + e^{-x}) - 1}{1 + e^{-x}} = 1 - \frac{1}{1 + e^{-x}} = 1 - \sigma(x)

Thus:

ddxσ(x)=σ(x)(1σ(x))\frac{d}{dx} \sigma(x) = \sigma(x) \cdot (1 - \sigma(x))

This derivative is crucial for computing gradients during optimization.

X = np.arange(-5, 5, 0.01)
fig = plt.figure(figsize=(8, 6))
plt.plot(sigmoid(X), -np.log(sigmoid(X)), label='Log(X)', c='blue', linewidth=3)
plt.legend()
plt.xlabel('Sigmoid(X)')
plt.grid(True)
plt.ylabel('Log(Sigmoid(X))')
plt.title('Plot for y = 1')
plt.axhline(y=0, color='k')
plt.axvline(x=0, color='k')
plt.show()
<Figure size 800x600 with 1 Axes>
X = np.arange(-10, 10, 0.01)
fig = plt.figure(figsize=(8, 6))
plt.plot(sigmoid(X), -np.log(1 - sigmoid(X)), label='Log(Sigmoid(X))', c='blue', linewidth=3)
plt.legend(loc='upper left')
plt.xlabel('Sigmoid(X)')
plt.grid(True)
plt.ylabel('-Log(1 - Sigmoid(X))')
plt.title('Plot for y = 0')
plt.axhline(y=0, color='k')
plt.axvline(x=0, color='k')
plt.show()
<Figure size 800x600 with 1 Axes>

Cost Function and Optimization Objective for Logistic Regression

Logistic regression aims to find parameters that best predict the probability of an instance belonging to one of two classes. To achieve this, we define a cost function (or loss function) that measures how well the model’s predictions match the observed data. The cost function is optimized using gradient-based methods to update the model parameters.

Why a Specific Cost Function?

Unlike linear regression, which uses mean squared error, logistic regression uses a log-loss (or binary cross-entropy) derived from the negative log-likelihood. This choice is motivated by:

  1. Mathematical Convenience: Logarithms convert products to sums, simplifying differentiation.

  2. Smoothness: The function is smooth and differentiable, aiding optimization.

  3. Numerical Stability: Logarithms handle very small probabilities (e.g., 4×10454 \times 10^{-45}) effectively.

  4. Convexity: The negative log-likelihood is convex, ensuring a single global minimum and no local optima.

  5. Probabilistic Interpretation: It aligns with maximum likelihood estimation (MLE), directly modeling the likelihood of the data.


1. Cost Function

The cost function measures the error for a single prediction. For a given instance with features x\vec{x}, true label y{0,1}y \in \{0, 1\}, and predicted probability hθ(x)=σ(θx)h_\theta(\vec{x}) = \sigma(\vec{\theta}^\intercal \vec{x}), the cost is:

Cost(hθ(x),y)={log(hθ(x)),if y=1log(1hθ(x)),if y=0\text{Cost}(h_\theta(\vec{x}), y) = \begin{cases} -\log(h_\theta(\vec{x})), & \text{if } y = 1 \\ -\log(1 - h_\theta(\vec{x})), & \text{if } y = 0 \end{cases}

where the sigmoid function is:

hθ(x)=σ(θx)=11+eθxh_\theta(\vec{x}) = \sigma(\vec{\theta}^\intercal \vec{x}) = \frac{1}{1 + e^{-\vec{\theta}^\intercal \vec{x}}}

Intuition

  • If y=1y = 1, we want hθ(x)1h_\theta(\vec{x}) \approx 1. If hθ(x)h_\theta(\vec{x}) is close to 1, log(hθ(x))-\log(h_\theta(\vec{x})) is small (low penalty). If hθ(x)h_\theta(\vec{x}) is close to 0, log(hθ(x))-\log(h_\theta(\vec{x})) is large (high penalty).

  • If y=0y = 0, we want hθ(x)0h_\theta(\vec{x}) \approx 0. If hθ(x)h_\theta(\vec{x}) is close to 0, log(1hθ(x))-\log(1 - h_\theta(\vec{x})) is small. If hθ(x)h_\theta(\vec{x}) is close to 1, the penalty is large.

Compact Form

The piecewise cost can be combined into a single expression:

Cost(hθ(x),y)=ylog(hθ(x))(1y)log(1hθ(x))\text{Cost}(h_\theta(\vec{x}), y) = -y \log(h_\theta(\vec{x})) - (1 - y) \log(1 - h_\theta(\vec{x}))
  • When y=1y = 1, the equation becomes log(hθ(x))-\log(h_\theta(\vec{x})).

  • When y=0y = 0, it becomes log(1hθ(x))-\log(1 - h_\theta(\vec{x})).

This is the binary cross-entropy loss for a single instance.

Full Cost Function

For a dataset with mm instances {(x(i),y(i))}i=1m\{(\vec{x}^{(i)}, y^{(i)})\}_{i=1}^m, the total cost function (average loss) is:

J(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]J(\vec{\theta}) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(h_\theta(\vec{x}^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(\vec{x}^{(i)})) \right]

This is the negative log-likelihood averaged over all instances. Minimizing J(θ)J(\vec{\theta}) is equivalent to maximizing the likelihood of the observed data.


1. Entropy: Measure of Uncertainty

Entropy comes from information theory. It quantifies the uncertainty or impurity in a probability distribution.

Definition:

For a binary classification problem with a probability pp of success (e.g., class 1), the entropy is:

H(p)=plog2(p)(1p)log2(1p)H(p) = -p \log_2(p) - (1 - p) \log_2(1 - p)
  • Entropy is 0 when the outcome is certain (i.e., p=0p = 0 or p=1p = 1).

  • Entropy is maximum (1 bit) when p=0.5p = 0.5.

import numpy as np
import matplotlib.pyplot as plt

def entropy(p):
    # Clip to avoid log(0)
    p = np.clip(p, 1e-10, 1 - 1e-10)
    return -p * np.log2(p) - (1 - p) * np.log2(1 - p)

# Plot entropy for p in [0, 1]
p_vals = np.linspace(0.001, 0.999, 100)
entropy_vals = entropy(p_vals)

plt.plot(p_vals, entropy_vals)
plt.xlabel("Probability (p)")
plt.ylabel("Entropy (bits)")
plt.title("Entropy of a Binary Distribution")
plt.grid(True)
plt.show()
<Figure size 640x480 with 1 Axes>

2. Log-Loss / Binary Cross-Entropy

While entropy measures uncertainty of a distribution, log-loss (or binary cross-entropy) measures the error of a predicted probability y^\hat{y} compared to the true label y{0,1}y \in \{0, 1\}.

Formula:

For a single prediction:

LogLoss(y,y^)=[ylog(y^)+(1y)log(1y^)]\text{LogLoss}(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
  • If y=1y = 1, only the log(y^)\log(\hat{y}) term remains.

  • If y=0y = 0, only the log(1y^)\log(1 - \hat{y}) term remains.

This is the same formula as entropy, but with yy as the true class and y^\hat{y} as the predicted probability.

import numpy as np
import matplotlib.pyplot as plt

def log_loss(y_true, y_pred):
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, 1e-10, 1 - 1e-10)
    return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Example: true label is 1
y_true = 1
y_preds = np.linspace(0.001, 0.999, 100)
losses = [log_loss(y_true, y) for y in y_preds]

plt.plot(y_preds, losses, label=r'True Label = 1')
plt.xlabel(r"Predicted Probability ($\hat{y}$)")
plt.ylabel(r"Log-Loss")
plt.title(r"Binary Cross-Entropy Loss for $y = 1$")
plt.grid(True)
plt.legend()
plt.show()
<Figure size 640x480 with 1 Axes>

2. Connection to Maximum Likelihood Estimation (MLE)

Likelihood Function

For a binary classification dataset, the probability of the observed label yiy_i given features xi\vec{x}_i is:

P(yixi,θ)=hθ(xi)yi(1hθ(xi))1yiP(y_i|\vec{x}_i, \vec{\theta}) = h_\theta(\vec{x}_i)^{y_i} \cdot (1 - h_\theta(\vec{x}_i))^{1 - y_i}

Assuming independence of instances, the likelihood of the entire dataset is:

L(θ)=i=1mP(yixi,θ)=i=1mhθ(xi)yi(1hθ(xi))1yiL(\vec{\theta}) = \prod_{i=1}^m P(y_i|\vec{x}_i, \vec{\theta}) = \prod_{i=1}^m h_\theta(\vec{x}_i)^{y_i} \cdot (1 - h_\theta(\vec{x}_i))^{1 - y_i}

Log-Likelihood

To simplify optimization, we take the logarithm:

(θ)=logL(θ)=i=1m[yiloghθ(xi)+(1yi)log(1hθ(xi))]\ell(\vec{\theta}) = \log L(\vec{\theta}) = \sum_{i=1}^m \left[ y_i \log h_\theta(\vec{x}_i) + (1 - y_i) \log (1 - h_\theta(\vec{x}_i)) \right]

The cost function J(θ)J(\vec{\theta}) is the negative log-likelihood scaled by 1m\frac{1}{m}:

J(θ)=1m(θ)J(\vec{\theta}) = -\frac{1}{m} \ell(\vec{\theta})

Maximizing (θ)\ell(\vec{\theta}) (likelihood) is equivalent to minimizing J(θ)J(\vec{\theta}) (loss).


3. Gradient Descent and Parameter Updates

To minimize J(θ)J(\vec{\theta}), we use gradient descent. We need the gradient of J(θ)J(\vec{\theta}) with respect to each parameter θj\theta_j.

Gradient of the Sigmoid

Recall the sigmoid function and its derivative:

σ(z)=11+ez,ddzσ(z)=σ(z)(1σ(z))\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \frac{d}{dz} \sigma(z) = \sigma(z) \cdot (1 - \sigma(z))

For hθ(x)=σ(θx)h_\theta(\vec{x}) = \sigma(\vec{\theta}^\intercal \vec{x}), the partial derivative with respect to θj\theta_j is:

θjhθ(x)=θjσ(θx)=σ(θx)(1σ(θx))θj(θx)\frac{\partial}{\partial \theta_j} h_\theta(\vec{x}) = \frac{\partial}{\partial \theta_j} \sigma(\vec{\theta}^\intercal \vec{x}) = \sigma(\vec{\theta}^\intercal \vec{x}) \cdot (1 - \sigma(\vec{\theta}^\intercal \vec{x})) \cdot \frac{\partial}{\partial \theta_j} (\vec{\theta}^\intercal \vec{x})

Since θx=θ0+θ1x1++θnxn\vec{\theta}^\intercal \vec{x} = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n, we have:

θj(θx)=xj\frac{\partial}{\partial \theta_j} (\vec{\theta}^\intercal \vec{x}) = x_j

Thus:

θjhθ(x)=hθ(x)(1hθ(x))xj\frac{\partial}{\partial \theta_j} h_\theta(\vec{x}) = h_\theta(\vec{x}) \cdot (1 - h_\theta(\vec{x})) \cdot x_j

Gradient of the Cost Function

Now, compute the partial derivative of J(θ)J(\vec{\theta}):

J(θ)=1mi=1m[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]J(\vec{\theta}) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log(h_\theta(\vec{x}^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(\vec{x}^{(i)})) \right]

Differentiate with respect to θj\theta_j:

J(θ)θj=1mi=1mθj[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]\frac{\partial J(\vec{\theta})}{\partial \theta_j} = -\frac{1}{m} \sum_{i=1}^m \frac{\partial}{\partial \theta_j} \left[ y^{(i)} \log(h_\theta(\vec{x}^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(\vec{x}^{(i)})) \right]

For a single term:

θj[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]\frac{\partial}{\partial \theta_j} \left[ y^{(i)} \log(h_\theta(\vec{x}^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(\vec{x}^{(i)})) \right]

Apply the chain rule:

  • First term: y(i)log(hθ(x(i)))y^{(i)} \log(h_\theta(\vec{x}^{(i)}))

θj[y(i)log(hθ(x(i)))]=y(i)1hθ(x(i))θjhθ(x(i))\frac{\partial}{\partial \theta_j} \left[ y^{(i)} \log(h_\theta(\vec{x}^{(i)})) \right] = y^{(i)} \cdot \frac{1}{h_\theta(\vec{x}^{(i)})} \cdot \frac{\partial}{\partial \theta_j} h_\theta(\vec{x}^{(i)})
  • Second term: (1y(i))log(1hθ(x(i)))(1 - y^{(i)}) \log(1 - h_\theta(\vec{x}^{(i)}))

θj[(1y(i))log(1hθ(x(i)))]=(1y(i))11hθ(x(i))θj(1hθ(x(i)))\frac{\partial}{\partial \theta_j} \left[ (1 - y^{(i)}) \log(1 - h_\theta(\vec{x}^{(i)})) \right] = (1 - y^{(i)}) \cdot \frac{1}{1 - h_\theta(\vec{x}^{(i)})} \cdot \frac{\partial}{\partial \theta_j} (1 - h_\theta(\vec{x}^{(i)}))

Since θj(1hθ(x(i)))=θjhθ(x(i))\frac{\partial}{\partial \theta_j} (1 - h_\theta(\vec{x}^{(i)})) = -\frac{\partial}{\partial \theta_j} h_\theta(\vec{x}^{(i)}), we get:

=(1y(i))11hθ(x(i))(θjhθ(x(i)))= (1 - y^{(i)}) \cdot \frac{1}{1 - h_\theta(\vec{x}^{(i)})} \cdot \left( -\frac{\partial}{\partial \theta_j} h_\theta(\vec{x}^{(i)}) \right)

Combine:

θj=y(i)hθ(x(i))hθ(x(i))θj1y(i)1hθ(x(i))hθ(x(i))θj\frac{\partial}{\partial \theta_j} = \frac{y^{(i)}}{h_\theta(\vec{x}^{(i)})} \cdot \frac{\partial h_\theta(\vec{x}^{(i)})}{\partial \theta_j} - \frac{1 - y^{(i)}}{1 - h_\theta(\vec{x}^{(i)})} \cdot \frac{\partial h_\theta(\vec{x}^{(i)})}{\partial \theta_j}

Factor out hθ(x(i))θj\frac{\partial h_\theta(\vec{x}^{(i)})}{\partial \theta_j}:

=(y(i)hθ(x(i))1y(i)1hθ(x(i)))hθ(x(i))θj= \left( \frac{y^{(i)}}{h_\theta(\vec{x}^{(i)})} - \frac{1 - y^{(i)}}{1 - h_\theta(\vec{x}^{(i)})} \right) \cdot \frac{\partial h_\theta(\vec{x}^{(i)})}{\partial \theta_j}

Simplify the expression inside:

y(i)hθ(x(i))1y(i)1hθ(x(i))=y(i)(1hθ(x(i)))(1y(i))hθ(x(i))hθ(x(i))(1hθ(x(i)))\frac{y^{(i)}}{h_\theta(\vec{x}^{(i)})} - \frac{1 - y^{(i)}}{1 - h_\theta(\vec{x}^{(i)})} = \frac{y^{(i)} (1 - h_\theta(\vec{x}^{(i)})) - (1 - y^{(i)}) h_\theta(\vec{x}^{(i)})}{h_\theta(\vec{x}^{(i)}) (1 - h_\theta(\vec{x}^{(i)}))}

Numerator:

y(i)(1hθ(x(i)))(1y(i))hθ(x(i))=y(i)y(i)hθ(x(i))hθ(x(i))+y(i)hθ(x(i))=y(i)hθ(x(i))y^{(i)} (1 - h_\theta(\vec{x}^{(i)})) - (1 - y^{(i)}) h_\theta(\vec{x}^{(i)}) = y^{(i)} - y^{(i)} h_\theta(\vec{x}^{(i)}) - h_\theta(\vec{x}^{(i)}) + y^{(i)} h_\theta(\vec{x}^{(i)}) = y^{(i)} - h_\theta(\vec{x}^{(i)})

So:

y(i)hθ(x(i))hθ(x(i))(1hθ(x(i)))\frac{y^{(i)} - h_\theta(\vec{x}^{(i)})}{h_\theta(\vec{x}^{(i)}) (1 - h_\theta(\vec{x}^{(i)}))}

Now substitute hθ(x(i))θj=hθ(x(i))(1hθ(x(i)))xj(i)\frac{\partial h_\theta(\vec{x}^{(i)})}{\partial \theta_j} = h_\theta(\vec{x}^{(i)}) (1 - h_\theta(\vec{x}^{(i)})) x_j^{(i)}:

θj=y(i)hθ(x(i))hθ(x(i))(1hθ(x(i)))hθ(x(i))(1hθ(x(i)))xj(i)\frac{\partial}{\partial \theta_j} = \frac{y^{(i)} - h_\theta(\vec{x}^{(i)})}{h_\theta(\vec{x}^{(i)}) (1 - h_\theta(\vec{x}^{(i)}))} \cdot h_\theta(\vec{x}^{(i)}) (1 - h_\theta(\vec{x}^{(i)})) x_j^{(i)}

The denominator cancels, leaving:

=(y(i)hθ(x(i)))xj(i)= (y^{(i)} - h_\theta(\vec{x}^{(i)})) x_j^{(i)}

Thus, the full gradient is:

J(θ)θj=1mi=1m(y(i)hθ(x(i)))xj(i)\frac{\partial J(\vec{\theta})}{\partial \theta_j} = -\frac{1}{m} \sum_{i=1}^m (y^{(i)} - h_\theta(\vec{x}^{(i)})) x_j^{(i)}

For gradient descent, we update in the direction of the negative gradient, so:

J(θ)θj=1mi=1m(hθ(x(i))y(i))xj(i)\frac{\partial J(\vec{\theta})}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m (h_\theta(\vec{x}^{(i)}) - y^{(i)}) x_j^{(i)}

Gradient Descent Update Rule

The update rule for each parameter θj\theta_j is:

θj:=θjαJ(θ)θj=θjα1mi=1m(hθ(x(i))y(i))xj(i)\theta_j := \theta_j - \alpha \cdot \frac{\partial J(\vec{\theta})}{\partial \theta_j} = \theta_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^m (h_\theta(\vec{x}^{(i)}) - y^{(i)}) x_j^{(i)}

where α\alpha is the learning rate. All θj\theta_j are updated simultaneously until convergence.

Similarity to Linear Regression

This update rule resembles linear regression’s gradient descent, but here, hθ(x)=σ(θx)h_\theta(\vec{x}) = \sigma(\vec{\theta}^\intercal \vec{x}) is the sigmoid function, not a linear function.


4. Interpretation of Weights

In logistic regression, the model predicts the log-odds:

ln(P(y=1x)P(y=0x))=θx=θ0+θ1x1++θnxn\ln \left( \frac{P(y=1|\vec{x})}{P(y=0|\vec{x})} \right) = \vec{\theta}^\intercal \vec{x} = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n

The weight θj\theta_j represents the change in the log-odds for a unit increase in feature xjx_j, holding other features constant. Exponentiating θj\theta_j gives the odds ratio:

eθj=change in odds for a unit increase in xje^{\theta_j} = \text{change in odds for a unit increase in } x_j

This is different from linear regression, where θj\theta_j directly represents the change in the output yy.


5. Maximum A Posteriori (MAP) Estimation and Regularization

MAP Estimation

MLE maximizes the likelihood P(yX,θ)P(\vec{y}|\vec{X}, \vec{\theta}). MAP incorporates a prior distribution P(θ)P(\vec{\theta}) and maximizes the posterior:

P(θX,y)P(yX,θ)P(θ)P(\vec{\theta}|\vec{X}, \vec{y}) \propto P(\vec{y}|\vec{X}, \vec{\theta}) \cdot P(\vec{\theta})

Taking the logarithm:

θMAP=argmaxθ[logP(yX,θ)+logP(θ)]\vec{\theta}_{\text{MAP}} = \arg\max_{\vec{\theta}} \left[ \log P(\vec{y}|\vec{X}, \vec{\theta}) + \log P(\vec{\theta}) \right]

Gaussian Prior and L2 Regularization

Assume a Gaussian prior on θ\vec{\theta}: P(θ)N(0,σ2I)P(\vec{\theta}) \sim \mathcal{N}(0, \sigma^2 \mathbf{I}). The log-prior is:

logP(θ)12σ2θ22\log P(\vec{\theta}) \propto -\frac{1}{2\sigma^2} \|\vec{\theta}\|_2^2

The MAP objective becomes:

θMAP=argmaxθ[(θ)λ2θ22],λ=1σ2\vec{\theta}_{\text{MAP}} = \arg\max_{\vec{\theta}} \left[ \ell(\vec{\theta}) - \frac{\lambda}{2} \|\vec{\theta}\|_2^2 \right], \quad \lambda = \frac{1}{\sigma^2}

Equivalently, minimize:

J(θ)=1m(θ)+λ2θ22J(\vec{\theta}) = -\frac{1}{m} \ell(\vec{\theta}) + \frac{\lambda}{2} \|\vec{\theta}\|_2^2

This is the cross-entropy loss plus an L2 regularization term, which penalizes large weights to prevent overfitting, similar to ridge regression.


1. MAP Estimation: Incorporating Prior Knowledge

Maximum Likelihood Estimation (MLE) focuses solely on maximizing the likelihood of the data given the model parameters, i.e., P(yX,θ)P(\vec{y}|\vec{X}, \vec{\theta}). However, MLE doesn’t account for prior beliefs about the parameters θ\vec{\theta}. MAP estimation improves on this by incorporating a prior distribution P(θ)P(\vec{\theta}), which reflects what we believe about the parameters before seeing the data.

The posterior probability of the parameters given the data is given by Bayes’ theorem:

P(θX,y)=P(yX,θ)P(θ)P(yX)P(\vec{\theta}|\vec{X}, \vec{y}) = \frac{P(\vec{y}|\vec{X}, \vec{\theta}) \cdot P(\vec{\theta})}{P(\vec{y}|\vec{X})}

Since P(yX)P(\vec{y}|\vec{X}) is a constant (it doesn’t depend on θ\vec{\theta}), we can maximize the unnormalized posterior:

P(θX,y)P(yX,θ)P(θ)P(\vec{\theta}|\vec{X}, \vec{y}) \propto P(\vec{y}|\vec{X}, \vec{\theta}) \cdot P(\vec{\theta})

To make optimization easier, we take the logarithm (since the log is a monotonic function, it preserves the maximum):

θMAP=argmaxθ[logP(yX,θ)+logP(θ)]\vec{\theta}_{\text{MAP}} = \arg\max_{\vec{\theta}} \left[ \log P(\vec{y}|\vec{X}, \vec{\theta}) + \log P(\vec{\theta}) \right]

Here:

  • logP(yX,θ)\log P(\vec{y}|\vec{X}, \vec{\theta}) is the log-likelihood, same as in MLE.

  • logP(θ)\log P(\vec{\theta}) is the log-prior, which encodes our prior beliefs about θ\vec{\theta}.

The prior acts as a regularizer, guiding the solution toward parameter values that are more likely based on prior knowledge.

2. Gaussian Prior and L2 Regularization

A common choice for the prior is a Gaussian distribution centered at zero, P(θ)N(0,σ2I)P(\vec{\theta}) \sim \mathcal{N}(0, \sigma^2 \mathbf{I}). This assumes that the parameters θ\vec{\theta} are likely to be small (close to zero), which encourages simpler models and helps prevent overfitting.

The probability density of a Gaussian prior is:

P(θ)e(12σ2θ22)P(\vec{\theta}) \propto e\left( -\frac{1}{2\sigma^2} \|\vec{\theta}\|_2^2 \right)

Taking the logarithm:

logP(θ)12σ2θ22\log P(\vec{\theta}) \propto -\frac{1}{2\sigma^2} \|\vec{\theta}\|_2^2

Now, substitute this into the MAP objective:

θMAP=argmaxθ[logP(yX,θ)12σ2θ22]\vec{\theta}_{\text{MAP}} = \arg\max_{\vec{\theta}} \left[ \log P(\vec{y}|\vec{X}, \vec{\theta}) - \frac{1}{2\sigma^2} \|\vec{\theta}\|_2^2 \right]

Let’s define λ=1σ2\lambda = \frac{1}{\sigma^2}, where λ\lambda controls the strength of the prior (smaller σ2\sigma^2 means a stronger belief that θ\vec{\theta} should be close to zero). The objective becomes:

θMAP=argmaxθ[(θ)λ2θ22]\vec{\theta}_{\text{MAP}} = \arg\max_{\vec{\theta}} \left[ \ell(\vec{\theta}) - \frac{\lambda}{2} \|\vec{\theta}\|_2^2 \right]

where (θ)=logP(yX,θ)\ell(\vec{\theta}) = \log P(\vec{y}|\vec{X}, \vec{\theta}) is the log-likelihood.

To turn this into a minimization problem (common in optimization), we negate the objective and scale the log-likelihood by 1m\frac{1}{m} (where mm is the number of data points) for numerical stability:

J(θ)=1m(θ)+λ2θ22J(\vec{\theta}) = -\frac{1}{m} \ell(\vec{\theta}) + \frac{\lambda}{2} \|\vec{\theta}\|_2^2

This is the loss function (e.g., cross-entropy loss for classification) plus an L2 regularization term λ2θ22\frac{\lambda}{2} \|\vec{\theta}\|_2^2. The L2 term penalizes large values of θ\vec{\theta}, encouraging smaller weights, which leads to simpler models and reduces overfitting. This is mathematically equivalent to ridge regression in linear regression.

3. Why L2 Regularization Works

  • Preventing Overfitting: Large weights can lead to overly complex models that fit noise in the training data. The L2 penalty discourages large weights, favoring smoother, more generalizable models.

  • Connection to Gaussian Prior: The L2 term arises naturally from the Gaussian prior, which assumes parameters are likely to be small. The strength of regularization (λ\lambda) is inversely proportional to the variance of the prior (σ2\sigma^2).

  • Geometric Interpretation: The L2 term constrains the solution to lie closer to the origin in parameter space, balancing the trade-off between fitting the data (likelihood) and keeping parameters small (prior).


Small Example: Logistic Regression with MAP and L2 Regularization

Let’s apply MAP estimation with a Gaussian prior to a logistic regression problem.

Problem Setup

Suppose we have a binary classification dataset with m=2m=2 data points:

  • X=[1234]\vec{X} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, y=[01]\vec{y} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}.

  • Model parameters: θ=[θ1θ2]\vec{\theta} = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}.

  • We assume a Gaussian prior: P(θ)N(0,σ2I)P(\vec{\theta}) \sim \mathcal{N}(0, \sigma^2 \mathbf{I}), with σ2=1\sigma^2 = 1, so λ=1σ2=1\lambda = \frac{1}{\sigma^2} = 1.

Step 1: Log-Likelihood for Logistic Regression

For logistic regression, the likelihood is:

P(yX,θ)=i=1mσ(xiTθ)yi(1σ(xiTθ))1yiP(\vec{y}|\vec{X}, \vec{\theta}) = \prod_{i=1}^m \sigma(\vec{x}_i^T \vec{\theta})^{y_i} (1 - \sigma(\vec{x}_i^T \vec{\theta}))^{1 - y_i}

where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.

The log-likelihood is:

(θ)=i=1m[yilogσ(xiTθ)+(1yi)log(1σ(xiTθ))]\ell(\vec{\theta}) = \sum_{i=1}^m \left[ y_i \log \sigma(\vec{x}_i^T \vec{\theta}) + (1 - y_i) \log (1 - \sigma(\vec{x}_i^T \vec{\theta})) \right]

For our data:

  • Point 1: x1=[1,2]\vec{x}_1 = [1, 2], y1=0y_1 = 0, so x1Tθ=θ1+2θ2\vec{x}_1^T \vec{\theta} = \theta_1 + 2\theta_2.

  • Point 2: x2=[3,4]\vec{x}_2 = [3, 4], y2=1y_2 = 1, so x2Tθ=3θ1+4θ2\vec{x}_2^T \vec{\theta} = 3\theta_1 + 4\theta_2.

The log-likelihood is:

(θ)=log(1σ(θ1+2θ2))+logσ(3θ1+4θ2)\ell(\vec{\theta}) = \log (1 - \sigma(\theta_1 + 2\theta_2)) + \log \sigma(3\theta_1 + 4\theta_2)

Step 2: Gaussian Prior

The log-prior with σ2=1\sigma^2 = 1 is:

logP(θ)12θ22=12(θ12+θ22)\log P(\vec{\theta}) \propto -\frac{1}{2} \|\vec{\theta}\|_2^2 = -\frac{1}{2} (\theta_1^2 + \theta_2^2)

Step 3: MAP Objective

The MAP objective is:

θMAP=argmaxθ[(θ)12(θ12+θ22)]\vec{\theta}_{\text{MAP}} = \arg\max_{\vec{\theta}} \left[ \ell(\vec{\theta}) - \frac{1}{2} (\theta_1^2 + \theta_2^2) \right]

Equivalently, minimize the loss function (with m=2m=2):

J(θ)=12(θ)+12(θ12+θ22)J(\vec{\theta}) = -\frac{1}{2} \ell(\vec{\theta}) + \frac{1}{2} (\theta_1^2 + \theta_2^2)

Step 4: Interpretation

  • The term 12(θ)-\frac{1}{2} \ell(\vec{\theta}) is the average negative log-likelihood (cross-entropy loss).

  • The term 12(θ12+θ22)\frac{1}{2} (\theta_1^2 + \theta_2^2) is the L2 regularization term with λ=1\lambda = 1.

  • To find θMAP\vec{\theta}_{\text{MAP}}, we’d typically use gradient descent or another optimization method to minimize J(θ)J(\vec{\theta}).

Step 5: Numerical Intuition

Suppose we test θ=[0.5,0.5]\vec{\theta} = [0.5, 0.5]:

  • Compute x1Tθ=10.5+20.5=1.5\vec{x}_1^T \vec{\theta} = 1 \cdot 0.5 + 2 \cdot 0.5 = 1.5, so σ(1.5)0.817\sigma(1.5) \approx 0.817.

  • Compute x2Tθ=30.5+40.5=3.5\vec{x}_2^T \vec{\theta} = 3 \cdot 0.5 + 4 \cdot 0.5 = 3.5, so σ(3.5)0.970\sigma(3.5) \approx 0.970.

  • Log-likelihood: (θ)log(10.817)+log(0.970)log(0.183)+log(0.970)1.70\ell(\vec{\theta}) \approx \log(1 - 0.817) + \log(0.970) \approx \log(0.183) + \log(0.970) \approx -1.70.

  • L2 term: 12(0.52+0.52)=120.5=0.25\frac{1}{2} (0.5^2 + 0.5^2) = \frac{1}{2} \cdot 0.5 = 0.25.

  • Loss: J(θ)12(1.70)+0.25=0.85+0.25=1.10J(\vec{\theta}) \approx -\frac{1}{2} \cdot (-1.70) + 0.25 = 0.85 + 0.25 = 1.10.

The optimizer would adjust θ\vec{\theta} to minimize J(θ)J(\vec{\theta}), balancing the fit to the data (log-likelihood) and the penalty on large weights (L2 term).

6. MLE and MAP in Context

Maximum Likelihood Estimation (MLE)

  • Goal: Find θ\vec{\theta} that maximizes the likelihood P(yX,θ)P(\vec{y}|\vec{X}, \vec{\theta}).

  • Process: Define the likelihood, take the log, and optimize.

  • Example: For a coin with kk heads in nn flips, the likelihood is L(p)=pk(1p)nkL(p) = p^k (1-p)^{n-k}. The MLE is p^=kn\hat{p} = \frac{k}{n}.

Maximum A Posteriori (MAP)

  • Goal: Incorporate prior beliefs P(θ)P(\vec{\theta}) to maximize the posterior.

  • Process: Combine log-likelihood and log-prior, then optimize.

  • Example: If you believe a coin is fair (prior P(p)BetaP(p) \sim \text{Beta}), MAP adjusts the estimate based on both data and prior.

MLE vs. MAP

  • MLE: Data-driven, no assumptions about θ\vec{\theta}.

  • MAP: Incorporates prior knowledge, leading to regularization in practice.


7. Example: Biased Coin

Consider a dataset of coin flips: D={H,H,T,H,T}\mathcal{D} = \{H, H, T, H, T\} (3 heads, 2 tails).

Model

  • Pθ(H)=θP_\theta(H) = \theta

  • Pθ(T)=1θP_\theta(T) = 1 - \theta

Likelihood

L(θ)=θ3(1θ)2L(\theta) = \theta^3 (1 - \theta)^2

Log-Likelihood

(θ)=3logθ+2log(1θ)\ell(\theta) = 3 \log \theta + 2 \log (1 - \theta)

MLE Solution

Differentiate and set to zero:

θ=3θ21θ=0\frac{\partial \ell}{\partial \theta} = \frac{3}{\theta} - \frac{2}{1 - \theta} = 0
3θ=21θ    3(1θ)=2θ    33θ=2θ    3=5θ    θ=35=0.6\frac{3}{\theta} = \frac{2}{1 - \theta} \implies 3 (1 - \theta) = 2 \theta \implies 3 - 3\theta = 2\theta \implies 3 = 5\theta \implies \theta = \frac{3}{5} = 0.6

The MLE estimate is θ=0.6\theta = 0.6, matching the proportion of heads.


8. Conditional Maximum Likelihood

Logistic regression models the conditional probability P(yx,θ)P(y|\vec{x}, \vec{\theta}). The log-likelihood is:

(θ)=i=1mlogP(y(i)x(i),θ)\ell(\vec{\theta}) = \sum_{i=1}^m \log P(y^{(i)}|\vec{x}^{(i)}, \vec{\theta})

For logistic regression:

P(y=1x,θ)=σ(θx),P(y=0x,θ)=1σ(θx)P(y=1|\vec{x}, \vec{\theta}) = \sigma(\vec{\theta}^\intercal \vec{x}), \quad P(y=0|\vec{x}, \vec{\theta}) = 1 - \sigma(\vec{\theta}^\intercal \vec{x})

Maximizing (θ)\ell(\vec{\theta}) pushes predictions toward 1 for y=1y=1 and 0 for y=0y=0.


# our dataset is {H, H, T, H, T}; if theta = P(x=H), we get:
coin_likelihood = lambda theta: theta*theta*(1-theta)*theta*(1-theta)

theta_vals = np.linspace(0,1)
plt.ylabel('Likelihood of the Data'); plt.xlabel(r'Parameter $\theta$')
plt.scatter([0.6], [coin_likelihood(0.6)])
plt.plot(theta_vals, coin_likelihood(theta_vals));

def log_likelihood(theta, X, y):
    """The cost function, J(theta0, theta1) describing the goodness of fit.

    We added the 1e-6 term in order to avoid overflow (inf and -inf).

    Parameters:
    theta (np.array): d-dimensional vector of parameters
    X (np.array): (n,d)-dimensional design matrix
    y (np.array): n-dimensional vector of targets
    """
    return (y*np.log(f(X, theta) + 1e-6) + (1-y)*np.log(1-f(X, theta) + 1e-6)).mean()
def cross_entropy(yHat, y):
    if y == 1:
        return -np.log(yHat)
    else:
        return -np.log(1 - yHat)

fig, ax = plt.subplots(figsize=(6*fig_scale,2*fig_scale))
x = np.linspace(0,1,100)

ax.plot(x,cross_entropy(x, 1),lw=2*fig_scale,c='b',label='true label = 1', linestyle='-')
ax.plot(x,cross_entropy(x, 0),lw=2*fig_scale,c='r',label='true label = 0', linestyle='-')
ax.set_xlabel(r"Predicted probability $\hat{y}$")
ax.set_ylabel("Log loss")
plt.grid()
plt.legend();

Logistic Regression from Scratch

Below is a Python implementation of logistic regression using NumPy, including gradient descent optimization and L2 regularization.

Explanation of Code:

•	Sigmoid Function: Implements (\sigma(z) = \frac{1}{1 + e^{-z}}).
•	Loss Function: Computes binary cross-entropy plus L2 regularization.
•	Gradient Descent: Updates weights and bias using the gradients of the loss.
•	Prediction: Outputs probabilities (via predict_proba) or class labels (via predict).
•	Example: Generates synthetic 2D data, trains the model, and plots the loss curve.

Notes:

•	The implementation includes L2 regularization (controlled by lambda_reg), which corresponds to the Gaussian prior in MAP estimation.
•	A small constant (1e-15) is added to logarithms to avoid numerical issues with (\log(0)).
•	The synthetic dataset is linearly separable for simplicity, but the model can handle non-separable data.
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from scipy.optimize import minimize

df = pd.read_csv('ex2data1.csv', header=None)
df.rename(columns={0: 'exam1', 1: 'exam2', 2: 'y'}, inplace=True)
df.head()
Loading...
fig = plt.figure(figsize=(12, 8))
plt.scatter(df[df['y'] == 0]['exam1'], df[df['y'] == 0]['exam2'],
            label='Not admitted', color='yellow', edgecolor='black')
plt.scatter(df[df['y'] == 1]['exam1'], df[df['y'] == 1]['exam2'],
            label='Admitted', marker='+', color='black')
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend(loc='upper right')
plt.title('Scores indicating admission')
plt.show()
<Figure size 1200x800 with 1 Axes>
import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000, lambda_reg=0.01):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.lambda_reg = lambda_reg  # Regularization strength
        self.weights = None
        self.bias = None
        self.losses = []

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def compute_loss(self, X, y, y_pred):
        # Binary cross-entropy loss + L2 regularization
        N = len(y)
        cross_entropy = -np.mean(y * np.log(y_pred + 1e-15) + (1 - y) * np.log(1 - y_pred + 1e-15))
        l2_penalty = (self.lambda_reg / (2 * N)) * np.sum(self.weights ** 2)
        return cross_entropy + l2_penalty

    def fit(self, X, y):
        n_samples, n_features = X.shape
        # Initialize weights and bias
        self.weights = np.zeros(n_features)
        self.bias = 0

        # Gradient descent
        for _ in range(self.n_iterations):
            # Forward pass
            linear_model = np.dot(X, self.weights) + self.bias
            y_pred = self.sigmoid(linear_model)

            # Compute loss
            loss = self.compute_loss(X, y, y_pred)
            self.losses.append(loss)

            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) + (self.lambda_reg / n_samples) * self.weights
            db = (1 / n_samples) * np.sum(y_pred - y)

            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict_proba(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        return self.sigmoid(linear_model)

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)

# Example usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(0)
    X = np.random.randn(100, 2)
    y = (X[:, 0] + X[:, 1] > 0).astype(int)

    # Train model
    model = LogisticRegression(learning_rate=0.1, n_iterations=1000, lambda_reg=0.1)
    model.fit(X, y)

    # Predict
    y_pred = model.predict(X)
    accuracy = np.mean(y_pred == y)
    print(f"Accuracy: {accuracy:.4f}")

    # Plot loss curve
    import matplotlib.pyplot as plt
    plt.plot(model.losses)
    plt.xlabel("Iteration")
    plt.ylabel("Loss")
    plt.title("Loss Curve")
    plt.show()
Accuracy: 0.9900
<Figure size 640x480 with 1 Axes>

Softmax Regression Explained

Softmax regression (also known as multinomial logistic regression) is a multi-class classification algorithm that generalizes logistic regression to handle more than two classes. It predicts the probability that an input belongs to each of KK possible classes, making it suitable for problems like classifying images into multiple categories (e.g., digits 0–9 in MNIST).

Key Components

  1. Score Computation: Computes a score for each class based on a linear combination of input features and class-specific parameters.

  2. Softmax Function: Converts scores into probabilities that sum to 1, representing the likelihood of each class.

  3. Optimization: Trains the model by maximizing the likelihood of the observed data using a loss function (negative log-likelihood).

Despite the name “regression,” softmax regression is used for classification, as it outputs class probabilities.


1. Model Formulation

Softmax regression maps an input xRd\vec{x} \in \mathbb{R}^d to a probability distribution over KK classes, producing a vector of probabilities:

fθ(x)[0,1]K,k=1Kfθ(x)k=1f_\theta(\vec{x}) \in [0, 1]^K, \quad \sum_{k=1}^K f_\theta(\vec{x})_k = 1

Parameters

  • Parameters: θ=(θ1,θ2,,θK)\theta = (\theta_1, \theta_2, \ldots, \theta_K), where each θkRd\theta_k \in \mathbb{R}^d is the parameter vector for class kk.

  • Parameter Space: Θ=RK×d\Theta = \mathbb{R}^{K \times d}, a matrix where each row θk\theta_k^\top corresponds to a class.

  • Input: xRd\vec{x} \in \mathbb{R}^d, the feature vector (often includes a bias term by appending a 1).

Step 1: Score Computation

For each class k=1,,Kk = 1, \ldots, K, compute a score (or logit):

zk=θkx=θk1x1+θk2x2++θkdxdz_k = \theta_k^\top \vec{x} = \theta_{k1} x_1 + \theta_{k2} x_2 + \cdots + \theta_{kd} x_d

The vector of scores is:

z=(z1,z2,,zK),zk=θkx\vec{z} = (z_1, z_2, \ldots, z_K), \quad z_k = \theta_k^\top \vec{x}

Step 2: Softmax Function

The scores are converted to probabilities using the softmax function, which ensures the outputs are positive and sum to 1:

σ(z)k=e(zk)l=1Ke(zl)\vec{\sigma}(\vec{z})_k = \frac{e(z_k)}{\sum_{l=1}^K e(z_l)}

For a given input x\vec{x}, the probability of class kk is:

P(y=kx,θ)=fθ(x)k=e(θkx)l=1Ke(θlx)P(y = k | \vec{x}, \theta) = f_\theta(\vec{x})_k = \frac{e(\theta_k^\top \vec{x})}{\sum_{l=1}^K e(\theta_l^\top \vec{x})}

Intuition

  • Exponentiation: e(zk)e(z_k) ensures all scores are positive, emphasizing larger scores.

  • Normalization: Dividing by the sum l=1Ke(zl)\sum_{l=1}^K e(z_l) makes the probabilities sum to 1.

  • Competition: Classes with higher scores zkz_k get higher probabilities, but the softmax ensures a smooth distribution across all classes.

Identifiability Note

The model is over-parametrized. Adding a constant vector c\vec{c} to all θk\theta_k (i.e., θkθk+c\theta_k \to \theta_k + \vec{c}) doesn’t change the probabilities because:

e((θk+c)x)=e(θkx)e(cx)e((\theta_k + \vec{c})^\top \vec{x}) = e(\theta_k^\top \vec{x}) \cdot e(\vec{c}^\top \vec{x})

The e(cx)e(\vec{c}^\top \vec{x}) terms cancel out in the softmax denominator. To address this, we can fix one parameter vector (e.g., θ1=0\theta_1 = \vec{0}) without loss of generality, reducing the parameter space.


Deeper Explanation of the Softmax Function

1. What is the Softmax Function?

The softmax function is a mathematical operation used to transform a vector of real-valued scores (often called logits) into a probability distribution. It is widely used in multi-class classification problems, where the goal is to assign an input to one of KK possible classes.

Given a vector of scores z=[z1,z2,,zK]RK\vec{z} = [z_1, z_2, \dots, z_K] \in \mathbb{R}^K, the softmax function computes probabilities for each class kk as follows:

σ(z)k=e(zk)l=1Ke(zl)\vec{\sigma}(\vec{z})_k = \frac{e(z_k)}{\sum_{l=1}^K e(z_l)}

Here:

  • e(zk)e(z_k): The exponential function ensures that the output for each class is positive, as e(zk)>0e(z_k) > 0 for any real zkz_k.

  • l=1Ke(zl)\sum_{l=1}^K e(z_l): The denominator normalizes the outputs by summing the exponentials of all scores, ensuring that the resulting probabilities sum to 1:

k=1Kσ(z)k=1\sum_{k=1}^K \vec{\sigma}(\vec{z})_k = 1

The output σ(z)k\vec{\sigma}(\vec{z})_k represents the probability of class kk, and the vector σ(z)=[σ(z)1,σ(z)2,,σ(z)K]\vec{\sigma}(\vec{z}) = [\vec{\sigma}(\vec{z})_1, \vec{\sigma}(\vec{z})_2, \dots, \vec{\sigma}(\vec{z})_K] is a valid probability distribution.

2. Why Use the Softmax Function?

The softmax function is ideal for multi-class classification because:

  • Positivity: The exponential ensures all outputs are positive, which is necessary for probabilities.

  • Normalization: The denominator ensures the outputs sum to 1, making them interpretable as probabilities.

  • Amplifies Differences: The exponential function exaggerates differences between scores. A larger zkz_k leads to a much larger e(zk)e(z_k), assigning higher probability to the corresponding class.

  • Differentiability: The softmax function is smooth and differentiable, which is crucial for optimization in machine learning (e.g., gradient-based methods like gradient descent).

3. Softmax in Multi-Class Classification

In a multi-class classification model (e.g., logistic regression for multiple classes or a neural network), the model computes a score for each class based on the input x\vec{x} and model parameters θ\theta. For class kk, the score (or logit) is often computed as a linear combination:

zk=θkxz_k = \theta_k^\top \vec{x}

where:

  • xRd\vec{x} \in \mathbb{R}^d is the input feature vector.

  • θkRd\theta_k \in \mathbb{R}^d is the parameter vector (e.g., weights) for class kk.

  • θ=[θ1,θ2,,θK]\theta = [\theta_1, \theta_2, \dots, \theta_K] is the collection of parameters for all KK classes.

The scores z=[z1,z2,,zK]\vec{z} = [z_1, z_2, \dots, z_K] are then passed through the softmax function to obtain the probability of each class:

P(y=kx,θ)=fθ(x)k=e(θkx)l=1Ke(θlx)P(y = k | \vec{x}, \theta) = f_\theta(\vec{x})_k = \frac{e(\theta_k^\top \vec{x})}{\sum_{l=1}^K e(\theta_l^\top \vec{x})}

This probability represents the model’s confidence that the input x\vec{x} belongs to class kk. The predicted class is typically the one with the highest probability:

y^=argmaxkP(y=kx,θ)\hat{y} = \arg\max_k P(y = k | \vec{x}, \theta)

4. Properties of Softmax

  • Invariance to Constant Shifts: Adding a constant cc to all scores zkz_k doesn’t change the output probabilities, because:

e(zk+c)l=1Ke(zl+c)=e(zk)e(c)l=1Ke(zl)e(c)=e(zk)l=1Ke(zl)\frac{e(z_k + c)}{\sum_{l=1}^K e(z_l + c)} = \frac{e(z_k) \cdot e(c)}{\sum_{l=1}^K e(z_l) \cdot e(c)} = \frac{e(z_k)}{\sum_{l=1}^K e(z_l)}

This property is useful numerically, as subtracting the maximum score (i.e., zmaxz_{\text{max}}) from all scores can prevent overflow in computations:

σ(z)k=e(zkzmax)l=1Ke(zlzmax)\vec{\sigma}(\vec{z})_k = \frac{e(z_k - z_{\text{max}})}{\sum_{l=1}^K e(z_l - z_{\text{max}})}
  • Sensitivity to Score Differences: The exponential amplifies differences between scores. If zkz_k is much larger than the others, σ(z)k\vec{\sigma}(\vec{z})_k will be close to 1, and the others will be close to 0.


Small Example: Applying the Softmax Function

Let’s walk through a concrete example to illustrate how the softmax function works in a multi-class classification setting.

Problem Setup

Suppose we have a classification problem with K=3K = 3 classes (e.g., classifying an image as a cat, dog, or bird). For a given input x\vec{x}, a model computes scores for each class based on parameters θ\theta. Let’s assume:

  • Input: x=[1,2]\vec{x} = [1, 2] (e.g., features extracted from an image).

  • Parameters for each class:

    • Class 1 (cat): θ1=[0.5,1.0]\theta_1 = [0.5, 1.0]

    • Class 2 (dog): θ2=[1.0,0.5]\theta_2 = [1.0, 0.5]

    • Class 3 (bird): θ3=[0.5,0.5]\theta_3 = [-0.5, 0.5]

Step 1: Compute Scores (Logits)

Calculate the score for each class using zk=θkxz_k = \theta_k^\top \vec{x}:

  • Class 1: z1=θ1x=0.51+1.02=0.5+2.0=2.5z_1 = \theta_1^\top \vec{x} = 0.5 \cdot 1 + 1.0 \cdot 2 = 0.5 + 2.0 = 2.5

  • Class 2: z2=θ2x=1.01+0.52=1.0+1.0=2.0z_2 = \theta_2^\top \vec{x} = 1.0 \cdot 1 + 0.5 \cdot 2 = 1.0 + 1.0 = 2.0

  • Class 3: z3=θ3x=0.51+0.52=0.5+1.0=0.5z_3 = \theta_3^\top \vec{x} = -0.5 \cdot 1 + 0.5 \cdot 2 = -0.5 + 1.0 = 0.5

So, the score vector is:

z=[2.5,2.0,0.5]\vec{z} = [2.5, 2.0, 0.5]

Step 2: Apply the Softmax Function

Compute the softmax probabilities:

σ(z)k=e(zk)l=13e(zl)\vec{\sigma}(\vec{z})_k = \frac{e(z_k)}{\sum_{l=1}^3 e(z_l)}

First, calculate the exponentials:

  • e(z1)=e(2.5)12.182e(z_1) = e(2.5) \approx 12.182

  • e(z2)=e(2.0)7.389e(z_2) = e(2.0) \approx 7.389

  • e(z3)=e(0.5)1.649e(z_3) = e(0.5) \approx 1.649

Sum the exponentials:

l=13e(zl)12.182+7.389+1.649=21.220\sum_{l=1}^3 e(z_l) \approx 12.182 + 7.389 + 1.649 = 21.220

Now compute the probabilities:

  • Class 1: σ(z)1=e(2.5)21.22012.18221.2200.574\vec{\sigma}(\vec{z})_1 = \frac{e(2.5)}{21.220} \approx \frac{12.182}{21.220} \approx 0.574

  • Class 2: σ(z)2=e(2.0)21.2207.38921.2200.348\vec{\sigma}(\vec{z})_2 = \frac{e(2.0)}{21.220} \approx \frac{7.389}{21.220} \approx 0.348

  • Class 3: σ(z)3=e(0.5)21.2201.64921.2200.078\vec{\sigma}(\vec{z})_3 = \frac{e(0.5)}{21.220} \approx \frac{1.649}{21.220} \approx 0.078

The probability distribution is:

P(y=catx,θ)0.574,P(y=dogx,θ)0.348,P(y=birdx,θ)0.078P(y = \text{cat} | \vec{x}, \theta) \approx 0.574, \quad P(y = \text{dog} | \vec{x}, \theta) \approx 0.348, \quad P(y = \text{bird} | \vec{x}, \theta) \approx 0.078

Step 3: Interpretation

  • The model assigns the highest probability (57.4%) to the “cat” class, so the predicted class is “cat”:

y^=argmaxkP(y=kx,θ)=cat\hat{y} = \arg\max_k P(y = k | \vec{x}, \theta) = \text{cat}
  • The probabilities sum to 1: 0.574+0.348+0.0781.0000.574 + 0.348 + 0.078 \approx 1.000.

  • The exponential in the softmax amplifies the effect of the highest score (z1=2.5z_1 = 2.5), giving the “cat” class a significantly higher probability than the others.

Step 4: Numerical Stability (Optional)

To avoid numerical overflow (e.g., if scores are very large), we can subtract the maximum score (zmax=2.5z_{\text{max}} = 2.5) from all scores:

  • Adjusted scores: [2.52.5,2.02.5,0.52.5]=[0,0.5,2.0][2.5 - 2.5, 2.0 - 2.5, 0.5 - 2.5] = [0, -0.5, -2.0]

  • Exponentials: e(0)=1e(0) = 1, e(0.5)0.607e(-0.5) \approx 0.607, e(2.0)0.135e(-2.0) \approx 0.135

  • Sum: 1+0.607+0.1351.7421 + 0.607 + 0.135 \approx 1.742

  • Probabilities:

    • Class 1: 11.7420.574\frac{1}{1.742} \approx 0.574

    • Class 2: 0.6071.7420.348\frac{0.607}{1.742} \approx 0.348

    • Class 3: 0.1351.7420.078\frac{0.135}{1.742} \approx 0.078

The results are identical, but the computation is more stable.

2. Probabilistic Interpretation

Softmax regression models the conditional probability of the class label y{1,2,,K}y \in \{1, 2, \ldots, K\} given the input x\vec{x}:

P(y=kx,θ)=e(θkx)l=1Ke(θlx)P(y = k | \vec{x}, \theta) = \frac{e(\theta_k^\top \vec{x})}{\sum_{l=1}^K e(\theta_l^\top \vec{x})}

This defines a categorical distribution over the KK classes. The model assigns higher probabilities to classes whose parameter vectors θk\theta_k are more aligned with the input x\vec{x} (i.e., larger θkx\theta_k^\top \vec{x}).

Decision Rule

To classify an input, choose the class with the highest probability:

y^=argmaxkP(y=kx,θ)=argmaxkθkx\hat{y} = \arg\max_{k} P(y = k | \vec{x}, \theta) = \arg\max_{k} \theta_k^\top \vec{x}

Since the softmax is monotonic, this is equivalent to picking the class with the highest score zkz_k.


3. Learning Objective: Maximum Likelihood Estimation (MLE)

The goal is to find the parameters θ\theta that maximize the likelihood of the observed data.

Dataset

Consider a dataset:

D={(x(i),y(i))}i=1n\mathcal{D} = \{(\vec{x}^{(i)}, y^{(i)})\}_{i=1}^n

where:

  • x(i)Rd\vec{x}^{(i)} \in \mathbb{R}^d: Feature vector for the ii-th instance.

  • y(i){1,2,,K}y^{(i)} \in \{1, 2, \ldots, K\}: True class label.

Likelihood Function

The probability of observing label y(i)y^{(i)} given x(i)\vec{x}^{(i)} is:

P(y(i)x(i),θ)=e(θy(i)x(i))l=1Ke(θlx(i))P(y^{(i)} | \vec{x}^{(i)}, \theta) = \frac{e(\theta_{y^{(i)}}^\top \vec{x}^{(i)})}{\sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)})}

Assuming the instances are independent, the likelihood is:

L(θ)=i=1nP(y(i)x(i),θ)=i=1ne(θy(i)x(i))l=1Ke(θlx(i))L(\theta) = \prod_{i=1}^n P(y^{(i)} | \vec{x}^{(i)}, \theta) = \prod_{i=1}^n \frac{e(\theta_{y^{(i)}}^\top \vec{x}^{(i)})}{\sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)})}

Log-Likelihood

To simplify optimization, take the logarithm:

(θ)=logL(θ)=i=1nlogP(y(i)x(i),θ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log P(y^{(i)} | \vec{x}^{(i)}, \theta)

Substitute the probability:

(θ)=i=1n[log(e(θy(i)x(i)))log(l=1Ke(θlx(i)))]\ell(\theta) = \sum_{i=1}^n \left[ \log \left( e(\theta_{y^{(i)}}^\top \vec{x}^{(i)}) \right) - \log \left( \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right) \right]
=i=1n[θy(i)x(i)logl=1Ke(θlx(i))]= \sum_{i=1}^n \left[ \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right]

Cost Function

In practice, we minimize the negative log-likelihood (cross-entropy loss), averaged over the dataset:

J(θ)=1n(θ)=1ni=1n[θy(i)x(i)logl=1Ke(θlx(i))]J(\theta) = -\frac{1}{n} \ell(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right]

This is the multinomial cross-entropy loss. Minimizing J(θ)J(\theta) is equivalent to maximizing the likelihood.


4. Gradient of the Cost Function

To optimize J(θ)J(\theta) using gradient descent, we need the gradient with respect to θk\theta_k for each class kk.

Gradient Derivation

The cost function is:

J(θ)=1ni=1n[θy(i)x(i)logl=1Ke(θlx(i))]J(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right]

Define the loss for a single instance ii:

L(i)=θy(i)x(i)logl=1Ke(θlx(i))L^{(i)} = \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)})

We compute the partial derivative with respect to θm\theta_m (parameters for class mm):

J(θ)θm=1ni=1nL(i)θm\frac{\partial J(\theta)}{\partial \theta_m} = -\frac{1}{n} \sum_{i=1}^n \frac{\partial L^{(i)}}{\partial \theta_m}

For a single instance, differentiate:

L(i)θm=θm[θy(i)x(i)logl=1Ke(θlx(i))]\frac{\partial L^{(i)}}{\partial \theta_m} = \frac{\partial}{\partial \theta_m} \left[ \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right]
  • First term: θy(i)x(i)\theta_{y^{(i)}}^\top \vec{x}^{(i)}

This is non-zero only if m=y(i)m = y^{(i)} (since only θy(i)\theta_{y^{(i)}} involves class y(i)y^{(i)}):

θm(θy(i)x(i))={x(i),if m=y(i)0,otherwise\frac{\partial}{\partial \theta_m} (\theta_{y^{(i)}}^\top \vec{x}^{(i)}) = \begin{cases} \vec{x}^{(i)}, & \text{if } m = y^{(i)} \\ 0, & \text{otherwise} \end{cases}

Using an indicator function, this is:

1{m=y(i)}x(i)\mathbb{1}\{m = y^{(i)}\} \cdot \vec{x}^{(i)}
  • Second term: logl=1Ke(θlx(i))-\log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)})

Let Z=l=1Ke(θlx(i))Z = \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}). Then:

θmlogZ=1ZZθm\frac{\partial}{\partial \theta_m} \log Z = \frac{1}{Z} \cdot \frac{\partial Z}{\partial \theta_m}

Compute:

Zθm=θml=1Ke(θlx(i))=e(θmx(i))x(i)\frac{\partial Z}{\partial \theta_m} = \frac{\partial}{\partial \theta_m} \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) = e(\theta_m^\top \vec{x}^{(i)}) \cdot \vec{x}^{(i)}

So:

θmlogZ=e(θmx(i))x(i)Z=P(y=mx(i),θ)x(i)\frac{\partial}{\partial \theta_m} \log Z = \frac{e(\theta_m^\top \vec{x}^{(i)}) \cdot \vec{x}^{(i)}}{Z} = P(y = m | \vec{x}^{(i)}, \theta) \cdot \vec{x}^{(i)}

Since P(y=mx(i),θ)=e(θmx(i))ZP(y = m | \vec{x}^{(i)}, \theta) = \frac{e(\theta_m^\top \vec{x}^{(i)})}{Z}, the derivative of the second term is:

θmlogZ=P(y=mx(i),θ)x(i)-\frac{\partial}{\partial \theta_m} \log Z = -P(y = m | \vec{x}^{(i)}, \theta) \cdot \vec{x}^{(i)}

Combine:

L(i)θm=1{m=y(i)}x(i)P(y=mx(i),θ)x(i)\frac{\partial L^{(i)}}{\partial \theta_m} = \mathbb{1}\{m = y^{(i)}\} \cdot \vec{x}^{(i)} - P(y = m | \vec{x}^{(i)}, \theta) \cdot \vec{x}^{(i)}

The full gradient is:

J(θ)θm=1ni=1n[1{m=y(i)}P(y=mx(i),θ)]x(i)\frac{\partial J(\theta)}{\partial \theta_m} = -\frac{1}{n} \sum_{i=1}^n \left[ \mathbb{1}\{m = y^{(i)}\} - P(y = m | \vec{x}^{(i)}, \theta) \right] \vec{x}^{(i)}

Intuition

  • If m=y(i)m = y^{(i)}, the term 1{m=y(i)}=1\mathbb{1}\{m = y^{(i)}\} = 1, and the gradient pushes P(y=mx(i),θ)P(y = m | \vec{x}^{(i)}, \theta) toward 1.

  • If my(i)m \neq y^{(i)}, the gradient pushes P(y=mx(i),θ)P(y = m | \vec{x}^{(i)}, \theta) toward 0.

  • The difference 1{m=y(i)}P(y=mx(i),θ)\mathbb{1}\{m = y^{(i)}\} - P(y = m | \vec{x}^{(i)}, \theta) measures the error in the predicted probability.

Gradient Descent Update

Update each θm\theta_m using:

θm:=θmαJ(θ)θm=θm+α1ni=1n[1{m=y(i)}P(y=mx(i),θ)]x(i)\theta_m := \theta_m - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_m} = \theta_m + \alpha \cdot \frac{1}{n} \sum_{i=1}^n \left[ \mathbb{1}\{m = y^{(i)}\} - P(y = m | \vec{x}^{(i)}, \theta) \right] \vec{x}^{(i)}

where α\alpha is the learning rate.


5. Connection to Logistic Regression

Softmax regression generalizes logistic regression:

  • For K=2K = 2, softmax regression reduces to logistic regression.

  • Let class 1 have probability p=exp(θ1x)exp(θ1x)+exp(θ2x)p = \frac{\exp(\theta_1^\top \vec{x})}{\exp(\theta_1^\top \vec{x}) + \exp(\theta_2^\top \vec{x})}. Set θ=θ1θ2\theta = \theta_1 - \theta_2, then:

p=exp(θ1x)exp(θ1x)+exp(θ2x)=11+exp((θ1xθ2x))=σ(θx)p = \frac{\exp(\theta_1^\top \vec{x})}{\exp(\theta_1^\top \vec{x}) + \exp(\theta_2^\top \vec{x})} = \frac{1}{1 + \exp(-(\theta_1^\top \vec{x} - \theta_2^\top \vec{x}))} = \sigma(\theta^\top \vec{x})

This matches the logistic regression sigmoid function.


eθe^\theta is same as exp(θ)exp(\theta)

starting with:

p=e(θ1x)e(θ1x)+e(θ2x)p = \frac{e(\theta_1^\top \vec{x})}{e(\theta_1^\top \vec{x}) + e(\theta_2^\top \vec{x})}

Then you performed the following algebraic manipulations:

  1. Divide numerator and denominator by e(θ1x)e(\theta_1^\top \vec{x}):

    p=e(θ1x)e(θ1x)e(θ1x)e(θ1x)+e(θ2x)e(θ1x)=11+e(θ2xθ1x)p = \frac{\frac{e(\theta_1^\top \vec{x})}{e(\theta_1^\top \vec{x})}}{\frac{e(\theta_1^\top \vec{x})}{e(\theta_1^\top \vec{x})} + \frac{e(\theta_2^\top \vec{x})}{e(\theta_1^\top \vec{x})}} = \frac{1}{1 + e(\theta_2^\top \vec{x} - \theta_1^\top \vec{x})}
  2. Introduce θ=θ1θ2\theta = \theta_1 - \theta_2: Notice that θ2xθ1x=(θ1xθ2x)=(θx)\theta_2^\top \vec{x} - \theta_1^\top \vec{x} = -(\theta_1^\top \vec{x} - \theta_2^\top \vec{x}) = -(\theta^\top \vec{x}). Substituting this, you get:

    p=11+e((θx))p = \frac{1}{1 + e(-(\theta^\top \vec{x}))}
  3. Recognize the sigmoid function: You correctly identified this as the sigmoid function σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}, where in this case, z=θxz = \theta^\top \vec{x}.

So, the eze^z format in this context is:

With z=θxz = \theta^\top \vec{x}, the probability pp can be expressed as:

p=11+ezp = \frac{1}{1 + e^{-z}}

Alternatively, if you wanted the numerator to explicitly have eze^z, you could multiply the numerator and denominator of the original expression by e(θ2x)e(-\theta_2^\top \vec{x}):

Can't use function '$' in math mode at position 155: …^\top \vec{x})}$̲$$$p = \frac{e(…

p = \frac{e(\theta_1^\top \vec{x}) \cdot e(-\theta_2^\top \vec{x})}{(e(\theta_1^\top \vec{x}) + e(\theta_2^\top \vec{x})) \cdot e(-\theta_2^\top \vec{x})}$$$$p = \frac{e(\theta_1^\top \vec{x} - \theta_2^\top \vec{x})}{e(\theta_1^\top \vec{x} - \theta_2^\top \vec{x}) + e(\theta_2^\top \vec{x} - \theta_2^\top \vec{x})}$$$$p = \frac{e((\theta_1 - \theta_2)^\top \vec{x})}{e((\theta_1 - \theta_2)^\top \vec{x}) + e(0)}
p=e(θx)e(θx)+1p = \frac{e(\theta^\top \vec{x})}{e(\theta^\top \vec{x}) + 1}

If we still define z=θxz = \theta^\top \vec{x}, then this form is:

p=ezez+1p = \frac{e^z}{e^z + 1}

Both 11+ez\frac{1}{1 + e^{-z}} and ezez+1\frac{e^z}{e^z + 1} are equivalent forms of the sigmoid function and represent the probability pp in terms of eze^z. Your derivation correctly arrived at the first form. The second form is just another way to express the same relationship.


6. Implementation

Softmax regression can be implemented using scikit-learn’s LogisticRegression with the multi_class='multinomial' option:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
  • solver=‘lbfgs’: Uses a second-order optimization method for efficiency.

  • The model automatically handles the softmax function and optimizes the cross-entropy loss.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic business-style 3-class data (e.g., low/medium/high value customers)
X, Y = make_classification(
    n_samples=500,
    n_features=2,
    n_redundant=0,
    n_clusters_per_class=1,
    n_classes=3,
    class_sep=2.0,
    random_state=42
)

# 2. Scale features (for stability and visualization)
scaler = StandardScaler()
X = scaler.fit_transform(X)

# 3. Train multinomial logistic regression (softmax)
logreg = LogisticRegression(C=1e5, multi_class='multinomial', solver='lbfgs')
logreg.fit(X, Y)

# 4. Meshgrid for decision boundary plot
xx, yy = np.meshgrid(
    np.arange(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 0.02),
    np.arange(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5, 0.02)
)

Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# 5. Plot
plt.figure(figsize=(10, 6))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired, shading='auto')
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired, s=50)
plt.xlabel('Customer Feature 1 (scaled)')
plt.ylabel('Customer Feature 2 (scaled)')
plt.title('Multinomial Logistic Regression (3-Class Business Example)')
plt.grid(True)
plt.show()
/Users/chandraveshchaudhari/anaconda3/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:1247: FutureWarning: 'multi_class' was deprecated in version 1.5 and will be removed in 1.7. From then on, it will always use 'multinomial'. Leave it to its default value to avoid this warning.
  warnings.warn(
<Figure size 1000x600 with 1 Axes>
import numpy as np
from scipy.special import softmax
from sklearn.preprocessing import LabelBinarizer
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class SoftmaxRegression:
    def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-4, random_state=None):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state
        self.weights = None
        self.classes_ = None
        self.bias = None  # Separate bias term

    def _initialize_weights(self, n_features, n_classes):
        if self.random_state is not None:
            np.random.seed(self.random_state)
        # Initialize weights and bias separately
        self.weights = np.random.randn(n_features, n_classes) * 0.01
        self.bias = np.zeros(n_classes)

    def _one_hot_encode(self, y):
        self.label_binarizer = LabelBinarizer()
        return self.label_binarizer.fit_transform(y)

    def _gradient(self, X, y_true, y_pred):
        # Gradient of cross-entropy loss
        error = y_pred - y_true
        grad_weights = X.T @ error / X.shape[0]
        grad_bias = np.mean(error, axis=0)
        return grad_weights, grad_bias

    def fit(self, X, y):
        # Convert labels to one-hot encoding
        y_one_hot = self._one_hot_encode(y)
        self.classes_ = self.label_binarizer.classes_
        n_samples, n_features = X.shape
        n_classes = y_one_hot.shape[1]

        # Initialize weights and bias
        self._initialize_weights(n_features, n_classes)

        prev_loss = float('inf')

        # Gradient descent
        for i in range(self.max_iter):
            # Forward pass: compute predictions
            logits = X @ self.weights + self.bias  # Add bias here
            y_pred = softmax(logits, axis=1)

            # Compute loss (cross-entropy)
            loss = -np.mean(np.sum(y_one_hot * np.log(y_pred + 1e-15), axis=1))

            # Check for convergence
            if np.abs(prev_loss - loss) < self.tol:
                print(f"Converged at iteration {i}")
                break
            prev_loss = loss

            # Backward pass: compute gradients and update parameters
            grad_weights, grad_bias = self._gradient(X, y_one_hot, y_pred)
            self.weights -= self.learning_rate * grad_weights
            self.bias -= self.learning_rate * grad_bias

            if i % 100 == 0:
                print(f"Iteration {i}, Loss: {loss:.4f}")

    def predict_proba(self, X):
        logits = X @ self.weights + self.bias
        return softmax(logits, axis=1)

    def predict(self, X):
        proba = self.predict_proba(X)
        return self.classes_[np.argmax(proba, axis=1)]

# Example usage
if __name__ == "__main__":
    # Load and prepare data
    iris = load_iris()
    X = iris.data
    y = iris.target

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Standardize features
    mean = X_train.mean(axis=0)
    std = X_train.std(axis=0)
    X_train = (X_train - mean) / std
    X_test = (X_test - mean) / std

    # Train model
    model = SoftmaxRegression(learning_rate=0.1, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"\nTest Accuracy: {accuracy:.4f}")

    # Show predictions
    print("\nSample predictions:")
    print("True:", y_test[:5])
    print("Pred:", y_pred[:5])
Iteration 0, Loss: 1.1053
Iteration 100, Loss: 0.3389
Iteration 200, Loss: 0.2699
Iteration 300, Loss: 0.2291
Iteration 400, Loss: 0.2013
Iteration 500, Loss: 0.1810
Iteration 600, Loss: 0.1657
Iteration 700, Loss: 0.1536
Converged at iteration 737

Test Accuracy: 1.0000

Sample predictions:
True: [1 0 2 1 1]
Pred: [1 0 2 1 1]
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np

def plot_decision_regions(X, y, model, resolution=0.02):
    # Setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # Plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
                           np.arange(x2_min, x2_max, resolution))

    # Predict for each point in meshgrid
    Z = model.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)

    plt.contourf(xx1, xx2, Z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    # Plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0],
                    y=X[y == cl, 1],
                    alpha=0.6,
                    c=cmap(idx),
                    edgecolor='black',
                    marker=markers[idx],
                    label=cl)

# Select two features for visualization (sepal length and width)
X_vis = X_train[:, :2]  # Using first two features for visualization

# Retrain model on just these two features
model_vis = SoftmaxRegression(learning_rate=0.01, max_iter=10000, random_state=42)
model_vis.fit(X_vis, y_train)

# Create plot
plt.figure(figsize=(10, 6))
plot_decision_regions(X_vis, y_train, model_vis)
plt.xlabel('Sepal length (standardized)')
plt.ylabel('Sepal width (standardized)')
plt.title('Softmax Regression Decision Boundaries on Iris Dataset')
plt.legend(loc='upper left')
plt.show()

# Also plot the actual test results
plt.figure(figsize=(10, 6))
plot_decision_regions(X_test[:, :2], y_test, model_vis)
plt.xlabel('Sepal length (standardized)')
plt.ylabel('Sepal width (standardized)')
plt.title('Test Set Classification Results')
plt.legend(loc='upper left')
plt.show()
Iteration 0, Loss: 1.0934
Iteration 100, Loss: 0.8577
Iteration 200, Loss: 0.7380
Iteration 300, Loss: 0.6699
Iteration 400, Loss: 0.6268
Iteration 500, Loss: 0.5973
Iteration 600, Loss: 0.5759
Iteration 700, Loss: 0.5595
Iteration 800, Loss: 0.5466
Converged at iteration 876
/var/folders/93/7lt42x5j7m39kz7wxbcghvrm0000gn/T/ipykernel_9647/1777411610.py:27: UserWarning: *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
  plt.scatter(x=X[y == cl, 0],
/var/folders/93/7lt42x5j7m39kz7wxbcghvrm0000gn/T/ipykernel_9647/1777411610.py:27: UserWarning: You passed a edgecolor/edgecolors ('black') for an unfilled marker ('x').  Matplotlib is ignoring the edgecolor in favor of the facecolor.  This behavior may change in the future.
  plt.scatter(x=X[y == cl, 0],
<Figure size 1000x600 with 1 Axes>
<Figure size 1000x600 with 1 Axes>

Information-Theoretic View: Entropy & KL Divergence in Classification

In the context of classification, an information-theoretic perspective provides a framework to understand and quantify uncertainty, information content, and divergence between probability distributions. Two key concepts in this view are entropy and Kullback-Leibler (KL) divergence. Below, we explore these concepts and their relevance to classification tasks.


1. Entropy in Classification

Entropy is a measure of uncertainty or randomness in a probability distribution. In classification, it quantifies the uncertainty associated with predicting the class label of a data point.

  • Definition: For a discrete random variable Y Y representing class labels with probability distribution P(Y) P(Y) , the entropy H(Y) H(Y) is defined as:

    H(Y)=i=1CP(yi)logP(yi)H(Y) = - \sum_{i=1}^C P(y_i) \log P(y_i)

    where:

    • C C is the number of classes.

    • P(yi) P(y_i) is the probability of class yi y_i .

    • The logarithm is typically base 2 (for bits) or base e e (for nats).

  • Interpretation:

    • High entropy: The class distribution is uniform (e.g., P(yi)=1C P(y_i) = \frac{1}{C} ), indicating maximum uncertainty. For example, in a binary classification problem with P(y1)=0.5 P(y_1) = 0.5 and P(y2)=0.5 P(y_2) = 0.5 , the entropy is H(Y)=1 H(Y) = 1 bit.

    • Low entropy: The distribution is skewed toward one class (e.g., P(y1)=0.99,P(y2)=0.01 P(y_1) = 0.99, P(y_2) = 0.01 ), indicating low uncertainty.

    • Zero entropy: The outcome is certain (e.g., P(y1)=1,P(y2)=0 P(y_1) = 1, P(y_2) = 0 ).

  • Role in Classification:

    • Decision Trees: Entropy is used in algorithms like ID3 or C4.5 to measure the impurity of a node. The goal is to split the data to minimize entropy (i.e., reduce uncertainty about class labels).

    • Model Evaluation: Entropy can help assess the uncertainty in a model’s predictions. For example, a model outputting high-entropy (uniform) probability distributions for test samples may indicate poor confidence.

    • Cross-Entropy Loss: In training classifiers (e.g., neural networks), the cross-entropy loss measures the difference between the true label distribution and the predicted distribution, effectively penalizing high uncertainty in incorrect predictions.


2. KL Divergence in Classification

Kullback-Leibler (KL) divergence measures how much one probability distribution differs from another. In classification, it is often used to compare the predicted probability distribution to the true distribution.

  • Definition: For two probability distributions P P (true distribution) and Q Q (approximated/predicted distribution) over the same set of class labels, the KL divergence is:

    DKL(PQ)=i=1CP(yi)logP(yi)Q(yi)D_{KL}(P || Q) = \sum_{i=1}^C P(y_i) \log \frac{P(y_i)}{Q(y_i)}
    • DKL(PQ)0 D_{KL}(P || Q) \geq 0 , with equality if and only if P=Q P = Q .

    • It is not symmetric: DKL(PQ)DKL(QP) D_{KL}(P || Q) \neq D_{KL}(Q || P) .

  • Interpretation:

    • KL divergence quantifies the “extra information” (in bits or nats) needed to encode samples from P P using a code optimized for Q Q .

    • A small KL divergence indicates that Q Q is a good approximation of P P .

    • A large KL divergence suggests that the predicted distribution Q Q deviates significantly from the true distribution P P .

  • Role in Classification:

    • Cross-Entropy Loss and KL Divergence: The cross-entropy loss used in classification can be decomposed as:

      H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P || Q)

      where:

      • H(P,Q)=i=1CP(yi)logQ(yi) H(P, Q) = - \sum_{i=1}^C P(y_i) \log Q(y_i) is the cross-entropy.

      • H(P) H(P) is the entropy of the true distribution.

      • DKL(PQ) D_{KL}(P || Q) is the KL divergence. Since H(P) H(P) is constant for a given true distribution, minimizing the cross-entropy loss is equivalent to minimizing the KL divergence between the true and predicted distributions.

    • Model Calibration: KL divergence can be used to evaluate how well a model’s predicted probabilities align with the true class distribution. A high KL divergence may indicate miscalibration.

    • Regularization: In some models, KL divergence is used as a regularization term. For example, in variational inference or Bayesian neural networks, KL divergence measures the difference between the learned parameter distribution and a prior distribution.

    • Domain Adaptation: KL divergence can quantify the difference between source and target domain distributions, helping to align feature distributions in transfer learning or domain adaptation tasks.


3. Practical Implications in Classification

  • Entropy and Decision Making:

    • Entropy guides feature selection in decision trees by identifying splits that reduce class uncertainty.

    • In ensemble methods (e.g., random forests), entropy can help assess the diversity of predictions across trees.

    • In active learning, entropy is used to select uncertain samples for labeling, maximizing information gain.

  • KL Divergence and Model Optimization:

    • During training, minimizing the cross-entropy loss implicitly minimizes the KL divergence, aligning the predicted probabilities with the true labels.

    • In generative models (e.g., GANs or VAEs) used for classification, KL divergence may appear in the objective function to ensure the generated distribution matches the true data distribution.

    • In knowledge distillation, KL divergence is used to transfer knowledge from a teacher model (with a “soft” probability distribution) to a student model.

  • Evaluation Metrics:

    • While entropy and KL divergence are not typically used directly as evaluation metrics, they underpin metrics like log-loss (cross-entropy) and can inform diagnostic tools for model performance.

    • For example, a model with high cross-entropy loss may have a large KL divergence, indicating poor alignment with the true distribution.


4. Example: Binary Classification

Consider a binary classification problem with true labels P(y=1)=0.7,P(y=0)=0.3 P(y=1) = 0.7, P(y=0) = 0.3 , and a model’s predicted probabilities Q(y=1)=0.6,Q(y=0)=0.4 Q(y=1) = 0.6, Q(y=0) = 0.4 .

  • Entropy of True Distribution:

    H(P)=[0.7log0.7+0.3log0.3]0.881 bitsH(P) = - [0.7 \log 0.7 + 0.3 \log 0.3] \approx 0.881 \text{ bits}
  • Cross-Entropy:

    H(P,Q)=[0.7log0.6+0.3log0.4]0.918 bitsH(P, Q) = - [0.7 \log 0.6 + 0.3 \log 0.4] \approx 0.918 \text{ bits}
  • KL Divergence:

    DKL(PQ)=H(P,Q)H(P)0.9180.881=0.037 bitsD_{KL}(P || Q) = H(P, Q) - H(P) \approx 0.918 - 0.881 = 0.037 \text{ bits}

    The KL divergence is small, indicating that the predicted distribution Q Q is close to the true distribution P P .


5. Limitations and Considerations

  • Entropy:

    • Entropy assumes a discrete distribution. For continuous variables, differential entropy is used, but it has different properties (e.g., it can be negative).

    • Entropy is sensitive to the number of classes; more classes generally increase entropy unless the distribution is highly skewed.

  • KL Divergence:

    • KL divergence is asymmetric and not a true distance metric.

    • It can be undefined if Q(yi)=0 Q(y_i) = 0 for any yi y_i where P(yi)>0 P(y_i) > 0 . Smoothing techniques (e.g., adding a small ϵ \epsilon to probabilities) are often used to avoid this.

    • KL divergence is sensitive to small differences in low-probability events, which may or may not be desirable depending on the application.

  • Computational Considerations:

    • Estimating entropy and KL divergence requires reliable probability estimates, which can be challenging with limited data or poorly calibrated models.

    • In high-dimensional settings, approximating these quantities may require techniques like Monte Carlo sampling or kernel density estimation.


6. Broader Context

  • Mutual Information: Entropy is closely related to mutual information, which measures the reduction in uncertainty about one variable (e.g., class labels) given another (e.g., features). Mutual information can guide feature selection in classification.

  • Information Bottleneck: This framework uses entropy and KL divergence to balance the trade-off between compressing input features and preserving information about the class labels.

  • Robustness and Uncertainty: Information-theoretic measures can help quantify model robustness to adversarial attacks or distributional shifts by analyzing changes in entropy or KL divergence under perturbations.


Summary

  • Entropy quantifies the uncertainty in a class distribution, guiding decision-making in algorithms like decision trees and informing loss functions like cross-entropy.

  • KL Divergence measures the difference between true and predicted distributions, playing a central role in optimizing classifiers and evaluating model calibration.

  • Together, these concepts provide a principled way to understand, train, and evaluate classification models, ensuring that predictions align with the true underlying distribution while minimizing uncertainty.


1. Multinomial Cross-Entropy Loss (Multi-Class)

The provided cost function is the multinomial cross-entropy loss, used for multi-class classification with KK classes. It is derived from the negative log-likelihood of the data under a multinomial logistic regression model (softmax classifier).

Formula

For a dataset with nn examples, the cost function is:

J(θ)=1n(θ)=1ni=1n[θy(i)x(i)logl=1Ke(θlx(i))]J(\theta) = -\frac{1}{n} \ell(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right]

Explanation

  • Input: For the ii-th example, x(i)\vec{x}^{(i)} is the feature vector, y(i){1,2,,K}y^{(i)} \in \{1, 2, \dots, K\} is the true class label, and θl\theta_l is the parameter vector for class ll.

  • Probability: The model predicts the probability of class kk using the softmax function:

P(y=kx(i),θ)=e(θkx(i))l=1Ke(θlx(i))P(y = k | \vec{x}^{(i)}, \theta) = \frac{e(\theta_k^\top \vec{x}^{(i)})}{\sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)})}
  • Log-Likelihood: The log-likelihood for the ii-th example is the log of the probability of the true class y(i)y^{(i)}:

logP(y=y(i)x(i),θ)=θy(i)x(i)logl=1Ke(θlx(i))\log P(y = y^{(i)} | \vec{x}^{(i)}, \theta) = \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)})
  • Cost Function: The negative average log-likelihood over nn examples gives J(θ)J(\theta), which we minimize to find the optimal parameters θ\theta.

Key Features

  • Applies to K2K \geq 2 classes.

  • Uses the softmax function to compute probabilities across all classes.

  • The loss penalizes the model when the predicted probability for the true class is low.


2. Binary Cross-Entropy Loss (Two Classes)

For binary classification (2 classes, typically labeled y{0,1}y \in \{0, 1\}), the binary cross-entropy loss (also called log loss) is used. It’s a special case of the multinomial cross-entropy loss when K=2K = 2, but it’s often written in a simpler form using the sigmoid function.

Formula

For a dataset with nn examples, the binary cross-entropy loss is:

J(θ)=1ni=1n[y(i)logp^(i)+(1y(i))log(1p^(i))]J(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ y^{(i)} \log \hat{p}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{p}^{(i)}) \right]

where:

  • p^(i)=σ(θx(i))\hat{p}^{(i)} = \sigma(\theta^\top \vec{x}^{(i)}) is the predicted probability of class 1, and σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function.

  • y(i){0,1}y^{(i)} \in \{0, 1\} is the true label for the ii-th example.

  • θ\theta is the parameter vector (single set of weights, unlike the multi-class case).

Explanation

  • Probability: The model predicts the probability of class 1 as:

P(y=1x(i),θ)=σ(θx(i))=p^(i)P(y = 1 | \vec{x}^{(i)}, \theta) = \sigma(\theta^\top \vec{x}^{(i)}) = \hat{p}^{(i)}
  • The probability of class 0 is:

P(y=0x(i),θ)=1p^(i)P(y = 0 | \vec{x}^{(i)}, \theta) = 1 - \hat{p}^{(i)}
  • Log-Likelihood: The log-likelihood for the ii-th example is:

logP(y=y(i)x(i),θ)=y(i)logp^(i)+(1y(i))log(1p^(i))\log P(y = y^{(i)} | \vec{x}^{(i)}, \theta) = y^{(i)} \log \hat{p}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{p}^{(i)})
  • Cost Function: The negative average log-likelihood gives J(θ)J(\theta), which is minimized to optimize θ\theta.

Key Features

  • Applies to exactly 2 classes (y=0y = 0 or y=1y = 1).

  • Uses the sigmoid function to compute the probability of class 1.

  • The loss penalizes incorrect predictions by assigning higher loss when the predicted probability p^(i)\hat{p}^{(i)} is far from the true label y(i)y^{(i)}.


3. Comparison

AspectMultinomial Cross-Entropy (Multi-Class)Binary Cross-Entropy (Two Classes)
Number of ClassesK2K \geq 2 (general case, including 2 or more classes)Exactly 2 classes (y{0,1}y \in \{0, 1\})
Probability ModelSoftmax: $P(y = k\vec{x}, \theta) = \frac{e(\theta_k^\top \vec{x})}{\sum_{l=1}^K e(\theta_l^\top \vec{x})}$
ParametersKK parameter vectors: θ1,θ2,,θK\theta_1, \theta_2, \dots, \theta_K (one per class)Single parameter vector: θ\theta
Loss FormulaJ(θ)=1ni=1n[θy(i)x(i)logl=1Ke(θlx(i))]J(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ \theta_{y^{(i)}}^\top \vec{x}^{(i)} - \log \sum_{l=1}^K e(\theta_l^\top \vec{x}^{(i)}) \right]J(θ)=1ni=1n[y(i)logp^(i)+(1y(i))log(1p^(i))]J(\theta) = -\frac{1}{n} \sum_{i=1}^n \left[ y^{(i)} \log \hat{p}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{p}^{(i)}) \right]
Special CaseReduces to binary cross-entropy when K=2K = 2 (see below)Is a special case of multinomial cross-entropy when K=2K = 2
InterpretationMeasures how well the predicted probability distribution matches the true class across KK classesMeasures how well the predicted probability for class 1 (or 0) matches the true binary label

Key Insight: Binary Case as a Special Case

When K=2K = 2 in the multinomial case, the softmax model can be shown to be equivalent to the sigmoid-based binary model. Let’s derive this briefly:

  • For K=2K = 2, classes are y{1,2}y \in \{1, 2\} (or relabeled as {0,1}\{0, 1\}). The softmax probabilities are:

P(y=1x,θ)=e(θ1x)e(θ1x)+e(θ2x)P(y = 1 | \vec{x}, \theta) = \frac{e(\theta_1^\top \vec{x})}{e(\theta_1^\top \vec{x}) + e(\theta_2^\top \vec{x})}
P(y=2x,θ)=e(θ2x)e(θ1x)+e(θ2x)P(y = 2 | \vec{x}, \theta) = \frac{e(\theta_2^\top \vec{x})}{e(\theta_1^\top \vec{x}) + e(\theta_2^\top \vec{x})}
  • Divide numerator and denominator by e(θ2x)e(\theta_2^\top \vec{x}):

P(y=1x,θ)=e(θ1xθ2x)1+e(θ1xθ2x)=σ((θ1θ2)x)P(y = 1 | \vec{x}, \theta) = \frac{e(\theta_1^\top \vec{x} - \theta_2^\top \vec{x})}{1 + e(\theta_1^\top \vec{x} - \theta_2^\top \vec{x})} = \sigma((\theta_1 - \theta_2)^\top \vec{x})
  • Let θ=θ1θ2\theta = \theta_1 - \theta_2. This matches the sigmoid form: P(y=1)=σ(θx)P(y = 1) = \sigma(\theta^\top \vec{x}).

  • The multinomial loss for K=2K = 2 simplifies to the binary cross-entropy loss, confirming they are equivalent up to a reparameterization.


4. Small Examples

Example 1: Binary Cross-Entropy (2 Classes)

Setup:

  • Dataset: n=2n = 2 examples.

  • Example 1: x(1)=[1,2]\vec{x}^{(1)} = [1, 2], y(1)=1y^{(1)} = 1 (class 1).

  • Example 2: x(2)=[3,4]\vec{x}^{(2)} = [3, 4], y(2)=0y^{(2)} = 0 (class 0).

  • Parameters: θ=[0.5,0.5]\theta = [0.5, 0.5].

Calculation:

  • For example 1: θx(1)=0.51+0.52=1.5\theta^\top \vec{x}^{(1)} = 0.5 \cdot 1 + 0.5 \cdot 2 = 1.5, p^(1)=σ(1.5)11+e1.50.818\hat{p}^{(1)} = \sigma(1.5) \approx \frac{1}{1 + e^{-1.5}} \approx 0.818.

  • Loss term: y(1)logp^(1)+(1y(1))log(1p^(1))=1log(0.818)+0log(10.818)log(0.818)0.201y^{(1)} \log \hat{p}^{(1)} + (1 - y^{(1)}) \log (1 - \hat{p}^{(1)}) = 1 \cdot \log(0.818) + 0 \cdot \log(1 - 0.818) \approx \log(0.818) \approx -0.201.

  • For example 2: θx(2)=0.53+0.54=3.5\theta^\top \vec{x}^{(2)} = 0.5 \cdot 3 + 0.5 \cdot 4 = 3.5, p^(2)=σ(3.5)0.970\hat{p}^{(2)} = \sigma(3.5) \approx 0.970.

  • Loss term: 0log(0.970)+1log(10.970)log(0.030)3.5070 \cdot \log(0.970) + 1 \cdot \log(1 - 0.970) \approx \log(0.030) \approx -3.507.

  • Total loss: J(θ)=12[(0.201)+(3.507)]3.70821.854J(\theta) = -\frac{1}{2} [(-0.201) + (-3.507)] \approx \frac{3.708}{2} \approx 1.854.

Example 2: Multinomial Cross-Entropy (3 Classes)

Setup:

  • Dataset: n=2n = 2 examples, K=3K = 3 classes.

  • Example 1: x(1)=[1,2]\vec{x}^{(1)} = [1, 2], y(1)=2y^{(1)} = 2 (class 2).

  • Example 2: x(2)=[3,4]\vec{x}^{(2)} = [3, 4], y(2)=1y^{(2)} = 1 (class 1).

  • Parameters: θ1=[0.5,1.0]\theta_1 = [0.5, 1.0], θ2=[1.0,0.5]\theta_2 = [1.0, 0.5], θ3=[0.5,0.5]\theta_3 = [-0.5, 0.5].

Calculation:

  • For example 1: Scores are θ1x(1)=0.51+1.02=2.5\theta_1^\top \vec{x}^{(1)} = 0.5 \cdot 1 + 1.0 \cdot 2 = 2.5, θ2x(1)=1.01+0.52=2.0\theta_2^\top \vec{x}^{(1)} = 1.0 \cdot 1 + 0.5 \cdot 2 = 2.0, θ3x(1)=0.51+0.52=0.5\theta_3^\top \vec{x}^{(1)} = -0.5 \cdot 1 + 0.5 \cdot 2 = 0.5.

  • Sum of exponentials: e(2.5)12.182e(2.5) \approx 12.182, e(2.0)7.389e(2.0) \approx 7.389, e(0.5)1.649e(0.5) \approx 1.649, total 21.220\approx 21.220.

  • Loss term: θ2x(1)logl=13e(θlx(1))=2.0log(21.220)2.03.0551.055\theta_2^\top \vec{x}^{(1)} - \log \sum_{l=1}^3 e(\theta_l^\top \vec{x}^{(1)}) = 2.0 - \log(21.220) \approx 2.0 - 3.055 \approx -1.055.

  • For example 2: Scores are θ1x(2)=3.5\theta_1^\top \vec{x}^{(2)} = 3.5, θ2x(2)=2.0\theta_2^\top \vec{x}^{(2)} = 2.0, θ3x(2)=1.0\theta_3^\top \vec{x}^{(2)} = 1.0.

  • Sum of exponentials: e(3.5)33.115e(3.5) \approx 33.115, e(2.0)7.389e(2.0) \approx 7.389, e(1.0)2.718e(1.0) \approx 2.718, total 43.222\approx 43.222.

  • Loss term: θ1x(2)logl=13e(θlx(2))=3.5log(43.222)3.53.7670.267\theta_1^\top \vec{x}^{(2)} - \log \sum_{l=1}^3 e(\theta_l^\top \vec{x}^{(2)}) = 3.5 - \log(43.222) \approx 3.5 - 3.767 \approx -0.267.

  • Total loss: J(θ)=12[(1.055)+(0.267)]1.32220.661J(\theta) = -\frac{1}{2} [(-1.055) + (-0.267)] \approx \frac{1.322}{2} \approx 0.661.


# Your code here