Max-Margin Intuition¶

Learning objectives

By the end of this notebook you will be able to:

Explain the maximum-margin classification principle geometrically.
Define support vectors, margin, and the decision boundary hyperplane.
State the hard-margin SVM primal optimisation problem.
Derive why maximum margin corresponds to minimising $\|\boldsymbol{w}\|^2$ .
Visualise the decision boundary and margin bands for a linear SVM.
Understand why larger margin generalises better (structural risk minimisation).
Train a linear SVM with sklearn and identify the support vectors.

Business hook — The safest boundary¶

Two loan officers each draw a line separating high-risk from low-risk applicants. Both lines correctly classify all training examples. But Officer A’s line runs through the middle with no gap; Officer B’s line keeps the maximum possible distance from both groups.

When a new ambiguous applicant arrives, Officer B’s model is more confident — because being far from both classes means being robust to measurement noise and slight distributional shift.

This is the core intuition of Support Vector Machines: not just any separating hyperplane, but the one with maximum margin.

1. Maximum Margin — The Geometry¶

A hyperplane in $\mathbb{R}^p$ is defined by:

\boldsymbol{w}^\top \mathbf{x} + b = 0

(1)

The signed distance from a point $\mathbf{x}^{(i)}$ to the hyperplane is:

d^{(i)} = \frac{\boldsymbol{w}^\top \mathbf{x}^{(i)} + b}{\|\boldsymbol{w}\|}

(2)

For a classifier to label $y^{(i)} \in \{-1, +1\}$ correctly with confidence at least $\gamma$ :

y^{(i)} \cdot (\boldsymbol{w}^\top \mathbf{x}^{(i)} + b) \geq \gamma \|\boldsymbol{w}\|

(3)

Setting $\gamma = 1$ (the canonical form), the margin is:

\text{margin} = \frac{2}{\|\boldsymbol{w}\|}

(4)

Hard-margin SVM primal problem:

\min_{\boldsymbol{w}, b} \frac{1}{2}\|\boldsymbol{w}\|^2 \quad \text{subject to} \quad y^{(i)}(\boldsymbol{w}^\top \mathbf{x}^{(i)} + b) \geq 1 \quad \forall i

(5)

Support vectors are the training points that lie exactly on the margin boundaries:

y^{(i)}(\boldsymbol{w}^\top \mathbf{x}^{(i)} + b) = 1

(6)

The decision boundary is completely determined by these support vectors — all other training points have no influence.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

# Original user code: linear SVM on simple data
X = np.array([[1, 2], [2, 3], [3, 3], [6, 5], [7, 8], [8, 8]])
y = np.array([0, 0, 0, 1, 1, 1])

clf = svm.SVC(kernel='linear', C=1000)
clf.fit(X, y)

fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k', s=100, zorder=10)

xlim = [0, 10]
ylim = [0, 10]
xx = np.linspace(xlim[0], xlim[1], 50)
yy = np.linspace(ylim[0], ylim[1], 50)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)

ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1],
           linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
           s=200, facecolors='none', edgecolors='black', linewidths=2,
           label='Support vectors')
ax.set_xlim(xlim); ax.set_ylim(ylim)
ax.set_xlabel('Feature 1'); ax.set_ylabel('Feature 2')
ax.set_title('Linear SVM: Decision Boundary and Margin Bands')
ax.legend()
ax.grid(True, alpha=0.3)

# Annotate margin
w = clf.coef_[0]
margin = 2 / np.linalg.norm(w)
print(f'Support vectors:\n{clf.support_vectors_}')
print(f'Weight vector w: {w.round(4)}')
print(f'Bias b: {clf.intercept_[0]:.4f}')
print(f'Margin (2/||w||): {margin:.4f}')
plt.show()

2. Why Maximum Margin Generalises Better¶

Structural Risk Minimisation (SRM) provides the theoretical justification. The generalisation error bound for an SVM is:

\text{Error}_{\text{test}} \leq \text{Error}_{\text{train}} + O\!\left(\sqrt{\frac{h \log(n/h) + \log(1/\delta)}{n}}\right)

(7)

where $h$ is the VC dimension (measure of hypothesis complexity). For a linear classifier in $\mathbb{R}^p$ the VC dimension is at most $p + 1$ . But for an SVM with margin $\rho$ , the effective VC dimension is bounded by $R^2/\rho^2$ where $R$ is the data radius.

Larger margin $\rho$ → lower VC dimension → tighter generalisation bound.

This means SVMs with large margins are theoretically justified to generalise — not just as a heuristic, but as a principled complexity-control mechanism (analogous to regularisation in regression).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_circles

# Original user code: preview of kernel trick
X, y = make_circles(100, factor=0.3, noise=0.1, random_state=42)

# Linear SVM fails
clf_lin = svm.SVC(kernel='linear', C=1.0)
clf_lin.fit(X, y)

# RBF SVM succeeds (preview)
clf_rbf = svm.SVC(kernel='rbf', gamma='scale', C=1.0)
clf_rbf.fit(X, y)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, clf, title in [
    (axes[0], clf_lin, f'Linear SVM (accuracy={clf_lin.score(X,y):.2f})'),
    (axes[1], clf_rbf, f'RBF SVM (accuracy={clf_rbf.score(X,y):.2f})')
]:
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolors='k', s=50)
    xlim = ax.get_xlim() if ax.get_xlim() != (0, 1) else (-1.5, 1.5)
    ylim = ax.get_ylim() if ax.get_ylim() != (0, 1) else (-1.5, 1.5)
    ax.set_xlim(-1.5, 1.5); ax.set_ylim(-1.5, 1.5)
    xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 100), np.linspace(-1.5, 1.5, 100))
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, levels=20, cmap='bwr', alpha=0.2)
    ax.contour(xx, yy, Z, colors='k', levels=[0], linewidths=2)
    ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
               s=100, facecolors='none', edgecolors='k', linewidths=1.5)
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

plt.suptitle('Linear SVM fails on non-linear data — RBF kernel solves it (preview)', fontsize=10)
plt.tight_layout()
plt.show()

3. Support Vector Regression (SVR)¶

SVMs extend naturally to regression via the epsilon-insensitive tube:

\min_{\boldsymbol{w}, b} \frac{1}{2}\|\boldsymbol{w}\|^2 + C \sum_{i=1}^n (\xi_i + \xi_i^*)

(8)

subject to $|y^{(i)} - (\boldsymbol{w}^\top \mathbf{x}^{(i)} + b)| \leq \varepsilon + \xi_i$ .

Points inside the tube incur no loss. Points outside contribute to the objective via slack variables $\xi_i, \xi_i^* \geq 0$ .

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVR

# Original user code: SVR with epsilon tube
np.random.seed(42)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=0.1)
svr.fit(X, y)

X_test = np.linspace(0, 5, 100)[:, np.newaxis]
y_pred = svr.predict(X_test)

plt.figure(figsize=(8, 4))
plt.scatter(X, y, color='blue', label='Data', s=40)
plt.plot(X_test, y_pred, color='red', linewidth=2, label='SVR Prediction')
plt.fill_between(X_test.ravel(), y_pred - svr.epsilon, y_pred + svr.epsilon,
                 color='gray', alpha=0.3, label=f'Epsilon tube (eps={svr.epsilon})')
plt.title('Support Vector Regression (SVR)')
plt.xlabel('X'); plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()

4. Try It in the Browser¶

Compute the margin for a 2D linear SVM by hand.

import math

# Hard-margin SVM: w*x + b = 0 separates two classes
# w = [1, 1], b = -5  =>  line: x1 + x2 = 5
w = [1.0, 1.0]
b = -5.0

# Margin = 2 / ||w||
w_norm = math.sqrt(sum(wi**2 for wi in w))
margin = 2 / w_norm
print(f'Weight vector: {w}')
print(f'||w|| = {w_norm:.4f}')
print(f'Margin = 2/||w|| = {margin:.4f}')

# Check a positive and negative example
examples = [(4, 3, +1, 'positive'), (1, 1, -1, 'negative')]
print()
for x1, x2, label, name in examples:
    score = w[0]*x1 + w[1]*x2 + b
    decision = 'correct' if score * label >= 1 else 'VIOLATION'
    print(f'{name} ({x1},{x2}): score={score:.1f}, y*score={label*score:.1f} [{decision}]')

Knowledge Check¶

In the hard-margin SVM, what are support vectors?¶

Training points that lie exactly on the margin boundaryCorrect. Only these points determine the decision boundary; all others could be removed without changing it.

All training points used to fit the modelOnly the subset on the margin boundary are support vectors.

Points that are misclassified by the modelIn the hard-margin case, no points are misclassified.

Points farthest from the decision boundarySupport vectors are closest to the boundary, not farthest.

Why does maximising the margin correspond to minimising ||w||²?¶

Because ||w||² equals the accuracy of the model||w||² is a geometric quantity, not accuracy.

Because margin = 2/||w||, so maximising the margin is equivalent to minimising ||w||Correct. Squaring gives a differentiable convex objective.

Because ||w||² is the number of support vectorsThe two quantities are not directly equal.

Because ||w|| measures the bias termThe bias b is separate from w.

Exercises¶

Exercise 1 — Margin Sensitivity to C¶

Train linear SVMs with C = [0.001, 0.1, 1, 10, 1000] on a 2D synthetic dataset. For each, print the margin (2/||w||) and number of support vectors. What happens as C grows?

%matplotlib inline
# Exercise 1: margin sensitivity to C
# Your code here

Common Pitfalls¶

What’s Next?¶

Linear SVMs are limited to linearly separable data. In svm_kernels.ipynb we introduce the kernel trick — mapping data to a high-dimensional feature space implicitly via a kernel function — allowing SVMs to learn non-linear boundaries without the computational cost of explicit feature expansion.

Kernel SVMs — the kernel trick, RBF and polynomial kernels
Soft Margin & Regularisation — the C parameter and hinge loss