Kernel SVMs — RBF, Polynomial, and Custom Kernels¶

What you will learn: the kernel trick, Mercer’s theorem, how RBF/polynomial/sigmoid kernels map data to high-dimensional feature spaces without computing that mapping explicitly, and how to implement and compare custom kernels with scikit-learn’s SVC.

Why Kernels Matter for Business¶

A credit-card fraud model trained on linearly-separable features was 78% accurate. After switching to an RBF-kernel SVM, accuracy rose to 94% — the kernel found non-linear fraud patterns invisible to a linear classifier.

The key insight: we never explicitly compute the high-dimensional mapping — only inner products, via the kernel function.

1. The Kernel Trick¶

The SVM dual objective is:

\max_{\alpha} \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle

(1)

The only data dependency is the inner product $\langle x_i, x_j \rangle$ . Replace it with any positive semi-definite kernel $K(x_i, x_j)$ :

K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)

(2)

where $\phi: \mathbb{R}^p \to \mathcal{H}$ is an implicit (possibly infinite-dimensional) feature map.

Mercer’s Theorem: A symmetric function $K$ is a valid kernel iff the Gram matrix $K_{ij} = K(x_i, x_j)$ is PSD for all datasets. This guarantees a $\phi$ exists.

Kernel	Formula	Use case
Linear	$x_i^\top x_j$	Linearly separable, text
Polynomial	$(\gamma x_i^\top x_j + r)^d$	Degree- $d$ interactions
RBF/Gaussian	$\exp(-\gamma \|x_i - x_j\|^2)$	General non-linear
Sigmoid	$\tanh(\gamma x_i^\top x_j + r)$	Neural network analogy

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances


def linear_kernel(x1, x2):
    return np.dot(x1, x2.T)


def polynomial_kernel(x1, x2, degree=3, gamma=1.0, coef0=1.0):
    x1_ = np.atleast_2d(x1)
    x2_ = np.atleast_2d(x2)
    return (gamma * np.dot(x1_, x2_.T) + coef0) ** degree


def rbf_kernel(x1, x2, gamma=1.0):
    x1_ = np.atleast_2d(x1)
    x2_ = np.atleast_2d(x2)
    sq_distances = euclidean_distances(x1_, x2_, squared=True)
    return np.exp(-gamma * sq_distances)


# Verify kernel values
vec1 = np.array([[1.0, 2.0]])
vec2 = np.array([[3.0, 4.0]])
print(f"Linear   K(v1,v2) = {linear_kernel(vec1, vec2)[0,0]:.4f}  (expect 11.0)")
print(f"Poly d=3 K(v1,v2) = {polynomial_kernel(vec1, vec2)[0,0]:.4f}")
print(f"RBF g=1  K(v1,v2) = {rbf_kernel(vec1, vec2)[0,0]:.6f}")
print(f"RBF g=1  K(v1,v1) = {rbf_kernel(vec1, vec1)[0,0]:.6f}  (expect 1.0 — self-similarity)")

# Visualise RBF kernel as function of distance
dists = np.linspace(0, 4, 200)
gammas = [0.1, 0.5, 1.0, 2.0]
fig, ax = plt.subplots(figsize=(8, 4))
for g in gammas:
    ax.plot(dists, np.exp(-g * dists**2), label=f'γ={g}')
ax.set_xlabel('||xi − xj||')
ax.set_ylabel('K(xi, xj)')
ax.set_title('RBF Kernel Similarity vs Distance')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

2. SVM Primal — Gradient Descent on Hinge Loss¶

The primal SVM minimises:

\min_{w,b} \frac{\lambda}{2} \|w\|^2 + \frac{1}{n} \sum_i \max(0,\, 1 - y_i(w^\top x_i - b))

(3)

This is convex and differentiable almost everywhere. Gradient descent works directly on the hinge + regularisation objective.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC


class SVMPrimal:
    def __init__(self, learning_rate=0.01, lambda_param=0.01, n_iters=500):
        self.lr = learning_rate
        self.lambda_param = lambda_param
        self.n_iters = n_iters
        self.w = None
        self.b = None

    def fit(self, X, y):
        y = np.where(y <= 0, -1, 1)
        n_samples, n_features = X.shape
        self.w = np.zeros(n_features)
        self.b = 0
        Xy = X * y[:, np.newaxis]
        for _ in range(self.n_iters):
            linear_model = np.dot(X, self.w) - self.b
            margins = y * linear_model
            mask = margins < 1
            dw = 2 * self.lambda_param * self.w
            if np.any(mask):
                dw -= np.mean(Xy[mask], axis=0)
                db = -np.mean(y[mask])
            else:
                db = 0
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        return np.sign(np.dot(X, self.w) - self.b)


# Linearly separable dataset
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1, random_state=42)
y_pm = np.where(y == 0, -1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y_pm, test_size=0.2, random_state=42)

primal = SVMPrimal(learning_rate=0.001, lambda_param=0.01, n_iters=1000)
primal.fit(X_tr, y_tr)
sklearn_svc = SVC(kernel='linear', C=50)
sklearn_svc.fit(X_tr, y_tr)

acc_primal = accuracy_score(y_te, primal.predict(X_te))
acc_sklearn = accuracy_score(y_te, sklearn_svc.predict(X_te))
print(f"SVMPrimal accuracy : {acc_primal:.4f}")
print(f"sklearn SVC linear : {acc_sklearn:.4f}")

# Plot decision boundaries side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for ax, (model, name) in zip(axes, [(primal, 'SVMPrimal'), (sklearn_svc, 'sklearn SVC')]):
    h = 0.05
    xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200),
                          np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
    ax.scatter(X_te[:,0], X_te[:,1], c=y_te, cmap=plt.cm.coolwarm, edgecolors='k', s=25)
    ax.set_title(f'{name}  acc={accuracy_score(y_te, model.predict(X_te)):.3f}')
    ax.set_xlabel('x1'); ax.set_ylabel('x2')
plt.tight_layout()
plt.show()

3. Sequential Minimal Optimisation (SMO)¶

The SVM dual is a quadratic program with $n$ variables. SMO solves it by iteratively picking two Lagrange multipliers $\alpha_i, \alpha_j$ and solving the 2-variable sub-problem analytically:

\alpha_j^{\text{new}} = \alpha_j^{\text{old}} - \frac{y_j(E_i - E_j)}{\eta}, \quad \eta = 2K_{ij} - K_{ii} - K_{jj}

(4)

where $E_i = f(x_i) - y_i$ is the prediction error. After clipping to $[L, H]$ and updating $\alpha_i$ , the bias $b$ is updated to maintain KKT conditions.

The full SMO below supports linear, RBF, and polynomial kernels via a precomputed kernel matrix.

%matplotlib inline
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt


class SMO:
    def __init__(self, C=1.0, tol=1e-3, max_iter=1000, kernel='linear', gamma=1.0, degree=3):
        self.C = C
        self.tol = tol
        self.max_iter = max_iter
        self.kernel = kernel
        self.gamma = gamma
        self.degree = degree
        self.alphas = None
        self.b = 0
        self.X = None
        self.y = None
        self.kernel_matrix = None
        self.support_vectors = None
        self.support_vector_alphas = None
        self.support_vector_y = None

    def _kernel(self, X1, X2=None):
        if X2 is None:
            X2 = X1
        if self.kernel == 'linear':
            return np.dot(X1, X2.T)
        elif self.kernel == 'rbf':
            return np.exp(-self.gamma * cdist(X1, X2, 'sqeuclidean'))
        elif self.kernel == 'poly':
            return (self.gamma * np.dot(X1, X2.T) + 1.0) ** self.degree
        return np.dot(X1, X2.T)

    def _compute_error_full(self, i):
        return np.sum(self.alphas * self.y * self.kernel_matrix[i]) + self.b - self.y[i]

    def _compute_error(self, i):
        if 0 < self.alphas[i] < self.C:
            return self.errors[i]
        return self._compute_error_full(i)

    def _select_second_alpha(self, i, error_i):
        valid = np.where((self.alphas > 0) & (self.alphas < self.C))[0]
        if len(valid) > 1:
            j = valid[np.argmax(np.abs(error_i - self.errors[valid]))]
            if j != i:
                return j
        j = i
        while j == i:
            j = np.random.randint(0, len(self.alphas))
        return j

    def _take_step(self, i, j):
        if i == j:
            return 0
        a_i, a_j = self.alphas[i], self.alphas[j]
        y_i, y_j = self.y[i], self.y[j]
        L = max(0, a_j - a_i) if y_i != y_j else max(0, a_i + a_j - self.C)
        H = min(self.C, self.C + a_j - a_i) if y_i != y_j else min(self.C, a_i + a_j)
        if L == H:
            return 0
        eta = 2*self.kernel_matrix[i,j] - self.kernel_matrix[i,i] - self.kernel_matrix[j,j]
        if eta >= 0:
            return 0
        ei = self._compute_error(i); ej = self._compute_error(j)
        self.alphas[j] = min(H, max(L, a_j - y_j*(ei - ej)/eta))
        if abs(self.alphas[j] - a_j) < 1e-5:
            return 0
        self.alphas[i] += y_i*y_j*(a_j - self.alphas[j])
        b1 = self.b - ei - y_i*(self.alphas[i]-a_i)*self.kernel_matrix[i,i] - y_j*(self.alphas[j]-a_j)*self.kernel_matrix[i,j]
        b2 = self.b - ej - y_i*(self.alphas[i]-a_i)*self.kernel_matrix[i,j] - y_j*(self.alphas[j]-a_j)*self.kernel_matrix[j,j]
        self.b = b1 if 0 < self.alphas[i] < self.C else (b2 if 0 < self.alphas[j] < self.C else (b1+b2)/2)
        for k in np.where((self.alphas > 0) & (self.alphas < self.C))[0]:
            self.errors[k] = self._compute_error_full(k)
        return 1

    def _examine_example(self, i):
        r_i = self._compute_error(i) * self.y[i]
        if (r_i < -self.tol and self.alphas[i] < self.C) or (r_i > self.tol and self.alphas[i] > 0):
            ei = self._compute_error(i)
            j = self._select_second_alpha(i, ei)
            if self._take_step(i, j):
                return 1
            for j in np.where((self.alphas > 0) & (self.alphas < self.C))[0]:
                if self._take_step(i, j):
                    return 1
            for j in np.random.permutation(len(self.alphas)):
                if self._take_step(i, j):
                    return 1
        return 0

    def fit(self, X, y):
        self.X = X
        self.y = np.where(y <= 0, -1, 1)
        n = X.shape[0]
        self.alphas = np.zeros(n)
        self.b = 0
        self.kernel_matrix = self._kernel(X)
        self.errors = np.array([self._compute_error_full(i) for i in range(n)])
        changed = 0; examine_all = True; it = 0
        while it < self.max_iter and (changed > 0 or examine_all):
            changed = sum(self._examine_example(i) for i in (range(n) if examine_all else np.where((self.alphas > 0) & (self.alphas < self.C))[0]))
            examine_all = not examine_all if examine_all else (changed == 0)
            it += 1
        sv_idx = np.where(self.alphas > self.tol)[0]
        self.support_vectors = X[sv_idx]
        self.support_vector_alphas = self.alphas[sv_idx]
        self.support_vector_y = self.y[sv_idx]
        return self

    def decision_function(self, X):
        K = self._kernel(X, self.support_vectors)
        return np.dot(K, self.support_vector_alphas * self.support_vector_y) + self.b

    def predict(self, X):
        return np.sign(self.decision_function(X))


# Compare linear/RBF/poly on make_moons
X, y = make_moons(n_samples=400, noise=0.2, random_state=42)
y_pm = np.where(y == 0, -1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y_pm, test_size=0.2, random_state=42)

models = {
    'SMO  linear': SMO(kernel='linear', C=1.0, max_iter=500),
    'SMO  RBF':    SMO(kernel='rbf',    gamma=0.5, C=1.0, max_iter=500),
    'SMO  poly':   SMO(kernel='poly',   degree=3, gamma=0.1, C=1.0, max_iter=500),
    'sklearn RBF': SVC(kernel='rbf',    gamma=0.5, C=1.0),
}
results = {}
for name, m in models.items():
    m.fit(X_tr, y_tr)
    results[name] = accuracy_score(y_te, m.predict(X_te))
    print(f"{name:18s}: acc={results[name]:.4f}")

4. Custom Kernels with scikit-learn SVC¶

sklearn’s SVC accepts a callable kernel: kernel=my_func where my_func(X, Y) returns the Gram matrix $K \in \mathbb{R}^{n \times m}$ .

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances


def custom_rbf(X, Y, gamma=1.0):
    sq_dist = euclidean_distances(X, Y, squared=True)
    return np.exp(-gamma * sq_dist)


X, y = make_moons(n_samples=200, noise=0.15, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)

from functools import partial
svm_custom = SVC(kernel=partial(custom_rbf, gamma=1.0), C=1.0)
svm_builtin = SVC(kernel='rbf', gamma=1.0, C=1.0)
svm_custom.fit(X_tr_s, y_tr)
svm_builtin.fit(X_tr_s, y_tr)
print(f"Custom RBF kernel : {svm_custom.score(X_te_s, y_te):.4f}")
print(f"Built-in RBF     : {svm_builtin.score(X_te_s, y_te):.4f}")

# Decision boundary comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for ax, (m, label) in zip(axes, [(svm_custom, 'Custom RBF'), (svm_builtin, 'Built-in RBF')]):
    xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
    Z = m.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
    ax.scatter(X_te_s[:,0], X_te_s[:,1], c=y_te, cmap=plt.cm.coolwarm, edgecolors='k', s=30)
    ax.set_title(f'{label}  acc={m.score(X_te_s, y_te):.3f}')
    ax.set_xlabel('x1'); ax.set_ylabel('x2')
plt.tight_layout()
plt.show()

5. Try It in the Browser¶

Compute kernel values by hand for small vectors.

import math

def rbf_kernel_scalar(xi, xj, gamma=1.0):
    sq_dist = sum((a - b)**2 for a, b in zip(xi, xj))
    return math.exp(-gamma * sq_dist)

def poly_kernel_scalar(xi, xj, degree=3, gamma=1.0, coef0=1.0):
    dot = sum(a*b for a, b in zip(xi, xj))
    return (gamma * dot + coef0) ** degree

xi = [1.0, 0.5]
xj = [0.5, 1.5]
gamma = 0.5
print(f"RBF  K(xi,xj) gamma={gamma}: {rbf_kernel_scalar(xi, xj, gamma):.6f}")
print(f"Poly K(xi,xj) d=3:          {poly_kernel_scalar(xi, xj):.6f}")
print(f"Linear K(xi,xj):            {sum(a*b for a,b in zip(xi,xj)):.6f}")

Knowledge Check¶

The RBF kernel $K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)$ satisfies Mercer's theorem because it is:¶

[ ] A) Always zero when $x_i \ne x_j$ [ ] B) Positive semi-definite for any finite dataset [ ] C) Only valid for 2-class problems [ ] D) Equivalent to a dot product in the original space

Check

When you pass a callable to sklearn's `SVC(kernel=my_func)`, `my_func(X, Y)` must return:¶

[ ] A) A 1-D array of kernel values for a single pair [ ] B) An $n \times m$ Gram matrix where $n = |X|, m = |Y|$ [ ] C) The feature-space coordinates $\phi(X)$ [ ] D) The pairwise Euclidean distances

Check

Exercises¶

Exercise 1 — Kernel Comparison on Moons¶

Train sklearn SVCs with kernel='linear', 'rbf', and 'poly' on make_moons(n_samples=500, noise=0.25). For each, report accuracy and number of support vectors. Which kernel uses the fewest support vectors?

Exercise 2 — Gamma Sensitivity¶

For the RBF kernel, train on make_moons with gamma ∈ {0.01, 0.1, 1.0, 10.0, 100.0}. Plot train vs test accuracy. At what gamma does overfitting begin?

%matplotlib inline
# Exercise 1 and 2: your code here
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = make_moons(n_samples=500, noise=0.25, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)
# Your code here

Common Pitfalls¶

What’s Next — Soft-Margin SVMs¶

Real data has noise and overlap. In the next notebook we introduce slack variables $\xi_i \geq 0$ and the soft-margin objective:

\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C \sum_i \xi_i \quad \text{s.t.} \quad y_i(w^\top x_i + b) \geq 1 - \xi_i

(5)

The hyperparameter $C$ controls the bias-variance trade-off: small $C$ tolerates misclassification for a wider margin; large $C$ penalises violations heavily. Proceed to svm_softmargin.ipynb.