
Kernel SVMs — RBF, Polynomial, and Custom Kernels¶
What you will learn: the kernel trick, Mercer’s theorem, how RBF/polynomial/sigmoid kernels map data to high-dimensional feature spaces without computing that mapping explicitly, and how to implement and compare custom kernels with scikit-learn’s SVC.

Why Kernels Matter for Business¶
A credit-card fraud model trained on linearly-separable features was 78% accurate. After switching to an RBF-kernel SVM, accuracy rose to 94% — the kernel found non-linear fraud patterns invisible to a linear classifier.
The key insight: we never explicitly compute the high-dimensional mapping — only inner products, via the kernel function.
1. The Kernel Trick¶
The SVM dual objective is:
The only data dependency is the inner product . Replace it with any positive semi-definite kernel :
where is an implicit (possibly infinite-dimensional) feature map.
Mercer’s Theorem: A symmetric function is a valid kernel iff the Gram matrix is PSD for all datasets. This guarantees a exists.
| Kernel | Formula | Use case |
|---|---|---|
| Linear | Linearly separable, text | |
| Polynomial | Degree- interactions | |
| RBF/Gaussian | General non-linear | |
| Sigmoid | Neural network analogy |
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
def linear_kernel(x1, x2):
return np.dot(x1, x2.T)
def polynomial_kernel(x1, x2, degree=3, gamma=1.0, coef0=1.0):
x1_ = np.atleast_2d(x1)
x2_ = np.atleast_2d(x2)
return (gamma * np.dot(x1_, x2_.T) + coef0) ** degree
def rbf_kernel(x1, x2, gamma=1.0):
x1_ = np.atleast_2d(x1)
x2_ = np.atleast_2d(x2)
sq_distances = euclidean_distances(x1_, x2_, squared=True)
return np.exp(-gamma * sq_distances)
# Verify kernel values
vec1 = np.array([[1.0, 2.0]])
vec2 = np.array([[3.0, 4.0]])
print(f"Linear K(v1,v2) = {linear_kernel(vec1, vec2)[0,0]:.4f} (expect 11.0)")
print(f"Poly d=3 K(v1,v2) = {polynomial_kernel(vec1, vec2)[0,0]:.4f}")
print(f"RBF g=1 K(v1,v2) = {rbf_kernel(vec1, vec2)[0,0]:.6f}")
print(f"RBF g=1 K(v1,v1) = {rbf_kernel(vec1, vec1)[0,0]:.6f} (expect 1.0 — self-similarity)")
# Visualise RBF kernel as function of distance
dists = np.linspace(0, 4, 200)
gammas = [0.1, 0.5, 1.0, 2.0]
fig, ax = plt.subplots(figsize=(8, 4))
for g in gammas:
ax.plot(dists, np.exp(-g * dists**2), label=f'γ={g}')
ax.set_xlabel('||xi − xj||')
ax.set_ylabel('K(xi, xj)')
ax.set_title('RBF Kernel Similarity vs Distance')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
2. SVM Primal — Gradient Descent on Hinge Loss¶
The primal SVM minimises:
This is convex and differentiable almost everywhere. Gradient descent works directly on the hinge + regularisation objective.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
class SVMPrimal:
def __init__(self, learning_rate=0.01, lambda_param=0.01, n_iters=500):
self.lr = learning_rate
self.lambda_param = lambda_param
self.n_iters = n_iters
self.w = None
self.b = None
def fit(self, X, y):
y = np.where(y <= 0, -1, 1)
n_samples, n_features = X.shape
self.w = np.zeros(n_features)
self.b = 0
Xy = X * y[:, np.newaxis]
for _ in range(self.n_iters):
linear_model = np.dot(X, self.w) - self.b
margins = y * linear_model
mask = margins < 1
dw = 2 * self.lambda_param * self.w
if np.any(mask):
dw -= np.mean(Xy[mask], axis=0)
db = -np.mean(y[mask])
else:
db = 0
self.w -= self.lr * dw
self.b -= self.lr * db
def predict(self, X):
return np.sign(np.dot(X, self.w) - self.b)
# Linearly separable dataset
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=42)
y_pm = np.where(y == 0, -1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y_pm, test_size=0.2, random_state=42)
primal = SVMPrimal(learning_rate=0.001, lambda_param=0.01, n_iters=1000)
primal.fit(X_tr, y_tr)
sklearn_svc = SVC(kernel='linear', C=50)
sklearn_svc.fit(X_tr, y_tr)
acc_primal = accuracy_score(y_te, primal.predict(X_te))
acc_sklearn = accuracy_score(y_te, sklearn_svc.predict(X_te))
print(f"SVMPrimal accuracy : {acc_primal:.4f}")
print(f"sklearn SVC linear : {acc_sklearn:.4f}")
# Plot decision boundaries side by side
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for ax, (model, name) in zip(axes, [(primal, 'SVMPrimal'), (sklearn_svc, 'sklearn SVC')]):
h = 0.05
xx, yy = np.meshgrid(np.linspace(X[:,0].min()-1, X[:,0].max()+1, 200),
np.linspace(X[:,1].min()-1, X[:,1].max()+1, 200))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
ax.scatter(X_te[:,0], X_te[:,1], c=y_te, cmap=plt.cm.coolwarm, edgecolors='k', s=25)
ax.set_title(f'{name} acc={accuracy_score(y_te, model.predict(X_te)):.3f}')
ax.set_xlabel('x1'); ax.set_ylabel('x2')
plt.tight_layout()
plt.show()
3. Sequential Minimal Optimisation (SMO)¶
The SVM dual is a quadratic program with variables. SMO solves it by iteratively picking two Lagrange multipliers and solving the 2-variable sub-problem analytically:
where is the prediction error. After clipping to and updating , the bias is updated to maintain KKT conditions.
The full SMO below supports linear, RBF, and polynomial kernels via a precomputed kernel matrix.
%matplotlib inline
import numpy as np
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
class SMO:
def __init__(self, C=1.0, tol=1e-3, max_iter=1000, kernel='linear', gamma=1.0, degree=3):
self.C = C
self.tol = tol
self.max_iter = max_iter
self.kernel = kernel
self.gamma = gamma
self.degree = degree
self.alphas = None
self.b = 0
self.X = None
self.y = None
self.kernel_matrix = None
self.support_vectors = None
self.support_vector_alphas = None
self.support_vector_y = None
def _kernel(self, X1, X2=None):
if X2 is None:
X2 = X1
if self.kernel == 'linear':
return np.dot(X1, X2.T)
elif self.kernel == 'rbf':
return np.exp(-self.gamma * cdist(X1, X2, 'sqeuclidean'))
elif self.kernel == 'poly':
return (self.gamma * np.dot(X1, X2.T) + 1.0) ** self.degree
return np.dot(X1, X2.T)
def _compute_error_full(self, i):
return np.sum(self.alphas * self.y * self.kernel_matrix[i]) + self.b - self.y[i]
def _compute_error(self, i):
if 0 < self.alphas[i] < self.C:
return self.errors[i]
return self._compute_error_full(i)
def _select_second_alpha(self, i, error_i):
valid = np.where((self.alphas > 0) & (self.alphas < self.C))[0]
if len(valid) > 1:
j = valid[np.argmax(np.abs(error_i - self.errors[valid]))]
if j != i:
return j
j = i
while j == i:
j = np.random.randint(0, len(self.alphas))
return j
def _take_step(self, i, j):
if i == j:
return 0
a_i, a_j = self.alphas[i], self.alphas[j]
y_i, y_j = self.y[i], self.y[j]
L = max(0, a_j - a_i) if y_i != y_j else max(0, a_i + a_j - self.C)
H = min(self.C, self.C + a_j - a_i) if y_i != y_j else min(self.C, a_i + a_j)
if L == H:
return 0
eta = 2*self.kernel_matrix[i,j] - self.kernel_matrix[i,i] - self.kernel_matrix[j,j]
if eta >= 0:
return 0
ei = self._compute_error(i); ej = self._compute_error(j)
self.alphas[j] = min(H, max(L, a_j - y_j*(ei - ej)/eta))
if abs(self.alphas[j] - a_j) < 1e-5:
return 0
self.alphas[i] += y_i*y_j*(a_j - self.alphas[j])
b1 = self.b - ei - y_i*(self.alphas[i]-a_i)*self.kernel_matrix[i,i] - y_j*(self.alphas[j]-a_j)*self.kernel_matrix[i,j]
b2 = self.b - ej - y_i*(self.alphas[i]-a_i)*self.kernel_matrix[i,j] - y_j*(self.alphas[j]-a_j)*self.kernel_matrix[j,j]
self.b = b1 if 0 < self.alphas[i] < self.C else (b2 if 0 < self.alphas[j] < self.C else (b1+b2)/2)
for k in np.where((self.alphas > 0) & (self.alphas < self.C))[0]:
self.errors[k] = self._compute_error_full(k)
return 1
def _examine_example(self, i):
r_i = self._compute_error(i) * self.y[i]
if (r_i < -self.tol and self.alphas[i] < self.C) or (r_i > self.tol and self.alphas[i] > 0):
ei = self._compute_error(i)
j = self._select_second_alpha(i, ei)
if self._take_step(i, j):
return 1
for j in np.where((self.alphas > 0) & (self.alphas < self.C))[0]:
if self._take_step(i, j):
return 1
for j in np.random.permutation(len(self.alphas)):
if self._take_step(i, j):
return 1
return 0
def fit(self, X, y):
self.X = X
self.y = np.where(y <= 0, -1, 1)
n = X.shape[0]
self.alphas = np.zeros(n)
self.b = 0
self.kernel_matrix = self._kernel(X)
self.errors = np.array([self._compute_error_full(i) for i in range(n)])
changed = 0; examine_all = True; it = 0
while it < self.max_iter and (changed > 0 or examine_all):
changed = sum(self._examine_example(i) for i in (range(n) if examine_all else np.where((self.alphas > 0) & (self.alphas < self.C))[0]))
examine_all = not examine_all if examine_all else (changed == 0)
it += 1
sv_idx = np.where(self.alphas > self.tol)[0]
self.support_vectors = X[sv_idx]
self.support_vector_alphas = self.alphas[sv_idx]
self.support_vector_y = self.y[sv_idx]
return self
def decision_function(self, X):
K = self._kernel(X, self.support_vectors)
return np.dot(K, self.support_vector_alphas * self.support_vector_y) + self.b
def predict(self, X):
return np.sign(self.decision_function(X))
# Compare linear/RBF/poly on make_moons
X, y = make_moons(n_samples=400, noise=0.2, random_state=42)
y_pm = np.where(y == 0, -1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y_pm, test_size=0.2, random_state=42)
models = {
'SMO linear': SMO(kernel='linear', C=1.0, max_iter=500),
'SMO RBF': SMO(kernel='rbf', gamma=0.5, C=1.0, max_iter=500),
'SMO poly': SMO(kernel='poly', degree=3, gamma=0.1, C=1.0, max_iter=500),
'sklearn RBF': SVC(kernel='rbf', gamma=0.5, C=1.0),
}
results = {}
for name, m in models.items():
m.fit(X_tr, y_tr)
results[name] = accuracy_score(y_te, m.predict(X_te))
print(f"{name:18s}: acc={results[name]:.4f}")
4. Custom Kernels with scikit-learn SVC¶
sklearn’s SVC accepts a callable kernel: kernel=my_func where my_func(X, Y) returns the Gram matrix .
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import euclidean_distances
def custom_rbf(X, Y, gamma=1.0):
sq_dist = euclidean_distances(X, Y, squared=True)
return np.exp(-gamma * sq_dist)
X, y = make_moons(n_samples=200, noise=0.15, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)
from functools import partial
svm_custom = SVC(kernel=partial(custom_rbf, gamma=1.0), C=1.0)
svm_builtin = SVC(kernel='rbf', gamma=1.0, C=1.0)
svm_custom.fit(X_tr_s, y_tr)
svm_builtin.fit(X_tr_s, y_tr)
print(f"Custom RBF kernel : {svm_custom.score(X_te_s, y_te):.4f}")
print(f"Built-in RBF : {svm_builtin.score(X_te_s, y_te):.4f}")
# Decision boundary comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
for ax, (m, label) in zip(axes, [(svm_custom, 'Custom RBF'), (svm_builtin, 'Built-in RBF')]):
xx, yy = np.meshgrid(np.linspace(-3, 3, 200), np.linspace(-3, 3, 200))
Z = m.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
ax.scatter(X_te_s[:,0], X_te_s[:,1], c=y_te, cmap=plt.cm.coolwarm, edgecolors='k', s=30)
ax.set_title(f'{label} acc={m.score(X_te_s, y_te):.3f}')
ax.set_xlabel('x1'); ax.set_ylabel('x2')
plt.tight_layout()
plt.show()
5. Try It in the Browser¶
Compute kernel values by hand for small vectors.
import math
def rbf_kernel_scalar(xi, xj, gamma=1.0):
sq_dist = sum((a - b)**2 for a, b in zip(xi, xj))
return math.exp(-gamma * sq_dist)
def poly_kernel_scalar(xi, xj, degree=3, gamma=1.0, coef0=1.0):
dot = sum(a*b for a, b in zip(xi, xj))
return (gamma * dot + coef0) ** degree
xi = [1.0, 0.5]
xj = [0.5, 1.5]
gamma = 0.5
print(f"RBF K(xi,xj) gamma={gamma}: {rbf_kernel_scalar(xi, xj, gamma):.6f}")
print(f"Poly K(xi,xj) d=3: {poly_kernel_scalar(xi, xj):.6f}")
print(f"Linear K(xi,xj): {sum(a*b for a,b in zip(xi,xj)):.6f}")Knowledge Check¶
The RBF kernel $K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)$ satisfies Mercer's theorem because it is:¶
CheckWhen you pass a callable to sklearn's `SVC(kernel=my_func)`, `my_func(X, Y)` must return:¶
CheckExercises¶
Exercise 1 — Kernel Comparison on Moons¶
Train sklearn SVCs with kernel='linear', 'rbf', and 'poly' on make_moons(n_samples=500, noise=0.25). For each, report accuracy and number of support vectors. Which kernel uses the fewest support vectors?
Exercise 2 — Gamma Sensitivity¶
For the RBF kernel, train on make_moons with gamma ∈ {0.01, 0.1, 1.0, 10.0, 100.0}. Plot train vs test accuracy. At what gamma does overfitting begin?
%matplotlib inline
# Exercise 1 and 2: your code here
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X, y = make_moons(n_samples=500, noise=0.25, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)
# Your code here
Common Pitfalls¶
Summary
The kernel trick replaces dot products in the SVM dual with , enabling non-linear boundaries at no extra inference cost.
Mercer’s theorem: a valid kernel must produce a PSD Gram matrix.
Common kernels: linear (), polynomial (), RBF ().
SMO solves the dual analytically two variables at a time; the full implementation works with any Mercer kernel via a precomputed kernel matrix.
sklearn custom kernels: pass a callable
f(X,Y) -> n×m matrixtoSVC(kernel=f).Always scale features before kernel SVMs and tune , , with cross-validation.

What’s Next — Soft-Margin SVMs¶
Real data has noise and overlap. In the next notebook we introduce slack variables and the soft-margin objective:
The hyperparameter controls the bias-variance trade-off: small tolerates misclassification for a wider margin; large penalises violations heavily. Proceed to svm_softmargin.ipynb.