Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hero image

Soft-Margin SVMs — Slack Variables, Hinge Loss, and the C Trade-off

What you will learn: how slack variables ξi\xi_i relax the hard-margin constraint, the primal and dual soft-margin formulations, hinge loss, the effect of CC on bias-variance, and how to tune CC with cross-validation.

Business hook

Why Soft Margins Save Real Models

A fraud-detection team trained a hard-margin SVM and got an error: “No feasible solution — data is not linearly separable.” Every real dataset has overlap and noise. The soft-margin SVM handles this by allowing a controlled number of violations — governed by the regularisation parameter CC.

The CC parameter is the single most important hyperparameter for SVMs.

1. Slack Variables and the Soft-Margin Primal

Hard-margin SVM requires every point to satisfy:

yi(wxi+b)1y_i(w^\top x_i + b) \geq 1

Soft-margin SVM introduces slack ξi0\xi_i \geq 0:

yi(wxi+b)1ξi,ξi0y_i(w^\top x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

The primal objective becomes:

minw,b,ξ  12w2+Ci=1nξi\min_{w,b,\xi}\; \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i
  • ξi=0\xi_i = 0: point correctly classified with margin 1\geq 1 (no penalty)

  • 0<ξi10 < \xi_i \leq 1: point inside the margin (small penalty)

  • ξi>1\xi_i > 1: point misclassified (large penalty)

The slack ξi\xi_i is exactly the hinge loss: ξi=max(0,1yif(xi))\xi_i = \max(0, 1 - y_i f(x_i)).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

margin_scores = np.linspace(-2, 3, 300)  # y_i * f(x_i)
hinge = np.maximum(0, 1 - margin_scores)
logistic_loss = np.log1p(np.exp(-margin_scores))

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(margin_scores, hinge, 'b-', lw=2, label='Hinge loss (SVM)')
ax.plot(margin_scores, logistic_loss, 'r--', lw=2, label='Logistic loss (LR)')
ax.axvline(0, color='k', lw=0.8, linestyle=':')
ax.axvline(1, color='gray', lw=0.8, linestyle=':', label='Margin boundary (score=1)')
ax.fill_betweenx([0, 2.5], -2, 0, alpha=0.05, color='red', label='Misclassified region')
ax.fill_betweenx([0, 2.5],  0, 1, alpha=0.05, color='orange', label='Inside margin')
ax.set_xlabel('Margin score  $y_i f(x_i)$')
ax.set_ylabel('Loss')
ax.set_ylim(-0.1, 2.5)
ax.set_title('Hinge loss vs Logistic loss')
ax.legend(fontsize=9)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Hinge at score=0.5 :", max(0, 1 - 0.5))   # = 0.5
print("Hinge at score=1.0 :", max(0, 1 - 1.0))   # = 0.0 (on margin)
print("Hinge at score=1.5 :", max(0, 1 - 1.5))   # = 0.0 (correctly classified, no penalty)

2. Soft-Margin Dual

The KKT conditions yield the dual:

maxα  iαi12i,jαiαjyiyjK(xi,xj)\max_{\alpha}\; \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i,x_j)

s.t.0αiC,iαiyi=0\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0

Compare with the hard-margin dual: the only difference is the box constraint αiC\alpha_i \leq C (hard margin had αi0\alpha_i \geq 0 only).

αi\alpha_iPoint type
0Non-support vector, correctly classified outside margin
0<αi<C0 < \alpha_i < CSupport vector, on the margin (ξi=0\xi_i = 0)
αi=C\alpha_i = CSupport vector, inside margin or misclassified (ξi>0\xi_i > 0)

The decision function and support vector extraction are identical to the hard-margin case — only the bound on αi\alpha_i changes.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Dataset with some overlap
np.random.seed(42)
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1,
                            class_sep=0.8, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)

C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
margins, n_svs, train_accs, test_accs = [], [], [], []

for C in C_values:
    m = SVC(kernel='linear', C=C)
    m.fit(X_tr_s, y_tr)
    w = m.coef_[0]
    margin = 2 / np.linalg.norm(w)
    margins.append(margin)
    n_svs.append(m.n_support_.sum())
    train_accs.append(accuracy_score(y_tr, m.predict(X_tr_s)))
    test_accs.append(accuracy_score(y_te, m.predict(X_te_s)))

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].semilogx(C_values, margins, 'b-o')
axes[0].set_xlabel('C'); axes[0].set_ylabel('Margin width')
axes[0].set_title('Margin vs C'); axes[0].grid(alpha=0.3)

axes[1].semilogx(C_values, n_svs, 'r-o')
axes[1].set_xlabel('C'); axes[1].set_ylabel('# Support vectors')
axes[1].set_title('Support Vectors vs C'); axes[1].grid(alpha=0.3)

axes[2].semilogx(C_values, train_accs, 'g-o', label='Train')
axes[2].semilogx(C_values, test_accs, 'b-s', label='Test')
axes[2].set_xlabel('C'); axes[2].set_ylabel('Accuracy')
axes[2].set_title('Accuracy vs C'); axes[2].legend(); axes[2].grid(alpha=0.3)

plt.suptitle('Effect of C on Soft-Margin SVM (linear kernel)', fontsize=11)
plt.tight_layout()
plt.show()

for C, m_w, sv, tr, te in zip(C_values, margins, n_svs, train_accs, test_accs):
    print(f"C={C:7.3f}: margin={m_w:.3f}  SVs={sv:3d}  train={tr:.3f}  test={te:.3f}")

3. Choosing C with Cross-Validation

The bias-variance trade-off is real: small CC = wide margin but tolerates more violations (high bias); large CC = narrow margin but forces correct classification (high variance). Use GridSearchCV or cross_val_score to find the optimum.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline

X, y = make_classification(n_samples=400, n_features=2, n_informative=2,
                            n_redundant=0, class_sep=0.9, random_state=0)

C_range = np.logspace(-3, 3, 13)
cv_scores = []
cv_stds = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for C in C_range:
    pipe = Pipeline([('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', C=C, gamma='scale'))])
    sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
    cv_scores.append(sc.mean())
    cv_stds.append(sc.std())

cv_scores = np.array(cv_scores)
cv_stds = np.array(cv_stds)
best_C = C_range[np.argmax(cv_scores)]

fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogx(C_range, cv_scores, 'b-o', label='CV accuracy mean')
ax.fill_between(C_range, cv_scores - cv_stds, cv_scores + cv_stds, alpha=0.2)
ax.axvline(best_C, color='r', linestyle='--', label=f'Best C={best_C:.3f}')
ax.set_xlabel('C'); ax.set_ylabel('CV Accuracy')
ax.set_title('5-fold CV accuracy vs C  (RBF kernel)')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Best C = {best_C:.4f}  mean CV acc = {cv_scores.max():.4f}")

4. Try It in the Browser

Compute soft-margin hinge loss and slack values for individual points.

def hinge_loss(y_true, score):
    return max(0, 1 - y_true * score)

def total_loss(C, points):
    # points: list of (y_true, score)
    slack = sum(hinge_loss(y, s) for y, s in points)
    # w^2 component omitted; just showing C * sum(xi)
    return C * slack

# Example: 5 points with various margin scores
points = [
    (+1,  1.5),   # correctly classified outside margin — no loss
    (+1,  0.5),   # inside margin — small loss
    (+1, -0.3),   # misclassified — large loss
    (-1, -1.2),   # correctly classified outside margin — no loss
    (-1,  0.2),   # misclassified — large loss
]
print("C = 1.0  total hinge penalty:", total_loss(1.0, points))
print("C = 10.0 total hinge penalty:", total_loss(10.0, points))
print()
for y, s in points:
    xi = hinge_loss(y, s)
    status = "OK" if xi == 0 else ("inside margin" if xi <= 1 else "MISCLASSIFIED")
    print(f"  y={y:+d}  score={s:+.1f}  xi={xi:.1f}  [{status}]")

Knowledge Check

In the soft-margin SVM dual, the constraint $\alpha_i \leq C$ (instead of just $\alpha_i \geq 0$) arises because:

[ ] A) The kernel trick requires bounded multipliers [ ] B) The slack penalty introduces an upper bound on $\alpha_i$ through KKT complementarity [ ] C) The dataset is always linearly separable [ ] D) $C$ is the learning rate of gradient descent
Check

If you increase $C$ from 0.01 to 1000 on a noisy dataset, you would expect:

[ ] A) Wider margin and fewer support vectors [ ] B) Narrower margin, more support vectors on the margin, potential overfitting [ ] C) The model to ignore noisy points entirely [ ] D) No change if the kernel is RBF
Check

Exercises

Exercise 1 — Hard vs Soft Margin

Generate a dataset with 5% overlapping points (make_classification(..., class_sep=0.5)). Train SVC(kernel='linear', C=1e6) (near-hard margin) and SVC(kernel='linear', C=0.01) (soft). Compare: number of support vectors, train accuracy, and test accuracy. Which generalises better?

On make_moons(n_samples=600, noise=0.3), perform GridSearchCV over C ∈ [0.1, 1, 10, 100] and gamma ∈ [0.01, 0.1, 1, 10] with kernel='rbf'. Report the best parameters and cross-validated score.

%matplotlib inline
# Exercise 1 and 2: your code here
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_moons, make_classification
from sklearn.preprocessing import StandardScaler
# Your code here

Common Pitfalls

Summary
  • Slack variables ξi0\xi_i \geq 0 relax the margin constraint to yif(xi)1ξiy_i f(x_i) \geq 1 - \xi_i.

  • The hinge loss max(0,1yif(xi))\max(0, 1 - y_i f(x_i)) is the slack at each point.

  • The soft-margin primal minimises 12w2+Cξi\frac{1}{2}\|w\|^2 + C \sum \xi_i.

  • The dual is identical to hard-margin except αi[0,C]\alpha_i \in [0, C] instead of [0,)[0, \infty).

  • CC controls bias-variance: small CC = wide margin + more violations; large CC = narrow margin + fewer violations.

  • Always choose CC (and γ\gamma for RBF) via cross-validation.

Next steps

What’s Next — SVM Lab

You now have the full SVM toolkit: max-margin geometry, kernel trick, SMO algorithm, and soft-margin regularisation. The next notebook puts it all together in an end-to-end NLP application: sentiment classification of customer reviews using linear and kernel SVMs, TF-IDF features, and full evaluation with precision/recall/F1. Proceed to svm_lab.ipynb.