
Soft-Margin SVMs — Slack Variables, Hinge Loss, and the C Trade-off¶
What you will learn: how slack variables relax the hard-margin constraint, the primal and dual soft-margin formulations, hinge loss, the effect of on bias-variance, and how to tune with cross-validation.

Why Soft Margins Save Real Models¶
A fraud-detection team trained a hard-margin SVM and got an error: “No feasible solution — data is not linearly separable.” Every real dataset has overlap and noise. The soft-margin SVM handles this by allowing a controlled number of violations — governed by the regularisation parameter .
The parameter is the single most important hyperparameter for SVMs.
1. Slack Variables and the Soft-Margin Primal¶
Hard-margin SVM requires every point to satisfy:
Soft-margin SVM introduces slack :
The primal objective becomes:
: point correctly classified with margin (no penalty)
: point inside the margin (small penalty)
: point misclassified (large penalty)
The slack is exactly the hinge loss: .
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
margin_scores = np.linspace(-2, 3, 300) # y_i * f(x_i)
hinge = np.maximum(0, 1 - margin_scores)
logistic_loss = np.log1p(np.exp(-margin_scores))
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(margin_scores, hinge, 'b-', lw=2, label='Hinge loss (SVM)')
ax.plot(margin_scores, logistic_loss, 'r--', lw=2, label='Logistic loss (LR)')
ax.axvline(0, color='k', lw=0.8, linestyle=':')
ax.axvline(1, color='gray', lw=0.8, linestyle=':', label='Margin boundary (score=1)')
ax.fill_betweenx([0, 2.5], -2, 0, alpha=0.05, color='red', label='Misclassified region')
ax.fill_betweenx([0, 2.5], 0, 1, alpha=0.05, color='orange', label='Inside margin')
ax.set_xlabel('Margin score $y_i f(x_i)$')
ax.set_ylabel('Loss')
ax.set_ylim(-0.1, 2.5)
ax.set_title('Hinge loss vs Logistic loss')
ax.legend(fontsize=9)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Hinge at score=0.5 :", max(0, 1 - 0.5)) # = 0.5
print("Hinge at score=1.0 :", max(0, 1 - 1.0)) # = 0.0 (on margin)
print("Hinge at score=1.5 :", max(0, 1 - 1.5)) # = 0.0 (correctly classified, no penalty)
2. Soft-Margin Dual¶
The KKT conditions yield the dual:
Compare with the hard-margin dual: the only difference is the box constraint (hard margin had only).
| Point type | |
|---|---|
| 0 | Non-support vector, correctly classified outside margin |
| Support vector, on the margin () | |
| Support vector, inside margin or misclassified () |
The decision function and support vector extraction are identical to the hard-margin case — only the bound on changes.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
# Dataset with some overlap
np.random.seed(42)
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1,
class_sep=0.8, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)
C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
margins, n_svs, train_accs, test_accs = [], [], [], []
for C in C_values:
m = SVC(kernel='linear', C=C)
m.fit(X_tr_s, y_tr)
w = m.coef_[0]
margin = 2 / np.linalg.norm(w)
margins.append(margin)
n_svs.append(m.n_support_.sum())
train_accs.append(accuracy_score(y_tr, m.predict(X_tr_s)))
test_accs.append(accuracy_score(y_te, m.predict(X_te_s)))
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].semilogx(C_values, margins, 'b-o')
axes[0].set_xlabel('C'); axes[0].set_ylabel('Margin width')
axes[0].set_title('Margin vs C'); axes[0].grid(alpha=0.3)
axes[1].semilogx(C_values, n_svs, 'r-o')
axes[1].set_xlabel('C'); axes[1].set_ylabel('# Support vectors')
axes[1].set_title('Support Vectors vs C'); axes[1].grid(alpha=0.3)
axes[2].semilogx(C_values, train_accs, 'g-o', label='Train')
axes[2].semilogx(C_values, test_accs, 'b-s', label='Test')
axes[2].set_xlabel('C'); axes[2].set_ylabel('Accuracy')
axes[2].set_title('Accuracy vs C'); axes[2].legend(); axes[2].grid(alpha=0.3)
plt.suptitle('Effect of C on Soft-Margin SVM (linear kernel)', fontsize=11)
plt.tight_layout()
plt.show()
for C, m_w, sv, tr, te in zip(C_values, margins, n_svs, train_accs, test_accs):
print(f"C={C:7.3f}: margin={m_w:.3f} SVs={sv:3d} train={tr:.3f} test={te:.3f}")
3. Choosing C with Cross-Validation¶
The bias-variance trade-off is real: small = wide margin but tolerates more violations (high bias); large = narrow margin but forces correct classification (high variance). Use GridSearchCV or cross_val_score to find the optimum.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
X, y = make_classification(n_samples=400, n_features=2, n_informative=2,
n_redundant=0, class_sep=0.9, random_state=0)
C_range = np.logspace(-3, 3, 13)
cv_scores = []
cv_stds = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for C in C_range:
pipe = Pipeline([('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', C=C, gamma='scale'))])
sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
cv_scores.append(sc.mean())
cv_stds.append(sc.std())
cv_scores = np.array(cv_scores)
cv_stds = np.array(cv_stds)
best_C = C_range[np.argmax(cv_scores)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogx(C_range, cv_scores, 'b-o', label='CV accuracy mean')
ax.fill_between(C_range, cv_scores - cv_stds, cv_scores + cv_stds, alpha=0.2)
ax.axvline(best_C, color='r', linestyle='--', label=f'Best C={best_C:.3f}')
ax.set_xlabel('C'); ax.set_ylabel('CV Accuracy')
ax.set_title('5-fold CV accuracy vs C (RBF kernel)')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Best C = {best_C:.4f} mean CV acc = {cv_scores.max():.4f}")
4. Try It in the Browser¶
Compute soft-margin hinge loss and slack values for individual points.
def hinge_loss(y_true, score):
return max(0, 1 - y_true * score)
def total_loss(C, points):
# points: list of (y_true, score)
slack = sum(hinge_loss(y, s) for y, s in points)
# w^2 component omitted; just showing C * sum(xi)
return C * slack
# Example: 5 points with various margin scores
points = [
(+1, 1.5), # correctly classified outside margin — no loss
(+1, 0.5), # inside margin — small loss
(+1, -0.3), # misclassified — large loss
(-1, -1.2), # correctly classified outside margin — no loss
(-1, 0.2), # misclassified — large loss
]
print("C = 1.0 total hinge penalty:", total_loss(1.0, points))
print("C = 10.0 total hinge penalty:", total_loss(10.0, points))
print()
for y, s in points:
xi = hinge_loss(y, s)
status = "OK" if xi == 0 else ("inside margin" if xi <= 1 else "MISCLASSIFIED")
print(f" y={y:+d} score={s:+.1f} xi={xi:.1f} [{status}]")Knowledge Check¶
In the soft-margin SVM dual, the constraint $\alpha_i \leq C$ (instead of just $\alpha_i \geq 0$) arises because:¶
CheckIf you increase $C$ from 0.01 to 1000 on a noisy dataset, you would expect:¶
CheckExercises¶
Exercise 1 — Hard vs Soft Margin¶
Generate a dataset with 5% overlapping points (make_classification(..., class_sep=0.5)). Train SVC(kernel='linear', C=1e6) (near-hard margin) and SVC(kernel='linear', C=0.01) (soft). Compare: number of support vectors, train accuracy, and test accuracy. Which generalises better?
Exercise 2 — C and Gamma Grid Search¶
On make_moons(n_samples=600, noise=0.3), perform GridSearchCV over C ∈ [0.1, 1, 10, 100] and gamma ∈ [0.01, 0.1, 1, 10] with kernel='rbf'. Report the best parameters and cross-validated score.
%matplotlib inline
# Exercise 1 and 2: your code here
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_moons, make_classification
from sklearn.preprocessing import StandardScaler
# Your code here
Common Pitfalls¶
Summary
Slack variables relax the margin constraint to .
The hinge loss is the slack at each point.
The soft-margin primal minimises .
The dual is identical to hard-margin except instead of .
controls bias-variance: small = wide margin + more violations; large = narrow margin + fewer violations.
Always choose (and for RBF) via cross-validation.

What’s Next — SVM Lab¶
You now have the full SVM toolkit: max-margin geometry, kernel trick, SMO algorithm, and soft-margin regularisation. The next notebook puts it all together in an end-to-end NLP application: sentiment classification of customer reviews using linear and kernel SVMs, TF-IDF features, and full evaluation with precision/recall/F1. Proceed to svm_lab.ipynb.