Soft-Margin SVMs — Slack Variables, Hinge Loss, and the C Trade-off¶

What you will learn: how slack variables $\xi_i$ relax the hard-margin constraint, the primal and dual soft-margin formulations, hinge loss, the effect of $C$ on bias-variance, and how to tune $C$ with cross-validation.

Why Soft Margins Save Real Models¶

A fraud-detection team trained a hard-margin SVM and got an error: “No feasible solution — data is not linearly separable.” Every real dataset has overlap and noise. The soft-margin SVM handles this by allowing a controlled number of violations — governed by the regularisation parameter $C$ .

The $C$ parameter is the single most important hyperparameter for SVMs.

1. Slack Variables and the Soft-Margin Primal¶

Hard-margin SVM requires every point to satisfy:

y_i(w^\top x_i + b) \geq 1

(1)

Soft-margin SVM introduces slack $\xi_i \geq 0$ :

y_i(w^\top x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

(2)

The primal objective becomes:

\min_{w,b,\xi}\; \frac{1}{2}\|w\|^2 + C \sum_{i=1}^n \xi_i

(3)

$\xi_i = 0$ : point correctly classified with margin $\geq 1$ (no penalty)
$0 < \xi_i \leq 1$ : point inside the margin (small penalty)
$\xi_i > 1$ : point misclassified (large penalty)

The slack $\xi_i$ is exactly the hinge loss: $\xi_i = \max(0, 1 - y_i f(x_i))$ .

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

margin_scores = np.linspace(-2, 3, 300)  # y_i * f(x_i)
hinge = np.maximum(0, 1 - margin_scores)
logistic_loss = np.log1p(np.exp(-margin_scores))

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(margin_scores, hinge, 'b-', lw=2, label='Hinge loss (SVM)')
ax.plot(margin_scores, logistic_loss, 'r--', lw=2, label='Logistic loss (LR)')
ax.axvline(0, color='k', lw=0.8, linestyle=':')
ax.axvline(1, color='gray', lw=0.8, linestyle=':', label='Margin boundary (score=1)')
ax.fill_betweenx([0, 2.5], -2, 0, alpha=0.05, color='red', label='Misclassified region')
ax.fill_betweenx([0, 2.5],  0, 1, alpha=0.05, color='orange', label='Inside margin')
ax.set_xlabel('Margin score  $y_i f(x_i)$')
ax.set_ylabel('Loss')
ax.set_ylim(-0.1, 2.5)
ax.set_title('Hinge loss vs Logistic loss')
ax.legend(fontsize=9)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("Hinge at score=0.5 :", max(0, 1 - 0.5))   # = 0.5
print("Hinge at score=1.0 :", max(0, 1 - 1.0))   # = 0.0 (on margin)
print("Hinge at score=1.5 :", max(0, 1 - 1.5))   # = 0.0 (correctly classified, no penalty)

2. Soft-Margin Dual¶

The KKT conditions yield the dual:

\max_{\alpha}\; \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i,x_j)

(4)

\text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0

(5)

Compare with the hard-margin dual: the only difference is the box constraint $\alpha_i \leq C$ (hard margin had $\alpha_i \geq 0$ only).

$\alpha_i$	Point type
0	Non-support vector, correctly classified outside margin
$0 < \alpha_i < C$	Support vector, on the margin ( $\xi_i = 0$ )
$\alpha_i = C$	Support vector, inside margin or misclassified ( $\xi_i > 0$ )

The decision function and support vector extraction are identical to the hard-margin case — only the bound on $\alpha_i$ changes.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Dataset with some overlap
np.random.seed(42)
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1,
                            class_sep=0.8, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_tr_s = scaler.fit_transform(X_tr)
X_te_s = scaler.transform(X_te)

C_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
margins, n_svs, train_accs, test_accs = [], [], [], []

for C in C_values:
    m = SVC(kernel='linear', C=C)
    m.fit(X_tr_s, y_tr)
    w = m.coef_[0]
    margin = 2 / np.linalg.norm(w)
    margins.append(margin)
    n_svs.append(m.n_support_.sum())
    train_accs.append(accuracy_score(y_tr, m.predict(X_tr_s)))
    test_accs.append(accuracy_score(y_te, m.predict(X_te_s)))

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
axes[0].semilogx(C_values, margins, 'b-o')
axes[0].set_xlabel('C'); axes[0].set_ylabel('Margin width')
axes[0].set_title('Margin vs C'); axes[0].grid(alpha=0.3)

axes[1].semilogx(C_values, n_svs, 'r-o')
axes[1].set_xlabel('C'); axes[1].set_ylabel('# Support vectors')
axes[1].set_title('Support Vectors vs C'); axes[1].grid(alpha=0.3)

axes[2].semilogx(C_values, train_accs, 'g-o', label='Train')
axes[2].semilogx(C_values, test_accs, 'b-s', label='Test')
axes[2].set_xlabel('C'); axes[2].set_ylabel('Accuracy')
axes[2].set_title('Accuracy vs C'); axes[2].legend(); axes[2].grid(alpha=0.3)

plt.suptitle('Effect of C on Soft-Margin SVM (linear kernel)', fontsize=11)
plt.tight_layout()
plt.show()

for C, m_w, sv, tr, te in zip(C_values, margins, n_svs, train_accs, test_accs):
    print(f"C={C:7.3f}: margin={m_w:.3f}  SVs={sv:3d}  train={tr:.3f}  test={te:.3f}")

3. Choosing C with Cross-Validation¶

The bias-variance trade-off is real: small $C$ = wide margin but tolerates more violations (high bias); large $C$ = narrow margin but forces correct classification (high variance). Use GridSearchCV or cross_val_score to find the optimum.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline

X, y = make_classification(n_samples=400, n_features=2, n_informative=2,
                            n_redundant=0, class_sep=0.9, random_state=0)

C_range = np.logspace(-3, 3, 13)
cv_scores = []
cv_stds = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for C in C_range:
    pipe = Pipeline([('scaler', StandardScaler()), ('svm', SVC(kernel='rbf', C=C, gamma='scale'))])
    sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
    cv_scores.append(sc.mean())
    cv_stds.append(sc.std())

cv_scores = np.array(cv_scores)
cv_stds = np.array(cv_stds)
best_C = C_range[np.argmax(cv_scores)]

fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogx(C_range, cv_scores, 'b-o', label='CV accuracy mean')
ax.fill_between(C_range, cv_scores - cv_stds, cv_scores + cv_stds, alpha=0.2)
ax.axvline(best_C, color='r', linestyle='--', label=f'Best C={best_C:.3f}')
ax.set_xlabel('C'); ax.set_ylabel('CV Accuracy')
ax.set_title('5-fold CV accuracy vs C  (RBF kernel)')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Best C = {best_C:.4f}  mean CV acc = {cv_scores.max():.4f}")

4. Try It in the Browser¶

Compute soft-margin hinge loss and slack values for individual points.

def hinge_loss(y_true, score):
    return max(0, 1 - y_true * score)

def total_loss(C, points):
    # points: list of (y_true, score)
    slack = sum(hinge_loss(y, s) for y, s in points)
    # w^2 component omitted; just showing C * sum(xi)
    return C * slack

# Example: 5 points with various margin scores
points = [
    (+1,  1.5),   # correctly classified outside margin — no loss
    (+1,  0.5),   # inside margin — small loss
    (+1, -0.3),   # misclassified — large loss
    (-1, -1.2),   # correctly classified outside margin — no loss
    (-1,  0.2),   # misclassified — large loss
]
print("C = 1.0  total hinge penalty:", total_loss(1.0, points))
print("C = 10.0 total hinge penalty:", total_loss(10.0, points))
print()
for y, s in points:
    xi = hinge_loss(y, s)
    status = "OK" if xi == 0 else ("inside margin" if xi <= 1 else "MISCLASSIFIED")
    print(f"  y={y:+d}  score={s:+.1f}  xi={xi:.1f}  [{status}]")

Knowledge Check¶

In the soft-margin SVM dual, the constraint $\alpha_i \leq C$ (instead of just $\alpha_i \geq 0$) arises because:¶

[ ] A) The kernel trick requires bounded multipliers [ ] B) The slack penalty introduces an upper bound on $\alpha_i$ through KKT complementarity [ ] C) The dataset is always linearly separable [ ] D) $C$ is the learning rate of gradient descent

Check

If you increase $C$ from 0.01 to 1000 on a noisy dataset, you would expect:¶

[ ] A) Wider margin and fewer support vectors [ ] B) Narrower margin, more support vectors on the margin, potential overfitting [ ] C) The model to ignore noisy points entirely [ ] D) No change if the kernel is RBF

Check

Exercises¶

Exercise 1 — Hard vs Soft Margin¶

Generate a dataset with 5% overlapping points (make_classification(..., class_sep=0.5)). Train SVC(kernel='linear', C=1e6) (near-hard margin) and SVC(kernel='linear', C=0.01) (soft). Compare: number of support vectors, train accuracy, and test accuracy. Which generalises better?

Exercise 2 — C and Gamma Grid Search¶

On make_moons(n_samples=600, noise=0.3), perform GridSearchCV over C ∈ [0.1, 1, 10, 100] and gamma ∈ [0.01, 0.1, 1, 10] with kernel='rbf'. Report the best parameters and cross-validated score.

%matplotlib inline
# Exercise 1 and 2: your code here
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_moons, make_classification
from sklearn.preprocessing import StandardScaler
# Your code here

Common Pitfalls¶

What’s Next — SVM Lab¶

You now have the full SVM toolkit: max-margin geometry, kernel trick, SMO algorithm, and soft-margin regularisation. The next notebook puts it all together in an end-to-end NLP application: sentiment classification of customer reviews using linear and kernel SVMs, TF-IDF features, and full evaluation with precision/recall/F1. Proceed to svm_lab.ipynb.