
Cross-Validation Strategies¶
Because One Train-Test Split Is Never Enough¶
A single holdout split gives a noisy estimate of generalisation error — you might get lucky or unlucky with which samples land in the test set. Cross-validation removes that luck by rotating the test set across the entire dataset, giving a reliable estimate of how the model will behave on new data.
Why One Split Is Not Enough¶
Suppose you split 100 samples 80/20. Whether you get a test MSE of 1.2 or 2.4 may depend entirely on which 20 samples fell in the test set — not on your model. Cross-validation averages this out:
where is the model trained on all folds except , and is the held-out fold .
| Approach | Bias | Variance | Data efficiency |
|---|---|---|---|
| Single holdout | Low (if large test) | High | Wastes test fraction |
| -fold CV | Low | Lower | Uses all data for training and testing |
| LOOCV | Lowest | Highest (noisy) | Maximum data use, slow |
Visual Flow — -Fold CV¶
-Fold Cross-Validation¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
model = make_pipeline(StandardScaler(), LinearRegression())
# Compare k=3, 5, 10
fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharey=True)
for ax, k in zip(axes, [3, 5, 10]):
kf = KFold(n_splits=k, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='r2')
ax.bar(range(1, k+1), scores, color='steelblue', alpha=0.8)
ax.axhline(scores.mean(), color='tomato', linewidth=2, linestyle='--',
label=f'Mean={scores.mean():.3f}')
ax.fill_between([0, k+1],
scores.mean() - scores.std(),
scores.mean() + scores.std(),
color='tomato', alpha=0.1, label=f'±1 std={scores.std():.3f}')
ax.set_title(f'{k}-Fold CV')
ax.set_xlabel('Fold')
ax.set_ylabel('R²')
ax.legend(fontsize=8)
plt.suptitle('K-Fold R² on the diabetes dataset', y=1.02)
plt.tight_layout()
plt.show()
# Numeric summary
print(f"{'k':>3} {'Mean R²':>9} {'Std':>7}")
for k in [3, 5, 10]:
kf = KFold(n_splits=k, shuffle=True, random_state=42)
sc = cross_val_score(model, X, y, cv=kf, scoring='r2')
print(f"{k:>3} {sc.mean():>9.4f} {sc.std():>7.4f}")Stratified -Fold (Classification)¶
Standard -fold might put almost all positives in one fold by chance. Stratified CV preserves the class ratio in every fold.
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer
X_c, y_c = load_breast_cancer(return_X_y=True)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, random_state=0))
kf = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
sc_kf = cross_val_score(clf, X_c, y_c, cv=kf, scoring='roc_auc')
sc_skf = cross_val_score(clf, X_c, y_c, cv=skf, scoring='roc_auc')
print(f"KFold AUC: {sc_kf.mean():.4f} ± {sc_kf.std():.4f}")
print(f"Stratified AUC: {sc_skf.mean():.4f} ± {sc_skf.std():.4f}")
# Show class balance in each fold
print("\nClass 1 proportion per fold (Stratified):")
for i, (_, test_idx) in enumerate(skf.split(X_c, y_c), 1):
prop = y_c[test_idx].mean()
print(f" Fold {i}: {prop:.3f} ({y_c[test_idx].sum()} positives / {len(test_idx)} samples)")Leave-One-Out CV (LOOCV)¶
Each sample is the test set exactly once. Gives the most data to training but is expensive for large .
import numpy as np
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Small dataset — where LOOCV makes sense
rng = np.random.default_rng(0)
n = 30
X_small = rng.normal(0, 1, (n, 3))
y_small = X_small @ np.array([2, -1, 0.5]) + rng.normal(0, 0.5, n)
model = make_pipeline(StandardScaler(), LinearRegression())
loo = LeaveOneOut()
sc_loo = cross_val_score(model, X_small, y_small, cv=loo,
scoring='neg_mean_squared_error')
sc_5f = cross_val_score(model, X_small, y_small, cv=5,
scoring='neg_mean_squared_error')
print(f"LOOCV MSE: {-sc_loo.mean():.4f} (n={n} fits)")
print(f"5-Fold MSE: {-sc_5f.mean():.4f} ± {sc_5f.std():.4f}")
print(f"Note: LOOCV ran {len(sc_loo)} training fits vs 5 for k-fold.")Time Series Split¶
For sequential data (sales, stock prices, demand forecasting) you must never use future data to predict the past. TimeSeriesSplit enforces this: each test fold is strictly after all training data.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
rng = np.random.default_rng(5)
n = 120
t = np.arange(n)
# Simulated monthly sales with trend + seasonality + noise
y_ts = 100 + 0.5*t + 20*np.sin(2*np.pi*t/12) + rng.normal(0, 5, n)
# Features: lag-1, lag-2, month index
X_ts = np.column_stack([y_ts[:-2], y_ts[1:-1], t[2:]])
y_target = y_ts[2:]
tscv = TimeSeriesSplit(n_splits=5)
model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(t[2:], y_target, 'k-', alpha=0.4, linewidth=1, label='True sales')
colors = plt.cm.viridis(np.linspace(0.2, 0.9, 5))
mse_list = []
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
model.fit(X_ts[train_idx], y_target[train_idx])
preds = model.predict(X_ts[test_idx])
mse = np.mean((y_target[test_idx] - preds)**2)
mse_list.append(mse)
ax.plot(t[2:][test_idx], preds, color=colors[i], linewidth=2,
label=f'Fold {i+1} (MSE={mse:.1f})')
ax.set_xlabel('Time (months)')
ax.set_ylabel('Sales')
ax.set_title('TimeSeriesSplit — test folds always in the future')
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()
print(f"Mean CV MSE: {np.mean(mse_list):.2f} ± {np.std(mse_list):.2f}")CV for Hyperparameter Tuning¶
The most common use of CV: finding the best hyperparameters without touching the final test set.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
X_d, y_d = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_d, y_d, test_size=0.2, random_state=0)
pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': [0.01, 0.1, 1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error',
return_train_score=True)
gs.fit(X_tr, y_tr)
alphas = [p['ridge__alpha'] for p in gs.cv_results_['params']]
cv_mse = -gs.cv_results_['mean_test_score']
cv_std = gs.cv_results_['std_test_score']
tr_mse = -gs.cv_results_['mean_train_score']
fig, ax = plt.subplots(figsize=(9, 5))
ax.semilogx(alphas, tr_mse, 'b-o', linewidth=2, label='Train MSE')
ax.semilogx(alphas, cv_mse, 'r-o', linewidth=2, label='CV MSE')
ax.fill_between(alphas, cv_mse - cv_std, cv_mse + cv_std, alpha=0.15, color='red')
ax.axvline(gs.best_params_['ridge__alpha'], color='green', linestyle='--',
label=f"Best alpha={gs.best_params_['ridge__alpha']}")
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('MSE')
ax.set_title('GridSearchCV: 5-fold CV for Ridge alpha')
ax.legend()
plt.tight_layout()
plt.show()
final_mse = np.mean((y_te - gs.predict(X_te))**2)
print(f"Best alpha: {gs.best_params_['ridge__alpha']}")
print(f"Best CV MSE: {-gs.best_score_:.2f}")
print(f"Final test MSE: {final_mse:.2f}")-Fold CV From Scratch¶
Understanding what sklearn is doing internally:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes
X_d, y_d = load_diabetes(return_X_y=True)
n = len(y_d)
k = 5
rng = np.random.default_rng(42)
indices = rng.permutation(n) # shuffle
folds = np.array_split(indices, k) # split into k chunks
scores = []
for i in range(k):
test_idx = folds[i]
train_idx = np.concatenate([folds[j] for j in range(k) if j != i])
X_tr, X_te = X_d[train_idx], X_d[test_idx]
y_tr, y_te = y_d[train_idx], y_d[test_idx]
sc = StandardScaler().fit(X_tr)
X_tr_s, X_te_s = sc.transform(X_tr), sc.transform(X_te)
model = LinearRegression().fit(X_tr_s, y_tr)
y_hat = model.predict(X_te_s)
mse = np.mean((y_te - y_hat)**2)
r2 = 1 - np.sum((y_te - y_hat)**2) / np.sum((y_te - y_te.mean())**2)
scores.append(r2)
print(f"Fold {i+1}: R²={r2:.4f} MSE={mse:.2f}")
print(f"\nMean R²: {np.mean(scores):.4f} ± {np.std(scores):.4f}")The Data Leakage Trap¶
Choosing the Right Strategy¶
| Problem type | Recommended strategy | Reason |
|---|---|---|
| General regression / classification | 5-fold or 10-fold | Good bias-variance balance |
| Imbalanced classification | Stratified -fold | Preserves class ratio per fold |
| Time series / sequential | TimeSeriesSplit | Prevents future leakage |
| Very small dataset () | LOOCV or 10-fold | Maximum use of training data |
| Large dataset () | 3-fold or 5-fold | Speed; variance is low anyway |
| Hyperparameter tuning | GridSearchCV / RandomizedSearchCV | Automated CV over param grid |
Try It in the Browser¶
Manual 3-fold CV from scratch using pure Python.
Guided Practice¶
Why does cross-validation give a more reliable performance estimate than a single train-test split?¶
You have a binary classification dataset where 5% of samples are positives. Which CV strategy is most appropriate?¶
You fit a StandardScaler on the full dataset before running 5-fold CV. What problem does this cause?¶
After running GridSearchCV, you should fit the final model on:¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_wine
X_w, y_w = load_wine(return_X_y=True)
model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
# TODO: for k in [3, 5, 10, 20]:
# run KFold CV, compute mean R² and std
# Plot mean ± std as a bar chart with error bars
# Describe how stability changes with k
Exercise 2 — Hyperparameter search with RandomizedSearchCV¶
Use RandomizedSearchCV on a Ridge pipeline over the diabetes dataset. Search over alpha (log-uniform 1e-3 to 1e3) and polynomial degree (1–4) using 5-fold CV. Print the best parameters and final test MSE.
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from scipy.stats import loguniform
X_d, y_d = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_d, y_d, test_size=0.2, random_state=0)
pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
param_dist = {
'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': loguniform(1e-3, 1e3),
}
# TODO: run RandomizedSearchCV with n_iter=20, cv=5
# print best_params_, best CV MSE, and final test MSE
Exercise 3 — Detect data leakage¶
Run 5-fold CV on the diabetes dataset twice: once with scaling inside the pipeline (correct), and once with scaling before the split (leaky). Compare the resulting MSE scores and explain the difference.
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
X_d, y_d = load_diabetes(return_X_y=True)
# CORRECT: scaler inside pipeline
correct_pipe = make_pipeline(StandardScaler(), LinearRegression())
correct_mse = -cross_val_score(correct_pipe, X_d, y_d, cv=5,
scoring='neg_mean_squared_error').mean()
# LEAKY: scale all data before CV
X_scaled_all = StandardScaler().fit_transform(X_d) # leakage!
leaky_mse = -cross_val_score(LinearRegression(), X_scaled_all, y_d, cv=5,
scoring='neg_mean_squared_error').mean()
print(f"Correct CV MSE : {correct_mse:.2f}")
print(f"Leaky CV MSE : {leaky_mse:.2f}")
print(f"Difference : {correct_mse - leaky_mse:.2f}")
print()
print("Leaky MSE is slightly lower — the scaler 'saw' the test fold, giving a")
print("falsely optimistic estimate. On larger feature scales the gap would be bigger.")Common Pitfalls¶
Summary¶
Key takeaways
| Concept | One-line meaning |
|---|---|
| CV purpose | Reliable generalisation estimate without a fixed holdout |
| -fold | Split into folds, rotate test fold times, average scores |
| Stratified | Preserves class ratio per fold — essential for imbalanced classification |
| LOOCV | Maximum training data; high variance estimate; slow for large |
| TimeSeriesSplit | Test fold always after training — prevents future leakage |
| Leakage | Never fit preprocessing on the full dataset before splitting |
| After CV | Refit final model on all training data with best params |
| GridSearchCV | Automates CV-based hyperparameter search |
Next Up — Nested CV and Model Comparison¶

You can now evaluate models reliably. Next: compare them fairly.¶
The next notebook — Nested CV and Model Comparison — shows that using the same CV loop for both hyperparameter tuning and performance estimation leads to an optimistic bias. Nested CV separates the two: an inner loop for tuning, an outer loop for evaluation. It also covers statistical tests for comparing models.
Dependencies: $k$-fold CV, GridSearchCV, and the concept that the test set must never inform any modelling decision.