Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Cross-Validation Strategies

Because One Train-Test Split Is Never Enough

A single holdout split gives a noisy estimate of generalisation error — you might get lucky or unlucky with which samples land in the test set. Cross-validation removes that luck by rotating the test set across the entire dataset, giving a reliable estimate of how the model will behave on new data.

Why One Split Is Not Enough

Suppose you split 100 samples 80/20. Whether you get a test MSE of 1.2 or 2.4 may depend entirely on which 20 samples fell in the test set — not on your model. Cross-validation averages this out:

CV score=1ki=1kmetric(f^i,Di)\text{CV score} = \frac{1}{k}\sum_{i=1}^{k} \text{metric}(\hat{f}_{-i}, D_i)

where f^i\hat{f}_{-i} is the model trained on all folds except ii, and DiD_i is the held-out fold ii.

ApproachBiasVarianceData efficiency
Single holdoutLow (if large test)HighWastes test fraction
kk-fold CVLowLowerUses all data for training and testing
LOOCVLowestHighest (noisy)Maximum data use, slow

Visual Flow — kk-Fold CV

kk-Fold Cross-Validation

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)
model = make_pipeline(StandardScaler(), LinearRegression())

# Compare k=3, 5, 10
fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharey=True)
for ax, k in zip(axes, [3, 5, 10]):
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=kf, scoring='r2')
    ax.bar(range(1, k+1), scores, color='steelblue', alpha=0.8)
    ax.axhline(scores.mean(), color='tomato', linewidth=2, linestyle='--',
               label=f'Mean={scores.mean():.3f}')
    ax.fill_between([0, k+1],
                    scores.mean() - scores.std(),
                    scores.mean() + scores.std(),
                    color='tomato', alpha=0.1, label=f'±1 std={scores.std():.3f}')
    ax.set_title(f'{k}-Fold CV')
    ax.set_xlabel('Fold')
    ax.set_ylabel('R²')
    ax.legend(fontsize=8)

plt.suptitle('K-Fold R² on the diabetes dataset', y=1.02)
plt.tight_layout()
plt.show()

# Numeric summary
print(f"{'k':>3}  {'Mean R²':>9}  {'Std':>7}")
for k in [3, 5, 10]:
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    sc = cross_val_score(model, X, y, cv=kf, scoring='r2')
    print(f"{k:>3}  {sc.mean():>9.4f}  {sc.std():>7.4f}")

Stratified kk-Fold (Classification)

Standard kk-fold might put almost all positives in one fold by chance. Stratified CV preserves the class ratio in every fold.

import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer

X_c, y_c = load_breast_cancer(return_X_y=True)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, random_state=0))

kf  = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

sc_kf  = cross_val_score(clf, X_c, y_c, cv=kf,  scoring='roc_auc')
sc_skf = cross_val_score(clf, X_c, y_c, cv=skf, scoring='roc_auc')

print(f"KFold     AUC: {sc_kf.mean():.4f} ± {sc_kf.std():.4f}")
print(f"Stratified AUC: {sc_skf.mean():.4f} ± {sc_skf.std():.4f}")

# Show class balance in each fold
print("\nClass 1 proportion per fold (Stratified):")
for i, (_, test_idx) in enumerate(skf.split(X_c, y_c), 1):
    prop = y_c[test_idx].mean()
    print(f"  Fold {i}: {prop:.3f}  ({y_c[test_idx].sum()} positives / {len(test_idx)} samples)")

Leave-One-Out CV (LOOCV)

Each sample is the test set exactly once. Gives the most data to training but is expensive for large nn.

import numpy as np
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Small dataset — where LOOCV makes sense
rng = np.random.default_rng(0)
n = 30
X_small = rng.normal(0, 1, (n, 3))
y_small = X_small @ np.array([2, -1, 0.5]) + rng.normal(0, 0.5, n)

model = make_pipeline(StandardScaler(), LinearRegression())

loo = LeaveOneOut()
sc_loo = cross_val_score(model, X_small, y_small, cv=loo,
                          scoring='neg_mean_squared_error')
sc_5f  = cross_val_score(model, X_small, y_small, cv=5,
                          scoring='neg_mean_squared_error')

print(f"LOOCV MSE: {-sc_loo.mean():.4f} (n={n} fits)")
print(f"5-Fold MSE: {-sc_5f.mean():.4f} ± {sc_5f.std():.4f}")
print(f"Note: LOOCV ran {len(sc_loo)} training fits vs 5 for k-fold.")

Time Series Split

For sequential data (sales, stock prices, demand forecasting) you must never use future data to predict the past. TimeSeriesSplit enforces this: each test fold is strictly after all training data.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

rng = np.random.default_rng(5)
n = 120
t = np.arange(n)
# Simulated monthly sales with trend + seasonality + noise
y_ts = 100 + 0.5*t + 20*np.sin(2*np.pi*t/12) + rng.normal(0, 5, n)
# Features: lag-1, lag-2, month index
X_ts = np.column_stack([y_ts[:-2], y_ts[1:-1], t[2:]])
y_target = y_ts[2:]

tscv = TimeSeriesSplit(n_splits=5)
model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(t[2:], y_target, 'k-', alpha=0.4, linewidth=1, label='True sales')

colors = plt.cm.viridis(np.linspace(0.2, 0.9, 5))
mse_list = []
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
    model.fit(X_ts[train_idx], y_target[train_idx])
    preds = model.predict(X_ts[test_idx])
    mse = np.mean((y_target[test_idx] - preds)**2)
    mse_list.append(mse)
    ax.plot(t[2:][test_idx], preds, color=colors[i], linewidth=2,
            label=f'Fold {i+1} (MSE={mse:.1f})')

ax.set_xlabel('Time (months)')
ax.set_ylabel('Sales')
ax.set_title('TimeSeriesSplit — test folds always in the future')
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

print(f"Mean CV MSE: {np.mean(mse_list):.2f} ± {np.std(mse_list):.2f}")

CV for Hyperparameter Tuning

The most common use of CV: finding the best hyperparameters without touching the final test set.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X_d, y_d = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_d, y_d, test_size=0.2, random_state=0)

pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': [0.01, 0.1, 1, 10, 100, 1000]}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error',
                  return_train_score=True)
gs.fit(X_tr, y_tr)

alphas = [p['ridge__alpha'] for p in gs.cv_results_['params']]
cv_mse   = -gs.cv_results_['mean_test_score']
cv_std   =  gs.cv_results_['std_test_score']
tr_mse   = -gs.cv_results_['mean_train_score']

fig, ax = plt.subplots(figsize=(9, 5))
ax.semilogx(alphas, tr_mse,  'b-o', linewidth=2, label='Train MSE')
ax.semilogx(alphas, cv_mse,  'r-o', linewidth=2, label='CV MSE')
ax.fill_between(alphas, cv_mse - cv_std, cv_mse + cv_std, alpha=0.15, color='red')
ax.axvline(gs.best_params_['ridge__alpha'], color='green', linestyle='--',
           label=f"Best alpha={gs.best_params_['ridge__alpha']}")
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('MSE')
ax.set_title('GridSearchCV: 5-fold CV for Ridge alpha')
ax.legend()
plt.tight_layout()
plt.show()

final_mse = np.mean((y_te - gs.predict(X_te))**2)
print(f"Best alpha: {gs.best_params_['ridge__alpha']}")
print(f"Best CV MSE: {-gs.best_score_:.2f}")
print(f"Final test MSE: {final_mse:.2f}")

kk-Fold CV From Scratch

Understanding what sklearn is doing internally:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes

X_d, y_d = load_diabetes(return_X_y=True)
n = len(y_d)
k = 5
rng = np.random.default_rng(42)
indices = rng.permutation(n)          # shuffle
folds   = np.array_split(indices, k)  # split into k chunks

scores = []
for i in range(k):
    test_idx  = folds[i]
    train_idx = np.concatenate([folds[j] for j in range(k) if j != i])

    X_tr, X_te = X_d[train_idx], X_d[test_idx]
    y_tr, y_te = y_d[train_idx], y_d[test_idx]

    sc = StandardScaler().fit(X_tr)
    X_tr_s, X_te_s = sc.transform(X_tr), sc.transform(X_te)

    model = LinearRegression().fit(X_tr_s, y_tr)
    y_hat = model.predict(X_te_s)
    mse   = np.mean((y_te - y_hat)**2)
    r2    = 1 - np.sum((y_te - y_hat)**2) / np.sum((y_te - y_te.mean())**2)
    scores.append(r2)
    print(f"Fold {i+1}: R²={r2:.4f}  MSE={mse:.2f}")

print(f"\nMean R²: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

The Data Leakage Trap

Choosing the Right Strategy

Problem typeRecommended strategyReason
General regression / classification5-fold or 10-foldGood bias-variance balance
Imbalanced classificationStratified kk-foldPreserves class ratio per fold
Time series / sequentialTimeSeriesSplitPrevents future leakage
Very small dataset (n<50n < 50)LOOCV or 10-foldMaximum use of training data
Large dataset (n>10000n > 10 000)3-fold or 5-foldSpeed; variance is low anyway
Hyperparameter tuningGridSearchCV / RandomizedSearchCVAutomated CV over param grid

Try It in the Browser

Manual 3-fold CV from scratch using pure Python.

Guided Practice

Why does cross-validation give a more reliable performance estimate than a single train-test split?

It averages over multiple test sets, reducing the variance of the estimateCorrect. Any single split might be lucky or unlucky. Averaging across k splits reduces that noise.
It trains the model on more epochsCV is about evaluation strategy, not training duration.
It prevents the model from seeing any training dataEach fold still has a training portion — CV does not eliminate training data.
It automatically tunes hyperparametersPlain CV only evaluates performance. GridSearchCV uses CV for tuning, but they are separate steps.

You have a binary classification dataset where 5% of samples are positives. Which CV strategy is most appropriate?

Plain KFold with k=5Without stratification, some folds might contain very few or zero positives.
StratifiedKFold with k=5Correct. Stratified CV preserves the 5% class ratio in every fold, ensuring a fair evaluation.
TimeSeriesSplitTimeSeriesSplit is for sequential/temporal data, not class imbalance.
LOOCVLOOCV would work but is very slow for most dataset sizes.

You fit a StandardScaler on the full dataset before running 5-fold CV. What problem does this cause?

The model trains fasterSpeed is not the issue here.
Data leakage — the scaler uses test fold statistics during trainingCorrect. The scaler learned the mean and std of the full dataset including test folds, giving the model indirect access to test data.
The CV score becomes negativeLeakage typically inflates scores, not makes them negative.
The number of folds automatically changesScaling before CV does not change the number of folds.

After running GridSearchCV, you should fit the final model on:

Only the best fold's training dataThat would waste most of your data.
The full training dataset with the best hyperparameters found by CVCorrect. CV is for evaluation and selection; the final model uses all available training data.
The test setThe test set must never be used for training.
A randomly sampled 20% of training dataThere is no reason to discard 80% of your training data for the final model.

Exercises

Exercise 1 — Compare CV strategies on the wine dataset

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_wine

X_w, y_w = load_wine(return_X_y=True)
model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

# TODO: for k in [3, 5, 10, 20]:
#   run KFold CV, compute mean R² and std
# Plot mean ± std as a bar chart with error bars
# Describe how stability changes with k

Exercise 2 — Hyperparameter search with RandomizedSearchCV

Use RandomizedSearchCV on a Ridge pipeline over the diabetes dataset. Search over alpha (log-uniform 1e-3 to 1e3) and polynomial degree (1–4) using 5-fold CV. Print the best parameters and final test MSE.

import numpy as np
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from scipy.stats import loguniform

X_d, y_d = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_d, y_d, test_size=0.2, random_state=0)

pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
param_dist = {
    'polynomialfeatures__degree': [1, 2, 3],
    'ridge__alpha': loguniform(1e-3, 1e3),
}

# TODO: run RandomizedSearchCV with n_iter=20, cv=5
# print best_params_, best CV MSE, and final test MSE

Exercise 3 — Detect data leakage

Run 5-fold CV on the diabetes dataset twice: once with scaling inside the pipeline (correct), and once with scaling before the split (leaky). Compare the resulting MSE scores and explain the difference.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes

X_d, y_d = load_diabetes(return_X_y=True)

# CORRECT: scaler inside pipeline
correct_pipe = make_pipeline(StandardScaler(), LinearRegression())
correct_mse = -cross_val_score(correct_pipe, X_d, y_d, cv=5,
                                scoring='neg_mean_squared_error').mean()

# LEAKY: scale all data before CV
X_scaled_all = StandardScaler().fit_transform(X_d)   # leakage!
leaky_mse = -cross_val_score(LinearRegression(), X_scaled_all, y_d, cv=5,
                              scoring='neg_mean_squared_error').mean()

print(f"Correct CV MSE : {correct_mse:.2f}")
print(f"Leaky CV MSE   : {leaky_mse:.2f}")
print(f"Difference     : {correct_mse - leaky_mse:.2f}")
print()
print("Leaky MSE is slightly lower — the scaler 'saw' the test fold, giving a")
print("falsely optimistic estimate. On larger feature scales the gap would be bigger.")

Common Pitfalls

Summary

Key takeaways
ConceptOne-line meaning
CV purposeReliable generalisation estimate without a fixed holdout
kk-foldSplit into kk folds, rotate test fold kk times, average scores
StratifiedPreserves class ratio per fold — essential for imbalanced classification
LOOCVMaximum training data; high variance estimate; slow for large nn
TimeSeriesSplitTest fold always after training — prevents future leakage
LeakageNever fit preprocessing on the full dataset before splitting
After CVRefit final model on all training data with best params
GridSearchCVAutomates CV-based hyperparameter search

Next Up — Nested CV and Model Comparison

You can now evaluate models reliably. Next: compare them fairly.

The next notebook — Nested CV and Model Comparison — shows that using the same CV loop for both hyperparameter tuning and performance estimation leads to an optimistic bias. Nested CV separates the two: an inner loop for tuning, an outer loop for evaluation. It also covers statistical tests for comparing models.

Dependencies: $k$-fold CV, GridSearchCV, and the concept that the test set must never inform any modelling decision.