Cross-Validation Strategies¶

Because One Train-Test Split Is Never Enough¶

A single holdout split gives a noisy estimate of generalisation error — you might get lucky or unlucky with which samples land in the test set. Cross-validation removes that luck by rotating the test set across the entire dataset, giving a reliable estimate of how the model will behave on new data.

Why One Split Is Not Enough¶

Suppose you split 100 samples 80/20. Whether you get a test MSE of 1.2 or 2.4 may depend entirely on which 20 samples fell in the test set — not on your model. Cross-validation averages this out:

\text{CV score} = \frac{1}{k}\sum_{i=1}^{k} \text{metric}(\hat{f}_{-i}, D_i)

(1)

where $\hat{f}_{-i}$ is the model trained on all folds except $i$ , and $D_i$ is the held-out fold $i$ .

Approach	Bias	Variance	Data efficiency
Single holdout	Low (if large test)	High	Wastes test fraction
$k$ -fold CV	Low	Lower	Uses all data for training and testing
LOOCV	Lowest	Highest (noisy)	Maximum data use, slow

Visual Flow — $k$ -Fold CV¶

$k$ -Fold Cross-Validation¶

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)
model = make_pipeline(StandardScaler(), LinearRegression())

# Compare k=3, 5, 10
fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharey=True)
for ax, k in zip(axes, [3, 5, 10]):
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=kf, scoring='r2')
    ax.bar(range(1, k+1), scores, color='steelblue', alpha=0.8)
    ax.axhline(scores.mean(), color='tomato', linewidth=2, linestyle='--',
               label=f'Mean={scores.mean():.3f}')
    ax.fill_between([0, k+1],
                    scores.mean() - scores.std(),
                    scores.mean() + scores.std(),
                    color='tomato', alpha=0.1, label=f'±1 std={scores.std():.3f}')
    ax.set_title(f'{k}-Fold CV')
    ax.set_xlabel('Fold')
    ax.set_ylabel('R²')
    ax.legend(fontsize=8)

plt.suptitle('K-Fold R² on the diabetes dataset', y=1.02)
plt.tight_layout()
plt.show()

# Numeric summary
print(f"{'k':>3}  {'Mean R²':>9}  {'Std':>7}")
for k in [3, 5, 10]:
    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    sc = cross_val_score(model, X, y, cv=kf, scoring='r2')
    print(f"{k:>3}  {sc.mean():>9.4f}  {sc.std():>7.4f}")

Choosing

k

$k$	Bias of estimate	Variance of estimate	Cost
3	Higher (less training data per fold)	Lower	Fast
5	Balanced	Balanced	Standard
10	Lower	Higher	2× slower than 5-fold
$n$ (LOOCV)	Lowest	Highest	Very slow

Rule of thumb: use $k=5$ or $k=10$ . Smaller datasets favour larger $k$ ; large datasets can use $k=3$ or $k=5$ for speed.

Stratified $k$ -Fold (Classification)¶

Standard $k$ -fold might put almost all positives in one fold by chance. Stratified CV preserves the class ratio in every fold.

import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_breast_cancer

X_c, y_c = load_breast_cancer(return_X_y=True)
clf = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000, random_state=0))

kf  = KFold(n_splits=5, shuffle=True, random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

sc_kf  = cross_val_score(clf, X_c, y_c, cv=kf,  scoring='roc_auc')
sc_skf = cross_val_score(clf, X_c, y_c, cv=skf, scoring='roc_auc')

print(f"KFold     AUC: {sc_kf.mean():.4f} ± {sc_kf.std():.4f}")
print(f"Stratified AUC: {sc_skf.mean():.4f} ± {sc_skf.std():.4f}")

# Show class balance in each fold
print("\nClass 1 proportion per fold (Stratified):")
for i, (_, test_idx) in enumerate(skf.split(X_c, y_c), 1):
    prop = y_c[test_idx].mean()
    print(f"  Fold {i}: {prop:.3f}  ({y_c[test_idx].sum()} positives / {len(test_idx)} samples)")

Leave-One-Out CV (LOOCV)¶

Each sample is the test set exactly once. Gives the most data to training but is expensive for large $n$ .

import numpy as np
from sklearn.model_selection import LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Small dataset — where LOOCV makes sense
rng = np.random.default_rng(0)
n = 30
X_small = rng.normal(0, 1, (n, 3))
y_small = X_small @ np.array([2, -1, 0.5]) + rng.normal(0, 0.5, n)

model = make_pipeline(StandardScaler(), LinearRegression())

loo = LeaveOneOut()
sc_loo = cross_val_score(model, X_small, y_small, cv=loo,
                          scoring='neg_mean_squared_error')
sc_5f  = cross_val_score(model, X_small, y_small, cv=5,
                          scoring='neg_mean_squared_error')

print(f"LOOCV MSE: {-sc_loo.mean():.4f} (n={n} fits)")
print(f"5-Fold MSE: {-sc_5f.mean():.4f} ± {sc_5f.std():.4f}")
print(f"Note: LOOCV ran {len(sc_loo)} training fits vs 5 for k-fold.")

Time Series Split¶

For sequential data (sales, stock prices, demand forecasting) you must never use future data to predict the past. TimeSeriesSplit enforces this: each test fold is strictly after all training data.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

rng = np.random.default_rng(5)
n = 120
t = np.arange(n)
# Simulated monthly sales with trend + seasonality + noise
y_ts = 100 + 0.5*t + 20*np.sin(2*np.pi*t/12) + rng.normal(0, 5, n)
# Features: lag-1, lag-2, month index
X_ts = np.column_stack([y_ts[:-2], y_ts[1:-1], t[2:]])
y_target = y_ts[2:]

tscv = TimeSeriesSplit(n_splits=5)
model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(t[2:], y_target, 'k-', alpha=0.4, linewidth=1, label='True sales')

colors = plt.cm.viridis(np.linspace(0.2, 0.9, 5))
mse_list = []
for i, (train_idx, test_idx) in enumerate(tscv.split(X_ts)):
    model.fit(X_ts[train_idx], y_target[train_idx])
    preds = model.predict(X_ts[test_idx])
    mse = np.mean((y_target[test_idx] - preds)**2)
    mse_list.append(mse)
    ax.plot(t[2:][test_idx], preds, color=colors[i], linewidth=2,
            label=f'Fold {i+1} (MSE={mse:.1f})')

ax.set_xlabel('Time (months)')
ax.set_ylabel('Sales')
ax.set_title('TimeSeriesSplit — test folds always in the future')
ax.legend(fontsize=8)
plt.tight_layout()
plt.show()

print(f"Mean CV MSE: {np.mean(mse_list):.2f} ± {np.std(mse_list):.2f}")

CV for Hyperparameter Tuning¶

The most common use of CV: finding the best hyperparameters without touching the final test set.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

X_d, y_d = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_d, y_d, test_size=0.2, random_state=0)

pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': [0.01, 0.1, 1, 10, 100, 1000]}

gs = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_mean_squared_error',
                  return_train_score=True)
gs.fit(X_tr, y_tr)

alphas = [p['ridge__alpha'] for p in gs.cv_results_['params']]
cv_mse   = -gs.cv_results_['mean_test_score']
cv_std   =  gs.cv_results_['std_test_score']
tr_mse   = -gs.cv_results_['mean_train_score']

fig, ax = plt.subplots(figsize=(9, 5))
ax.semilogx(alphas, tr_mse,  'b-o', linewidth=2, label='Train MSE')
ax.semilogx(alphas, cv_mse,  'r-o', linewidth=2, label='CV MSE')
ax.fill_between(alphas, cv_mse - cv_std, cv_mse + cv_std, alpha=0.15, color='red')
ax.axvline(gs.best_params_['ridge__alpha'], color='green', linestyle='--',
           label=f"Best alpha={gs.best_params_['ridge__alpha']}")
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('MSE')
ax.set_title('GridSearchCV: 5-fold CV for Ridge alpha')
ax.legend()
plt.tight_layout()
plt.show()

final_mse = np.mean((y_te - gs.predict(X_te))**2)
print(f"Best alpha: {gs.best_params_['ridge__alpha']}")
print(f"Best CV MSE: {-gs.best_score_:.2f}")
print(f"Final test MSE: {final_mse:.2f}")

$k$ -Fold CV From Scratch¶

Understanding what sklearn is doing internally:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_diabetes

X_d, y_d = load_diabetes(return_X_y=True)
n = len(y_d)
k = 5
rng = np.random.default_rng(42)
indices = rng.permutation(n)          # shuffle
folds   = np.array_split(indices, k)  # split into k chunks

scores = []
for i in range(k):
    test_idx  = folds[i]
    train_idx = np.concatenate([folds[j] for j in range(k) if j != i])

    X_tr, X_te = X_d[train_idx], X_d[test_idx]
    y_tr, y_te = y_d[train_idx], y_d[test_idx]

    sc = StandardScaler().fit(X_tr)
    X_tr_s, X_te_s = sc.transform(X_tr), sc.transform(X_te)

    model = LinearRegression().fit(X_tr_s, y_tr)
    y_hat = model.predict(X_te_s)
    mse   = np.mean((y_te - y_hat)**2)
    r2    = 1 - np.sum((y_te - y_hat)**2) / np.sum((y_te - y_te.mean())**2)
    scores.append(r2)
    print(f"Fold {i+1}: R²={r2:.4f}  MSE={mse:.2f}")

print(f"\nMean R²: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

The Data Leakage Trap¶

Warning

Never fit preprocessing on the full dataset before CV. If you call StandardScaler().fit(X_all) and then split, the test fold statistics have contaminated the scaler — the model has effectively seen the test set. This inflates performance estimates.

Always use a Pipeline so that fitting (.fit()) happens only on the training fold.

# WRONG — leakage
X_scaled = StandardScaler().fit_transform(X)   # sees all data including test
cross_val_score(model, X_scaled, y, cv=5)

# CORRECT — no leakage
pipe = make_pipeline(StandardScaler(), model)
cross_val_score(pipe, X, y, cv=5)              # scaler fits only on train fold

Choosing the Right Strategy¶

Problem type	Recommended strategy	Reason
General regression / classification	5-fold or 10-fold	Good bias-variance balance
Imbalanced classification	Stratified $k$ -fold	Preserves class ratio per fold
Time series / sequential	`TimeSeriesSplit`	Prevents future leakage
Very small dataset ( $n < 50$ )	LOOCV or 10-fold	Maximum use of training data
Large dataset ( $n > 10 000$ )	3-fold or 5-fold	Speed; variance is low anyway
Hyperparameter tuning	`GridSearchCV` / `RandomizedSearchCV`	Automated CV over param grid

Try It in the Browser¶

Manual 3-fold CV from scratch using pure Python.

Guided Practice¶

Why does cross-validation give a more reliable performance estimate than a single train-test split?¶

It averages over multiple test sets, reducing the variance of the estimateCorrect. Any single split might be lucky or unlucky. Averaging across k splits reduces that noise.

It trains the model on more epochsCV is about evaluation strategy, not training duration.

It prevents the model from seeing any training dataEach fold still has a training portion — CV does not eliminate training data.

It automatically tunes hyperparametersPlain CV only evaluates performance. GridSearchCV uses CV for tuning, but they are separate steps.

You have a binary classification dataset where 5% of samples are positives. Which CV strategy is most appropriate?¶

Plain KFold with k=5Without stratification, some folds might contain very few or zero positives.

StratifiedKFold with k=5Correct. Stratified CV preserves the 5% class ratio in every fold, ensuring a fair evaluation.

TimeSeriesSplitTimeSeriesSplit is for sequential/temporal data, not class imbalance.

LOOCVLOOCV would work but is very slow for most dataset sizes.

You fit a StandardScaler on the full dataset before running 5-fold CV. What problem does this cause?¶

The model trains fasterSpeed is not the issue here.

Data leakage — the scaler uses test fold statistics during trainingCorrect. The scaler learned the mean and std of the full dataset including test folds, giving the model indirect access to test data.

The CV score becomes negativeLeakage typically inflates scores, not makes them negative.

The number of folds automatically changesScaling before CV does not change the number of folds.

After running GridSearchCV, you should fit the final model on:¶

Only the best fold's training dataThat would waste most of your data.

The full training dataset with the best hyperparameters found by CVCorrect. CV is for evaluation and selection; the final model uses all available training data.

The test setThe test set must never be used for training.

A randomly sampled 20% of training dataThere is no reason to discard 80% of your training data for the final model.

Exercises¶

Exercise 1 — Compare CV strategies on the wine dataset¶

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_wine

X_w, y_w = load_wine(return_X_y=True)
model = make_pipeline(StandardScaler(), Ridge(alpha=1.0))

# TODO: for k in [3, 5, 10, 20]:
#   run KFold CV, compute mean R² and std
# Plot mean ± std as a bar chart with error bars
# Describe how stability changes with k

Exercise 2 — Hyperparameter search with RandomizedSearchCV¶

Use RandomizedSearchCV on a Ridge pipeline over the diabetes dataset. Search over alpha (log-uniform 1e-3 to 1e3) and polynomial degree (1–4) using 5-fold CV. Print the best parameters and final test MSE.

import numpy as np
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from scipy.stats import loguniform

X_d, y_d = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X_d, y_d, test_size=0.2, random_state=0)

pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
param_dist = {
    'polynomialfeatures__degree': [1, 2, 3],
    'ridge__alpha': loguniform(1e-3, 1e3),
}

# TODO: run RandomizedSearchCV with n_iter=20, cv=5
# print best_params_, best CV MSE, and final test MSE

Exercise 3 — Detect data leakage¶

Run 5-fold CV on the diabetes dataset twice: once with scaling inside the pipeline (correct), and once with scaling before the split (leaky). Compare the resulting MSE scores and explain the difference.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes

X_d, y_d = load_diabetes(return_X_y=True)

# CORRECT: scaler inside pipeline
correct_pipe = make_pipeline(StandardScaler(), LinearRegression())
correct_mse = -cross_val_score(correct_pipe, X_d, y_d, cv=5,
                                scoring='neg_mean_squared_error').mean()

# LEAKY: scale all data before CV
X_scaled_all = StandardScaler().fit_transform(X_d)   # leakage!
leaky_mse = -cross_val_score(LinearRegression(), X_scaled_all, y_d, cv=5,
                              scoring='neg_mean_squared_error').mean()

print(f"Correct CV MSE : {correct_mse:.2f}")
print(f"Leaky CV MSE   : {leaky_mse:.2f}")
print(f"Difference     : {correct_mse - leaky_mse:.2f}")
print()
print("Leaky MSE is slightly lower — the scaler 'saw' the test fold, giving a")
print("falsely optimistic estimate. On larger feature scales the gap would be bigger.")

Common Pitfalls¶

Summary¶

Concept	One-line meaning
CV purpose	Reliable generalisation estimate without a fixed holdout
$k$ -fold	Split into $k$ folds, rotate test fold $k$ times, average scores
Stratified	Preserves class ratio per fold — essential for imbalanced classification
LOOCV	Maximum training data; high variance estimate; slow for large $n$
TimeSeriesSplit	Test fold always after training — prevents future leakage
Leakage	Never fit preprocessing on the full dataset before splitting
After CV	Refit final model on all training data with best params
GridSearchCV	Automates CV-based hyperparameter search

Next Up — Nested CV and Model Comparison¶

You can now evaluate models reliably. Next: compare them fairly.¶

The next notebook — Nested CV and Model Comparison — shows that using the same CV loop for both hyperparameter tuning and performance estimation leads to an optimistic bias. Nested CV separates the two: an inner loop for tuning, an outer loop for evaluation. It also covers statistical tests for comparing models.

Dependencies: $k$-fold CV, GridSearchCV, and the concept that the test set must never inform any modelling decision.

Cross-Validation Strategies¶

Because One Train-Test Split Is Never Enough¶

Why One Split Is Not Enough¶

Visual Flow — kkk-Fold CV¶

kkk-Fold Cross-Validation¶

Stratified kkk-Fold (Classification)¶

Leave-One-Out CV (LOOCV)¶

Time Series Split¶

CV for Hyperparameter Tuning¶

kkk-Fold CV From Scratch¶

The Data Leakage Trap¶

Choosing the Right Strategy¶

Try It in the Browser¶

Guided Practice¶

Why does cross-validation give a more reliable performance estimate than a single train-test split?¶

You have a binary classification dataset where 5% of samples are positives. Which CV strategy is most appropriate?¶

You fit a StandardScaler on the full dataset before running 5-fold CV. What problem does this cause?¶

After running GridSearchCV, you should fit the final model on:¶

Exercises¶

Exercise 1 — Compare CV strategies on the wine dataset¶

Exercise 2 — Hyperparameter search with RandomizedSearchCV¶

Exercise 3 — Detect data leakage¶

Common Pitfalls¶

Summary¶

Next Up — Nested CV and Model Comparison¶

You can now evaluate models reliably. Next: compare them fairly.¶

Visual Flow — $k$ -Fold CV¶

$k$ -Fold Cross-Validation¶

Stratified $k$ -Fold (Classification)¶

$k$ -Fold CV From Scratch¶