Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Nested CV & Model Comparison

Fair Evaluation When You Also Tune Hyperparameters

Standard cross-validation gives an unbiased generalisation estimate — until you use the same CV loop to also select hyperparameters. Every time you pick the best alpha or the best kernel based on CV scores, you introduce an optimistic bias. Nested CV separates tuning from evaluation with two independent loops: the inner loop tunes, the outer loop measures.

The Problem with Non-Nested CV

Consider this workflow:

  1. Try Ridge with alpha in {0.01, 0.1, 1, 10}.

  2. Pick the alpha with the best 5-fold CV score.

  3. Report that best CV score as your model’s performance.

The bias: step 2 selected the alpha that happened to do best on these folds. The reported score is the maximum over four candidates, not the expected score of a fresh model. The more hyperparameters you search, the more this inflates.

Demonstrating the Bias Empirically

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

X, y = load_diabetes(return_X_y=True)
alphas = np.logspace(-3, 3, 20)
pipe   = make_pipeline(StandardScaler(), Ridge())

# Non-nested: pick best alpha by 5-fold, then report that same CV score
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(pipe, {'ridge__alpha': alphas}, cv=inner_cv,
                  scoring='r2', return_train_score=False)
gs.fit(X, y)
non_nested_score = gs.best_score_

# Nested: outer 5-fold wraps inner 3-fold GridSearch
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
gs_nested = GridSearchCV(pipe, {'ridge__alpha': alphas},
                         cv=KFold(n_splits=3, shuffle=True, random_state=42),
                         scoring='r2')
nested_scores = cross_val_score(gs_nested, X, y, cv=outer_cv, scoring='r2')
nested_score  = nested_scores.mean()

print(f"Non-nested CV R²  : {non_nested_score:.4f}  (optimistically biased)")
print(f"Nested CV R²      : {nested_score:.4f} ± {nested_scores.std():.4f}  (unbiased)")
print(f"Optimism gap      : {non_nested_score - nested_score:.4f}")

fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(['Non-nested\n(biased)', 'Nested\n(unbiased)'],
       [non_nested_score, nested_score],
       color=['tomato', 'steelblue'], width=0.4)
ax.errorbar([1], [nested_score], yerr=[nested_scores.std()],
            fmt='none', color='black', capsize=6, linewidth=2)
ax.set_ylabel('R²'); ax.set_title('Non-nested vs Nested CV on the diabetes dataset')
ax.set_ylim(0.3, 0.6)
plt.tight_layout(); plt.show()

How Nested CV Works — Step by Step

For outer k1k_1-fold and inner k2k_2-fold:

  1. Split data into k1k_1 outer folds.

  2. For each outer fold ii:

    • Hold out fold ii as the outer test set.

    • On the remaining k11k_1 - 1 folds, run inner k2k_2-fold GridSearch to find the best hyperparameters.

    • Refit the model with those best hyperparameters on all k11k_1 - 1 training folds.

    • Evaluate on the held-out outer fold ii → get score sis_i.

  3. Report sˉ±std(s)\bar{s} \pm \text{std}(s).

Total fits = k1×k2×param_gridk_1 \times k_2 \times |\text{param\_grid}|. For k1=5k_1=5, k2=3k_2=3, 10 alphas = 150 fits.

Nested CV From Scratch

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score

X, y = load_diabetes(return_X_y=True)
alphas = np.logspace(-2, 3, 10)

outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

outer_scores = []
best_alphas  = []

for i, (outer_tr, outer_te) in enumerate(outer_cv.split(X)):
    X_out_tr, X_out_te = X[outer_tr], X[outer_te]
    y_out_tr, y_out_te = y[outer_tr], y[outer_te]

    # Inner loop: find best alpha on outer_tr
    best_score, best_alpha = -np.inf, None
    for alpha in alphas:
        pipe = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
        inner_sc = cross_val_score(pipe, X_out_tr, y_out_tr,
                                   cv=inner_cv, scoring='r2')
        if inner_sc.mean() > best_score:
            best_score = inner_sc.mean()
            best_alpha = alpha

    # Refit best model on full outer train, evaluate on outer test
    pipe_best = make_pipeline(StandardScaler(), Ridge(alpha=best_alpha))
    pipe_best.fit(X_out_tr, y_out_tr)
    y_hat = pipe_best.predict(X_out_te)
    r2 = 1 - np.sum((y_out_te - y_hat)**2) / np.sum((y_out_te - y_out_te.mean())**2)
    outer_scores.append(r2)
    best_alphas.append(best_alpha)
    print(f"Outer fold {i+1}: best_alpha={best_alpha:.3g}  R²={r2:.4f}")

print(f"\nNested CV R²: {np.mean(outer_scores):.4f} ± {np.std(outer_scores):.4f}")
print(f"Best alphas per fold: {[f'{a:.3g}' for a in best_alphas]}")

Fair Model Comparison with Nested CV

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

X, y = load_diabetes(return_X_y=True)
alphas = np.logspace(-3, 3, 7)

outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=0)

def nested_cv_score(estimator_class, param_grid):
    pipe = make_pipeline(StandardScaler(), estimator_class())
    # Map param names with pipeline prefix
    prefix = estimator_class.__name__.lower()
    prefixed_grid = {f'{prefix}__{k}': v for k, v in param_grid.items()}
    gs = GridSearchCV(pipe, prefixed_grid, cv=inner_cv, scoring='r2')
    return cross_val_score(gs, X, y, cv=outer_cv, scoring='r2')

results = {}
results['Ridge']         = nested_cv_score(Ridge, {'alpha': alphas})
results['Lasso']         = nested_cv_score(Lasso, {'alpha': alphas})

# Bar chart with error bars
names  = list(results.keys())
means  = [results[n].mean() for n in names]
stds   = [results[n].std()  for n in names]
colors = ['steelblue', 'tomato']

fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(names, means, color=colors, alpha=0.8, width=0.5)
ax.errorbar(names, means, yerr=stds, fmt='none', color='black',
            capsize=6, linewidth=2)
ax.set_ylabel('Nested CV R²')
ax.set_title('Fair model comparison via nested CV\n(outer 5-fold / inner 3-fold)')
ax.set_ylim(0.3, 0.6)

for bar, m, s, n in zip(bars, means, stds, names):
    ax.text(bar.get_x() + bar.get_width()/2, m + s + 0.005,
            f'{m:.3f}±{s:.3f}', ha='center', fontsize=9)

plt.tight_layout(); plt.show()

print("\nFull results:")
for name in names:
    sc = results[name]
    print(f"  {name:<20} R² = {sc.mean():.4f} ± {sc.std():.4f}  per-fold: {np.round(sc, 3)}")

Statistical Testing — Is Model A Actually Better?

A difference in mean R² might just be noise. A paired t-test on the per-fold scores tests whether the difference is statistically significant.

import numpy as np
from scipy import stats

# Use scores from the comparison above
ridge_sc = results['Ridge']
lasso_sc = results['Lasso']

diff = ridge_sc - lasso_sc
t_stat, p_value = stats.ttest_rel(ridge_sc, lasso_sc)

print(f"Ridge mean R²  : {ridge_sc.mean():.4f}")
print(f"Lasso mean R²  : {lasso_sc.mean():.4f}")
print(f"Mean difference: {diff.mean():.4f} (Ridge - Lasso)")
print(f"Paired t-stat  : {t_stat:.4f}")
print(f"p-value        : {p_value:.4f}")
print()
if p_value < 0.05:
    winner = 'Ridge' if diff.mean() > 0 else 'Lasso'
    print(f"Conclusion: {winner} is significantly better (p < 0.05).")
else:
    print("Conclusion: No statistically significant difference between the two models (p >= 0.05).")
    print("The observed gap may be noise from the CV folds.")

When to Use Nested CV

SituationUse nested CV?Reason
Final model benchmarking for a paper or deployment decisionYesKeeps evaluation honest
Comparing two or more model familiesYesPrevents biased selection
Quick prototyping during developmentNoToo slow; use simple CV
Very large dataset (n>100000n > 100 000)RarelySimple holdout usually sufficient
Small dataset (n<200n < 200)YesExtra folds extract more signal
Reporting to regulators or executivesYesCredibility requires rigour

Try It in the Browser

Manual 2-outer-fold, 3-inner-fold nested CV in pure Python.

Guided Practice

Why does non-nested CV give an optimistically biased performance estimate when hyperparameters are tuned?

The best hyperparameter is selected to maximise the CV score on those same folds, inflating the reported scoreCorrect. Every selection over a set of candidates uses the CV folds as a test set; the maximum is biased upward by chance variation.
The model trains on too little data per foldData size per fold is a bias-variance concern, not the source of the selection bias described here.
The inner and outer folds overlapIn standard GridSearchCV the folds do not overlap; the bias comes from using CV scores both to select and to report.
Ridge always overfits when alpha is smallOverfitting is a different issue from the selection bias in non-nested CV.

In nested CV, what does the inner loop do?

Evaluates the final model on unseen dataEvaluation is the job of the outer loop.
Selects the best hyperparameters on the outer training fold onlyCorrect. The inner loop runs GridSearch on the data left after the outer fold is held out.
Computes the final production predictionsNested CV is for evaluation; a separate final fit on all data produces production predictions.
Removes outliers from the training dataData cleaning is not part of the nested CV loop.

A paired t-test on nested CV fold scores returns p = 0.4. What should you conclude?

Model A is significantly better than Model Bp = 0.4 is well above the conventional 0.05 threshold for significance.
There is no statistically significant difference; the gap may be noiseCorrect. A high p-value means we cannot reject the null that both models perform equally.
Model B is better because the test failedFailing to reject the null does not favour either model.
You need to use a different datasetThe p-value does not imply the dataset is wrong.

With outer k=5, inner k=3, and a grid of 10 alpha values, how many model fits does nested CV run?

15That would be outer × inner only, ignoring the grid size.
50That would be outer × grid, ignoring the inner folds.
150 (5 × 3 × 10)Correct. Each of 5 outer folds runs a 3-fold grid search over 10 candidates = 150 fits total.
300300 would be correct if the outer loop also re-fitted the best model on the full outer train, but that extra fit is small in comparison.

Exercises

Exercise 1 — Quantify the bias gap on multiple datasets

import numpy as np
from sklearn.datasets import load_diabetes, load_boston
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

alphas = np.logspace(-3, 3, 10)
pipe   = make_pipeline(StandardScaler(), Ridge())

X, y = load_diabetes(return_X_y=True)

# TODO: compute non_nested_score and nested_score for Ridge on diabetes
# Print the gap and interpret: is it large enough to matter for deployment decisions?

Exercise 2 — Three-model comparison with bar chart

Compare Ridge, Lasso, and a LinearRegression baseline using nested CV on the diabetes dataset. Plot a bar chart with error bars. Run a pairwise t-test between the two best models.

%matplotlib inline
import numpy as np, matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from scipy import stats

X, y = load_diabetes(return_X_y=True)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=0)
alphas   = np.logspace(-3, 3, 7)

# TODO:
# 1. Nested CV score for LinearRegression (no inner grid needed — use plain cross_val_score)
# 2. Nested CV score for Ridge (GridSearchCV inner + cross_val_score outer)
# 3. Nested CV score for Lasso (same)
# 4. Bar chart with error bars
# 5. Paired t-test between the two regularised models

Exercise 3 — Speed up with n_jobs=-1

Re-run the model comparison from the notebook cells above but pass n_jobs=-1 to both GridSearchCV and cross_val_score. Time both versions with %%timeit or time.time() and report the speedup factor.

import time
# Your code here

Common Pitfalls

Summary

Key takeaways
ConceptOne-line meaning
Selection biasPicking the best CV score from a grid inflates the reported performance
Nested CV structureOuter loop evaluates; inner loop tunes — they never share folds
sklearn patterncross_val_score(GridSearchCV(pipe, grid, cv=inner_cv), X, y, cv=outer_cv)
Cost$k_1 \times k_2 \times
After nested CVRefit final model on all data with best params — nested CV was for evaluation only
Statistical comparisonPaired t-test on outer fold scores — p<0.05p < 0.05 to claim significance
When to useFinal benchmarks, comparing models, regulatory reporting

Next Up — Hyperparameter Tuning

You can now evaluate models fairly. Next: search for the best hyperparameters efficiently.

The next notebook — Hyperparameter Tuning — goes deeper into search strategies: grid search, random search, Bayesian optimisation, and early stopping. It shows how to set up a tuning pipeline that avoids leakage, scales to many parameters, and terminates efficiently.

Dependencies: nested CV structure, GridSearchCV / RandomizedSearchCV, and the principle that hyperparameter selection must not see the test fold.