
Nested CV & Model Comparison¶
Fair Evaluation When You Also Tune Hyperparameters¶
Standard cross-validation gives an unbiased generalisation estimate — until you use the same CV loop to also select hyperparameters. Every time you pick the best alpha or the best kernel based on CV scores, you introduce an optimistic bias. Nested CV separates tuning from evaluation with two independent loops: the inner loop tunes, the outer loop measures.
The Problem with Non-Nested CV¶
Consider this workflow:
Try Ridge with
alphain {0.01, 0.1, 1, 10}.Pick the
alphawith the best 5-fold CV score.Report that best CV score as your model’s performance.
The bias: step 2 selected the alpha that happened to do best on these folds. The reported score is the maximum over four candidates, not the expected score of a fresh model. The more hyperparameters you search, the more this inflates.
Demonstrating the Bias Empirically¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
X, y = load_diabetes(return_X_y=True)
alphas = np.logspace(-3, 3, 20)
pipe = make_pipeline(StandardScaler(), Ridge())
# Non-nested: pick best alpha by 5-fold, then report that same CV score
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
gs = GridSearchCV(pipe, {'ridge__alpha': alphas}, cv=inner_cv,
scoring='r2', return_train_score=False)
gs.fit(X, y)
non_nested_score = gs.best_score_
# Nested: outer 5-fold wraps inner 3-fold GridSearch
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
gs_nested = GridSearchCV(pipe, {'ridge__alpha': alphas},
cv=KFold(n_splits=3, shuffle=True, random_state=42),
scoring='r2')
nested_scores = cross_val_score(gs_nested, X, y, cv=outer_cv, scoring='r2')
nested_score = nested_scores.mean()
print(f"Non-nested CV R² : {non_nested_score:.4f} (optimistically biased)")
print(f"Nested CV R² : {nested_score:.4f} ± {nested_scores.std():.4f} (unbiased)")
print(f"Optimism gap : {non_nested_score - nested_score:.4f}")
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(['Non-nested\n(biased)', 'Nested\n(unbiased)'],
[non_nested_score, nested_score],
color=['tomato', 'steelblue'], width=0.4)
ax.errorbar([1], [nested_score], yerr=[nested_scores.std()],
fmt='none', color='black', capsize=6, linewidth=2)
ax.set_ylabel('R²'); ax.set_title('Non-nested vs Nested CV on the diabetes dataset')
ax.set_ylim(0.3, 0.6)
plt.tight_layout(); plt.show()How Nested CV Works — Step by Step¶
For outer -fold and inner -fold:
Split data into outer folds.
For each outer fold :
Hold out fold as the outer test set.
On the remaining folds, run inner -fold GridSearch to find the best hyperparameters.
Refit the model with those best hyperparameters on all training folds.
Evaluate on the held-out outer fold → get score .
Report .
Total fits = . For , , 10 alphas = 150 fits.
Nested CV From Scratch¶
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, cross_val_score
X, y = load_diabetes(return_X_y=True)
alphas = np.logspace(-2, 3, 10)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
outer_scores = []
best_alphas = []
for i, (outer_tr, outer_te) in enumerate(outer_cv.split(X)):
X_out_tr, X_out_te = X[outer_tr], X[outer_te]
y_out_tr, y_out_te = y[outer_tr], y[outer_te]
# Inner loop: find best alpha on outer_tr
best_score, best_alpha = -np.inf, None
for alpha in alphas:
pipe = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
inner_sc = cross_val_score(pipe, X_out_tr, y_out_tr,
cv=inner_cv, scoring='r2')
if inner_sc.mean() > best_score:
best_score = inner_sc.mean()
best_alpha = alpha
# Refit best model on full outer train, evaluate on outer test
pipe_best = make_pipeline(StandardScaler(), Ridge(alpha=best_alpha))
pipe_best.fit(X_out_tr, y_out_tr)
y_hat = pipe_best.predict(X_out_te)
r2 = 1 - np.sum((y_out_te - y_hat)**2) / np.sum((y_out_te - y_out_te.mean())**2)
outer_scores.append(r2)
best_alphas.append(best_alpha)
print(f"Outer fold {i+1}: best_alpha={best_alpha:.3g} R²={r2:.4f}")
print(f"\nNested CV R²: {np.mean(outer_scores):.4f} ± {np.std(outer_scores):.4f}")
print(f"Best alphas per fold: {[f'{a:.3g}' for a in best_alphas]}")Fair Model Comparison with Nested CV¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
X, y = load_diabetes(return_X_y=True)
alphas = np.logspace(-3, 3, 7)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=0)
def nested_cv_score(estimator_class, param_grid):
pipe = make_pipeline(StandardScaler(), estimator_class())
# Map param names with pipeline prefix
prefix = estimator_class.__name__.lower()
prefixed_grid = {f'{prefix}__{k}': v for k, v in param_grid.items()}
gs = GridSearchCV(pipe, prefixed_grid, cv=inner_cv, scoring='r2')
return cross_val_score(gs, X, y, cv=outer_cv, scoring='r2')
results = {}
results['Ridge'] = nested_cv_score(Ridge, {'alpha': alphas})
results['Lasso'] = nested_cv_score(Lasso, {'alpha': alphas})
# Bar chart with error bars
names = list(results.keys())
means = [results[n].mean() for n in names]
stds = [results[n].std() for n in names]
colors = ['steelblue', 'tomato']
fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(names, means, color=colors, alpha=0.8, width=0.5)
ax.errorbar(names, means, yerr=stds, fmt='none', color='black',
capsize=6, linewidth=2)
ax.set_ylabel('Nested CV R²')
ax.set_title('Fair model comparison via nested CV\n(outer 5-fold / inner 3-fold)')
ax.set_ylim(0.3, 0.6)
for bar, m, s, n in zip(bars, means, stds, names):
ax.text(bar.get_x() + bar.get_width()/2, m + s + 0.005,
f'{m:.3f}±{s:.3f}', ha='center', fontsize=9)
plt.tight_layout(); plt.show()
print("\nFull results:")
for name in names:
sc = results[name]
print(f" {name:<20} R² = {sc.mean():.4f} ± {sc.std():.4f} per-fold: {np.round(sc, 3)}")Statistical Testing — Is Model A Actually Better?¶
A difference in mean R² might just be noise. A paired t-test on the per-fold scores tests whether the difference is statistically significant.
import numpy as np
from scipy import stats
# Use scores from the comparison above
ridge_sc = results['Ridge']
lasso_sc = results['Lasso']
diff = ridge_sc - lasso_sc
t_stat, p_value = stats.ttest_rel(ridge_sc, lasso_sc)
print(f"Ridge mean R² : {ridge_sc.mean():.4f}")
print(f"Lasso mean R² : {lasso_sc.mean():.4f}")
print(f"Mean difference: {diff.mean():.4f} (Ridge - Lasso)")
print(f"Paired t-stat : {t_stat:.4f}")
print(f"p-value : {p_value:.4f}")
print()
if p_value < 0.05:
winner = 'Ridge' if diff.mean() > 0 else 'Lasso'
print(f"Conclusion: {winner} is significantly better (p < 0.05).")
else:
print("Conclusion: No statistically significant difference between the two models (p >= 0.05).")
print("The observed gap may be noise from the CV folds.")When to Use Nested CV¶
| Situation | Use nested CV? | Reason |
|---|---|---|
| Final model benchmarking for a paper or deployment decision | Yes | Keeps evaluation honest |
| Comparing two or more model families | Yes | Prevents biased selection |
| Quick prototyping during development | No | Too slow; use simple CV |
| Very large dataset () | Rarely | Simple holdout usually sufficient |
| Small dataset () | Yes | Extra folds extract more signal |
| Reporting to regulators or executives | Yes | Credibility requires rigour |
Try It in the Browser¶
Manual 2-outer-fold, 3-inner-fold nested CV in pure Python.
Guided Practice¶
Why does non-nested CV give an optimistically biased performance estimate when hyperparameters are tuned?¶
In nested CV, what does the inner loop do?¶
A paired t-test on nested CV fold scores returns p = 0.4. What should you conclude?¶
With outer k=5, inner k=3, and a grid of 10 alpha values, how many model fits does nested CV run?¶
import numpy as np
from sklearn.datasets import load_diabetes, load_boston
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
alphas = np.logspace(-3, 3, 10)
pipe = make_pipeline(StandardScaler(), Ridge())
X, y = load_diabetes(return_X_y=True)
# TODO: compute non_nested_score and nested_score for Ridge on diabetes
# Print the gap and interpret: is it large enough to matter for deployment decisions?
Exercise 2 — Three-model comparison with bar chart¶
Compare Ridge, Lasso, and a LinearRegression baseline using nested CV on the diabetes dataset. Plot a bar chart with error bars. Run a pairwise t-test between the two best models.
%matplotlib inline
import numpy as np, matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
from scipy import stats
X, y = load_diabetes(return_X_y=True)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=0)
alphas = np.logspace(-3, 3, 7)
# TODO:
# 1. Nested CV score for LinearRegression (no inner grid needed — use plain cross_val_score)
# 2. Nested CV score for Ridge (GridSearchCV inner + cross_val_score outer)
# 3. Nested CV score for Lasso (same)
# 4. Bar chart with error bars
# 5. Paired t-test between the two regularised models
Exercise 3 — Speed up with n_jobs=-1¶
Re-run the model comparison from the notebook cells above but pass n_jobs=-1 to both GridSearchCV and cross_val_score. Time both versions with %%timeit or time.time() and report the speedup factor.
import time
# Your code here
Common Pitfalls¶
Summary¶
Key takeaways
| Concept | One-line meaning |
|---|---|
| Selection bias | Picking the best CV score from a grid inflates the reported performance |
| Nested CV structure | Outer loop evaluates; inner loop tunes — they never share folds |
| sklearn pattern | cross_val_score(GridSearchCV(pipe, grid, cv=inner_cv), X, y, cv=outer_cv) |
| Cost | $k_1 \times k_2 \times |
| After nested CV | Refit final model on all data with best params — nested CV was for evaluation only |
| Statistical comparison | Paired t-test on outer fold scores — to claim significance |
| When to use | Final benchmarks, comparing models, regulatory reporting |
Next Up — Hyperparameter Tuning¶

You can now evaluate models fairly. Next: search for the best hyperparameters efficiently.¶
The next notebook — Hyperparameter Tuning — goes deeper into search strategies: grid search, random search, Bayesian optimisation, and early stopping. It shows how to set up a tuning pipeline that avoids leakage, scales to many parameters, and terminates efficiently.
Dependencies: nested CV structure, GridSearchCV / RandomizedSearchCV, and the principle that hyperparameter selection must not see the test fold.