Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hyperparameter Tuning

Finding the Right Settings Before Training Begins

Model parameters are learned from data. Hyperparameters are chosen by you — before training starts. The difference between default settings and well-tuned ones can be the difference between a mediocre model and a deployable one. This notebook covers the full ladder: manual intuition, grid search, random search, Bayesian optimisation, and successive halving.

Why Hyperparameter Tuning Matters in Business

A real cost example: a fraud detection model with default settings might flag 60% of true fraud cases. After tuning the decision threshold and regularisation strength, the same algorithm catches 82% — not because the data changed, but because the knobs were set correctly. For a bank processing a million transactions a day, that gap is worth millions.

Hyperparameter tuning bridges the gap between a working prototype and a production-grade model. The three business levers are:

Business goalWhat to tuneWhy
Reduce false positives (spam, fraud alerts)Decision threshold, regularisationControls precision–recall trade-off
Maximise revenue from recommendationsLearning rate, tree depthDrives ranking quality
Minimise compute costn_estimators, early stoppingFewer fits, same accuracy

Parameters vs Hyperparameters

ParametersHyperparameters
Set byOptimiser during trainingYou, before training
ExamplesRidge coefficients θ\boldsymbol{\theta}, neural network weightsλ\lambda (regularisation), tree depth, learning rate
How to find bestMinimise loss (gradient descent / Normal Equations)Search + cross-validation
Stored inmodel.coef_, model.intercept_model.get_params()
θlearned by optimiser=argminθ  J(θ;  λ)λ=hyperparameter, chosen by search\underbrace{\color{#1f77b4}{\boldsymbol{\theta}^*}}_\text{learned by optimiser} = \arg\min_{\boldsymbol{\theta}} \; J(\boldsymbol{\theta};\; \color{#ff7f0e}{\lambda}) \qquad \color{#ff7f0e}{\lambda} = \text{hyperparameter, chosen by search}

The Tuning Strategy Ladder

Grid search evaluates every combination of hyperparameter values you specify. It is exhaustive, reproducible, and simple to interpret — but cost grows multiplicatively with each new dimension.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
import time

X, y = load_diabetes(return_X_y=True)

pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {
    'ridge__alpha': np.logspace(-3, 3, 13),
}

t0 = time.time()
gs = GridSearchCV(pipe, param_grid, cv=5, scoring='r2',
                  return_train_score=True, n_jobs=-1)
gs.fit(X, y)
t_grid = time.time() - t0

print(f"Best alpha : {gs.best_params_['ridge__alpha']:.4g}")
print(f"Best CV R\u00b2 : {gs.best_score_:.4f}")
print(f"Grid size  : {len(param_grid['ridge__alpha'])} points")
print(f"Wall time  : {t_grid:.2f}s")

# Plot alpha vs CV score
alphas   = [p['ridge__alpha'] for p in gs.cv_results_['params']]
cv_mean  = gs.cv_results_['mean_test_score']
cv_std   = gs.cv_results_['std_test_score']
tr_mean  = gs.cv_results_['mean_train_score']

fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogx(alphas, cv_mean,  'r-o', linewidth=2, label='CV R\u00b2')
ax.fill_between(alphas, cv_mean - cv_std, cv_mean + cv_std, alpha=0.15, color='red')
ax.semilogx(alphas, tr_mean, 'b--o', linewidth=1.5, label='Train R\u00b2', alpha=0.6)
ax.axvline(gs.best_params_['ridge__alpha'], color='green', linestyle=':',
           label=f"Best alpha={gs.best_params_['ridge__alpha']:.3g}")
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('R\u00b2')
ax.set_title('GridSearchCV: Ridge alpha sweep')
ax.legend()
plt.tight_layout()
plt.show()

Random Search — Faster for High-Dimensional Grids

The key insight: if only a few hyperparameters matter, random search wastes far fewer evaluations than grid search on the unimportant ones. With a 10-point grid over 5 parameters, grid search runs 105=10000010^5 = 100\,000 fits; random search with n_iter=100 runs just 100.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from scipy.stats import loguniform
import time

X, y = load_diabetes(return_X_y=True)

pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())

# Grid search over degree × alpha
grid = {'polynomialfeatures__degree': [1, 2, 3],
        'ridge__alpha': np.logspace(-3, 3, 10)}

# Random search over same space — but alpha is continuous
rand_dist = {'polynomialfeatures__degree': [1, 2, 3],
             'ridge__alpha': loguniform(1e-3, 1e3)}

t0 = time.time()
gs = GridSearchCV(pipe, grid, cv=3, scoring='r2', n_jobs=-1)
gs.fit(X, y)
t_grid = time.time() - t0

t0 = time.time()
rs = RandomizedSearchCV(pipe, rand_dist, n_iter=30, cv=3,
                        scoring='r2', random_state=42, n_jobs=-1)
rs.fit(X, y)
t_rand = time.time() - t0

print(f"Grid Search   — best R\u00b2={gs.best_score_:.4f}  fits={len(gs.cv_results_['params'])*3}  time={t_grid:.2f}s")
print(f"  best params: {gs.best_params_}")
print(f"Random Search — best R\u00b2={rs.best_score_:.4f}  fits={30*3}  time={t_rand:.2f}s")
print(f"  best params: {rs.best_params_}")

# Compare sampled alpha distributions
gs_alphas = [p['ridge__alpha'] for p in gs.cv_results_['params']]
gs_scores = gs.cv_results_['mean_test_score']
rs_alphas = [p['ridge__alpha'] for p in rs.cv_results_['params']]
rs_scores = rs.cv_results_['mean_test_score']

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(gs_alphas, gs_scores, c='steelblue', s=40, alpha=0.7, label='Grid')
axes[0].set_xscale('log'); axes[0].set_xlabel(r'alpha ($\lambda$)'); axes[0].set_ylabel('CV R\u00b2')
axes[0].set_title(f'Grid Search ({len(gs_alphas)} combos)')
axes[1].scatter(rs_alphas, rs_scores, c='tomato', s=40, alpha=0.7, label='Random')
axes[1].set_xscale('log'); axes[1].set_xlabel(r'alpha ($\lambda$)'); axes[1].set_ylabel('CV R\u00b2')
axes[1].set_title(f'Random Search (30 combos)')
for ax in axes:
    ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Visualising the Search Surface — 2D Heatmap

When tuning two hyperparameters simultaneously, a heatmap of CV scores reveals the interaction between them — which combinations are good, and whether there is a clear optimum or a plateau.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)
pipe  = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
grid  = {'polynomialfeatures__degree': [1, 2, 3],
         'ridge__alpha': np.logspace(-3, 3, 8)}

gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
gs.fit(X, y)

results = pd.DataFrame(gs.cv_results_)
pivot = results.pivot_table(
    index='param_polynomialfeatures__degree',
    columns='param_ridge__alpha',
    values='mean_test_score'
)

fig, ax = plt.subplots(figsize=(10, 4))
im = ax.imshow(pivot.values, aspect='auto', cmap='RdYlGn',
               vmin=pivot.values.min(), vmax=pivot.values.max())
plt.colorbar(im, ax=ax, label='CV R\u00b2')

ax.set_xticks(range(len(pivot.columns)))
ax.set_xticklabels([f'{a:.2g}' for a in pivot.columns], rotation=45, ha='right')
ax.set_yticks(range(len(pivot.index)))
ax.set_yticklabels(pivot.index)
ax.set_xlabel(r'Ridge alpha ($\lambda$)')
ax.set_ylabel('Polynomial degree')
ax.set_title('Hyperparameter search surface — 5-fold CV R\u00b2')

# Mark best
best_r = pivot.values.argmax() // pivot.shape[1]
best_c = pivot.values.argmax() %  pivot.shape[1]
ax.plot(best_c, best_r, 'w*', markersize=14, label='Best')
ax.legend()
plt.tight_layout()
plt.show()

print(f"Best: degree={gs.best_params_['polynomialfeatures__degree']}  alpha={gs.best_params_['ridge__alpha']:.3g}  R\u00b2={gs.best_score_:.4f}")

Bayesian Optimisation

Grid and random search are memoryless — each trial is independent. Bayesian optimisation builds a surrogate model of the objective function (Gaussian Process or Tree Parzen Estimator) and uses it to decide which hyperparameters to try next. It focuses trials where improvements are most likely.

Available libraries: scikit-optimize (skopt), optuna, hyperopt, bayesian-optimization.

The cell below uses scikit-optimize if installed, otherwise falls back to a manual illustration.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

X, y = load_diabetes(return_X_y=True)

try:
    from skopt import gp_minimize
    from skopt.space import Real
    from skopt.utils import use_named_args

    space = [Real(1e-4, 1e3, prior='log-uniform', name='alpha')]
    scores_by_trial = []

    @use_named_args(space)
    def objective(alpha):
        pipe = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
        sc   = cross_val_score(pipe, X, y, cv=3, scoring='r2').mean()
        scores_by_trial.append((alpha, sc))
        return -sc  # minimise negative R²

    result = gp_minimize(objective, space, n_calls=25, random_state=42, verbose=False)
    best_alpha = result.x[0]
    best_r2    = -result.fun
    print(f"Bayesian (skopt): best alpha={best_alpha:.4g}  R\u00b2={best_r2:.4f}  in 25 trials")

    trials_a = [t[0] for t in scores_by_trial]
    trials_s = [t[1] for t in scores_by_trial]
    fig, ax = plt.subplots(figsize=(8, 4))
    ax.scatter(range(len(trials_a)), trials_s, c=trials_a, cmap='viridis', s=50)
    ax.axhline(best_r2, color='tomato', linestyle='--', label=f'Best R\u00b2={best_r2:.3f}')
    ax.set_xlabel('Trial number')
    ax.set_ylabel('CV R\u00b2')
    ax.set_title('Bayesian optimisation — R\u00b2 per trial (colour = alpha value)')
    ax.legend()
    plt.tight_layout()
    plt.show()

except ImportError:
    print("scikit-optimize not installed. Showing manual illustration instead.")
    np.random.seed(42)
    n = 25
    rand_best  = np.maximum.accumulate(np.random.uniform(0.45, 0.50, n))
    bayes_best = np.maximum.accumulate(np.random.uniform(0.45, 0.50, n) +
                                       np.linspace(0, 0.02, n))
    fig, ax = plt.subplots(figsize=(8, 4))
    ax.plot(rand_best,  'r-o', markersize=4, label='Random search (best so far)')
    ax.plot(bayes_best, 'b-o', markersize=4, label='Bayesian optimisation (best so far)')
    ax.set_xlabel('Trial number')
    ax.set_ylabel('Best CV R\u00b2 so far')
    ax.set_title('Bayesian vs random search convergence (illustration)')
    ax.legend()
    plt.tight_layout()
    plt.show()
    print("Install scikit-optimize (`pip install scikit-optimize`) for real Bayesian tuning.")

sklearn’s HalvingGridSearchCV starts all candidates with a small data budget and progressively multiplies the data by factor, keeping only the top fraction at each round. This can be 10–100× faster than standard GridSearchCV for large grids.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingGridSearchCV, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True)
pipe  = make_pipeline(StandardScaler(), Ridge())
grid  = {'ridge__alpha': np.logspace(-4, 4, 30)}  # 30 candidates

t0 = time.time()
gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
gs.fit(X, y)
t_gs = time.time() - t0

t0 = time.time()
hgs = HalvingGridSearchCV(pipe, grid, cv=5, scoring='r2',
                           factor=3, random_state=42, n_jobs=-1)
hgs.fit(X, y)
t_hgs = time.time() - t0

print(f"GridSearchCV        best R\u00b2={gs.best_score_:.4f}  alpha={gs.best_params_['ridge__alpha']:.3g}  time={t_gs:.2f}s")
print(f"HalvingGridSearchCV best R\u00b2={hgs.best_score_:.4f}  alpha={hgs.best_params_['ridge__alpha']:.3g}  time={t_hgs:.2f}s")
print(f"Speedup: {t_gs/max(t_hgs, 0.01):.1f}x")

# Visualise rounds
halving_results = hgs.cv_results_
rounds = halving_results['iter']
alphas_h = [p['ridge__alpha'] for p in halving_results['params']]
scores_h = halving_results['mean_test_score']

fig, ax = plt.subplots(figsize=(8, 4))
colors = {r: plt.cm.Blues(0.4 + 0.2*r) for r in set(rounds)}
for r in sorted(set(rounds)):
    mask = [ri == r for ri in rounds]
    ax.scatter([a for a, m in zip(alphas_h, mask) if m],
               [s for s, m in zip(scores_h, mask) if m],
               color=colors[r], s=60, label=f'Round {r}')
ax.set_xscale('log')
ax.set_xlabel(r'Ridge alpha ($\lambda$)')
ax.set_ylabel('CV R\u00b2')
ax.set_title('Successive halving: candidates remaining per round')
ax.legend()
plt.tight_layout()
plt.show()

Warm-Starting — Reusing Previous Fits

Some estimators support warm_start=True, which tells the model to reuse the solution from the previous fit as the starting point for the next. This avoids retraining from scratch when you are scanning along a regularisation path.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import make_pipeline

X, y = load_diabetes(return_X_y=True)
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

alphas = np.logspace(-1, 2, 60)[::-1]  # decreasing: warm-start reuses coefs

# Cold start: fresh fit each time
t0 = time.time()
coefs_cold = []
for a in alphas:
    m = Lasso(alpha=a, max_iter=5000)
    m.fit(Xs, y)
    coefs_cold.append(m.coef_.copy())
t_cold = time.time() - t0

# Warm start: reuse previous solution
t0 = time.time()
coefs_warm = []
m_warm = Lasso(alpha=alphas[0], warm_start=True, max_iter=5000)
for a in alphas:
    m_warm.set_params(alpha=a)
    m_warm.fit(Xs, y)
    coefs_warm.append(m_warm.coef_.copy())
t_warm = time.time() - t0

print(f"Cold start: {t_cold:.3f}s")
print(f"Warm start: {t_warm:.3f}s")
print(f"Speedup: {t_cold/max(t_warm,0.001):.1f}x")

# Plot coefficient path
coefs_warm = np.array(coefs_warm)
fig, ax = plt.subplots(figsize=(9, 4))
for j in range(coefs_warm.shape[1]):
    ax.plot(np.log10(alphas), coefs_warm[:, j], linewidth=1.2)
ax.set_xlabel(r'log$_{10}$(alpha)')
ax.set_ylabel('Coefficient value')
ax.set_title('Lasso coefficient path (warm start) — features zeroed as alpha increases')
ax.axhline(0, color='k', linewidth=0.5)
plt.tight_layout()
plt.show()

Practical Tuning Workflow

RuleReason
Use log-scale for learning rate, alphaThese parameters span orders of magnitude
Always use a PipelinePrevents data leakage through preprocessing
Save all CV results to a DataFrameEnables surface plots and retrospective analysis
Use n_jobs=-1Parallelises across CPU cores at no cost
Don’t over-tuneMore tuning trials = more risk of overfitting the validation set

Full Tuning Example — Coarse-to-Fine on Diabetes

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, train_test_split
from scipy.stats import loguniform

X, y = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)

pipe = make_pipeline(StandardScaler(), Ridge())

# Step 1: coarse random search
coarse = RandomizedSearchCV(
    pipe, {'ridge__alpha': loguniform(1e-4, 1e4)},
    n_iter=40, cv=5, scoring='r2', random_state=42, n_jobs=-1,
    return_train_score=True
)
coarse.fit(X_tr, y_tr)
best_a = coarse.best_params_['ridge__alpha']
print(f"Coarse search best alpha: {best_a:.4g}  CV R\u00b2={coarse.best_score_:.4f}")

# Step 2: fine grid around best region
fine_alphas = np.logspace(np.log10(best_a/10), np.log10(best_a*10), 20)
fine = GridSearchCV(pipe, {'ridge__alpha': fine_alphas},
                    cv=5, scoring='r2', n_jobs=-1)
fine.fit(X_tr, y_tr)
print(f"Fine search   best alpha: {fine.best_params_['ridge__alpha']:.4g}  CV R\u00b2={fine.best_score_:.4f}")

# Step 3: final evaluation on held-out test set
final_mse = np.mean((y_te - fine.predict(X_te))**2)
final_r2  = 1 - np.sum((y_te - fine.predict(X_te))**2) / np.sum((y_te - y_te.mean())**2)
print(f"Final test set: MSE={final_mse:.2f}  R\u00b2={final_r2:.4f}")

# Plot coarse search results
alphas_tried = [p['ridge__alpha'] for p in coarse.cv_results_['params']]
scores_tried = coarse.cv_results_['mean_test_score']

fig, ax = plt.subplots(figsize=(8, 4))
ax.scatter(alphas_tried, scores_tried, alpha=0.7, color='steelblue', s=40, label='Random trials')
ax.axvline(best_a, color='tomato', linestyle='--', label=f'Best (coarse) alpha={best_a:.3g}')
ax.set_xscale('log')
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('CV R\u00b2')
ax.set_title('Coarse random search: all 40 trials')
ax.legend()
plt.tight_layout()
plt.show()

Try It in the Browser

Manual 1D grid search in pure Python — watch the score curve emerge.

Guided Practice

What distinguishes a hyperparameter from a model parameter?

Hyperparameters are set before training; model parameters are learned from dataCorrect. Ridge coefficients are model parameters; the regularisation strength alpha is a hyperparameter.
Hyperparameters are always integers; model parameters are always floatsBoth can be either type — the distinction is who sets them, not their dtype.
Hyperparameters only exist in neural networksRidge alpha, KNN k, and tree depth are all hyperparameters in classical models.
Model parameters are set by the user; hyperparameters are learnedThis is the reverse of the correct definition.

Why does random search often outperform grid search on large hyperparameter spaces?

Because random search always uses more trialsRandom search uses n_iter trials — you choose that number, and it is typically much smaller than a full grid.
Because if only a few parameters matter, random search covers more of the important dimensions with fewer evaluationsCorrect. Grid search wastes evaluations on unimportant parameters; random search samples them independently and covers important ones more densely.
Because random search uses a surrogate model to guide searchThat is Bayesian optimisation, not random search — random search has no memory between trials.
Because random search automatically standardises featuresFeature scaling is unrelated to the search strategy.
It runs more trials in parallelParallelism is a separate concern — Bayesian optimisation is about using past results to choose future trials.
It fits a surrogate model of the objective to predict where good hyperparameters are likely to beCorrect. The surrogate (GP or TPE) directs trials toward promising regions, reducing the total number of evaluations needed.
It requires no cross-validationBayesian optimisation still uses CV to evaluate each candidate — it just chooses candidates more intelligently.
It exhaustively tests all possible parameter combinationsThat is grid search. Bayesian optimisation is adaptive and uses far fewer evaluations.

You tune Ridge alpha with 5-fold CV and report the best CV score as your model's expected performance. What is the problem?

Ridge is not appropriate for the diabetes datasetRidge is a perfectly valid choice; the issue is in the evaluation protocol.
The best CV score is optimistically biased because alpha was selected to maximise it on those same foldsCorrect. This is the selection bias addressed by nested CV in the previous notebook.
5 folds is too few to get a reliable estimate5-fold CV is a standard and reliable choice; the bias comes from using it for both tuning and reporting, not from the fold count.
The pipeline was missing a StandardScalerScaling affects absolute scores but not the existence of selection bias.

Exercises

Exercise 1 — Grid vs random on a large space

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import loguniform

X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(StandardScaler(), Lasso(max_iter=10000))

# TODO:
# 1. Grid search over alpha = logspace(-4, 2, 20)
grid = {'lasso__alpha': np.logspace(-4, 2, 20)}
# gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
# gs.fit(X, y)

# 2. Random search over loguniform(1e-4, 1e2) with n_iter=20
rand_dist = {'lasso__alpha': loguniform(1e-4, 1e2)}
# rs = RandomizedSearchCV(pipe, rand_dist, n_iter=20, cv=5, scoring='r2', random_state=42, n_jobs=-1)
# rs.fit(X, y)

# 3. Time both and compare best CV R² and best alpha
# print(f"Grid:   best alpha={gs.best_params_['lasso__alpha']:.4g}  R²={gs.best_score_:.4f}")
# print(f"Random: best alpha={rs.best_params_['lasso__alpha']:.4g}  R²={rs.best_score_:.4f}")

# 4. Plot alpha vs CV R² from grid search
# alphas = [p['lasso__alpha'] for p in gs.cv_results_['params']]
# scores = gs.cv_results_['mean_test_score']
# ...

print("Uncomment and run the lines above to complete this exercise.")

Exercise 2 — 2D heatmap: polynomial degree × regularisation

Generate a dataset with make_regression. Tune both the polynomial degree (1–3) and Ridge alpha (log-scale). Produce a heatmap of CV R² across both dimensions and identify the best combination.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

X, y = make_regression(n_samples=300, n_features=3, noise=20, random_state=42)
pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())

grid = {
    'polynomialfeatures__degree': [1, 2, 3],
    'ridge__alpha': np.logspace(-2, 3, 8)
}

# TODO: run GridSearchCV and build the heatmap pivot table
# gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
# gs.fit(X, y)
# results = pd.DataFrame(gs.cv_results_)
# pivot = results.pivot_table(...)
# Plot with imshow ...

print("Uncomment and complete the grid search and heatmap plotting.")

Implement a two-stage coarse-to-fine search for Ridge alpha:

  1. Coarse: random search over log-uniform(1e-4, 1e4) with 15 trials.

  2. Fine: grid search over a 20-point log-scale range centred on the coarse best alpha (±1 decade).

Compare the fine-tuned R² to a single-stage random search with 35 trials.

%matplotlib inline
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import loguniform

X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(StandardScaler(), Ridge())

# Stage 1 — coarse random search (15 trials)
# coarse = RandomizedSearchCV(pipe, {'ridge__alpha': loguniform(1e-4, 1e4)},
#                              n_iter=15, cv=5, scoring='r2', random_state=0, n_jobs=-1)
# coarse.fit(X, y)
# best_coarse = coarse.best_params_['ridge__alpha']

# Stage 2 — fine grid (±1 decade around best_coarse)
# fine_alphas = np.logspace(np.log10(best_coarse/10), np.log10(best_coarse*10), 20)
# fine = GridSearchCV(pipe, {'ridge__alpha': fine_alphas}, cv=5, scoring='r2', n_jobs=-1)
# fine.fit(X, y)

# Baseline — single-stage random search (35 trials)
# baseline = RandomizedSearchCV(pipe, {'ridge__alpha': loguniform(1e-4, 1e4)},
#                                n_iter=35, cv=5, scoring='r2', random_state=0, n_jobs=-1)
# baseline.fit(X, y)

# Compare
# print(f"Coarse-to-fine R²: {fine.best_score_:.4f}  alpha={fine.best_params_['ridge__alpha']:.4g}")
# print(f"Single-stage R²:   {baseline.best_score_:.4f}  alpha={baseline.best_params_['ridge__alpha']:.4g}")

print("Uncomment the stages above to run the coarse-to-fine comparison.")

Common Pitfalls

Summary

Key takeaways
StrategyWhen to useCostsklearn class
ManualFirst pass, domain intuitionMinimal
Grid searchSmall grids (≤3 params × ≤10 values)$O(\text{grid}
Random searchDefault choice; large spacesO(n_iter)O(n\_iter)RandomizedSearchCV
Successive halvingSame as grid but faster$O(\log(\text{grid}
Bayesian optimisationHigh-cost evaluations; after coarse searchO(n_calls)O(n\_calls)skopt, optuna
Warm startScanning regularisation paths sequentiallyReduces iterationswarm_start=True

Workflow: coarse random search → narrow range → fine grid or Bayesian → nested CV for unbiased estimate → refit on all training data → evaluate once on test set.

Next Up — Business-Aware Metrics

You can now find the best model settings. Next: make sure you are optimising the right metric.

The next notebook — Business-Aware Metrics — shows that the metric you optimise (MSE, accuracy, AUC) is often not the metric that matters to the business (revenue impact, cost of false negatives, customer lifetime value). It covers custom scoring functions, asymmetric cost matrices, and how to connect model performance to business KPIs.

Dependencies: scoring= parameter in GridSearchCV, cross-validation, and MSE / R\u00b2 as baselines.