
Hyperparameter Tuning¶
Finding the Right Settings Before Training Begins¶
Model parameters are learned from data. Hyperparameters are chosen by you — before training starts. The difference between default settings and well-tuned ones can be the difference between a mediocre model and a deployable one. This notebook covers the full ladder: manual intuition, grid search, random search, Bayesian optimisation, and successive halving.
Why Hyperparameter Tuning Matters in Business¶

Hyperparameter tuning bridges the gap between a working prototype and a production-grade model. The three business levers are:
| Business goal | What to tune | Why |
|---|---|---|
| Reduce false positives (spam, fraud alerts) | Decision threshold, regularisation | Controls precision–recall trade-off |
| Maximise revenue from recommendations | Learning rate, tree depth | Drives ranking quality |
| Minimise compute cost | n_estimators, early stopping | Fewer fits, same accuracy |
Parameters vs Hyperparameters¶
| Parameters | Hyperparameters | |
|---|---|---|
| Set by | Optimiser during training | You, before training |
| Examples | Ridge coefficients , neural network weights | (regularisation), tree depth, learning rate |
| How to find best | Minimise loss (gradient descent / Normal Equations) | Search + cross-validation |
| Stored in | model.coef_, model.intercept_ | model.get_params() |
The Tuning Strategy Ladder¶
Grid Search¶
Grid search evaluates every combination of hyperparameter values you specify. It is exhaustive, reproducible, and simple to interpret — but cost grows multiplicatively with each new dimension.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
import time
X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {
'ridge__alpha': np.logspace(-3, 3, 13),
}
t0 = time.time()
gs = GridSearchCV(pipe, param_grid, cv=5, scoring='r2',
return_train_score=True, n_jobs=-1)
gs.fit(X, y)
t_grid = time.time() - t0
print(f"Best alpha : {gs.best_params_['ridge__alpha']:.4g}")
print(f"Best CV R\u00b2 : {gs.best_score_:.4f}")
print(f"Grid size : {len(param_grid['ridge__alpha'])} points")
print(f"Wall time : {t_grid:.2f}s")
# Plot alpha vs CV score
alphas = [p['ridge__alpha'] for p in gs.cv_results_['params']]
cv_mean = gs.cv_results_['mean_test_score']
cv_std = gs.cv_results_['std_test_score']
tr_mean = gs.cv_results_['mean_train_score']
fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogx(alphas, cv_mean, 'r-o', linewidth=2, label='CV R\u00b2')
ax.fill_between(alphas, cv_mean - cv_std, cv_mean + cv_std, alpha=0.15, color='red')
ax.semilogx(alphas, tr_mean, 'b--o', linewidth=1.5, label='Train R\u00b2', alpha=0.6)
ax.axvline(gs.best_params_['ridge__alpha'], color='green', linestyle=':',
label=f"Best alpha={gs.best_params_['ridge__alpha']:.3g}")
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('R\u00b2')
ax.set_title('GridSearchCV: Ridge alpha sweep')
ax.legend()
plt.tight_layout()
plt.show()Random Search — Faster for High-Dimensional Grids¶
The key insight: if only a few hyperparameters matter, random search wastes far fewer evaluations than grid search on the unimportant ones. With a 10-point grid over 5 parameters, grid search runs fits; random search with n_iter=100 runs just 100.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
from scipy.stats import loguniform
import time
X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
# Grid search over degree × alpha
grid = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 10)}
# Random search over same space — but alpha is continuous
rand_dist = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': loguniform(1e-3, 1e3)}
t0 = time.time()
gs = GridSearchCV(pipe, grid, cv=3, scoring='r2', n_jobs=-1)
gs.fit(X, y)
t_grid = time.time() - t0
t0 = time.time()
rs = RandomizedSearchCV(pipe, rand_dist, n_iter=30, cv=3,
scoring='r2', random_state=42, n_jobs=-1)
rs.fit(X, y)
t_rand = time.time() - t0
print(f"Grid Search — best R\u00b2={gs.best_score_:.4f} fits={len(gs.cv_results_['params'])*3} time={t_grid:.2f}s")
print(f" best params: {gs.best_params_}")
print(f"Random Search — best R\u00b2={rs.best_score_:.4f} fits={30*3} time={t_rand:.2f}s")
print(f" best params: {rs.best_params_}")
# Compare sampled alpha distributions
gs_alphas = [p['ridge__alpha'] for p in gs.cv_results_['params']]
gs_scores = gs.cv_results_['mean_test_score']
rs_alphas = [p['ridge__alpha'] for p in rs.cv_results_['params']]
rs_scores = rs.cv_results_['mean_test_score']
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(gs_alphas, gs_scores, c='steelblue', s=40, alpha=0.7, label='Grid')
axes[0].set_xscale('log'); axes[0].set_xlabel(r'alpha ($\lambda$)'); axes[0].set_ylabel('CV R\u00b2')
axes[0].set_title(f'Grid Search ({len(gs_alphas)} combos)')
axes[1].scatter(rs_alphas, rs_scores, c='tomato', s=40, alpha=0.7, label='Random')
axes[1].set_xscale('log'); axes[1].set_xlabel(r'alpha ($\lambda$)'); axes[1].set_ylabel('CV R\u00b2')
axes[1].set_title(f'Random Search (30 combos)')
for ax in axes:
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Visualising the Search Surface — 2D Heatmap¶
When tuning two hyperparameters simultaneously, a heatmap of CV scores reveals the interaction between them — which combinations are good, and whether there is a clear optimum or a plateau.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
grid = {'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-3, 3, 8)}
gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
gs.fit(X, y)
results = pd.DataFrame(gs.cv_results_)
pivot = results.pivot_table(
index='param_polynomialfeatures__degree',
columns='param_ridge__alpha',
values='mean_test_score'
)
fig, ax = plt.subplots(figsize=(10, 4))
im = ax.imshow(pivot.values, aspect='auto', cmap='RdYlGn',
vmin=pivot.values.min(), vmax=pivot.values.max())
plt.colorbar(im, ax=ax, label='CV R\u00b2')
ax.set_xticks(range(len(pivot.columns)))
ax.set_xticklabels([f'{a:.2g}' for a in pivot.columns], rotation=45, ha='right')
ax.set_yticks(range(len(pivot.index)))
ax.set_yticklabels(pivot.index)
ax.set_xlabel(r'Ridge alpha ($\lambda$)')
ax.set_ylabel('Polynomial degree')
ax.set_title('Hyperparameter search surface — 5-fold CV R\u00b2')
# Mark best
best_r = pivot.values.argmax() // pivot.shape[1]
best_c = pivot.values.argmax() % pivot.shape[1]
ax.plot(best_c, best_r, 'w*', markersize=14, label='Best')
ax.legend()
plt.tight_layout()
plt.show()
print(f"Best: degree={gs.best_params_['polynomialfeatures__degree']} alpha={gs.best_params_['ridge__alpha']:.3g} R\u00b2={gs.best_score_:.4f}")Bayesian Optimisation¶
Grid and random search are memoryless — each trial is independent. Bayesian optimisation builds a surrogate model of the objective function (Gaussian Process or Tree Parzen Estimator) and uses it to decide which hyperparameters to try next. It focuses trials where improvements are most likely.
Available libraries: scikit-optimize (skopt), optuna, hyperopt, bayesian-optimization.
The cell below uses scikit-optimize if installed, otherwise falls back to a manual illustration.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
X, y = load_diabetes(return_X_y=True)
try:
from skopt import gp_minimize
from skopt.space import Real
from skopt.utils import use_named_args
space = [Real(1e-4, 1e3, prior='log-uniform', name='alpha')]
scores_by_trial = []
@use_named_args(space)
def objective(alpha):
pipe = make_pipeline(StandardScaler(), Ridge(alpha=alpha))
sc = cross_val_score(pipe, X, y, cv=3, scoring='r2').mean()
scores_by_trial.append((alpha, sc))
return -sc # minimise negative R²
result = gp_minimize(objective, space, n_calls=25, random_state=42, verbose=False)
best_alpha = result.x[0]
best_r2 = -result.fun
print(f"Bayesian (skopt): best alpha={best_alpha:.4g} R\u00b2={best_r2:.4f} in 25 trials")
trials_a = [t[0] for t in scores_by_trial]
trials_s = [t[1] for t in scores_by_trial]
fig, ax = plt.subplots(figsize=(8, 4))
ax.scatter(range(len(trials_a)), trials_s, c=trials_a, cmap='viridis', s=50)
ax.axhline(best_r2, color='tomato', linestyle='--', label=f'Best R\u00b2={best_r2:.3f}')
ax.set_xlabel('Trial number')
ax.set_ylabel('CV R\u00b2')
ax.set_title('Bayesian optimisation — R\u00b2 per trial (colour = alpha value)')
ax.legend()
plt.tight_layout()
plt.show()
except ImportError:
print("scikit-optimize not installed. Showing manual illustration instead.")
np.random.seed(42)
n = 25
rand_best = np.maximum.accumulate(np.random.uniform(0.45, 0.50, n))
bayes_best = np.maximum.accumulate(np.random.uniform(0.45, 0.50, n) +
np.linspace(0, 0.02, n))
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(rand_best, 'r-o', markersize=4, label='Random search (best so far)')
ax.plot(bayes_best, 'b-o', markersize=4, label='Bayesian optimisation (best so far)')
ax.set_xlabel('Trial number')
ax.set_ylabel('Best CV R\u00b2 so far')
ax.set_title('Bayesian vs random search convergence (illustration)')
ax.legend()
plt.tight_layout()
plt.show()
print("Install scikit-optimize (`pip install scikit-optimize`) for real Bayesian tuning.")Successive Halving — Faster Grid Search¶
sklearn’s HalvingGridSearchCV starts all candidates with a small data budget and progressively multiplies the data by factor, keeping only the top fraction at each round. This can be 10–100× faster than standard GridSearchCV for large grids.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import HalvingGridSearchCV, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(StandardScaler(), Ridge())
grid = {'ridge__alpha': np.logspace(-4, 4, 30)} # 30 candidates
t0 = time.time()
gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
gs.fit(X, y)
t_gs = time.time() - t0
t0 = time.time()
hgs = HalvingGridSearchCV(pipe, grid, cv=5, scoring='r2',
factor=3, random_state=42, n_jobs=-1)
hgs.fit(X, y)
t_hgs = time.time() - t0
print(f"GridSearchCV best R\u00b2={gs.best_score_:.4f} alpha={gs.best_params_['ridge__alpha']:.3g} time={t_gs:.2f}s")
print(f"HalvingGridSearchCV best R\u00b2={hgs.best_score_:.4f} alpha={hgs.best_params_['ridge__alpha']:.3g} time={t_hgs:.2f}s")
print(f"Speedup: {t_gs/max(t_hgs, 0.01):.1f}x")
# Visualise rounds
halving_results = hgs.cv_results_
rounds = halving_results['iter']
alphas_h = [p['ridge__alpha'] for p in halving_results['params']]
scores_h = halving_results['mean_test_score']
fig, ax = plt.subplots(figsize=(8, 4))
colors = {r: plt.cm.Blues(0.4 + 0.2*r) for r in set(rounds)}
for r in sorted(set(rounds)):
mask = [ri == r for ri in rounds]
ax.scatter([a for a, m in zip(alphas_h, mask) if m],
[s for s, m in zip(scores_h, mask) if m],
color=colors[r], s=60, label=f'Round {r}')
ax.set_xscale('log')
ax.set_xlabel(r'Ridge alpha ($\lambda$)')
ax.set_ylabel('CV R\u00b2')
ax.set_title('Successive halving: candidates remaining per round')
ax.legend()
plt.tight_layout()
plt.show()Warm-Starting — Reusing Previous Fits¶
Some estimators support warm_start=True, which tells the model to reuse the solution from the previous fit as the starting point for the next. This avoids retraining from scratch when you are scanning along a regularisation path.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.pipeline import make_pipeline
X, y = load_diabetes(return_X_y=True)
scaler = StandardScaler()
Xs = scaler.fit_transform(X)
alphas = np.logspace(-1, 2, 60)[::-1] # decreasing: warm-start reuses coefs
# Cold start: fresh fit each time
t0 = time.time()
coefs_cold = []
for a in alphas:
m = Lasso(alpha=a, max_iter=5000)
m.fit(Xs, y)
coefs_cold.append(m.coef_.copy())
t_cold = time.time() - t0
# Warm start: reuse previous solution
t0 = time.time()
coefs_warm = []
m_warm = Lasso(alpha=alphas[0], warm_start=True, max_iter=5000)
for a in alphas:
m_warm.set_params(alpha=a)
m_warm.fit(Xs, y)
coefs_warm.append(m_warm.coef_.copy())
t_warm = time.time() - t0
print(f"Cold start: {t_cold:.3f}s")
print(f"Warm start: {t_warm:.3f}s")
print(f"Speedup: {t_cold/max(t_warm,0.001):.1f}x")
# Plot coefficient path
coefs_warm = np.array(coefs_warm)
fig, ax = plt.subplots(figsize=(9, 4))
for j in range(coefs_warm.shape[1]):
ax.plot(np.log10(alphas), coefs_warm[:, j], linewidth=1.2)
ax.set_xlabel(r'log$_{10}$(alpha)')
ax.set_ylabel('Coefficient value')
ax.set_title('Lasso coefficient path (warm start) — features zeroed as alpha increases')
ax.axhline(0, color='k', linewidth=0.5)
plt.tight_layout()
plt.show()Practical Tuning Workflow¶
| Rule | Reason |
|---|---|
| Use log-scale for learning rate, alpha | These parameters span orders of magnitude |
| Always use a Pipeline | Prevents data leakage through preprocessing |
| Save all CV results to a DataFrame | Enables surface plots and retrospective analysis |
Use n_jobs=-1 | Parallelises across CPU cores at no cost |
| Don’t over-tune | More tuning trials = more risk of overfitting the validation set |
Full Tuning Example — Coarse-to-Fine on Diabetes¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, train_test_split
from scipy.stats import loguniform
X, y = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=0)
pipe = make_pipeline(StandardScaler(), Ridge())
# Step 1: coarse random search
coarse = RandomizedSearchCV(
pipe, {'ridge__alpha': loguniform(1e-4, 1e4)},
n_iter=40, cv=5, scoring='r2', random_state=42, n_jobs=-1,
return_train_score=True
)
coarse.fit(X_tr, y_tr)
best_a = coarse.best_params_['ridge__alpha']
print(f"Coarse search best alpha: {best_a:.4g} CV R\u00b2={coarse.best_score_:.4f}")
# Step 2: fine grid around best region
fine_alphas = np.logspace(np.log10(best_a/10), np.log10(best_a*10), 20)
fine = GridSearchCV(pipe, {'ridge__alpha': fine_alphas},
cv=5, scoring='r2', n_jobs=-1)
fine.fit(X_tr, y_tr)
print(f"Fine search best alpha: {fine.best_params_['ridge__alpha']:.4g} CV R\u00b2={fine.best_score_:.4f}")
# Step 3: final evaluation on held-out test set
final_mse = np.mean((y_te - fine.predict(X_te))**2)
final_r2 = 1 - np.sum((y_te - fine.predict(X_te))**2) / np.sum((y_te - y_te.mean())**2)
print(f"Final test set: MSE={final_mse:.2f} R\u00b2={final_r2:.4f}")
# Plot coarse search results
alphas_tried = [p['ridge__alpha'] for p in coarse.cv_results_['params']]
scores_tried = coarse.cv_results_['mean_test_score']
fig, ax = plt.subplots(figsize=(8, 4))
ax.scatter(alphas_tried, scores_tried, alpha=0.7, color='steelblue', s=40, label='Random trials')
ax.axvline(best_a, color='tomato', linestyle='--', label=f'Best (coarse) alpha={best_a:.3g}')
ax.set_xscale('log')
ax.set_xlabel(r'alpha ($\lambda$)')
ax.set_ylabel('CV R\u00b2')
ax.set_title('Coarse random search: all 40 trials')
ax.legend()
plt.tight_layout()
plt.show()Try It in the Browser¶
Manual 1D grid search in pure Python — watch the score curve emerge.
Guided Practice¶
What distinguishes a hyperparameter from a model parameter?¶
Why does random search often outperform grid search on large hyperparameter spaces?¶
What makes Bayesian optimisation more efficient than random search?¶
You tune Ridge alpha with 5-fold CV and report the best CV score as your model's expected performance. What is the problem?¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import loguniform
X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(StandardScaler(), Lasso(max_iter=10000))
# TODO:
# 1. Grid search over alpha = logspace(-4, 2, 20)
grid = {'lasso__alpha': np.logspace(-4, 2, 20)}
# gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
# gs.fit(X, y)
# 2. Random search over loguniform(1e-4, 1e2) with n_iter=20
rand_dist = {'lasso__alpha': loguniform(1e-4, 1e2)}
# rs = RandomizedSearchCV(pipe, rand_dist, n_iter=20, cv=5, scoring='r2', random_state=42, n_jobs=-1)
# rs.fit(X, y)
# 3. Time both and compare best CV R² and best alpha
# print(f"Grid: best alpha={gs.best_params_['lasso__alpha']:.4g} R²={gs.best_score_:.4f}")
# print(f"Random: best alpha={rs.best_params_['lasso__alpha']:.4g} R²={rs.best_score_:.4f}")
# 4. Plot alpha vs CV R² from grid search
# alphas = [p['lasso__alpha'] for p in gs.cv_results_['params']]
# scores = gs.cv_results_['mean_test_score']
# ...
print("Uncomment and run the lines above to complete this exercise.")Exercise 2 — 2D heatmap: polynomial degree × regularisation¶
Generate a dataset with make_regression. Tune both the polynomial degree (1–3) and Ridge alpha (log-scale). Produce a heatmap of CV R² across both dimensions and identify the best combination.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
X, y = make_regression(n_samples=300, n_features=3, noise=20, random_state=42)
pipe = make_pipeline(PolynomialFeatures(), StandardScaler(), Ridge())
grid = {
'polynomialfeatures__degree': [1, 2, 3],
'ridge__alpha': np.logspace(-2, 3, 8)
}
# TODO: run GridSearchCV and build the heatmap pivot table
# gs = GridSearchCV(pipe, grid, cv=5, scoring='r2', n_jobs=-1)
# gs.fit(X, y)
# results = pd.DataFrame(gs.cv_results_)
# pivot = results.pivot_table(...)
# Plot with imshow ...
print("Uncomment and complete the grid search and heatmap plotting.")Exercise 3 — Coarse-to-fine search¶
Implement a two-stage coarse-to-fine search for Ridge alpha:
Coarse: random search over log-uniform(1e-4, 1e4) with 15 trials.
Fine: grid search over a 20-point log-scale range centred on the coarse best alpha (±1 decade).
Compare the fine-tuned R² to a single-stage random search with 35 trials.
%matplotlib inline
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from scipy.stats import loguniform
X, y = load_diabetes(return_X_y=True)
pipe = make_pipeline(StandardScaler(), Ridge())
# Stage 1 — coarse random search (15 trials)
# coarse = RandomizedSearchCV(pipe, {'ridge__alpha': loguniform(1e-4, 1e4)},
# n_iter=15, cv=5, scoring='r2', random_state=0, n_jobs=-1)
# coarse.fit(X, y)
# best_coarse = coarse.best_params_['ridge__alpha']
# Stage 2 — fine grid (±1 decade around best_coarse)
# fine_alphas = np.logspace(np.log10(best_coarse/10), np.log10(best_coarse*10), 20)
# fine = GridSearchCV(pipe, {'ridge__alpha': fine_alphas}, cv=5, scoring='r2', n_jobs=-1)
# fine.fit(X, y)
# Baseline — single-stage random search (35 trials)
# baseline = RandomizedSearchCV(pipe, {'ridge__alpha': loguniform(1e-4, 1e4)},
# n_iter=35, cv=5, scoring='r2', random_state=0, n_jobs=-1)
# baseline.fit(X, y)
# Compare
# print(f"Coarse-to-fine R²: {fine.best_score_:.4f} alpha={fine.best_params_['ridge__alpha']:.4g}")
# print(f"Single-stage R²: {baseline.best_score_:.4f} alpha={baseline.best_params_['ridge__alpha']:.4g}")
print("Uncomment the stages above to run the coarse-to-fine comparison.")Common Pitfalls¶
Summary¶
Key takeaways
| Strategy | When to use | Cost | sklearn class |
|---|---|---|---|
| Manual | First pass, domain intuition | Minimal | — |
| Grid search | Small grids (≤3 params × ≤10 values) | $O( | \text{grid} |
| Random search | Default choice; large spaces | RandomizedSearchCV | |
| Successive halving | Same as grid but faster | $O(\log( | \text{grid} |
| Bayesian optimisation | High-cost evaluations; after coarse search | skopt, optuna | |
| Warm start | Scanning regularisation paths sequentially | Reduces iterations | warm_start=True |
Workflow: coarse random search → narrow range → fine grid or Bayesian → nested CV for unbiased estimate → refit on all training data → evaluate once on test set.
Next Up — Business-Aware Metrics¶

You can now find the best model settings. Next: make sure you are optimising the right metric.¶
The next notebook — Business-Aware Metrics — shows that the metric you optimise (MSE, accuracy, AUC) is often not the metric that matters to the business (revenue impact, cost of false negatives, customer lifetime value). It covers custom scoring functions, asymmetric cost matrices, and how to connect model performance to business KPIs.
Dependencies: scoring= parameter in GridSearchCV, cross-validation, and MSE / R\u00b2 as baselines.