
KNN Basics — Distance, Voting, and Lazy Learning¶
What you will learn: how K-Nearest Neighbours works, the core distance metrics (Euclidean, Manhattan, Minkowski), the choice of , KNN for regression vs classification, the curse of dimensionality, and how to use sklearn’s KNeighborsClassifier and KNeighborsRegressor.

Why KNN Matters for Business¶
A streaming service needs to recommend films to a new user. Without any model training — just by finding the 10 most similar users (based on past ratings) and averaging their preferences — it achieves 74% recall on holdout ratings. KNN’s zero-training-time property makes it uniquely suited for cold-start recommendation and anomaly detection.
No model to train. The entire training set is the model. Prediction cost is per query.
1. The KNN Algorithm¶
Given training set and query point :
Compute for all
Sort: find the smallest distances → indices
Classification:
Regression:
Distance Metrics¶
The Minkowski distance unifies the most common metrics:
| Name | Sensitivity | |
|---|---|---|
| 1 | Manhattan (L1) | Robust to outliers in individual features |
| 2 | Euclidean (L2) | Standard geometric distance — most common |
| ∞ | Chebyshev | Maximum single-coordinate difference |
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
# -------------------------------------------------------
# KNN from scratch
# -------------------------------------------------------
def euclidean_distance(a, b):
return np.sqrt(np.sum((a - b) ** 2))
def knn_predict_class(X_train, y_train, x_query, k):
distances = [(euclidean_distance(x_query, x_i), y_i) for x_i, y_i in zip(X_train, y_train)]
distances.sort(key=lambda t: t[0])
k_nearest_labels = [label for _, label in distances[:k]]
return max(set(k_nearest_labels), key=k_nearest_labels.count)
def knn_predict_reg(X_train, y_train, x_query, k):
distances = [(euclidean_distance(x_query, x_i), y_i) for x_i, y_i in zip(X_train, y_train)]
distances.sort(key=lambda t: t[0])
return np.mean([v for _, v in distances[:k]])
# -------------------------------------------------------
# Visualise KNN decision boundary for k=1,3,9,15
# -------------------------------------------------------
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1,
class_sep=1.2, random_state=42)
scaler = StandardScaler()
X_s = scaler.fit_transform(X)
K_vals = [1, 3, 9, 15]
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
h = 0.04
xx, yy = np.meshgrid(np.arange(-3, 3, h), np.arange(-3, 3, h))
grid = np.c_[xx.ravel(), yy.ravel()]
for ax, k in zip(axes, K_vals):
# Vectorised prediction on grid (using from-scratch for pedagogy)
dists = np.linalg.norm(X_s[:, None, :] - grid[None, :, :], axis=2) # (n_train, n_grid)
nn_idx = np.argsort(dists, axis=0)[:k] # (k, n_grid)
votes = y[nn_idx] # (k, n_grid)
Z = np.array([np.bincount(votes[:, i], minlength=2).argmax() for i in range(len(grid))])
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
ax.scatter(X_s[:,0], X_s[:,1], c=y, cmap=plt.cm.coolwarm, edgecolors='k', s=20)
ax.set_title(f'k = {k}')
ax.set_xlabel('x1'); ax.set_ylabel('x2')
ax.set_xlim(-3, 3); ax.set_ylim(-3, 3)
plt.suptitle('KNN Decision Boundary vs K (from scratch, Euclidean)', fontsize=11)
plt.tight_layout()
plt.show()
print("k=1 → highly irregular boundary (low bias, high variance)")
print("k=15 → smoother boundary (higher bias, lower variance)")
2. Distance Metric Comparison¶
Different metrics give different neighbourhoods. The unit ball (set of points within distance 1 from origin) varies by metric:
L1 (Manhattan): diamond shape
L2 (Euclidean): circle
L∞ (Chebyshev): square
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
theta = np.linspace(0, 2*np.pi, 300)
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
titles = ['L1 (Manhattan)', 'L2 (Euclidean)', 'L∞ (Chebyshev)']
# L1 unit ball: |x| + |y| = 1 → diamond
angles = np.array([0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi])
x_l1 = np.cos(angles) # This is just a shortcut for the corners
# Proper L1 unit ball
t = np.linspace(0, 1, 100)
x_l1 = np.concatenate([t, 1-t, -t, -(1-t)])
y_l1 = np.concatenate([1-t, -t, -(1-t), t])
# L2 unit ball: circle
x_l2 = np.cos(theta)
y_l2 = np.sin(theta)
# L∞ unit ball: square
x_linf = np.array([-1, 1, 1, -1, -1])
y_linf = np.array([-1, -1, 1, 1, -1])
for ax, (xs, ys), title in zip(axes, [(x_l1, y_l1), (x_l2, y_l2), (x_linf, y_linf)], titles):
ax.fill(xs, ys, alpha=0.3, color='steelblue')
ax.plot(xs, ys, 'b-', lw=2)
ax.axhline(0, color='k', lw=0.5); ax.axvline(0, color='k', lw=0.5)
ax.set_xlim(-1.4, 1.4); ax.set_ylim(-1.4, 1.4)
ax.set_aspect('equal'); ax.set_title(title); ax.grid(alpha=0.2)
plt.suptitle('Unit ball shapes by distance metric', fontsize=11)
plt.tight_layout()
plt.show()
# Numerical comparison
p1 = np.array([1.0, 0.0])
p2 = np.array([0.0, 1.0])
p3 = np.array([0.7, 0.7])
print("Point A=[1,0] vs B=[0,1] vs C=[0.7,0.7]:")
for name, p, q in [('A vs B', p1, p2), ('A vs C', p1, p3)]:
l1 = np.sum(np.abs(p - q))
l2 = np.linalg.norm(p - q)
linf = np.max(np.abs(p - q))
print(f" {name}: L1={l1:.3f} L2={l2:.3f} L∞={linf:.3f}")
3. KNN for Regression¶
For regression, the prediction is the mean of the nearest target values. Weighted KNN uses inverse-distance weights:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
np.random.seed(0)
X_raw = np.sort(np.random.uniform(0, 2*np.pi, 80))
y = np.sin(X_raw) + np.random.normal(0, 0.2, 80)
X = X_raw.reshape(-1, 1)
X_test = np.linspace(0, 2*np.pi, 300).reshape(-1, 1)
y_true = np.sin(X_test.ravel())
fig, axes = plt.subplots(1, 3, figsize=(14, 4))
k_vals = [1, 5, 20]
for ax, k in zip(axes, k_vals):
m = KNeighborsRegressor(n_neighbors=k)
m.fit(X, y)
y_pred = m.predict(X_test)
ax.scatter(X_raw, y, s=20, alpha=0.5, label='Training data')
ax.plot(X_test, y_true, 'g--', lw=1.5, label='True sin(x)')
ax.plot(X_test, y_pred, 'r-', lw=2, label=f'KNN k={k}')
ax.set_title(f'KNN Regression k={k}')
ax.set_xlabel('x'); ax.set_ylabel('y')
ax.legend(fontsize=8); ax.grid(alpha=0.3)
plt.suptitle('KNN Regression: k=1 overfits; k=20 underfits', fontsize=11)
plt.tight_layout()
plt.show()
4. Choosing K — Bias-Variance Trade-off¶
Small (e.g. 1): very flexible boundary, fits every training point, low bias / high variance → overfits
Large (e.g. ): very smooth, ignores local structure, high bias / low variance → underfits
Optimal : found via cross-validation. Rule of thumb: start with (odd for binary classification).
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
np.random.seed(42)
X, y = make_classification(n_samples=400, n_features=2, n_informative=2,
n_redundant=0, class_sep=1.0, random_state=42)
k_range = range(1, 31)
cv_scores = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for k in k_range:
pipe = Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))])
sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
cv_scores.append(sc.mean())
best_k = k_range[np.argmax(cv_scores)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(list(k_range), cv_scores, 'b-o', markersize=4, label='CV accuracy')
ax.axvline(best_k, color='r', linestyle='--', label=f'Best k={best_k}')
ax.axvline(int(np.sqrt(400)), color='gray', linestyle=':', label=f'√n = {int(np.sqrt(400))}')
ax.set_xlabel('K'); ax.set_ylabel('5-fold CV Accuracy')
ax.set_title('Choosing K by Cross-Validation')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Best k = {best_k} CV accuracy = {max(cv_scores):.4f}")
print(f"Rule-of-thumb k = √400 = {int(np.sqrt(400))}")
5. The Curse of Dimensionality¶
As (features) increases, all points become approximately equidistant — the ratio of max/min distance → 1 and the nearest-neighbour concept breaks down.
The expected distance to the nearest neighbour in the unit hypercube scales as:
In practice: for points in 100 dimensions, almost all points are equidistant.
Mitigation strategies:
Dimensionality reduction (PCA, UMAP) before KNN
Feature selection to remove noise dimensions
Approximate nearest-neighbour search (→
efficient_search.ipynb)Use alternative models (SVM, trees) when
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 1000
dims = [2, 5, 10, 20, 50, 100, 200]
mean_ratios = []
for p in dims:
X = np.random.uniform(0, 1, (n_samples, p))
query = np.random.uniform(0, 1, p)
dists = np.linalg.norm(X - query, axis=1)
ratio = dists.max() / dists.min()
mean_ratios.append(ratio)
print(f"p={p:4d}: min_dist={dists.min():.4f} max_dist={dists.max():.4f} ratio={ratio:.2f}")
fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogy(dims, mean_ratios, 'r-o')
ax.axhline(1.0, color='k', linestyle='--', lw=1, label='ratio=1 (equidistant)')
ax.set_xlabel('Number of dimensions p')
ax.set_ylabel('max_dist / min_dist (log scale)')
ax.set_title('Curse of Dimensionality — Distance Ratio Collapses')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
6. KNN Classification Pipeline with sklearn¶
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
iris = load_iris()
X_tr, X_te, y_tr, y_te = train_test_split(iris.data, iris.target,
test_size=0.2, random_state=42, stratify=iris.target)
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='uniform'))
])
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)
print("=== KNN Pipeline — Iris Dataset ===")
print(classification_report(y_te, y_pred, target_names=iris.target_names))
fig, ax = plt.subplots(figsize=(5, 4))
ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
display_labels=iris.target_names, colorbar=False, ax=ax)
ax.set_title('KNN (k=5) Confusion Matrix — Iris')
plt.tight_layout()
plt.show()
# Distance-weighted KNN
pipe_w = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=5, weights='distance'))
])
pipe_w.fit(X_tr, y_tr)
print(f"Uniform weights accuracy : {pipe.score(X_te, y_te):.4f}")
print(f"Distance weights accuracy: {pipe_w.score(X_te, y_te):.4f}")
7. Try It in the Browser¶
Compute KNN classification from scratch for a small 2D dataset.
import math
def euclidean(a, b):
return math.sqrt(sum((x - y)**2 for x, y in zip(a, b)))
def knn_classify(X_train, y_train, x_query, k):
dists = sorted([(euclidean(x_query, xi), yi) for xi, yi in zip(X_train, y_train)])
labels = [yi for _, yi in dists[:k]]
return max(set(labels), key=labels.count), dists[:k]
# Training data (2 features, 2 classes)
X_train = [[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]]
y_train = [0, 0, 0, 1, 1, 1]
query = [4, 4]
for k in [1, 3, 5]:
label, neighbors = knn_classify(X_train, y_train, query, k)
print(f"k={k}: predicted class={label}")
for d, lbl in neighbors:
print(f" distance={d:.3f} class={lbl}")
print()Knowledge Check¶
KNN with k=1 always achieves 100% accuracy on the training set. Why is this a problem?¶
CheckWhy must you apply StandardScaler before KNN but NOT before a Decision Tree?¶
CheckExercises¶
Exercise 1 — K Sweep on Breast Cancer¶
Load sklearn.datasets.load_breast_cancer(). Build a Pipeline([StandardScaler, KNeighborsClassifier]). Sweep k from 1 to 30 using 5-fold CV. Plot CV accuracy vs k. What is the optimal k?
Exercise 2 — Metric Comparison¶
On the same breast cancer dataset, fix k=7 and compare metric ∈ {'euclidean', 'manhattan', 'chebyshev'} using 5-fold CV accuracy. Which metric wins? Why?
Exercise 3 — Weighted vs Uniform KNN¶
On make_moons(n_samples=500, noise=0.2), compare weights='uniform' vs weights='distance' for k=5. Which generalises better and why?
%matplotlib inline
# Exercises 1, 2, 3 — your code here
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
# Your code here
Common Pitfalls¶
Summary
KNN is a non-parametric, instance-based algorithm: no training, prediction = vote/average of nearest neighbours.
Distance metrics: Euclidean (), Manhattan (), Minkowski () — always scale features first.
k trade-off: small = low bias / high variance; large = high bias / low variance. Tune via cross-validation.
Regression KNN: predicts the mean (or inverse-distance-weighted mean) of nearest targets.
Curse of dimensionality: in high dimensions, all distances converge — KNN degrades. Reduce dimensions first.
Production: brute-force KNN is per query. For large , use KD-trees or Ball Trees (→ next notebook).

What’s Next — Efficient Search Structures¶
The bottleneck in KNN is the scan over all training points. For large datasets, this is prohibitive. The next notebook introduces:
KD-Trees: partition feature space into axis-aligned hyperrectangles — sub-linear average search time in low dimensions
Ball Trees: partition by enclosing hyperspheres — better in moderate dimensions
Approximate methods: LSH and FAISS for million-scale nearest-neighbour search
Proceed to efficient_search.ipynb.