Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hero image

KNN Basics — Distance, Voting, and Lazy Learning

What you will learn: how K-Nearest Neighbours works, the core distance metrics (Euclidean, Manhattan, Minkowski), the choice of kk, KNN for regression vs classification, the curse of dimensionality, and how to use sklearn’s KNeighborsClassifier and KNeighborsRegressor.

Business hook

Why KNN Matters for Business

A streaming service needs to recommend films to a new user. Without any model training — just by finding the 10 most similar users (based on past ratings) and averaging their preferences — it achieves 74% recall on holdout ratings. KNN’s zero-training-time property makes it uniquely suited for cold-start recommendation and anomaly detection.

No model to train. The entire training set is the model. Prediction cost is O(np)O(n \cdot p) per query.

1. The KNN Algorithm

Given training set {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n and query point xqx_q:

  1. Compute d(xq,xi)d(x_q, x_i) for all ii

  2. Sort: find the KK smallest distances → indices NK(xq)\mathcal{N}_K(x_q)

  3. Classification: y^=mode{yi:iNK}\hat{y} = \text{mode}\{y_i : i \in \mathcal{N}_K\}
    Regression: y^=1KiNKyi\hat{y} = \frac{1}{K} \sum_{i \in \mathcal{N}_K} y_i

Distance Metrics

The Minkowski distance unifies the most common metrics:

dp(x,z)=(j=1pxjzjp)1/pd_p(x, z) = \left(\sum_{j=1}^{p} |x_j - z_j|^p\right)^{1/p}
ppNameSensitivity
1Manhattan (L1)Robust to outliers in individual features
2Euclidean (L2)Standard geometric distance — most common
ChebyshevMaximum single-coordinate difference
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# -------------------------------------------------------
# KNN from scratch
# -------------------------------------------------------
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

def knn_predict_class(X_train, y_train, x_query, k):
    distances = [(euclidean_distance(x_query, x_i), y_i) for x_i, y_i in zip(X_train, y_train)]
    distances.sort(key=lambda t: t[0])
    k_nearest_labels = [label for _, label in distances[:k]]
    return max(set(k_nearest_labels), key=k_nearest_labels.count)

def knn_predict_reg(X_train, y_train, x_query, k):
    distances = [(euclidean_distance(x_query, x_i), y_i) for x_i, y_i in zip(X_train, y_train)]
    distances.sort(key=lambda t: t[0])
    return np.mean([v for _, v in distances[:k]])

# -------------------------------------------------------
# Visualise KNN decision boundary for k=1,3,9,15
# -------------------------------------------------------
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1,
                            class_sep=1.2, random_state=42)
scaler = StandardScaler()
X_s = scaler.fit_transform(X)

K_vals = [1, 3, 9, 15]
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
h = 0.04
xx, yy = np.meshgrid(np.arange(-3, 3, h), np.arange(-3, 3, h))
grid = np.c_[xx.ravel(), yy.ravel()]

for ax, k in zip(axes, K_vals):
    # Vectorised prediction on grid (using from-scratch for pedagogy)
    dists = np.linalg.norm(X_s[:, None, :] - grid[None, :, :], axis=2)  # (n_train, n_grid)
    nn_idx = np.argsort(dists, axis=0)[:k]                               # (k, n_grid)
    votes = y[nn_idx]                                                      # (k, n_grid)
    Z = np.array([np.bincount(votes[:, i], minlength=2).argmax() for i in range(len(grid))])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
    ax.scatter(X_s[:,0], X_s[:,1], c=y, cmap=plt.cm.coolwarm, edgecolors='k', s=20)
    ax.set_title(f'k = {k}')
    ax.set_xlabel('x1'); ax.set_ylabel('x2')
    ax.set_xlim(-3, 3); ax.set_ylim(-3, 3)

plt.suptitle('KNN Decision Boundary vs K (from scratch, Euclidean)', fontsize=11)
plt.tight_layout()
plt.show()
print("k=1 → highly irregular boundary (low bias, high variance)")
print("k=15 → smoother boundary (higher bias, lower variance)")

2. Distance Metric Comparison

Different metrics give different neighbourhoods. The unit ball (set of points within distance 1 from origin) varies by metric:

  • L1 (Manhattan): diamond shape

  • L2 (Euclidean): circle

  • L∞ (Chebyshev): square

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

theta = np.linspace(0, 2*np.pi, 300)
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
titles = ['L1 (Manhattan)', 'L2 (Euclidean)', 'L∞ (Chebyshev)']

# L1 unit ball: |x| + |y| = 1 → diamond
angles = np.array([0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi])
x_l1 = np.cos(angles)  # This is just a shortcut for the corners
# Proper L1 unit ball
t = np.linspace(0, 1, 100)
x_l1 = np.concatenate([t, 1-t, -t, -(1-t)])
y_l1 = np.concatenate([1-t, -t, -(1-t), t])

# L2 unit ball: circle
x_l2 = np.cos(theta)
y_l2 = np.sin(theta)

# L∞ unit ball: square
x_linf = np.array([-1, 1, 1, -1, -1])
y_linf = np.array([-1, -1, 1, 1, -1])

for ax, (xs, ys), title in zip(axes, [(x_l1, y_l1), (x_l2, y_l2), (x_linf, y_linf)], titles):
    ax.fill(xs, ys, alpha=0.3, color='steelblue')
    ax.plot(xs, ys, 'b-', lw=2)
    ax.axhline(0, color='k', lw=0.5); ax.axvline(0, color='k', lw=0.5)
    ax.set_xlim(-1.4, 1.4); ax.set_ylim(-1.4, 1.4)
    ax.set_aspect('equal'); ax.set_title(title); ax.grid(alpha=0.2)

plt.suptitle('Unit ball shapes by distance metric', fontsize=11)
plt.tight_layout()
plt.show()

# Numerical comparison
p1 = np.array([1.0, 0.0])
p2 = np.array([0.0, 1.0])
p3 = np.array([0.7, 0.7])
print("Point A=[1,0]  vs  B=[0,1]  vs  C=[0.7,0.7]:")
for name, p, q in [('A vs B', p1, p2), ('A vs C', p1, p3)]:
    l1 = np.sum(np.abs(p - q))
    l2 = np.linalg.norm(p - q)
    linf = np.max(np.abs(p - q))
    print(f"  {name}: L1={l1:.3f}  L2={l2:.3f}  L∞={linf:.3f}")

3. KNN for Regression

For regression, the prediction is the mean of the KK nearest target values. Weighted KNN uses inverse-distance weights:

y^=iNKwiyiiNKwi,wi=1d(xq,xi)+ε\hat{y} = \frac{\sum_{i \in \mathcal{N}_K} w_i y_i}{\sum_{i \in \mathcal{N}_K} w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \varepsilon}
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

np.random.seed(0)
X_raw = np.sort(np.random.uniform(0, 2*np.pi, 80))
y = np.sin(X_raw) + np.random.normal(0, 0.2, 80)
X = X_raw.reshape(-1, 1)

X_test = np.linspace(0, 2*np.pi, 300).reshape(-1, 1)
y_true = np.sin(X_test.ravel())

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
k_vals = [1, 5, 20]
for ax, k in zip(axes, k_vals):
    m = KNeighborsRegressor(n_neighbors=k)
    m.fit(X, y)
    y_pred = m.predict(X_test)
    ax.scatter(X_raw, y, s=20, alpha=0.5, label='Training data')
    ax.plot(X_test, y_true, 'g--', lw=1.5, label='True sin(x)')
    ax.plot(X_test, y_pred, 'r-', lw=2, label=f'KNN k={k}')
    ax.set_title(f'KNN Regression  k={k}')
    ax.set_xlabel('x'); ax.set_ylabel('y')
    ax.legend(fontsize=8); ax.grid(alpha=0.3)

plt.suptitle('KNN Regression: k=1 overfits; k=20 underfits', fontsize=11)
plt.tight_layout()
plt.show()

4. Choosing K — Bias-Variance Trade-off

  • Small kk (e.g. 1): very flexible boundary, fits every training point, low bias / high variance → overfits

  • Large kk (e.g. nn): very smooth, ignores local structure, high bias / low variance → underfits

  • Optimal kk: found via cross-validation. Rule of thumb: start with k=nk = \sqrt{n} (odd for binary classification).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)
X, y = make_classification(n_samples=400, n_features=2, n_informative=2,
                            n_redundant=0, class_sep=1.0, random_state=42)

k_range = range(1, 31)
cv_scores = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for k in k_range:
    pipe = Pipeline([('scaler', StandardScaler()),
                     ('knn', KNeighborsClassifier(n_neighbors=k))])
    sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
    cv_scores.append(sc.mean())

best_k = k_range[np.argmax(cv_scores)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(list(k_range), cv_scores, 'b-o', markersize=4, label='CV accuracy')
ax.axvline(best_k, color='r', linestyle='--', label=f'Best k={best_k}')
ax.axvline(int(np.sqrt(400)), color='gray', linestyle=':', label=f'√n = {int(np.sqrt(400))}')
ax.set_xlabel('K'); ax.set_ylabel('5-fold CV Accuracy')
ax.set_title('Choosing K by Cross-Validation')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Best k = {best_k}  CV accuracy = {max(cv_scores):.4f}")
print(f"Rule-of-thumb k = √400 = {int(np.sqrt(400))}")

5. The Curse of Dimensionality

As pp (features) increases, all points become approximately equidistant — the ratio of max/min distance → 1 and the nearest-neighbour concept breaks down.

The expected distance to the nearest neighbour in the unit hypercube scales as:

E[dmin](Γ(1+p/2)nπp/2)1/p\mathbb{E}[d_{\min}] \approx \left(\frac{\Gamma(1 + p/2)}{n \cdot \pi^{p/2}}\right)^{1/p}

In practice: for n=1000n = 1000 points in 100 dimensions, almost all points are equidistant.

Mitigation strategies:

  • Dimensionality reduction (PCA, UMAP) before KNN

  • Feature selection to remove noise dimensions

  • Approximate nearest-neighbour search (→ efficient_search.ipynb)

  • Use alternative models (SVM, trees) when pnp \gg n

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n_samples = 1000
dims = [2, 5, 10, 20, 50, 100, 200]
mean_ratios = []

for p in dims:
    X = np.random.uniform(0, 1, (n_samples, p))
    query = np.random.uniform(0, 1, p)
    dists = np.linalg.norm(X - query, axis=1)
    ratio = dists.max() / dists.min()
    mean_ratios.append(ratio)
    print(f"p={p:4d}: min_dist={dists.min():.4f}  max_dist={dists.max():.4f}  ratio={ratio:.2f}")

fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogy(dims, mean_ratios, 'r-o')
ax.axhline(1.0, color='k', linestyle='--', lw=1, label='ratio=1 (equidistant)')
ax.set_xlabel('Number of dimensions p')
ax.set_ylabel('max_dist / min_dist (log scale)')
ax.set_title('Curse of Dimensionality — Distance Ratio Collapses')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

6. KNN Classification Pipeline with sklearn

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

iris = load_iris()
X_tr, X_te, y_tr, y_te = train_test_split(iris.data, iris.target,
                                            test_size=0.2, random_state=42, stratify=iris.target)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='uniform'))
])
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)

print("=== KNN Pipeline — Iris Dataset ===")
print(classification_report(y_te, y_pred, target_names=iris.target_names))

fig, ax = plt.subplots(figsize=(5, 4))
ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
    display_labels=iris.target_names, colorbar=False, ax=ax)
ax.set_title('KNN (k=5) Confusion Matrix — Iris')
plt.tight_layout()
plt.show()

# Distance-weighted KNN
pipe_w = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, weights='distance'))
])
pipe_w.fit(X_tr, y_tr)
print(f"Uniform weights accuracy : {pipe.score(X_te, y_te):.4f}")
print(f"Distance weights accuracy: {pipe_w.score(X_te, y_te):.4f}")

7. Try It in the Browser

Compute KNN classification from scratch for a small 2D dataset.

import math

def euclidean(a, b):
    return math.sqrt(sum((x - y)**2 for x, y in zip(a, b)))

def knn_classify(X_train, y_train, x_query, k):
    dists = sorted([(euclidean(x_query, xi), yi) for xi, yi in zip(X_train, y_train)])
    labels = [yi for _, yi in dists[:k]]
    return max(set(labels), key=labels.count), dists[:k]

# Training data (2 features, 2 classes)
X_train = [[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]]
y_train = [0, 0, 0, 1, 1, 1]
query = [4, 4]

for k in [1, 3, 5]:
    label, neighbors = knn_classify(X_train, y_train, query, k)
    print(f"k={k}: predicted class={label}")
    for d, lbl in neighbors:
        print(f"  distance={d:.3f}  class={lbl}")
    print()

Knowledge Check

KNN with k=1 always achieves 100% accuracy on the training set. Why is this a problem?

[ ] A) It violates the Euclidean distance assumption [ ] B) The model has memorised the training data (zero bias) and will overfit — test accuracy will be much lower [ ] C) k=1 is not a valid KNN configuration [ ] D) The training accuracy metric is computed incorrectly for k=1
Check

Why must you apply StandardScaler before KNN but NOT before a Decision Tree?

[ ] A) Decision trees are faster than KNN [ ] B) KNN uses raw feature magnitudes in distance computations; decision trees split on thresholds and are scale-invariant [ ] C) StandardScaler changes the class labels [ ] D) Decision trees do not support floating-point features
Check

Exercises

Exercise 1 — K Sweep on Breast Cancer

Load sklearn.datasets.load_breast_cancer(). Build a Pipeline([StandardScaler, KNeighborsClassifier]). Sweep k from 1 to 30 using 5-fold CV. Plot CV accuracy vs k. What is the optimal k?

Exercise 2 — Metric Comparison

On the same breast cancer dataset, fix k=7 and compare metric ∈ {'euclidean', 'manhattan', 'chebyshev'} using 5-fold CV accuracy. Which metric wins? Why?

Exercise 3 — Weighted vs Uniform KNN

On make_moons(n_samples=500, noise=0.2), compare weights='uniform' vs weights='distance' for k=5. Which generalises better and why?

%matplotlib inline
# Exercises 1, 2, 3 — your code here
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
# Your code here

Common Pitfalls

Summary
  • KNN is a non-parametric, instance-based algorithm: no training, prediction = vote/average of KK nearest neighbours.

  • Distance metrics: Euclidean (p=2p=2), Manhattan (p=1p=1), Minkowski (p=qp=q) — always scale features first.

  • k trade-off: small kk = low bias / high variance; large kk = high bias / low variance. Tune via cross-validation.

  • Regression KNN: predicts the mean (or inverse-distance-weighted mean) of KK nearest targets.

  • Curse of dimensionality: in high dimensions, all distances converge — KNN degrades. Reduce dimensions first.

  • Production: brute-force KNN is O(np)O(n \cdot p) per query. For large nn, use KD-trees or Ball Trees (→ next notebook).

Next steps

What’s Next — Efficient Search Structures

The bottleneck in KNN is the O(np)O(n \cdot p) scan over all training points. For large datasets, this is prohibitive. The next notebook introduces:

  • KD-Trees: partition feature space into axis-aligned hyperrectangles — sub-linear average search time in low dimensions

  • Ball Trees: partition by enclosing hyperspheres — better in moderate dimensions

  • Approximate methods: LSH and FAISS for million-scale nearest-neighbour search

Proceed to efficient_search.ipynb.