KNN Basics — Distance, Voting, and Lazy Learning¶

What you will learn: how K-Nearest Neighbours works, the core distance metrics (Euclidean, Manhattan, Minkowski), the choice of $k$ , KNN for regression vs classification, the curse of dimensionality, and how to use sklearn’s KNeighborsClassifier and KNeighborsRegressor.

Why KNN Matters for Business¶

A streaming service needs to recommend films to a new user. Without any model training — just by finding the 10 most similar users (based on past ratings) and averaging their preferences — it achieves 74% recall on holdout ratings. KNN’s zero-training-time property makes it uniquely suited for cold-start recommendation and anomaly detection.

No model to train. The entire training set is the model. Prediction cost is $O(n \cdot p)$ per query.

1. The KNN Algorithm¶

Given training set $\{(x_i, y_i)\}_{i=1}^n$ and query point $x_q$ :

Compute $d(x_q, x_i)$ for all $i$
Sort: find the $K$ smallest distances → indices $\mathcal{N}_K(x_q)$
Classification: $\hat{y} = \text{mode}\{y_i : i \in \mathcal{N}_K\}$
Regression: $\hat{y} = \frac{1}{K} \sum_{i \in \mathcal{N}_K} y_i$

Distance Metrics¶

The Minkowski distance unifies the most common metrics:

d_p(x, z) = \left(\sum_{j=1}^{p} |x_j - z_j|^p\right)^{1/p}

(1)

$p$	Name	Sensitivity
1	Manhattan (L1)	Robust to outliers in individual features
2	Euclidean (L2)	Standard geometric distance — most common
∞	Chebyshev	Maximum single-coordinate difference

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# -------------------------------------------------------
# KNN from scratch
# -------------------------------------------------------
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

def knn_predict_class(X_train, y_train, x_query, k):
    distances = [(euclidean_distance(x_query, x_i), y_i) for x_i, y_i in zip(X_train, y_train)]
    distances.sort(key=lambda t: t[0])
    k_nearest_labels = [label for _, label in distances[:k]]
    return max(set(k_nearest_labels), key=k_nearest_labels.count)

def knn_predict_reg(X_train, y_train, x_query, k):
    distances = [(euclidean_distance(x_query, x_i), y_i) for x_i, y_i in zip(X_train, y_train)]
    distances.sort(key=lambda t: t[0])
    return np.mean([v for _, v in distances[:k]])

# -------------------------------------------------------
# Visualise KNN decision boundary for k=1,3,9,15
# -------------------------------------------------------
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                            n_redundant=0, n_clusters_per_class=1,
                            class_sep=1.2, random_state=42)
scaler = StandardScaler()
X_s = scaler.fit_transform(X)

K_vals = [1, 3, 9, 15]
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
h = 0.04
xx, yy = np.meshgrid(np.arange(-3, 3, h), np.arange(-3, 3, h))
grid = np.c_[xx.ravel(), yy.ravel()]

for ax, k in zip(axes, K_vals):
    # Vectorised prediction on grid (using from-scratch for pedagogy)
    dists = np.linalg.norm(X_s[:, None, :] - grid[None, :, :], axis=2)  # (n_train, n_grid)
    nn_idx = np.argsort(dists, axis=0)[:k]                               # (k, n_grid)
    votes = y[nn_idx]                                                      # (k, n_grid)
    Z = np.array([np.bincount(votes[:, i], minlength=2).argmax() for i in range(len(grid))])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.25)
    ax.scatter(X_s[:,0], X_s[:,1], c=y, cmap=plt.cm.coolwarm, edgecolors='k', s=20)
    ax.set_title(f'k = {k}')
    ax.set_xlabel('x1'); ax.set_ylabel('x2')
    ax.set_xlim(-3, 3); ax.set_ylim(-3, 3)

plt.suptitle('KNN Decision Boundary vs K (from scratch, Euclidean)', fontsize=11)
plt.tight_layout()
plt.show()
print("k=1 → highly irregular boundary (low bias, high variance)")
print("k=15 → smoother boundary (higher bias, lower variance)")

2. Distance Metric Comparison¶

Different metrics give different neighbourhoods. The unit ball (set of points within distance 1 from origin) varies by metric:

L1 (Manhattan): diamond shape
L2 (Euclidean): circle
L∞ (Chebyshev): square

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

theta = np.linspace(0, 2*np.pi, 300)
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
titles = ['L1 (Manhattan)', 'L2 (Euclidean)', 'L∞ (Chebyshev)']

# L1 unit ball: |x| + |y| = 1 → diamond
angles = np.array([0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi])
x_l1 = np.cos(angles)  # This is just a shortcut for the corners
# Proper L1 unit ball
t = np.linspace(0, 1, 100)
x_l1 = np.concatenate([t, 1-t, -t, -(1-t)])
y_l1 = np.concatenate([1-t, -t, -(1-t), t])

# L2 unit ball: circle
x_l2 = np.cos(theta)
y_l2 = np.sin(theta)

# L∞ unit ball: square
x_linf = np.array([-1, 1, 1, -1, -1])
y_linf = np.array([-1, -1, 1, 1, -1])

for ax, (xs, ys), title in zip(axes, [(x_l1, y_l1), (x_l2, y_l2), (x_linf, y_linf)], titles):
    ax.fill(xs, ys, alpha=0.3, color='steelblue')
    ax.plot(xs, ys, 'b-', lw=2)
    ax.axhline(0, color='k', lw=0.5); ax.axvline(0, color='k', lw=0.5)
    ax.set_xlim(-1.4, 1.4); ax.set_ylim(-1.4, 1.4)
    ax.set_aspect('equal'); ax.set_title(title); ax.grid(alpha=0.2)

plt.suptitle('Unit ball shapes by distance metric', fontsize=11)
plt.tight_layout()
plt.show()

# Numerical comparison
p1 = np.array([1.0, 0.0])
p2 = np.array([0.0, 1.0])
p3 = np.array([0.7, 0.7])
print("Point A=[1,0]  vs  B=[0,1]  vs  C=[0.7,0.7]:")
for name, p, q in [('A vs B', p1, p2), ('A vs C', p1, p3)]:
    l1 = np.sum(np.abs(p - q))
    l2 = np.linalg.norm(p - q)
    linf = np.max(np.abs(p - q))
    print(f"  {name}: L1={l1:.3f}  L2={l2:.3f}  L∞={linf:.3f}")

3. KNN for Regression¶

For regression, the prediction is the mean of the $K$ nearest target values. Weighted KNN uses inverse-distance weights:

\hat{y} = \frac{\sum_{i \in \mathcal{N}_K} w_i y_i}{\sum_{i \in \mathcal{N}_K} w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \varepsilon}

(2)

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

np.random.seed(0)
X_raw = np.sort(np.random.uniform(0, 2*np.pi, 80))
y = np.sin(X_raw) + np.random.normal(0, 0.2, 80)
X = X_raw.reshape(-1, 1)

X_test = np.linspace(0, 2*np.pi, 300).reshape(-1, 1)
y_true = np.sin(X_test.ravel())

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
k_vals = [1, 5, 20]
for ax, k in zip(axes, k_vals):
    m = KNeighborsRegressor(n_neighbors=k)
    m.fit(X, y)
    y_pred = m.predict(X_test)
    ax.scatter(X_raw, y, s=20, alpha=0.5, label='Training data')
    ax.plot(X_test, y_true, 'g--', lw=1.5, label='True sin(x)')
    ax.plot(X_test, y_pred, 'r-', lw=2, label=f'KNN k={k}')
    ax.set_title(f'KNN Regression  k={k}')
    ax.set_xlabel('x'); ax.set_ylabel('y')
    ax.legend(fontsize=8); ax.grid(alpha=0.3)

plt.suptitle('KNN Regression: k=1 overfits; k=20 underfits', fontsize=11)
plt.tight_layout()
plt.show()

4. Choosing K — Bias-Variance Trade-off¶

Small $k$ (e.g. 1): very flexible boundary, fits every training point, low bias / high variance → overfits
Large $k$ (e.g. $n$ ): very smooth, ignores local structure, high bias / low variance → underfits
Optimal $k$ : found via cross-validation. Rule of thumb: start with $k = \sqrt{n}$ (odd for binary classification).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

np.random.seed(42)
X, y = make_classification(n_samples=400, n_features=2, n_informative=2,
                            n_redundant=0, class_sep=1.0, random_state=42)

k_range = range(1, 31)
cv_scores = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for k in k_range:
    pipe = Pipeline([('scaler', StandardScaler()),
                     ('knn', KNeighborsClassifier(n_neighbors=k))])
    sc = cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')
    cv_scores.append(sc.mean())

best_k = k_range[np.argmax(cv_scores)]
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(list(k_range), cv_scores, 'b-o', markersize=4, label='CV accuracy')
ax.axvline(best_k, color='r', linestyle='--', label=f'Best k={best_k}')
ax.axvline(int(np.sqrt(400)), color='gray', linestyle=':', label=f'√n = {int(np.sqrt(400))}')
ax.set_xlabel('K'); ax.set_ylabel('5-fold CV Accuracy')
ax.set_title('Choosing K by Cross-Validation')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Best k = {best_k}  CV accuracy = {max(cv_scores):.4f}")
print(f"Rule-of-thumb k = √400 = {int(np.sqrt(400))}")

5. The Curse of Dimensionality¶

As $p$ (features) increases, all points become approximately equidistant — the ratio of max/min distance → 1 and the nearest-neighbour concept breaks down.

The expected distance to the nearest neighbour in the unit hypercube scales as:

\mathbb{E}[d_{\min}] \approx \left(\frac{\Gamma(1 + p/2)}{n \cdot \pi^{p/2}}\right)^{1/p}

(3)

In practice: for $n = 1000$ points in 100 dimensions, almost all points are equidistant.

Mitigation strategies:

Dimensionality reduction (PCA, UMAP) before KNN
Feature selection to remove noise dimensions
Approximate nearest-neighbour search (→ efficient_search.ipynb)
Use alternative models (SVM, trees) when $p \gg n$

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n_samples = 1000
dims = [2, 5, 10, 20, 50, 100, 200]
mean_ratios = []

for p in dims:
    X = np.random.uniform(0, 1, (n_samples, p))
    query = np.random.uniform(0, 1, p)
    dists = np.linalg.norm(X - query, axis=1)
    ratio = dists.max() / dists.min()
    mean_ratios.append(ratio)
    print(f"p={p:4d}: min_dist={dists.min():.4f}  max_dist={dists.max():.4f}  ratio={ratio:.2f}")

fig, ax = plt.subplots(figsize=(8, 4))
ax.semilogy(dims, mean_ratios, 'r-o')
ax.axhline(1.0, color='k', linestyle='--', lw=1, label='ratio=1 (equidistant)')
ax.set_xlabel('Number of dimensions p')
ax.set_ylabel('max_dist / min_dist (log scale)')
ax.set_title('Curse of Dimensionality — Distance Ratio Collapses')
ax.legend(); ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

6. KNN Classification Pipeline with sklearn¶

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

iris = load_iris()
X_tr, X_te, y_tr, y_te = train_test_split(iris.data, iris.target,
                                            test_size=0.2, random_state=42, stratify=iris.target)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, metric='euclidean', weights='uniform'))
])
pipe.fit(X_tr, y_tr)
y_pred = pipe.predict(X_te)

print("=== KNN Pipeline — Iris Dataset ===")
print(classification_report(y_te, y_pred, target_names=iris.target_names))

fig, ax = plt.subplots(figsize=(5, 4))
ConfusionMatrixDisplay.from_predictions(y_te, y_pred,
    display_labels=iris.target_names, colorbar=False, ax=ax)
ax.set_title('KNN (k=5) Confusion Matrix — Iris')
plt.tight_layout()
plt.show()

# Distance-weighted KNN
pipe_w = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5, weights='distance'))
])
pipe_w.fit(X_tr, y_tr)
print(f"Uniform weights accuracy : {pipe.score(X_te, y_te):.4f}")
print(f"Distance weights accuracy: {pipe_w.score(X_te, y_te):.4f}")

7. Try It in the Browser¶

Compute KNN classification from scratch for a small 2D dataset.

import math

def euclidean(a, b):
    return math.sqrt(sum((x - y)**2 for x, y in zip(a, b)))

def knn_classify(X_train, y_train, x_query, k):
    dists = sorted([(euclidean(x_query, xi), yi) for xi, yi in zip(X_train, y_train)])
    labels = [yi for _, yi in dists[:k]]
    return max(set(labels), key=labels.count), dists[:k]

# Training data (2 features, 2 classes)
X_train = [[1, 2], [2, 3], [3, 1], [6, 5], [7, 7], [8, 6]]
y_train = [0, 0, 0, 1, 1, 1]
query = [4, 4]

for k in [1, 3, 5]:
    label, neighbors = knn_classify(X_train, y_train, query, k)
    print(f"k={k}: predicted class={label}")
    for d, lbl in neighbors:
        print(f"  distance={d:.3f}  class={lbl}")
    print()

Knowledge Check¶

KNN with k=1 always achieves 100% accuracy on the training set. Why is this a problem?¶

[ ] A) It violates the Euclidean distance assumption [ ] B) The model has memorised the training data (zero bias) and will overfit — test accuracy will be much lower [ ] C) k=1 is not a valid KNN configuration [ ] D) The training accuracy metric is computed incorrectly for k=1

Check

Why must you apply StandardScaler before KNN but NOT before a Decision Tree?¶

[ ] A) Decision trees are faster than KNN [ ] B) KNN uses raw feature magnitudes in distance computations; decision trees split on thresholds and are scale-invariant [ ] C) StandardScaler changes the class labels [ ] D) Decision trees do not support floating-point features

Check

Exercises¶

Exercise 1 — K Sweep on Breast Cancer¶

Load sklearn.datasets.load_breast_cancer(). Build a Pipeline([StandardScaler, KNeighborsClassifier]). Sweep k from 1 to 30 using 5-fold CV. Plot CV accuracy vs k. What is the optimal k?

Exercise 2 — Metric Comparison¶

On the same breast cancer dataset, fix k=7 and compare metric ∈ {'euclidean', 'manhattan', 'chebyshev'} using 5-fold CV accuracy. Which metric wins? Why?

Exercise 3 — Weighted vs Uniform KNN¶

On make_moons(n_samples=500, noise=0.2), compare weights='uniform' vs weights='distance' for k=5. Which generalises better and why?

%matplotlib inline
# Exercises 1, 2, 3 — your code here
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
# Your code here

Common Pitfalls¶

What’s Next — Efficient Search Structures¶

The bottleneck in KNN is the $O(n \cdot p)$ scan over all training points. For large datasets, this is prohibitive. The next notebook introduces:

KD-Trees: partition feature space into axis-aligned hyperrectangles — sub-linear average search time in low dimensions
Ball Trees: partition by enclosing hyperspheres — better in moderate dimensions
Approximate methods: LSH and FAISS for million-scale nearest-neighbour search

Proceed to efficient_search.ipynb.