Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Welcome to the Neighborhood! 🏡 This chapter is all about the algorithms that don’t learn until you ask them to — the lazy but surprisingly effective geniuses of machine learning.

Meet K-Nearest Neighbors (KNN) — the algorithm that says:

“Why train a model when I can just look at my neighbors and copy their answers?” 😎


💡 What Are Distance-Based Methods?

Unlike other algorithms that build mathematical models or find optimal weights, instance-based methods like KNN:

  • Store the entire dataset 🗃️

  • Wait for a new query 😴

  • Then measure distance to known examples 🚶‍♂️

  • And predict based on the closest neighbors 🧑‍🤝‍🧑

They’re like that one student who never studies but always borrows notes from their friends. 📒


🧮 The Core Idea

When a new data point arrives:

  1. Compute its distance to every other data point.

  2. Pick the K closest ones (its “neighbors”).

  3. For classification → take a majority vote.

  4. For regression → take the average value.

That’s it! No gradient descent, no backpropagation — just pure neighborhood wisdom. 🧠


🧊 Distance Metrics

Different distance measures give KNN different personalities:

Distance MetricFormulaWhen to Use
Euclidean( \sqrt{\sum (x_i - y_i)^2} )Continuous features (default)
Manhattan( \sumx_i - y_i
MinkowskiGeneralized distanceWhen you can’t decide 😅
Cosine( 1 - \frac{x \cdot y}{

“Your metric defines your neighborhood’s vibe.” — KNN Philosophy, Vol. 1


🧠 Example Intuition

Imagine you’re a store manager predicting if a new customer will buy premium coffee ☕ based on age and income.

KNN looks for the K most similar customers and checks if they bought premium coffee. If most of them did → predicts “Yes”. If not → “No”.

It’s basically peer pressure, but mathematical. 😅


🧰 K Is for “Kool” (and “Kinda Important”)

Choosing K is crucial:

K ValueBehaviorAnalogy
Small (K=1)Overfits noise“Believes every rumor.” 🤷
Large (K=15)Smooth, general“Listens to the community.” 🏘️

The sweet spot depends on your dataset size and noise level — you’ll experiment with that in the lab. 🔬


🧱 Strengths & Weaknesses

👍 Pros👎 Cons
Simple, intuitiveSlow for large datasets
No training phaseNeeds efficient search
Works for any data typeSensitive to irrelevant features
Performs well with good scalingDistance can be misleading in high dimensions

⚙️ Libraries You’ll Use

  • scikit-learnKNeighborsClassifier, KNeighborsRegressor

  • numpy, pandas, matplotlib

  • Optional: scipy.spatial for faster nearest-neighbor searches


📊 Quick Demo

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Create sample data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                           n_redundant=0, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

🧩 Visualization

plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap='coolwarm', s=40)
plt.title("KNN Classification – Neighborhood Watch 👀")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Each test point’s color is decided by its neighbors — because in KNN, it’s not who you are, it’s who you hang out with. 😎


💼 Business Use Cases

  • Customer Segmentation 🛒 Group customers by behavioral similarity.

  • Recommendation Systems 🎬 Suggest movies/products based on similar users.

  • Credit Risk Scoring 💳 Predict risk by comparing with similar borrowers.

  • Anomaly Detection 🚨 Spot weird patterns in fraud or network data.


🧪 What’s Inside This Chapter

SectionFocusTagline
KNN BasicsHow KNN thinks“Lazy but effective.”
Efficient Search StructuresSpeed up neighborhood lookup“Because searching everyone’s house is slow.” 🏃‍♂️
Lab – Customer SegmentationApply KNN to business data“Find your customer tribe.” 🧑‍🤝‍🧑

💡 KNN doesn’t predict the future — it just copies what similar people did in the past.


🔗 Next Up: KNN Basics Let’s meet our lazy genius and learn how to pick good neighbors.

Nearest Neighbour Algorithm

  1. Distance Metrics

The core idea: classify/regress a point based on the distance to nearby points.

Common Metrics: • Euclidean Distance (L2 norm):

d(x,x)=i=1n(xixi)2d(x, x’) = \sqrt{\sum_{i=1}^n (x_i - x_i’)^2}
•	Manhattan Distance (L1 norm):
d(x,x)=i=1nxixid(x, x’) = \sum_{i=1}^n |x_i - x_i’|
•	Minkowski Distance (general form):
d(x,x)=(i=1nxixip)1/pd(x, x’) = \left( \sum_{i=1}^n |x_i - x_i’|^p \right)^{1/p}
•	Cosine Similarity (for high-dimensional text data):
sim(x,x)=xxxx\text{sim}(x, x’) = \frac{x \cdot x’}{|x| \cdot |x’|}

  1. k-Nearest Neighbors (k-NN) for Classification

Given a new point xx: 1. Find the kk closest points in training set. 2. Assign the majority class among those neighbors.

Decision rule:

y^=modey(i):x(i)Nk(x)\hat{y} = \text{mode}{y^{(i)} : x^{(i)} \in \mathcal{N}_k(x)}

  1. k-NN for Regression

Instead of voting, take the average of the kk nearest neighbors’ target values:

y^=1kiNk(x)y(i)\hat{y} = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y^{(i)}

  1. Curse of Dimensionality

As dimensionality increases: • Distances between points become less meaningful • All points become equidistant in high-dimensional space • k-NN performance drops if feature selection/dimensionality reduction isn’t applied

Example:

In 100 dimensions, the difference between nearest and farthest neighbor distances may be negligible.

  1. Efficient Search with KD-Trees / Ball Trees • KD-Trees: Efficient for low-dimensional numerical data • Ball Trees: Better for high-dimensional or non-Euclidean metrics

These reduce search time from O(n)O(n) to roughly O(logn)O(\log n) per query (in low dimensions).

  1. Python: k-NN using scikit-learn

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor from sklearn.metrics import classification_report, mean_squared_error

Load classification dataset

iris = load_iris() X, y = iris.data, iris.target

Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

k-NN Classification

knn_cls = KNeighborsClassifier(n_neighbors=3, metric=‘euclidean’) knn_cls.fit(X_train, y_train) y_pred_cls = knn_cls.predict(X_test) print(“Classification Report:\n”, classification_report(y_test, y_pred_cls))

k-NN Regression (simulate continuous target)

import numpy as np y_reg = X[:, 0] + np.random.normal(0, 0.2, size=X.shape[0]) # Fake regression target X_train, X_test, y_train, y_test = train_test_split(X, y_reg, test_size=0.3, random_state=42)

knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train, y_train) y_pred_reg = knn_reg.predict(X_test) print(“Regression MSE:”, mean_squared_error(y_test, y_pred_reg))

# Your code here