Distance-Based & Instance Methods

Distance-Based & Instance Methods#

Welcome to the Neighborhood! 🏡 This chapter is all about the algorithms that don’t learn until you ask them to — the lazy but surprisingly effective geniuses of machine learning.

Meet K-Nearest Neighbors (KNN) — the algorithm that says:

“Why train a model when I can just look at my neighbors and copy their answers?” 😎

💡 What Are Distance-Based Methods?#

Unlike other algorithms that build mathematical models or find optimal weights, instance-based methods like KNN:

Store the entire dataset 🗃️
Wait for a new query 😴
Then measure distance to known examples 🚶‍♂️
And predict based on the closest neighbors 🧑‍🤝‍🧑

They’re like that one student who never studies but always borrows notes from their friends. 📒

🧮 The Core Idea#

When a new data point arrives:

Compute its distance to every other data point.
Pick the K closest ones (its “neighbors”).
For classification → take a majority vote.
For regression → take the average value.

That’s it! No gradient descent, no backpropagation — just pure neighborhood wisdom. 🧠

🧊 Distance Metrics#

Different distance measures give KNN different personalities:

Distance Metric	Formula	When to Use
Euclidean	( \sqrt{\sum (x_i - y_i)^2} )	Continuous features (default)
Manhattan	( \sum	x_i - y_i
Minkowski	Generalized distance	When you can’t decide 😅
Cosine	( 1 - \frac{x \cdot y}{

“Your metric defines your neighborhood’s vibe.” — KNN Philosophy, Vol. 1

🧠 Example Intuition#

Imagine you’re a store manager predicting if a new customer will buy premium coffee ☕ based on age and income.

KNN looks for the K most similar customers and checks if they bought premium coffee. If most of them did → predicts “Yes”. If not → “No”.

It’s basically peer pressure, but mathematical. 😅

🧰 K Is for “Kool” (and “Kinda Important”)#

Choosing K is crucial:

K Value	Behavior	Analogy
Small (K=1)	Overfits noise	“Believes every rumor.” 🤷
Large (K=15)	Smooth, general	“Listens to the community.” 🏘️

The sweet spot depends on your dataset size and noise level — you’ll experiment with that in the lab. 🔬

🧱 Strengths & Weaknesses#

👍 Pros	👎 Cons
Simple, intuitive	Slow for large datasets
No training phase	Needs efficient search
Works for any data type	Sensitive to irrelevant features
Performs well with good scaling	Distance can be misleading in high dimensions

⚙️ Libraries You’ll Use#

scikit-learn → KNeighborsClassifier, KNeighborsRegressor
numpy, pandas, matplotlib
Optional: scipy.spatial for faster nearest-neighbor searches

📊 Quick Demo#

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Create sample data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                           n_redundant=0, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

🧩 Visualization#

plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap='coolwarm', s=40)
plt.title("KNN Classification – Neighborhood Watch 👀")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Each test point’s color is decided by its neighbors — because in KNN, it’s not who you are, it’s who you hang out with. 😎

💼 Business Use Cases#

Customer Segmentation 🛒 Group customers by behavioral similarity.
Recommendation Systems 🎬 Suggest movies/products based on similar users.
Credit Risk Scoring 💳 Predict risk by comparing with similar borrowers.
Anomaly Detection 🚨 Spot weird patterns in fraud or network data.

🧪 What’s Inside This Chapter#

Section	Focus	Tagline
KNN Basics	How KNN thinks	“Lazy but effective.”
Efficient Search Structures	Speed up neighborhood lookup	“Because searching everyone’s house is slow.” 🏃‍♂️
Lab – Customer Segmentation	Apply KNN to business data	“Find your customer tribe.” 🧑‍🤝‍🧑

💡 KNN doesn’t predict the future — it just copies what similar people did in the past.

🔗 Next Up: KNN Basics Let’s meet our lazy genius and learn how to pick good neighbors.

⸻

Nearest Neighbour Algorithm

⸻

Distance Metrics

The core idea: classify/regress a point based on the distance to nearby points.

Common Metrics: • Euclidean Distance (L2 norm):

\[ d(x, x’) = \sqrt{\sum_{i=1}^n (x_i - x_i’)^2} \]

•	Manhattan Distance (L1 norm):

\[ d(x, x’) = \sum_{i=1}^n |x_i - x_i’| \]

•	Minkowski Distance (general form):

\[ d(x, x’) = \left( \sum_{i=1}^n |x_i - x_i’|^p \right)^{1/p} \]

•	Cosine Similarity (for high-dimensional text data):

\[ \text{sim}(x, x’) = \frac{x \cdot x’}{|x| \cdot |x’|} \]

⸻

k-Nearest Neighbors (k-NN) for Classification

Given a new point \(x\): 1. Find the \(k\) closest points in training set. 2. Assign the majority class among those neighbors.

Decision rule:

\[ \hat{y} = \text{mode}{y^{(i)} : x^{(i)} \in \mathcal{N}_k(x)} \]

⸻

k-NN for Regression

Instead of voting, take the average of the \(k\) nearest neighbors’ target values:

\[ \hat{y} = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y^{(i)} \]

⸻

Curse of Dimensionality

As dimensionality increases: • Distances between points become less meaningful • All points become equidistant in high-dimensional space • k-NN performance drops if feature selection/dimensionality reduction isn’t applied

Example:

In 100 dimensions, the difference between nearest and farthest neighbor distances may be negligible.

⸻

Efficient Search with KD-Trees / Ball Trees • KD-Trees: Efficient for low-dimensional numerical data • Ball Trees: Better for high-dimensional or non-Euclidean metrics

These reduce search time from \(O(n)\) to roughly \(O(\log n)\) per query (in low dimensions).

⸻

Python: k-NN using scikit-learn

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor from sklearn.metrics import classification_report, mean_squared_error

Load classification dataset#

iris = load_iris() X, y = iris.data, iris.target

Split#

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

k-NN Classification#

knn_cls = KNeighborsClassifier(n_neighbors=3, metric=’euclidean’) knn_cls.fit(X_train, y_train) y_pred_cls = knn_cls.predict(X_test) print(“Classification Report:\n”, classification_report(y_test, y_pred_cls))

k-NN Regression (simulate continuous target)#

import numpy as np y_reg = X[:, 0] + np.random.normal(0, 0.2, size=X.shape[0]) # Fake regression target X_train, X_test, y_train, y_test = train_test_split(X, y_reg, test_size=0.3, random_state=42)

knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train, y_train) y_pred_reg = knn_reg.predict(X_test) print(“Regression MSE:”, mean_squared_error(y_test, y_pred_reg))

⸻

# Your code here