Distance-Based & Instance Methods#
Welcome to the Neighborhood! 🏡 This chapter is all about the algorithms that don’t learn until you ask them to — the lazy but surprisingly effective geniuses of machine learning.
Meet K-Nearest Neighbors (KNN) — the algorithm that says:
“Why train a model when I can just look at my neighbors and copy their answers?” 😎
💡 What Are Distance-Based Methods?#
Unlike other algorithms that build mathematical models or find optimal weights, instance-based methods like KNN:
Store the entire dataset 🗃️
Wait for a new query 😴
Then measure distance to known examples 🚶♂️
And predict based on the closest neighbors 🧑🤝🧑
They’re like that one student who never studies but always borrows notes from their friends. 📒
🧮 The Core Idea#
When a new data point arrives:
Compute its distance to every other data point.
Pick the K closest ones (its “neighbors”).
For classification → take a majority vote.
For regression → take the average value.
That’s it! No gradient descent, no backpropagation — just pure neighborhood wisdom. 🧠
🧊 Distance Metrics#
Different distance measures give KNN different personalities:
Distance Metric |
Formula |
When to Use |
|---|---|---|
Euclidean |
( \sqrt{\sum (x_i - y_i)^2} ) |
Continuous features (default) |
Manhattan |
( \sum |
x_i - y_i |
Minkowski |
Generalized distance |
When you can’t decide 😅 |
Cosine |
( 1 - \frac{x \cdot y}{ |
“Your metric defines your neighborhood’s vibe.” — KNN Philosophy, Vol. 1
🧠 Example Intuition#
Imagine you’re a store manager predicting if a new customer will buy premium coffee ☕ based on age and income.
KNN looks for the K most similar customers and checks if they bought premium coffee. If most of them did → predicts “Yes”. If not → “No”.
It’s basically peer pressure, but mathematical. 😅
🧰 K Is for “Kool” (and “Kinda Important”)#
Choosing K is crucial:
K Value |
Behavior |
Analogy |
|---|---|---|
Small (K=1) |
Overfits noise |
“Believes every rumor.” 🤷 |
Large (K=15) |
Smooth, general |
“Listens to the community.” 🏘️ |
The sweet spot depends on your dataset size and noise level — you’ll experiment with that in the lab. 🔬
🧱 Strengths & Weaknesses#
👍 Pros |
👎 Cons |
|---|---|
Simple, intuitive |
Slow for large datasets |
No training phase |
Needs efficient search |
Works for any data type |
Sensitive to irrelevant features |
Performs well with good scaling |
Distance can be misleading in high dimensions |
⚙️ Libraries You’ll Use#
scikit-learn→KNeighborsClassifier,KNeighborsRegressornumpy,pandas,matplotlibOptional:
scipy.spatialfor faster nearest-neighbor searches
📊 Quick Demo#
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Create sample data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
🧩 Visualization#
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap='coolwarm', s=40)
plt.title("KNN Classification – Neighborhood Watch 👀")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Each test point’s color is decided by its neighbors — because in KNN, it’s not who you are, it’s who you hang out with. 😎
💼 Business Use Cases#
Customer Segmentation 🛒 Group customers by behavioral similarity.
Recommendation Systems 🎬 Suggest movies/products based on similar users.
Credit Risk Scoring 💳 Predict risk by comparing with similar borrowers.
Anomaly Detection 🚨 Spot weird patterns in fraud or network data.
🧪 What’s Inside This Chapter#
Section |
Focus |
Tagline |
|---|---|---|
How KNN thinks |
“Lazy but effective.” |
|
Speed up neighborhood lookup |
“Because searching everyone’s house is slow.” 🏃♂️ |
|
Apply KNN to business data |
“Find your customer tribe.” 🧑🤝🧑 |
💡 KNN doesn’t predict the future — it just copies what similar people did in the past.
🔗 Next Up: KNN Basics Let’s meet our lazy genius and learn how to pick good neighbors.
⸻
Nearest Neighbour Algorithm
⸻
Distance Metrics
The core idea: classify/regress a point based on the distance to nearby points.
Common Metrics: • Euclidean Distance (L2 norm):
• Manhattan Distance (L1 norm):
• Minkowski Distance (general form):
• Cosine Similarity (for high-dimensional text data):
⸻
k-Nearest Neighbors (k-NN) for Classification
Given a new point \(x\): 1. Find the \(k\) closest points in training set. 2. Assign the majority class among those neighbors.
Decision rule:
⸻
k-NN for Regression
Instead of voting, take the average of the \(k\) nearest neighbors’ target values:
⸻
Curse of Dimensionality
As dimensionality increases: • Distances between points become less meaningful • All points become equidistant in high-dimensional space • k-NN performance drops if feature selection/dimensionality reduction isn’t applied
Example:
In 100 dimensions, the difference between nearest and farthest neighbor distances may be negligible.
⸻
Efficient Search with KD-Trees / Ball Trees • KD-Trees: Efficient for low-dimensional numerical data • Ball Trees: Better for high-dimensional or non-Euclidean metrics
These reduce search time from \(O(n)\) to roughly \(O(\log n)\) per query (in low dimensions).
⸻
Python: k-NN using scikit-learn
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor from sklearn.metrics import classification_report, mean_squared_error
Load classification dataset#
iris = load_iris() X, y = iris.data, iris.target
Split#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
k-NN Classification#
knn_cls = KNeighborsClassifier(n_neighbors=3, metric=’euclidean’) knn_cls.fit(X_train, y_train) y_pred_cls = knn_cls.predict(X_test) print(“Classification Report:\n”, classification_report(y_test, y_pred_cls))
k-NN Regression (simulate continuous target)#
import numpy as np y_reg = X[:, 0] + np.random.normal(0, 0.2, size=X.shape[0]) # Fake regression target X_train, X_test, y_train, y_test = train_test_split(X, y_reg, test_size=0.3, random_state=42)
knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train, y_train) y_pred_reg = knn_reg.predict(X_test) print(“Regression MSE:”, mean_squared_error(y_test, y_pred_reg))
⸻
# Your code here