Welcome to the Neighborhood! 🏡 This chapter is all about the algorithms that don’t learn until you ask them to — the lazy but surprisingly effective geniuses of machine learning.
Meet K-Nearest Neighbors (KNN) — the algorithm that says:
“Why train a model when I can just look at my neighbors and copy their answers?” 😎
💡 What Are Distance-Based Methods?¶
Unlike other algorithms that build mathematical models or find optimal weights, instance-based methods like KNN:
Store the entire dataset 🗃️
Wait for a new query 😴
Then measure distance to known examples 🚶♂️
And predict based on the closest neighbors 🧑🤝🧑
They’re like that one student who never studies but always borrows notes from their friends. 📒
🧮 The Core Idea¶
When a new data point arrives:
Compute its distance to every other data point.
Pick the K closest ones (its “neighbors”).
For classification → take a majority vote.
For regression → take the average value.
That’s it! No gradient descent, no backpropagation — just pure neighborhood wisdom. 🧠
🧊 Distance Metrics¶
Different distance measures give KNN different personalities:
| Distance Metric | Formula | When to Use |
|---|---|---|
| Euclidean | ( \sqrt{\sum (x_i - y_i)^2} ) | Continuous features (default) |
| Manhattan | ( \sum | x_i - y_i |
| Minkowski | Generalized distance | When you can’t decide 😅 |
| Cosine | ( 1 - \frac{x \cdot y}{ |
“Your metric defines your neighborhood’s vibe.” — KNN Philosophy, Vol. 1
🧠 Example Intuition¶
Imagine you’re a store manager predicting if a new customer will buy premium coffee ☕ based on age and income.
KNN looks for the K most similar customers and checks if they bought premium coffee. If most of them did → predicts “Yes”. If not → “No”.
It’s basically peer pressure, but mathematical. 😅
🧰 K Is for “Kool” (and “Kinda Important”)¶
Choosing K is crucial:
| K Value | Behavior | Analogy |
|---|---|---|
| Small (K=1) | Overfits noise | “Believes every rumor.” 🤷 |
| Large (K=15) | Smooth, general | “Listens to the community.” 🏘️ |
The sweet spot depends on your dataset size and noise level — you’ll experiment with that in the lab. 🔬
🧱 Strengths & Weaknesses¶
| 👍 Pros | 👎 Cons |
|---|---|
| Simple, intuitive | Slow for large datasets |
| No training phase | Needs efficient search |
| Works for any data type | Sensitive to irrelevant features |
| Performs well with good scaling | Distance can be misleading in high dimensions |
⚙️ Libraries You’ll Use¶
scikit-learn→KNeighborsClassifier,KNeighborsRegressornumpy,pandas,matplotlibOptional:
scipy.spatialfor faster nearest-neighbor searches
📊 Quick Demo¶
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Create sample data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))🧩 Visualization¶
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap='coolwarm', s=40)
plt.title("KNN Classification – Neighborhood Watch 👀")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()Each test point’s color is decided by its neighbors — because in KNN, it’s not who you are, it’s who you hang out with. 😎
💼 Business Use Cases¶
Customer Segmentation 🛒 Group customers by behavioral similarity.
Recommendation Systems 🎬 Suggest movies/products based on similar users.
Credit Risk Scoring 💳 Predict risk by comparing with similar borrowers.
Anomaly Detection 🚨 Spot weird patterns in fraud or network data.
🧪 What’s Inside This Chapter¶
| Section | Focus | Tagline |
|---|---|---|
| KNN Basics | How KNN thinks | “Lazy but effective.” |
| Efficient Search Structures | Speed up neighborhood lookup | “Because searching everyone’s house is slow.” 🏃♂️ |
| Lab – Customer Segmentation | Apply KNN to business data | “Find your customer tribe.” 🧑🤝🧑 |
💡 KNN doesn’t predict the future — it just copies what similar people did in the past.
🔗 Next Up: KNN Basics Let’s meet our lazy genius and learn how to pick good neighbors.
⸻
Nearest Neighbour Algorithm
⸻
Distance Metrics
The core idea: classify/regress a point based on the distance to nearby points.
Common Metrics: • Euclidean Distance (L2 norm):
• Manhattan Distance (L1 norm):• Minkowski Distance (general form):• Cosine Similarity (for high-dimensional text data):⸻
k-Nearest Neighbors (k-NN) for Classification
Given a new point : 1. Find the closest points in training set. 2. Assign the majority class among those neighbors.
Decision rule:
⸻
k-NN for Regression
Instead of voting, take the average of the nearest neighbors’ target values:
⸻
Curse of Dimensionality
As dimensionality increases: • Distances between points become less meaningful • All points become equidistant in high-dimensional space • k-NN performance drops if feature selection/dimensionality reduction isn’t applied
Example:
In 100 dimensions, the difference between nearest and farthest neighbor distances may be negligible.
⸻
Efficient Search with KD-Trees / Ball Trees • KD-Trees: Efficient for low-dimensional numerical data • Ball Trees: Better for high-dimensional or non-Euclidean metrics
These reduce search time from to roughly per query (in low dimensions).
⸻
Python: k-NN using scikit-learn
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor from sklearn.metrics import classification_report, mean_squared_error
Load classification dataset¶
iris = load_iris() X, y = iris.data, iris.target
Split¶
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
k-NN Classification¶
knn_cls = KNeighborsClassifier(n_neighbors=3, metric=‘euclidean’) knn_cls.fit(X_train, y_train) y_pred_cls = knn_cls.predict(X_test) print(“Classification Report:\n”, classification_report(y_test, y_pred_cls))
k-NN Regression (simulate continuous target)¶
import numpy as np y_reg = X[:, 0] + np.random.normal(0, 0.2, size=X.shape[0]) # Fake regression target X_train, X_test, y_train, y_test = train_test_split(X, y_reg, test_size=0.3, random_state=42)
knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train, y_train) y_pred_reg = knn_reg.predict(X_test) print(“Regression MSE:”, mean_squared_error(y_test, y_pred_reg))
⸻
# Your code here