🌿 Decision Trees

🌿 Decision Trees#

Welcome to the Decision Tree dojo, where data gets sliced, diced, and neatly organized into if–then–else statements — basically, the algorithmic version of your mom deciding what to cook:

“If it’s raining → make pakoras ☔ Else if it’s sunny → ice cream 🍦 Else → leftovers 😅”

That, my friend, is a Decision Tree.

🌱 The Core Idea#

A Decision Tree works by asking a series of binary questions that split your data into smaller and smaller groups — until each group is so pure it could join a yoga retreat.

Example:

“Is income > ₹60,000?” “Yes? → Go right 🌳” “No? → Go left 🍂”

Each split reduces uncertainty — kind of like narrowing down who ate the last slice of pizza at the office.

🎯 The Goal: Minimize Impurity#

Decision Trees are obsessed with purity — not moral, but mathematical purity. They use measures like:

Metric	Meaning
Gini Impurity	“How mixed-up is this node?” (0 = perfectly pure)
Entropy	Borrowed from physics — aka “How chaotic is this node?”

When splitting data, the tree looks for the feature and threshold that bring the biggest drop in impurity — because fewer mixed decisions = more confident predictions.

🧮 A Quick Example#

Say we have customer data for a telecom company:

Age	Income	Churned
23	30K	Yes
42	90K	No
35	40K	Yes
50	100K	No

A Decision Tree might start with:

“Is Income > 60K?” If yes, most people didn’t churn → go right. If no, they probably churned → go left.

Boom 💥 — you’ve just made your first data-driven business policy.

🧠 Overfitting: The Tree That Knew Too Much#

Left unchecked, trees love to memorize the entire dataset — like that one intern who remembers every client’s birthday but forgets to send invoices.

This is called overfitting, and it happens when your tree becomes too deep, too specific, and too useless on new data.

So we prune it — ✂️ because in both gardening and machine learning, pruning keeps things healthy.

⚙️ In Python#

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(
    criterion="gini",
    max_depth=4,
    random_state=42
)
tree.fit(X_train, y_train)

You can visualize it with:

from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
tree.plot_tree(tree, filled=True, feature_names=feature_names)
plt.show()

And voilà — a tree diagram that looks suspiciously like your thought process at 3 AM before deadlines.

🧩 Practice Time#

Try building a tree on your own:

Load a small dataset (e.g., Titanic survivors).
Train a DecisionTreeClassifier.
Visualize the tree.
Find out:
- Which feature was split first?
- How many leaves does your tree have?
- Can you explain one decision path in plain English?

💡 Hint: The first split tells you what your model thinks is most important — like income, age, or whether the customer clicked “unsubscribe” three times this week.

🌳 Coming Up Next#

Up next: we assemble an entire forest — because if one tree is good, a hundred are tree-mendous 🌲😎

👉 Next: Bagging, RF, XGBoost

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
from collections import Counter
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split

# Entropy for classification
def entropy(y):
    hist = np.bincount(y)
    ps = hist / len(y)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])

# Variance for regression
def variance(y):
    return np.var(y) if len(y) > 0 else 0

# Information Gain for classification
def information_gain(X, y, feature_idx, threshold):
    parent_entropy = entropy(y)
    left_mask = X[:, feature_idx] <= threshold
    right_mask = ~left_mask
    if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
        return 0
    n = len(y)
    n_left, n_right = np.sum(left_mask), np.sum(right_mask)
    child_entropy = (n_left / n) * entropy(y[left_mask]) + (n_right / n) * entropy(y[right_mask])
    return parent_entropy - child_entropy

# Variance reduction for regression
def variance_reduction(X, y, feature_idx, threshold):
    parent_var = variance(y)
    left_mask = X[:, feature_idx] <= threshold
    right_mask = ~left_mask
    if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
        return 0
    n = len(y)
    n_left, n_right = np.sum(left_mask), np.sum(right_mask)
    child_var = (n_left / n) * variance(y[left_mask]) + (n_right / n) * variance(y[right_mask])
    return parent_var - child_var

# Decision Tree Node
class Node:
    def __init__(self, feature_idx=None, threshold=None, left=None, right=None, value=None):
        self.feature_idx = feature_idx
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

# Decision Tree with path tracking
class DecisionTree:
    def __init__(self, max_depth=3, min_samples_split=2, criterion='entropy'):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.criterion = criterion
        self.root = None
        self.boundaries = []

    def fit(self, X, y):
        self.root = self._grow_tree(X, y, depth=0)

    def _grow_tree(self, X, y, depth):
        n_samples, n_features = X.shape
        if depth >= self.max_depth or n_samples < self.min_samples_split:
            return Node(value=self._leaf_value(y))
        
        best_gain = -1
        best_idx, best_threshold = None, None
        
        for feature_idx in range(n_features):
            thresholds = np.unique(X[:, feature_idx])
            for threshold in thresholds:
                gain = (information_gain(X, y, feature_idx, threshold) if self.criterion in ['entropy']
                        else variance_reduction(X, y, feature_idx, threshold))
                if gain > best_gain:
                    best_gain = gain
                    best_idx = feature_idx
                    best_threshold = threshold
        
        if best_gain == 0:
            return Node(value=self._leaf_value(y))
        
        self.boundaries.append((best_idx, best_threshold, depth))
        left_mask = X[:, best_idx] <= best_threshold
        right_mask = ~left_mask
        left = self._grow_tree(X[left_mask], y[left_mask], depth + 1)
        right = self._grow_tree(X[right_mask], y[right_mask], depth + 1)
        return Node(best_idx, best_threshold, left, right)

    def _leaf_value(self, y):
        return Counter(y).most_common(1)[0][0] if self.criterion in ['entropy'] else np.mean(y)

    def predict(self, X):
        return np.array([self._predict(x, self.root) for x in X])

    def _predict(self, x, node):
        if node.value is not None:
            return node.value
        if x[node.feature_idx] <= node.threshold:
            return self._predict(x, node.left)
        return self._predict(x, node.right)

    def get_prediction_path(self, x):
        path = []
        node = self.root
        while node.value is None:
            path.append((node.feature_idx, node.threshold))
            if x[node.feature_idx] <= node.threshold:
                node = node.left
            else:
                node = node.right
        path.append(('leaf', node.value))
        return path

Classification and Regression Algorithms#

Classification (ID3-like)#

The DecisionTree class performs classification when criterion='entropy' (or gini). It uses Information Gain to select the optimal feature and threshold for splitting. The tree grows recursively until it hits a stopping condition (e.g., max_depth or min_samples_split). Leaf nodes return the majority class.

Example:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and predict
clf = DecisionTree(max_depth=3, criterion='entropy')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Classification Accuracy: {accuracy_score(y_test, y_pred):.2f}")

Classification Accuracy: 0.97

Explanation:

Splitting: The algorithm selects splits that maximize Information Gain, reducing entropy in child nodes.
Prediction: For a new sample, the tree is traversed from root to leaf based on feature thresholds, and the majority class at the leaf is returned.
Overfitting Control: Parameters like max_depth=3 and min_samples_split=2 prevent the tree from growing too complex, reducing overfitting.
Performance: The accuracy score evaluates how well the tree generalizes to unseen data.

Regression (CART-like)#

For regression, set criterion='variance'. The tree uses variance reduction to choose splits, and leaf nodes return the mean of the target values in that region.