Bagging, Random Forests, and XGBoost¶

What you will learn: why single trees are unstable, how bagging reduces variance, how random forests add feature randomness, and how boosting methods reduce bias through sequential learning.

Why Ensembles Matter in Business¶

A single decision tree is easy to explain but often unstable. In fraud detection, credit scoring, and churn models, this instability creates business risk.

Ensembles combine many weak or unstable learners so predictions become more robust.

Continuity from `decision_trees.ipynb`¶

In the previous notebook, you saw that deeper trees can overfit and single trees are unstable. This chapter fixes that by combining trees intelligently.

1. Core Ideas¶

Bagging (Bootstrap Aggregating)¶

Train many base models on bootstrap-resampled datasets.
Aggregate predictions:
- classification: majority vote
- regression: average
Main effect: reduces variance.

Random Forest¶

Bagging + random subset of features at each split.
De-correlates trees and usually improves test performance.

Boosting¶

Train weak learners sequentially.
Each new learner focuses on mistakes of prior learners.
Main effect: reduces bias.

Method	Training style	Primary gain	Typical risk
Bagging	Parallel	Variance reduction	More compute
Random Forest	Parallel + feature randomness	Strong tabular baseline	Less interpretable than one tree
Boosting	Sequential	Bias reduction	Overfitting if under-regularised

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value


class DecisionTree:
    """Simple entropy-based tree used as a base learner for the custom RF."""

    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def fit(self, X, y, feature_indices):
        self.feature_indices = feature_indices
        self.root = self._grow_tree(X, y, depth=0)

    def _grow_tree(self, X, y, depth):
        n_samples, _ = X.shape
        if depth >= self.max_depth or n_samples < self.min_samples_split or len(np.unique(y)) == 1:
            return Node(value=Counter(y).most_common(1)[0][0])

        best_feature, best_threshold = self._best_split(X, y)
        if best_feature is None:
            return Node(value=Counter(y).most_common(1)[0][0])

        left_idx = X[:, best_feature] <= best_threshold
        right_idx = ~left_idx
        left = self._grow_tree(X[left_idx], y[left_idx], depth + 1)
        right = self._grow_tree(X[right_idx], y[right_idx], depth + 1)
        return Node(best_feature, best_threshold, left, right)

    def _best_split(self, X, y):
        best_gain = -1
        best_feature, best_threshold = None, None

        for feature in self.feature_indices:
            for threshold in np.unique(X[:, feature]):
                left_idx = X[:, feature] <= threshold
                right_idx = ~left_idx
                if np.sum(left_idx) == 0 or np.sum(right_idx) == 0:
                    continue

                gain = self._information_gain(y, left_idx, right_idx)
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        return best_feature, best_threshold

    def _information_gain(self, y, left_idx, right_idx):
        n = len(y)
        n_l, n_r = np.sum(left_idx), np.sum(right_idx)
        parent_entropy = self._entropy(y)
        child_entropy = (n_l / n) * self._entropy(y[left_idx]) + (n_r / n) * self._entropy(y[right_idx])
        return parent_entropy - child_entropy

    @staticmethod
    def _entropy(y):
        hist = np.bincount(y)
        ps = hist / len(y)
        return -np.sum([p * np.log2(p) for p in ps if p > 0])

    def predict(self, X):
        return np.array([self._traverse_tree(x, self.root) for x in X])

    def _traverse_tree(self, x, node):
        if node.value is not None:
            return node.value
        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)


class RandomForest:
    """Custom RF: bootstrap + random feature subsets + majority vote."""

    def __init__(self, n_trees=25, max_depth=3, min_samples_split=2, n_features=None, random_state=42):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.n_features = n_features
        self.rng = np.random.default_rng(random_state)
        self.trees = []

    def fit(self, X, y):
        self.trees = []
        n_samples, n_features_total = X.shape
        n_features = self.n_features or int(np.sqrt(n_features_total))

        for _ in range(self.n_trees):
            idx = self.rng.choice(n_samples, n_samples, replace=True)
            X_sample, y_sample = X[idx], y[idx]
            feature_indices = self.rng.choice(n_features_total, n_features, replace=False)
            tree = DecisionTree(self.max_depth, self.min_samples_split)
            tree.fit(X_sample, y_sample, feature_indices)
            self.trees.append(tree)

    def predict(self, X):
        predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.apply_along_axis(lambda x: Counter(x).most_common(1)[0][0], axis=0, arr=predictions)


def plot_decision_boundary(X, y, model, title):
    x_min, x_max = X[:, 0].min() - 0.8, X[:, 0].max() + 0.8
    y_min, y_max = X[:, 1].min() - 0.8, X[:, 1].max() + 0.8
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05), np.arange(y_min, y_max, 0.05))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    plt.figure(figsize=(6, 4.5))
    plt.contourf(xx, yy, Z, alpha=0.25, cmap=ListedColormap(['#ffb3ba', '#bae1ff']))
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=ListedColormap(['#e74c3c', '#3498db']), s=25)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(alpha=0.2)
    plt.tight_layout()
    plt.show()


np.random.seed(42)
X = np.vstack([
    np.random.normal([2, 2], 0.55, (120, 2)),
    np.random.normal([4, 4], 0.55, (120, 2))
])
y = np.array([0] * 120 + [1] * 120)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

rf_custom = RandomForest(n_trees=25, max_depth=4, n_features=2, random_state=42)
rf_custom.fit(X_train, y_train)
y_pred_custom = rf_custom.predict(X_test)

print(f"Custom RandomForest test accuracy: {accuracy_score(y_test, y_pred_custom):.4f}")
plot_decision_boundary(X_test, y_test, rf_custom, 'Custom RandomForest Decision Boundary')

Custom RandomForest test accuracy: 0.9833

2. Practical Benchmark: Single Tree vs Bagging vs Random Forest vs Boosting¶

We compare generalisation, not training accuracy.

same split and noisy dataset (make_moons)
models: single tree, bagging, random forest, gradient boosting
optional XGBoost with fallback if package is unavailable

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier


X, y = make_moons(n_samples=800, noise=0.28, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

models = {
    'Single Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Bagging (50 trees)': BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=5, random_state=42),
        n_estimators=50,
        random_state=42
    ),
    'Random Forest (200 trees)': RandomForestClassifier(
        n_estimators=200,
        max_depth=5,
        random_state=42
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=2,
        random_state=42
    )
}

scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    scores[name] = accuracy_score(y_test, model.predict(X_test))

try:
    from xgboost import XGBClassifier

    xgb = XGBClassifier(
        n_estimators=250,
        learning_rate=0.05,
        max_depth=3,
        subsample=0.9,
        colsample_bytree=0.9,
        random_state=42,
        eval_metric='logloss'
    )
    xgb.fit(X_train, y_train)
    scores['XGBoost (optional)'] = accuracy_score(y_test, xgb.predict(X_test))
except Exception as exc:
    print(f"XGBoost skipped: {type(exc).__name__}: {exc}")

print('Test accuracy comparison:')
for k, v in sorted(scores.items(), key=lambda kv: kv[1], reverse=True):
    print(f"  {k:<28} {v:.4f}")

plt.figure(figsize=(9, 4.5))
names = list(scores.keys())
vals = [scores[n] for n in names]
bars = plt.bar(names, vals, color=['#8ecae6', '#219ebc', '#023047', '#fb8500', '#90be6d'][:len(names)])
plt.ylim(max(0.75, min(vals) - 0.03), min(1.0, max(vals) + 0.03))
plt.ylabel('Test accuracy')
plt.title('Ensemble Methods vs Single Decision Tree')
plt.xticks(rotation=20, ha='right')
for b, v in zip(bars, vals):
    plt.text(b.get_x() + b.get_width() / 2, v + 0.002, f"{v:.3f}", ha='center', fontsize=9)
plt.grid(axis='y', alpha=0.25)
plt.tight_layout()
plt.show()

Test accuracy comparison:
  Bagging (50 trees)           0.9400
  Random Forest (200 trees)    0.9350
  Gradient Boosting            0.9300
  XGBoost (optional)           0.9300
  Single Tree                  0.9200

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


class AdaBoostFromScratch:
    """Binary AdaBoost with decision stumps and weighted errors."""

    def __init__(self, n_estimators=40):
        self.n_estimators = n_estimators
        self.alphas = []
        self.models = []

    def fit(self, X, y_pm1):
        n_samples = X.shape[0]
        w = np.ones(n_samples) / n_samples

        self.alphas, self.models = [], []

        for _ in range(self.n_estimators):
            stump = DecisionTreeClassifier(max_depth=1, random_state=42)
            stump.fit(X, y_pm1, sample_weight=w)
            pred = stump.predict(X)

            err = np.sum(w * (pred != y_pm1))
            err = np.clip(err, 1e-10, 1 - 1e-10)
            alpha = 0.5 * np.log((1 - err) / err)

            w *= np.exp(-alpha * y_pm1 * pred)
            w /= np.sum(w)

            self.models.append(stump)
            self.alphas.append(alpha)

    def predict(self, X):
        scores = np.zeros(X.shape[0])
        for alpha, model in zip(self.alphas, self.models):
            scores += alpha * model.predict(X)
        return np.sign(scores)


X, y = make_classification(
    n_samples=900,
    n_features=12,
    n_informative=6,
    n_redundant=2,
    class_sep=1.1,
    random_state=42
)

y_pm1 = 2 * y - 1
X_train, X_test, y_train, y_test = train_test_split(
    X, y_pm1, test_size=0.25, random_state=42, stratify=y_pm1
)

ada = AdaBoostFromScratch(n_estimators=50)
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)

print(f"AdaBoost (from scratch) test accuracy: {accuracy_score(y_test, y_pred):.4f}")

acc_curve = []
for m in range(1, len(ada.models) + 1):
    scores = np.zeros(X_test.shape[0])
    for alpha, model in zip(ada.alphas[:m], ada.models[:m]):
        scores += alpha * model.predict(X_test)
    acc_curve.append(accuracy_score(y_test, np.sign(scores)))

plt.figure(figsize=(8, 4))
plt.plot(range(1, len(acc_curve) + 1), acc_curve, color='#d62828', lw=2)
plt.xlabel('Number of weak learners')
plt.ylabel('Test accuracy')
plt.title('AdaBoost Learning Curve (From Scratch)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

AdaBoost (from scratch) test accuracy: 0.8356

3. Try It in the Browser¶

Build a mini bagging vote from hand-written model predictions.

from collections import Counter

weak_model_preds = [
    [1, 0, 1, 1, 0, 0],
    [1, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 0, 1],
    [1, 0, 1, 0, 0, 0],
    [1, 0, 1, 1, 1, 0],
]

n_customers = len(weak_model_preds[0])
ensemble_vote = []
for i in range(n_customers):
    votes = [m[i] for m in weak_model_preds]
    ensemble_vote.append(Counter(votes).most_common(1)[0][0])

print('Weak model predictions (rows)')
for row in weak_model_preds:
    print(row)

print('\nEnsemble majority-vote prediction:')
print(ensemble_vote)

Knowledge Check¶

What is the main reason Random Forest usually beats a single deep tree?¶

[ ] A) It always uses deeper trees [ ] B) It averages many decorrelated trees, reducing variance [ ] C) It scales all features automatically [ ] D) It only works on balanced data

Check

Which statement best describes boosting?¶

[ ] A) Weak learners are trained independently and then averaged [ ] B) A single deep tree is repeatedly pruned [ ] C) Learners are trained sequentially to correct prior errors [ ] D) It is identical to bagging but with more trees

Check

Common Pitfalls¶

Exercises¶

Exercise 1 - Variance reduction check¶

Train 20 independent DecisionTreeClassifier(max_depth=None) models with different random seeds and measure test accuracy variance. Then compare with RandomForestClassifier.

Exercise 2 - Boosting tuning¶

Tune n_estimators and learning_rate for Gradient Boosting on a noisy binary dataset. Plot validation accuracy vs estimator count.

Exercise 3 - Explainability trade-off¶

Compare one shallow tree and one random forest on the same data. Write a short note on interpretability vs performance.

What’s Next - Feature Importance¶

Now that you can train strong tree ensembles, the next notebook focuses on feature importance: which variables drive predictions, and how to avoid misleading interpretations when features are correlated.

Proceed to feature_importance.ipynb.

%matplotlib inline
# Exercise starter cell
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Your code here

Bagging, Random Forests, and XGBoost¶

Why Ensembles Matter in Business¶

Continuity from decision_trees.ipynb¶

1. Core Ideas¶

Bagging (Bootstrap Aggregating)¶

Random Forest¶

Boosting¶

2. Practical Benchmark: Single Tree vs Bagging vs Random Forest vs Boosting¶

3. Try It in the Browser¶

Knowledge Check¶

What is the main reason Random Forest usually beats a single deep tree?¶

Which statement best describes boosting?¶

Common Pitfalls¶

Exercises¶

Exercise 1 - Variance reduction check¶

Exercise 2 - Boosting tuning¶

Exercise 3 - Explainability trade-off¶

What’s Next - Feature Importance¶

Continuity from `decision_trees.ipynb`¶