Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hero image

Bagging, Random Forests, and XGBoost

What you will learn: why single trees are unstable, how bagging reduces variance, how random forests add feature randomness, and how boosting methods reduce bias through sequential learning.

Why Ensembles Matter in Business

A single decision tree is easy to explain but often unstable. In fraud detection, credit scoring, and churn models, this instability creates business risk.

Ensembles combine many weak or unstable learners so predictions become more robust.

Continuity from decision_trees.ipynb

In the previous notebook, you saw that deeper trees can overfit and single trees are unstable. This chapter fixes that by combining trees intelligently.

1. Core Ideas

Bagging (Bootstrap Aggregating)

  • Train many base models on bootstrap-resampled datasets.

  • Aggregate predictions:

    • classification: majority vote

    • regression: average

  • Main effect: reduces variance.

Random Forest

  • Bagging + random subset of features at each split.

  • De-correlates trees and usually improves test performance.

Boosting

  • Train weak learners sequentially.

  • Each new learner focuses on mistakes of prior learners.

  • Main effect: reduces bias.

MethodTraining stylePrimary gainTypical risk
BaggingParallelVariance reductionMore compute
Random ForestParallel + feature randomnessStrong tabular baselineLess interpretable than one tree
BoostingSequentialBias reductionOverfitting if under-regularised
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value


class DecisionTree:
    """Simple entropy-based tree used as a base learner for the custom RF."""

    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def fit(self, X, y, feature_indices):
        self.feature_indices = feature_indices
        self.root = self._grow_tree(X, y, depth=0)

    def _grow_tree(self, X, y, depth):
        n_samples, _ = X.shape
        if depth >= self.max_depth or n_samples < self.min_samples_split or len(np.unique(y)) == 1:
            return Node(value=Counter(y).most_common(1)[0][0])

        best_feature, best_threshold = self._best_split(X, y)
        if best_feature is None:
            return Node(value=Counter(y).most_common(1)[0][0])

        left_idx = X[:, best_feature] <= best_threshold
        right_idx = ~left_idx
        left = self._grow_tree(X[left_idx], y[left_idx], depth + 1)
        right = self._grow_tree(X[right_idx], y[right_idx], depth + 1)
        return Node(best_feature, best_threshold, left, right)

    def _best_split(self, X, y):
        best_gain = -1
        best_feature, best_threshold = None, None

        for feature in self.feature_indices:
            for threshold in np.unique(X[:, feature]):
                left_idx = X[:, feature] <= threshold
                right_idx = ~left_idx
                if np.sum(left_idx) == 0 or np.sum(right_idx) == 0:
                    continue

                gain = self._information_gain(y, left_idx, right_idx)
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        return best_feature, best_threshold

    def _information_gain(self, y, left_idx, right_idx):
        n = len(y)
        n_l, n_r = np.sum(left_idx), np.sum(right_idx)
        parent_entropy = self._entropy(y)
        child_entropy = (n_l / n) * self._entropy(y[left_idx]) + (n_r / n) * self._entropy(y[right_idx])
        return parent_entropy - child_entropy

    @staticmethod
    def _entropy(y):
        hist = np.bincount(y)
        ps = hist / len(y)
        return -np.sum([p * np.log2(p) for p in ps if p > 0])

    def predict(self, X):
        return np.array([self._traverse_tree(x, self.root) for x in X])

    def _traverse_tree(self, x, node):
        if node.value is not None:
            return node.value
        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)


class RandomForest:
    """Custom RF: bootstrap + random feature subsets + majority vote."""

    def __init__(self, n_trees=25, max_depth=3, min_samples_split=2, n_features=None, random_state=42):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.n_features = n_features
        self.rng = np.random.default_rng(random_state)
        self.trees = []

    def fit(self, X, y):
        self.trees = []
        n_samples, n_features_total = X.shape
        n_features = self.n_features or int(np.sqrt(n_features_total))

        for _ in range(self.n_trees):
            idx = self.rng.choice(n_samples, n_samples, replace=True)
            X_sample, y_sample = X[idx], y[idx]
            feature_indices = self.rng.choice(n_features_total, n_features, replace=False)
            tree = DecisionTree(self.max_depth, self.min_samples_split)
            tree.fit(X_sample, y_sample, feature_indices)
            self.trees.append(tree)

    def predict(self, X):
        predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.apply_along_axis(lambda x: Counter(x).most_common(1)[0][0], axis=0, arr=predictions)


def plot_decision_boundary(X, y, model, title):
    x_min, x_max = X[:, 0].min() - 0.8, X[:, 0].max() + 0.8
    y_min, y_max = X[:, 1].min() - 0.8, X[:, 1].max() + 0.8
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05), np.arange(y_min, y_max, 0.05))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    plt.figure(figsize=(6, 4.5))
    plt.contourf(xx, yy, Z, alpha=0.25, cmap=ListedColormap(['#ffb3ba', '#bae1ff']))
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=ListedColormap(['#e74c3c', '#3498db']), s=25)
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.grid(alpha=0.2)
    plt.tight_layout()
    plt.show()


np.random.seed(42)
X = np.vstack([
    np.random.normal([2, 2], 0.55, (120, 2)),
    np.random.normal([4, 4], 0.55, (120, 2))
])
y = np.array([0] * 120 + [1] * 120)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

rf_custom = RandomForest(n_trees=25, max_depth=4, n_features=2, random_state=42)
rf_custom.fit(X_train, y_train)
y_pred_custom = rf_custom.predict(X_test)

print(f"Custom RandomForest test accuracy: {accuracy_score(y_test, y_pred_custom):.4f}")
plot_decision_boundary(X_test, y_test, rf_custom, 'Custom RandomForest Decision Boundary')
Custom RandomForest test accuracy: 0.9833
<Figure size 600x450 with 1 Axes>

2. Practical Benchmark: Single Tree vs Bagging vs Random Forest vs Boosting

We compare generalisation, not training accuracy.

  • same split and noisy dataset (make_moons)

  • models: single tree, bagging, random forest, gradient boosting

  • optional XGBoost with fallback if package is unavailable

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier


X, y = make_moons(n_samples=800, noise=0.28, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

models = {
    'Single Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Bagging (50 trees)': BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=5, random_state=42),
        n_estimators=50,
        random_state=42
    ),
    'Random Forest (200 trees)': RandomForestClassifier(
        n_estimators=200,
        max_depth=5,
        random_state=42
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=2,
        random_state=42
    )
}

scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    scores[name] = accuracy_score(y_test, model.predict(X_test))

try:
    from xgboost import XGBClassifier

    xgb = XGBClassifier(
        n_estimators=250,
        learning_rate=0.05,
        max_depth=3,
        subsample=0.9,
        colsample_bytree=0.9,
        random_state=42,
        eval_metric='logloss'
    )
    xgb.fit(X_train, y_train)
    scores['XGBoost (optional)'] = accuracy_score(y_test, xgb.predict(X_test))
except Exception as exc:
    print(f"XGBoost skipped: {type(exc).__name__}: {exc}")

print('Test accuracy comparison:')
for k, v in sorted(scores.items(), key=lambda kv: kv[1], reverse=True):
    print(f"  {k:<28} {v:.4f}")

plt.figure(figsize=(9, 4.5))
names = list(scores.keys())
vals = [scores[n] for n in names]
bars = plt.bar(names, vals, color=['#8ecae6', '#219ebc', '#023047', '#fb8500', '#90be6d'][:len(names)])
plt.ylim(max(0.75, min(vals) - 0.03), min(1.0, max(vals) + 0.03))
plt.ylabel('Test accuracy')
plt.title('Ensemble Methods vs Single Decision Tree')
plt.xticks(rotation=20, ha='right')
for b, v in zip(bars, vals):
    plt.text(b.get_x() + b.get_width() / 2, v + 0.002, f"{v:.3f}", ha='center', fontsize=9)
plt.grid(axis='y', alpha=0.25)
plt.tight_layout()
plt.show()
Test accuracy comparison:
  Bagging (50 trees)           0.9400
  Random Forest (200 trees)    0.9350
  Gradient Boosting            0.9300
  XGBoost (optional)           0.9300
  Single Tree                  0.9200
<Figure size 900x450 with 1 Axes>
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


class AdaBoostFromScratch:
    """Binary AdaBoost with decision stumps and weighted errors."""

    def __init__(self, n_estimators=40):
        self.n_estimators = n_estimators
        self.alphas = []
        self.models = []

    def fit(self, X, y_pm1):
        n_samples = X.shape[0]
        w = np.ones(n_samples) / n_samples

        self.alphas, self.models = [], []

        for _ in range(self.n_estimators):
            stump = DecisionTreeClassifier(max_depth=1, random_state=42)
            stump.fit(X, y_pm1, sample_weight=w)
            pred = stump.predict(X)

            err = np.sum(w * (pred != y_pm1))
            err = np.clip(err, 1e-10, 1 - 1e-10)
            alpha = 0.5 * np.log((1 - err) / err)

            w *= np.exp(-alpha * y_pm1 * pred)
            w /= np.sum(w)

            self.models.append(stump)
            self.alphas.append(alpha)

    def predict(self, X):
        scores = np.zeros(X.shape[0])
        for alpha, model in zip(self.alphas, self.models):
            scores += alpha * model.predict(X)
        return np.sign(scores)


X, y = make_classification(
    n_samples=900,
    n_features=12,
    n_informative=6,
    n_redundant=2,
    class_sep=1.1,
    random_state=42
)

y_pm1 = 2 * y - 1
X_train, X_test, y_train, y_test = train_test_split(
    X, y_pm1, test_size=0.25, random_state=42, stratify=y_pm1
)

ada = AdaBoostFromScratch(n_estimators=50)
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)

print(f"AdaBoost (from scratch) test accuracy: {accuracy_score(y_test, y_pred):.4f}")

acc_curve = []
for m in range(1, len(ada.models) + 1):
    scores = np.zeros(X_test.shape[0])
    for alpha, model in zip(ada.alphas[:m], ada.models[:m]):
        scores += alpha * model.predict(X_test)
    acc_curve.append(accuracy_score(y_test, np.sign(scores)))

plt.figure(figsize=(8, 4))
plt.plot(range(1, len(acc_curve) + 1), acc_curve, color='#d62828', lw=2)
plt.xlabel('Number of weak learners')
plt.ylabel('Test accuracy')
plt.title('AdaBoost Learning Curve (From Scratch)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
AdaBoost (from scratch) test accuracy: 0.8356
<Figure size 800x400 with 1 Axes>

3. Try It in the Browser

Build a mini bagging vote from hand-written model predictions.

from collections import Counter

weak_model_preds = [
    [1, 0, 1, 1, 0, 0],
    [1, 1, 1, 1, 0, 0],
    [0, 0, 1, 1, 0, 1],
    [1, 0, 1, 0, 0, 0],
    [1, 0, 1, 1, 1, 0],
]

n_customers = len(weak_model_preds[0])
ensemble_vote = []
for i in range(n_customers):
    votes = [m[i] for m in weak_model_preds]
    ensemble_vote.append(Counter(votes).most_common(1)[0][0])

print('Weak model predictions (rows)')
for row in weak_model_preds:
    print(row)

print('\nEnsemble majority-vote prediction:')
print(ensemble_vote)

Knowledge Check

What is the main reason Random Forest usually beats a single deep tree?

[ ] A) It always uses deeper trees [ ] B) It averages many decorrelated trees, reducing variance [ ] C) It scales all features automatically [ ] D) It only works on balanced data
Check

Which statement best describes boosting?

[ ] A) Weak learners are trained independently and then averaged [ ] B) A single deep tree is repeatedly pruned [ ] C) Learners are trained sequentially to correct prior errors [ ] D) It is identical to bagging but with more trees
Check

Common Pitfalls

Exercises

Exercise 1 - Variance reduction check

Train 20 independent DecisionTreeClassifier(max_depth=None) models with different random seeds and measure test accuracy variance. Then compare with RandomForestClassifier.

Exercise 2 - Boosting tuning

Tune n_estimators and learning_rate for Gradient Boosting on a noisy binary dataset. Plot validation accuracy vs estimator count.

Exercise 3 - Explainability trade-off

Compare one shallow tree and one random forest on the same data. Write a short note on interpretability vs performance.

Summary
  • Bagging averages many models to reduce variance.

  • Random Forest adds feature randomness to de-correlate trees and improve robustness.

  • Boosting learns sequentially from errors, often improving bias.

  • Ensemble models are usually stronger than a single tree on tabular business datasets.

Next steps

What’s Next - Feature Importance

Now that you can train strong tree ensembles, the next notebook focuses on feature importance: which variables drive predictions, and how to avoid misleading interpretations when features are correlated.

Proceed to feature_importance.ipynb.

%matplotlib inline
# Exercise starter cell
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Your code here