07: Ensemble Learning

The Wisdom of the Crowd

The Wisdom of the Crowd in Action

In Finance and Economics, Prediction is Everywhere

We rely on various models to make critical decisions:

The Core Question: A Single Model vs. The Crowd

Any single model can have flaws, biases, or errors.

The question this chapter explores is:

If we combine many ‘pretty good’ models, can we get a ‘very powerful’ super-model?

Spoiler alert: the answer is yes. This method is called Ensemble Learning.

Today’s Learning Objectives

Understand the Theory: Use Hoeffding’s Inequality to mathematically grasp why there is ‘strength in numbers’.
Master the Mechanisms: Differentiate and master the three core ensemble mechanisms: Bagging, Boosting, and Stacking.
Build the Foundation: Deeply understand the most common building block: the Decision Tree.
Apply in Practice: Use Python to model real credit card default data and witness the power of ensembles firsthand.

Our Learning Roadmap

Part 1: The Theoretical Foundation

Why can we trust the ‘wisdom of the crowd’?

Hoeffding’s Inequality Provides the Proof

Hoeffding’s Inequality gives us a probabilistic guarantee:

The average of many independent random variables will converge to its true expected value with extremely high probability.

In other words, if you have enough samples, the sample average is a very good estimate of the true average.

This sounds abstract, so let’s use a classic coin toss experiment to build intuition.

An Intuitive Analogy: The Coin Toss Experiment

Problem: We have a potentially biased coin. The true probability of heads is p (unknown). How can we estimate p?
Method: Toss it n times and calculate the frequency of heads, h (the sample mean).
Intuition: The larger n is, the closer h should be to the true p.

Hoeffding’s Inequality precisely quantifies the degree of this ‘closeness’.

Hoeffding’s Inequality in Mathematical Terms

It states that the probability of the sample mean h being far from the true mean p by more than any small amount ε decreases exponentially as the sample size n increases.

\[ \large{P(|h - p| > \epsilon) \le 2e^{-2n\epsilon^2}} \]

This negative exponential term is the key. It means that for every additional observation (coin toss), the probability of making a large error shrinks dramatically.

The Leap from Coin Tosses to Ensemble Learning

The Ensemble’s Error Rate Decreases Exponentially

Based on this analogy, a corollary of Hoeffding’s Inequality tells us:

If we have T independent binary classifiers, each with an error rate of ε < 0.5 (i.e., better than random guessing), the ensemble model H(x) formed by simple voting will have an error probability that decreases exponentially with T:

\[ \large{P(H(x) \ne y) \le \exp(-2T(0.5 - \epsilon)^2)} \]

The Two Core Conditions for Ensemble Success

Ensemble learning works because it relies on two key assumptions:

Independence: Each ‘weak learner’ needs to be different. If they all make the same mistakes, the ensemble provides no benefit.
Better than Random: Each ‘weak learner’ must have an accuracy slightly better than guessing (for binary classification, > 50%).

As long as these two conditions are met, we can almost always build a powerful learner by creating an ensemble.

Part 2: The Three Core Mechanisms

In practice, how do we create a group of ‘weak learners’ that satisfy those two conditions?

The Three Schools of Ensemble Learning

Mechanism 1: Bagging (Bootstrap Aggregating)

The workflow for Bagging is very intuitive: Bootstrap + Aggregating.

Bootstrap: From the original training set D, create T new training sets D_1, D_2, ..., D_T of the same size by sampling with replacement.
Train: On each new training set D_t, independently and in parallel, train a base learner h_t.
Aggregate:
- Classification: Simple voting.
- Regression: Simple averaging.

The Bagging Workflow

Bagging’s Magic: Reducing Variance

Variance: The degree to which a model’s predictions fluctuate on different training sets. High variance means the model is too sensitive to the training data and is prone to overfitting.
Why Bagging Works: Each base learner sees only a subset of the data, so their individual overfitting directions may differ. By averaging or voting out these different errors, the overall volatility is smoothed out, thus reducing variance.
Most Successful Application: Random Forest.

Mechanism 2: Boosting

Boosting is a family of algorithms that ‘boosts’ weak learners into strong ones using a sequential, iterative approach.

Initialize: Assign equal weights to all training samples.
Iterative Training (t=1 to T):
1. Train a weak learner h_t on the currently weighted sample set.
2. Increase the weights of samples that h_t misclassified.
3. Decrease the weights of samples that h_t classified correctly.
Final Combination: The final strong learner is a weighted combination of all the weak learners.

The Boosting Workflow

Boosting’s Core: Reducing Bias

Bias: The systematic gap between a model’s predictions and the true values. High bias means the model is underfitting and hasn’t learned the data’s fundamental patterns.
Why Boosting Works: Each new learner is forced to focus on the ‘difficult’ samples that previous learners got wrong. This process continuously corrects the model’s systematic errors, gradually reducing bias.
Famous Algorithms: AdaBoost, Gradient Boosting Machines (GBM), XGBoost.

Mechanism 3: Stacking

Stacking is a more sophisticated combination strategy that tries to learn how to ‘intelligently’ combine the predictions of base learners, rather than simply voting or averaging.

Layer 0: Train several different base learners. Use their predictions as new features.
Layer 1: Train a ‘Meta-Learner’ whose input is the predictions from the Layer 0 models and whose output is the final prediction.

The Stacking Workflow

Stacking’s Advantage: Model Fusion

Core Idea: Stacking doesn’t just combine predictions; it trains a meta-model to learn when to trust which base model more.
For example: The meta-model might learn: ‘If Model A and Model B’s predictions are close, but Model C’s is very different, then the final result should lean towards the average of A and B’.
Use Case: Very popular in data science competitions (like Kaggle) for squeezing out the last bit of performance by blending the strengths of multiple high-performing models.

Part 3: The Favorite Building Block—Decision Trees

Why spend so much time on decision trees? Because they are by far the most common and successful base learners for ensemble methods.

Decision Trees: A Natural Fit for Ensembles

Pros:
- Non-linear, capable of capturing complex relationships.
- Interpretable (a single tree).
- Relatively fast to train.
Cons:
- Very prone to overfitting. A single decision tree’s performance is often unstable (high variance).

A Perfect Match: The high variance of decision trees is exactly what Bagging (like Random Forest) is designed to combat through averaging! And the ‘weakness’ of a tree (by limiting its depth) makes it the perfect ‘raw material’ for Boosting.

How Does a Decision Tree Make Decisions?

A decision tree continuously splits a complex dataset into purer subsets by asking a series of ‘yes/no’ questions.

Root Node: Represents the entire dataset.
Internal Node: Represents a test on a feature (a question).
Branch: Represents the outcome of the test (the answer).
Leaf Node: Represents the final decision class or predicted value.

Anatomy of a Decision Tree

The Key Question: How to Choose the ‘Best’ Split?

The key to growing a decision tree is to select the optimal feature to split the data at each step.

The ‘optimal’ standard is: after the split, the resulting subsets are the ‘purest’.

Higher ‘purity’ means less uncertainty and a clearer classification.

An Intuitive Look at ‘Purity’

How Do We Quantify ‘Purity’?

We use two main metrics to measure ‘impurity’ or ‘disorder’:

Information Entropy (used in ID3, C4.5 algorithms)
Gini Impurity (used in the CART algorithm)

The goal is the same: choose a split that results in the minimum weighted impurity in the child nodes.

Purity Metric 1: Information Entropy

Information Entropy H(D) measures the uncertainty or disorder of a dataset D.
The higher the entropy, the more chaotic the dataset (more mixed classes).
The lower the entropy, the purer the dataset (most samples belong to one class).

For a dataset D with K classes, its information entropy is defined as: \[ \large{H(D) = - \sum_{k=1}^{K} p_k \log_2(p_k)} \] where p_k is the proportion of samples belonging to class k.

Numerical Properties of Entropy

Imagine a dataset with two classes: Positive (+) and Negative (-).

Scenario	`p_+`	`p_-`	Entropy `H(D)`	Purity
Perfectly Pure	1.0	0.0	`-1*log2(1) - 0 = 0`	Highest
Mixed	0.8	0.2	`-0.8log2(0.8) - 0.2log2(0.2) ≈ 0.72`	Lower
Most Chaotic	0.5	0.5	`-0.5log2(0.5) - 0.5log2(0.5) = 1`	Lowest

When positive and negative cases are equally likely, uncertainty is at its maximum, and entropy is 1.

Splitting Criterion 1: Information Gain

The ID3 algorithm uses Information Gain as its splitting criterion.

Idea: Calculate how much the system’s uncertainty (entropy) decreases after splitting dataset D by attribute A. The larger the decrease, the better A is for classification.
Formula: \[ \large{\text{Gain}(D, A) = H(D) - H(D|A)} \] where H(D) is the entropy before the split, and H(D|A) is the weighted average of the entropy of the subsets after the split (called conditional entropy).
Decision: Choose the attribute A that maximizes Gain(D, A) as the splitting node.

Purity Metric 2: Gini Impurity

The CART (Classification and Regression Tree) algorithm uses the Gini Index to select the splitting attribute.

Gini Impurity Gini(D): The probability of misclassifying a randomly chosen element from dataset D if it were randomly labeled according to the distribution of labels in D.
The smaller the Gini index, the higher the purity of the dataset.

\[ \large{\text{Gini}(D) = \sum_{k=1}^{K} p_k (1 - p_k) = 1 - \sum_{k=1}^{K} p_k^2} \]

Splitting Criterion 2: Gini Index Gain

Idea: Choose an attribute A and a split point that results in the minimum weighted Gini index after the split.
Formula: For a split on attribute A into V subsets: \[ \large{\text{GiniIndex}(D|A) = \sum_{v=1}^{V} \frac{|D^v|}{|D|} \text{Gini}(D^v)} \]
Decision: Choose the attribute A and split that minimizes GiniIndex(D|A).
Advantage: Compared to entropy, Gini index calculation does not involve logarithms, making it computationally more efficient.

A Realistic Problem with Decision Trees: Overfitting

If a tree is allowed to grow without limits, it will continue to split until each leaf node contains only one sample. The training error will be zero, but the model will be extremely complex and generalize poorly to new data.

The Solution: Pruning

To prevent overfitting, we need to ‘prune’ the decision tree.

Pre-pruning: During the tree’s growth, if a split does not improve generalization performance (e.g., performance on a validation set decreases), stop splitting early.
- Pros: Faster, produces smaller trees.
- Cons: Can be ‘short-sighted’, missing good split combinations.
Post-pruning: First, grow a full decision tree. Then, from the bottom up, examine nodes. If removing a subtree improves generalization performance, prune it.
- Pros: Usually more effective, less likely to miss good structures.
- Cons: Higher computational cost.

Part 4: Powerful Ensemble Models

Now, let’s combine our ‘building block’ (Decision Tree) with our ‘construction methods’ (Bagging/Boosting).

Bagging + Decision Trees = Random Forest

Random Forest is a highly successful extension of Bagging. It adds an extra layer of randomness on top of Bagging: feature randomness.

Construction Process:

Perform T rounds of Bootstrap sampling to get T training subsets.
For each subset, train a decision tree. When splitting each node of this tree:
- Do not select the best feature from all d features.
- Instead, randomly select l features (l < d), and then choose the best one from that smaller set.
Combine the T trees through voting or averaging.

The Dual Randomness of Random Forest

Why is Random Forest More Powerful? ‘Diversity’

Role of Feature Randomness: It de-correlates the trees in the forest.
Why De-correlation Matters: Without feature randomness, every tree in the forest would likely choose the same strongest feature to split on at the root node. This would lead to very similar tree structures, diminishing the benefit of ensembling.
By forcing each tree to consider only a subset of features at each split, Random Forest ensures that each tree learns from a different ‘perspective’. This makes them ‘specialized’ in different ways. When combined, they form a more powerful and complementary team, further reducing the overall variance.

Boosting + Decision Trees = AdaBoost

AdaBoost (Adaptive Boosting) is the classic algorithm of the Boosting family.

The Core Iterative Loop:

Train a weak learner h_t.
Evaluate h_t’s performance and assign it a weight α_t (better models get higher weights).
Update the training sample weights w based on h_t’s predictions (misclassified samples get higher weights).
Repeat.

The final model is a weighted combination of all weak learners.

AdaBoost Algorithm Explained (1/4): Initialization

Goal: Train a strong classifier \(\large{H(x) = sign(\sum \alpha_t h_t(x))}\)

1. Initialize: The weights for all N training samples are initialized equally:

\[ \large{w_{1,n} = 1/N} \quad \text{for } n=1, \dots, N \]

AdaBoost Algorithm Explained (2/4): Training & Evaluation

Loop for t = 1 to T:

Train Weak Learner: Train a weak learner h_t(x) using the current weighted training set, minimizing the weighted error.
Calculate Weighted Error Rate ε_t: The sum of weights for the samples misclassified by h_t. \[ \large{\epsilon_t = \sum_{n=1}^{N} w_{t,n} I(h_t(x_n) \ne y_n)} \]
Calculate Learner’s Weight α_t: \[ \large{\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)} \] The lower the error rate ε_t, the larger the weight α_t (its ‘say’ in the final vote).

AdaBoost Algorithm Explained (3/4): Updating Weights

Update Sample Weights w: This is the core adaptive step. \[ \large{w_{t+1, n} = \frac{w_{t,n} \exp(-\alpha_t y_n h_t(x_n))}{Z_t}} \] (Z_t is a normalization factor to ensure the new weights sum to 1)

Intuitive Explanation:
- If sample n is classified correctly (\(y_n h_t(x_n) = 1\)), the exponent is negative, and w decreases.
- If sample n is misclassified (\(y_n h_t(x_n) = -1\)), the exponent is positive, and w increases.

AdaBoost Algorithm Explained (4/4): Final Combination

3. Final Output: Combine all T weak learners via a weighted vote using their respective weights α_t to form the final strong classifier:

\[ \large{H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)} \]

Part 5: Practice - Predicting Credit Card Default

Enough theory. Let’s test these models on real data.

Task: Predict whether a credit card client will default next month.
Data: ‘Default of Credit Card Clients’ dataset (from UCI).
Tools: Python, pandas, scikit-learn.
Model Comparison:
1. Single Decision Tree (Baseline)
2. Random Forest (Bagging)
3. AdaBoost (Boosting)

Step 1: Loading and Preparing the Data

We’ll use the ucimlrepo library to fetch the data directly and then split it into training and testing sets.

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from ucimlrepo import fetch_ucirepo 

# --- Fetch data from the UCI repository ---
# This is a standardized way to load data, ensuring reproducibility.
credit_default = fetch_ucirepo(id=350) 
X = credit_default.data.features 
y = credit_default.data.targets.squeeze() # Convert to a Pandas Series

# --- Split the data ---
# Split into a training set (70%) and a testing set (30%)
# random_state=42 ensures the split is the same every time
# stratify=y ensures the proportion of defaults is the same in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f'Training set dimensions: {X_train.shape}')
print(f'Testing set dimensions: {X_test.shape}')

Training set dimensions: (21000, 23)
Testing set dimensions: (9000, 23)

Step 2: Training the Baseline - A Single Decision Tree

First, we’ll train a single decision tree as our performance baseline. To prevent severe overfitting, we’ll limit its maximum depth to 5.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# --- Train the model ---
# max_depth=5 limits the tree's depth to prevent overfitting
dt_clf = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_clf.fit(X_train, y_train)

# --- Evaluate the model ---
y_pred_dt = dt_clf.predict(X_test)
y_prob_dt = dt_clf.predict_proba(X_test)[:, 1] # Get probabilities for the positive class

acc_dt = accuracy_score(y_test, y_pred_dt)
auc_dt = roc_auc_score(y_test, y_prob_dt)

print(f'Single Decision Tree (max_depth=5):')
print(f'  Accuracy: {acc_dt:.4f}')
print(f'  AUC: {auc_dt:.4f}')

Single Decision Tree (max_depth=5):
  Accuracy: 0.8164
  AUC: 0.7427

Step 3: Training a Bagging Model - Random Forest

Now, let’s see how a ‘forest’ of 100 decision trees performs. n_estimators is the number of base learners T.

from sklearn.ensemble import RandomForestClassifier

# --- Train the model ---
# n_estimators=100: Build 100 trees
# n_jobs=-1: Use all available CPU cores for parallel computation
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1)
rf_clf.fit(X_train, y_train)

# --- Evaluate the model ---
y_pred_rf = rf_clf.predict(X_test)
y_prob_rf = rf_clf.predict_proba(X_test)[:, 1]

acc_rf = accuracy_score(y_test, y_pred_rf)
auc_rf = roc_auc_score(y_test, y_prob_rf)

print(f'Random Forest (100 trees, max_depth=5):')
print(f'  Accuracy: {acc_rf:.4f}')
print(f'  AUC: {auc_rf:.4f}')

Random Forest (100 trees, max_depth=5):
  Accuracy: 0.8118
  AUC: 0.7675

Observation: Both the accuracy and AUC of the Random Forest are higher than the single decision tree.

Step 4: Training a Boosting Model - AdaBoost

Finally, let’s try AdaBoost. It also uses decision trees but employs a sequential strategy focused on correcting errors.

from sklearn.ensemble import AdaBoostClassifier

# --- Train the model ---
# AdaBoost often uses very shallow trees ('stumps'), here with max_depth=1
base_estimator = DecisionTreeClassifier(max_depth=1)

ada_clf = AdaBoostClassifier(
    estimator=base_estimator,
    n_estimators=100,
    random_state=42
)
ada_clf.fit(X_train, y_train)

# --- Evaluate the model ---
y_pred_ada = ada_clf.predict(X_test)
y_prob_ada = ada_clf.predict_proba(X_test)[:, 1]

acc_ada = accuracy_score(y_test, y_pred_ada)
auc_ada = roc_auc_score(y_test, y_prob_ada)

print(f'AdaBoost (100 stumps):')
print(f'  Accuracy: {acc_ada:.4f}')
print(f'  AUC: {auc_ada:.4f}')

AdaBoost (100 stumps):
  Accuracy: 0.8160
  AUC: 0.7704

Observation: AdaBoost also significantly outperforms the single decision tree.

Step 5: Comparing and Visualizing the Results

Let’s visualize the results. The AUC (Area Under the Curve) is a more robust metric than accuracy for classification, as it measures the model’s ability to distinguish between positive and negative classes across all thresholds.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Set the plotting style
sns.set_theme(style='whitegrid', context='talk')
fig, ax = plt.subplots(figsize=(10, 6), dpi=100)

# Data
results = pd.DataFrame({
    'Model': ['Single Decision Tree', 'Random Forest', 'AdaBoost'],
    'AUC': [auc_dt, auc_rf, auc_ada]
}).sort_values('AUC', ascending=True)

colors = ['#A4AAAB', '#00A4E6', '#EC232A']

# Plotting
bars = ax.barh(results['Model'], results['AUC'], color=colors, height=0.6)
ax.set_xlim(0.76, 0.785)
ax.set_xlabel('ROC AUC Score', fontsize=14, labelpad=10)
ax.set_title('Model Performance Comparison: Credit Card Default', fontsize=18, pad=20, weight='bold')

# Displaying values on the bars
for bar in bars:
    width = bar.get_width()
    ax.text(width + 0.0005, bar.get_y() + bar.get_height()/2, f'{width:.4f}', 
            ha='left', va='center', fontsize=14, weight='bold')

# Beautifying the chart
ax.spines[['top', 'right', 'bottom']].set_visible(False)
ax.xaxis.grid(True, linestyle='--', which='major', color='grey', alpha=0.5)
ax.yaxis.grid(False)
ax.tick_params(axis='y', labelsize=14, length=0)
ax.tick_params(axis='x', labelsize=12)

plt.tight_layout()
plt.show()

Figure 1: AUC Performance Comparison of Three Models on the Credit Card Default Prediction Task

Result Analysis: The Victory of the Crowd

From the chart in Figure 1, we can clearly see:

Ensemble models (Random Forest and AdaBoost) significantly outperform the single decision tree.
This provides strong empirical proof for our theoretical foundation: combining multiple ‘decent’ learners (decision trees) with systematic methods (Bagging, Boosting) indeed creates a more powerful model.
On this specific task, Random Forest and AdaBoost performed similarly well, both achieving excellent results.

Another Major Advantage of Ensembles: Interpretability

Ensemble models, especially tree-based ones, have another huge benefit: they can tell us which input features are most important for making the final decision.

This is crucial in economics and finance. We don’t just want to predict; we want to understand the drivers behind the prediction.

Visualizing Feature Importance

Let’s see what factors the Random Forest model considered most important for predicting credit card default.

import numpy as np

# Get feature importances
importances = rf_clf.feature_importances_
feature_names = X.columns
df_importance = pd.DataFrame({'feature': feature_names, 'importance': importances})
df_importance = df_importance.sort_values('importance', ascending=False).head(15)

# Set plotting style
fig, ax = plt.subplots(figsize=(10, 7), dpi=100)

# Plotting
sns.barplot(
    x='importance', 
    y='feature',
    data=df_importance,
    palette='viridis',
    ax=ax
)

# Beautifying the chart
ax.set_title('Feature Importance Analysis (from Random Forest)', fontsize=18, pad=20, weight='bold')
ax.set_xlabel('Relative Importance (Mean Decrease in Impurity)', fontsize=14, labelpad=10)
ax.set_ylabel('')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.tick_params(axis='both', which='major', labelsize=12)
plt.tight_layout()
plt.show()

Figure 2: Feature Importance Ranking from the Random Forest Model

Business Insights: What Drives Default Risk?

From Figure 2, we can derive very clear business insights:

Historical payment status (PAY_0, PAY_2…) are by far the most important predictors. This aligns perfectly with financial common sense: a client’s recent payment history is the strongest signal of their future creditworthiness.
Bill amount (BILL_AMT1) and credit limit (LIMIT_BAL) are also very important.
Demographic information (like AGE) plays a role, but it’s far less important than the client’s recent behavioral data.

This type of analysis is invaluable for banks in setting credit policies and risk management strategies.

Chapter Summary

Theoretical Foundation: The ‘wisdom of the crowd’ has a solid mathematical basis (Hoeffding’s Inequality). Combining independent learners that are better than random can exponentially reduce the error rate.
Two Main Paths:
- Bagging (e.g., Random Forest): Parallel training, uses voting/averaging to reduce variance and make the model more stable.
- Boosting (e.g., AdaBoost): Sequential training, iteratively focuses on errors, uses a weighted combination to reduce bias and make the model more accurate.
Core Component: Decision trees are the ideal ‘building blocks’ for ensembles. Their inherent high variance or controllable ‘weakness’ makes them a perfect match for both Bagging and Boosting.
Practical Value: Ensemble learning not only delivers significant improvements in predictive performance but also provides deep business insights, such as feature importance.