In Finance and Economics, Prediction is Everywhere
We rely on various models to make critical decisions:
The Core Question: A Single Model vs. The Crowd
Any single model can have flaws, biases, or errors.
The question this chapter explores is:
If we combine many ‘pretty good’ models, can we get a ‘very powerful’ super-model?
Spoiler alert: the answer is yes. This method is called Ensemble Learning.
Today’s Learning Objectives
Understand the Theory: Use Hoeffding’s Inequality to mathematically grasp why there is ‘strength in numbers’.
Master the Mechanisms: Differentiate and master the three core ensemble mechanisms: Bagging, Boosting, and Stacking.
Build the Foundation: Deeply understand the most common building block: the Decision Tree.
Apply in Practice: Use Python to model real credit card default data and witness the power of ensembles firsthand.
Our Learning Roadmap
Part 1: The Theoretical Foundation
Why can we trust the ‘wisdom of the crowd’?
Hoeffding’s Inequality Provides the Proof
Hoeffding’s Inequality gives us a probabilistic guarantee:
The average of many independent random variables will converge to its true expected value with extremely high probability.
In other words, if you have enough samples, the sample average is a very good estimate of the true average.
This sounds abstract, so let’s use a classic coin toss experiment to build intuition.
An Intuitive Analogy: The Coin Toss Experiment
Problem: We have a potentially biased coin. The true probability of heads is p (unknown). How can we estimate p?
Method: Toss it n times and calculate the frequency of heads, h (the sample mean).
Intuition: The larger n is, the closer h should be to the true p.
Hoeffding’s Inequality precisely quantifies the degree of this ‘closeness’.
Hoeffding’s Inequality in Mathematical Terms
It states that the probability of the sample mean h being far from the true mean p by more than any small amount εdecreases exponentially as the sample size n increases.
This negative exponential term is the key. It means that for every additional observation (coin toss), the probability of making a large error shrinks dramatically.
The Leap from Coin Tosses to Ensemble Learning
The Ensemble’s Error Rate Decreases Exponentially
Based on this analogy, a corollary of Hoeffding’s Inequality tells us:
If we have Tindependent binary classifiers, each with an error rate of ε < 0.5 (i.e., better than random guessing), the ensemble model H(x) formed by simple voting will have an error probability that decreases exponentially with T:
Ensemble learning works because it relies on two key assumptions:
Independence: Each ‘weak learner’ needs to be different. If they all make the same mistakes, the ensemble provides no benefit.
Better than Random: Each ‘weak learner’ must have an accuracy slightly better than guessing (for binary classification, > 50%).
As long as these two conditions are met, we can almost always build a powerful learner by creating an ensemble.
Part 2: The Three Core Mechanisms
In practice, how do we create a group of ‘weak learners’ that satisfy those two conditions?
The Three Schools of Ensemble Learning
Mechanism 1: Bagging (Bootstrap Aggregating)
The workflow for Bagging is very intuitive: Bootstrap + Aggregating.
Bootstrap: From the original training set D, create T new training sets D_1, D_2, ..., D_T of the same size by sampling with replacement.
Train: On each new training set D_t, independently and in parallel, train a base learner h_t.
Aggregate:
Classification: Simple voting.
Regression: Simple averaging.
The Bagging Workflow
Bagging’s Magic: Reducing Variance
Variance: The degree to which a model’s predictions fluctuate on different training sets. High variance means the model is too sensitive to the training data and is prone to overfitting.
Why Bagging Works: Each base learner sees only a subset of the data, so their individual overfitting directions may differ. By averaging or voting out these different errors, the overall volatility is smoothed out, thus reducing variance.
Most Successful Application:Random Forest.
Mechanism 2: Boosting
Boosting is a family of algorithms that ‘boosts’ weak learners into strong ones using a sequential, iterative approach.
Initialize: Assign equal weights to all training samples.
Iterative Training (t=1 to T):
Train a weak learner h_t on the currently weighted sample set.
Increase the weights of samples that h_tmisclassified.
Decrease the weights of samples that h_tclassified correctly.
Final Combination: The final strong learner is a weighted combination of all the weak learners.
The Boosting Workflow
Boosting’s Core: Reducing Bias
Bias: The systematic gap between a model’s predictions and the true values. High bias means the model is underfitting and hasn’t learned the data’s fundamental patterns.
Why Boosting Works: Each new learner is forced to focus on the ‘difficult’ samples that previous learners got wrong. This process continuously corrects the model’s systematic errors, gradually reducing bias.
Famous Algorithms:AdaBoost, Gradient Boosting Machines (GBM), XGBoost.
Mechanism 3: Stacking
Stacking is a more sophisticated combination strategy that tries to learn how to ‘intelligently’ combine the predictions of base learners, rather than simply voting or averaging.
Layer 0: Train several different base learners. Use their predictions as new features.
Layer 1: Train a ‘Meta-Learner’ whose input is the predictions from the Layer 0 models and whose output is the final prediction.
The Stacking Workflow
Stacking’s Advantage: Model Fusion
Core Idea: Stacking doesn’t just combine predictions; it trains a meta-model to learn when to trust which base model more.
For example: The meta-model might learn: ‘If Model A and Model B’s predictions are close, but Model C’s is very different, then the final result should lean towards the average of A and B’.
Use Case: Very popular in data science competitions (like Kaggle) for squeezing out the last bit of performance by blending the strengths of multiple high-performing models.
Part 3: The Favorite Building Block—Decision Trees
Why spend so much time on decision trees? Because they are by far the most common and successful base learners for ensemble methods.
Decision Trees: A Natural Fit for Ensembles
Pros:
Non-linear, capable of capturing complex relationships.
Interpretable (a single tree).
Relatively fast to train.
Cons:
Very prone to overfitting. A single decision tree’s performance is often unstable (high variance).
A Perfect Match: The high variance of decision trees is exactly what Bagging (like Random Forest) is designed to combat through averaging! And the ‘weakness’ of a tree (by limiting its depth) makes it the perfect ‘raw material’ for Boosting.
How Does a Decision Tree Make Decisions?
A decision tree continuously splits a complex dataset into purer subsets by asking a series of ‘yes/no’ questions.
Root Node: Represents the entire dataset.
Internal Node: Represents a test on a feature (a question).
Branch: Represents the outcome of the test (the answer).
Leaf Node: Represents the final decision class or predicted value.
Anatomy of a Decision Tree
The Key Question: How to Choose the ‘Best’ Split?
The key to growing a decision tree is to select the optimal feature to split the data at each step.
The ‘optimal’ standard is: after the split, the resulting subsets are the ‘purest’.
Higher ‘purity’ means less uncertainty and a clearer classification.
An Intuitive Look at ‘Purity’
How Do We Quantify ‘Purity’?
We use two main metrics to measure ‘impurity’ or ‘disorder’:
Information Entropy (used in ID3, C4.5 algorithms)
Gini Impurity (used in the CART algorithm)
The goal is the same: choose a split that results in the minimum weighted impurity in the child nodes.
Purity Metric 1: Information Entropy
Information EntropyH(D) measures the uncertainty or disorder of a dataset D.
The higher the entropy, the more chaotic the dataset (more mixed classes).
The lower the entropy, the purer the dataset (most samples belong to one class).
For a dataset D with K classes, its information entropy is defined as: \[
\large{H(D) = - \sum_{k=1}^{K} p_k \log_2(p_k)}
\] where p_k is the proportion of samples belonging to class k.
Numerical Properties of Entropy
Imagine a dataset with two classes: Positive (+) and Negative (-).
Scenario
p_+
p_-
Entropy H(D)
Purity
Perfectly Pure
1.0
0.0
-1*log2(1) - 0 = 0
Highest
Mixed
0.8
0.2
-0.8*log2(0.8) - 0.2*log2(0.2) ≈ 0.72
Lower
Most Chaotic
0.5
0.5
-0.5*log2(0.5) - 0.5*log2(0.5) = 1
Lowest
When positive and negative cases are equally likely, uncertainty is at its maximum, and entropy is 1.
Splitting Criterion 1: Information Gain
The ID3 algorithm uses Information Gain as its splitting criterion.
Idea: Calculate how much the system’s uncertainty (entropy) decreases after splitting dataset D by attribute A. The larger the decrease, the better A is for classification.
Formula:\[
\large{\text{Gain}(D, A) = H(D) - H(D|A)}
\] where H(D) is the entropy before the split, and H(D|A) is the weighted average of the entropy of the subsets after the split (called conditional entropy).
Decision: Choose the attribute A that maximizesGain(D, A) as the splitting node.
Purity Metric 2: Gini Impurity
The CART (Classification and Regression Tree) algorithm uses the Gini Index to select the splitting attribute.
Gini ImpurityGini(D): The probability of misclassifying a randomly chosen element from dataset D if it were randomly labeled according to the distribution of labels in D.
The smaller the Gini index, the higher the purity of the dataset.
Idea: Choose an attribute A and a split point that results in the minimum weighted Gini index after the split.
Formula: For a split on attribute A into V subsets: \[
\large{\text{GiniIndex}(D|A) = \sum_{v=1}^{V} \frac{|D^v|}{|D|} \text{Gini}(D^v)}
\]
Decision: Choose the attribute A and split that minimizesGiniIndex(D|A).
Advantage: Compared to entropy, Gini index calculation does not involve logarithms, making it computationally more efficient.
A Realistic Problem with Decision Trees: Overfitting
If a tree is allowed to grow without limits, it will continue to split until each leaf node contains only one sample. The training error will be zero, but the model will be extremely complex and generalize poorly to new data.
The Solution: Pruning
To prevent overfitting, we need to ‘prune’ the decision tree.
Pre-pruning: During the tree’s growth, if a split does not improve generalization performance (e.g., performance on a validation set decreases), stop splitting early.
Pros: Faster, produces smaller trees.
Cons: Can be ‘short-sighted’, missing good split combinations.
Post-pruning: First, grow a full decision tree. Then, from the bottom up, examine nodes. If removing a subtree improves generalization performance, prune it.
Pros: Usually more effective, less likely to miss good structures.
Random Forest is a highly successful extension of Bagging. It adds an extra layer of randomness on top of Bagging: feature randomness.
Construction Process:
Perform T rounds of Bootstrap sampling to get T training subsets.
For each subset, train a decision tree. When splitting each node of this tree:
Do not select the best feature from all d features.
Instead, randomly select l features (l < d), and then choose the best one from that smaller set.
Combine the T trees through voting or averaging.
The Dual Randomness of Random Forest
Why is Random Forest More Powerful? ‘Diversity’
Role of Feature Randomness: It de-correlates the trees in the forest.
Why De-correlation Matters: Without feature randomness, every tree in the forest would likely choose the same strongest feature to split on at the root node. This would lead to very similar tree structures, diminishing the benefit of ensembling.
By forcing each tree to consider only a subset of features at each split, Random Forest ensures that each tree learns from a different ‘perspective’. This makes them ‘specialized’ in different ways. When combined, they form a more powerful and complementary team, further reducing the overall variance.
Boosting + Decision Trees = AdaBoost
AdaBoost (Adaptive Boosting) is the classic algorithm of the Boosting family.
The Core Iterative Loop:
Train a weak learner h_t.
Evaluate h_t’s performance and assign it a weight α_t (better models get higher weights).
Update the training sample weights w based on h_t’s predictions (misclassified samples get higher weights).
Repeat.
The final model is a weighted combination of all weak learners.
AdaBoost Algorithm Explained (2/4): Training & Evaluation
Loop for t = 1 to T:
Train Weak Learner: Train a weak learner h_t(x) using the current weighted training set, minimizing the weighted error.
Calculate Weighted Error Rate ε_t: The sum of weights for the samples misclassified by h_t. \[
\large{\epsilon_t = \sum_{n=1}^{N} w_{t,n} I(h_t(x_n) \ne y_n)}
\]
Calculate Learner’s Weight α_t:\[
\large{\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)}
\] The lower the error rate ε_t, the larger the weight α_t (its ‘say’ in the final vote).
Update Sample Weights w: This is the core adaptive step. \[
\large{w_{t+1, n} = \frac{w_{t,n} \exp(-\alpha_t y_n h_t(x_n))}{Z_t}}
\] (Z_t is a normalization factor to ensure the new weights sum to 1)
Intuitive Explanation:
If sample n is classified correctly (\(y_n h_t(x_n) = 1\)), the exponent is negative, and w decreases.
If sample n is misclassified (\(y_n h_t(x_n) = -1\)), the exponent is positive, and w increases.
AdaBoost Algorithm Explained (4/4): Final Combination
3. Final Output: Combine all T weak learners via a weighted vote using their respective weights α_t to form the final strong classifier:
Enough theory. Let’s test these models on real data.
Task: Predict whether a credit card client will default next month.
Data: ‘Default of Credit Card Clients’ dataset (from UCI).
Tools:Python, pandas, scikit-learn.
Model Comparison:
Single Decision Tree (Baseline)
Random Forest (Bagging)
AdaBoost (Boosting)
Step 1: Loading and Preparing the Data
We’ll use the ucimlrepo library to fetch the data directly and then split it into training and testing sets.
# Import necessary librariesimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom ucimlrepo import fetch_ucirepo # --- Fetch data from the UCI repository ---# This is a standardized way to load data, ensuring reproducibility.credit_default = fetch_ucirepo(id=350) X = credit_default.data.features y = credit_default.data.targets.squeeze() # Convert to a Pandas Series# --- Split the data ---# Split into a training set (70%) and a testing set (30%)# random_state=42 ensures the split is the same every time# stratify=y ensures the proportion of defaults is the same in both setsX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y)print(f'Training set dimensions: {X_train.shape}')print(f'Testing set dimensions: {X_test.shape}')
Training set dimensions: (21000, 23)
Testing set dimensions: (9000, 23)
Step 2: Training the Baseline - A Single Decision Tree
First, we’ll train a single decision tree as our performance baseline. To prevent severe overfitting, we’ll limit its maximum depth to 5.
from sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import accuracy_score, roc_auc_score# --- Train the model ---# max_depth=5 limits the tree's depth to prevent overfittingdt_clf = DecisionTreeClassifier(max_depth=5, random_state=42)dt_clf.fit(X_train, y_train)# --- Evaluate the model ---y_pred_dt = dt_clf.predict(X_test)y_prob_dt = dt_clf.predict_proba(X_test)[:, 1] # Get probabilities for the positive classacc_dt = accuracy_score(y_test, y_pred_dt)auc_dt = roc_auc_score(y_test, y_prob_dt)print(f'Single Decision Tree (max_depth=5):')print(f' Accuracy: {acc_dt:.4f}')print(f' AUC: {auc_dt:.4f}')
Single Decision Tree (max_depth=5):
Accuracy: 0.8164
AUC: 0.7427
Step 3: Training a Bagging Model - Random Forest
Now, let’s see how a ‘forest’ of 100 decision trees performs. n_estimators is the number of base learners T.
from sklearn.ensemble import RandomForestClassifier# --- Train the model ---# n_estimators=100: Build 100 trees# n_jobs=-1: Use all available CPU cores for parallel computationrf_clf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1)rf_clf.fit(X_train, y_train)# --- Evaluate the model ---y_pred_rf = rf_clf.predict(X_test)y_prob_rf = rf_clf.predict_proba(X_test)[:, 1]acc_rf = accuracy_score(y_test, y_pred_rf)auc_rf = roc_auc_score(y_test, y_prob_rf)print(f'Random Forest (100 trees, max_depth=5):')print(f' Accuracy: {acc_rf:.4f}')print(f' AUC: {auc_rf:.4f}')
Random Forest (100 trees, max_depth=5):
Accuracy: 0.8118
AUC: 0.7675
Observation: Both the accuracy and AUC of the Random Forest are higher than the single decision tree.
Step 4: Training a Boosting Model - AdaBoost
Finally, let’s try AdaBoost. It also uses decision trees but employs a sequential strategy focused on correcting errors.
from sklearn.ensemble import AdaBoostClassifier# --- Train the model ---# AdaBoost often uses very shallow trees ('stumps'), here with max_depth=1base_estimator = DecisionTreeClassifier(max_depth=1)ada_clf = AdaBoostClassifier( estimator=base_estimator, n_estimators=100, random_state=42)ada_clf.fit(X_train, y_train)# --- Evaluate the model ---y_pred_ada = ada_clf.predict(X_test)y_prob_ada = ada_clf.predict_proba(X_test)[:, 1]acc_ada = accuracy_score(y_test, y_pred_ada)auc_ada = roc_auc_score(y_test, y_prob_ada)print(f'AdaBoost (100 stumps):')print(f' Accuracy: {acc_ada:.4f}')print(f' AUC: {auc_ada:.4f}')
Observation: AdaBoost also significantly outperforms the single decision tree.
Step 5: Comparing and Visualizing the Results
Let’s visualize the results. The AUC (Area Under the Curve) is a more robust metric than accuracy for classification, as it measures the model’s ability to distinguish between positive and negative classes across all thresholds.
import matplotlib.pyplot as pltimport seaborn as snsimport pandas as pd# Set the plotting stylesns.set_theme(style='whitegrid', context='talk')fig, ax = plt.subplots(figsize=(10, 6), dpi=100)# Dataresults = pd.DataFrame({'Model': ['Single Decision Tree', 'Random Forest', 'AdaBoost'],'AUC': [auc_dt, auc_rf, auc_ada]}).sort_values('AUC', ascending=True)colors = ['#A4AAAB', '#00A4E6', '#EC232A']# Plottingbars = ax.barh(results['Model'], results['AUC'], color=colors, height=0.6)ax.set_xlim(0.76, 0.785)ax.set_xlabel('ROC AUC Score', fontsize=14, labelpad=10)ax.set_title('Model Performance Comparison: Credit Card Default', fontsize=18, pad=20, weight='bold')# Displaying values on the barsfor bar in bars: width = bar.get_width() ax.text(width +0.0005, bar.get_y() + bar.get_height()/2, f'{width:.4f}', ha='left', va='center', fontsize=14, weight='bold')# Beautifying the chartax.spines[['top', 'right', 'bottom']].set_visible(False)ax.xaxis.grid(True, linestyle='--', which='major', color='grey', alpha=0.5)ax.yaxis.grid(False)ax.tick_params(axis='y', labelsize=14, length=0)ax.tick_params(axis='x', labelsize=12)plt.tight_layout()plt.show()
Figure 1: AUC Performance Comparison of Three Models on the Credit Card Default Prediction Task
Ensemble models (Random Forest and AdaBoost) significantly outperform the single decision tree.
This provides strong empirical proof for our theoretical foundation: combining multiple ‘decent’ learners (decision trees) with systematic methods (Bagging, Boosting) indeed creates a more powerful model.
On this specific task, Random Forest and AdaBoost performed similarly well, both achieving excellent results.
Another Major Advantage of Ensembles: Interpretability
Ensemble models, especially tree-based ones, have another huge benefit: they can tell us which input features are most important for making the final decision.
This is crucial in economics and finance. We don’t just want to predict; we want to understand the drivers behind the prediction.
Visualizing Feature Importance
Let’s see what factors the Random Forest model considered most important for predicting credit card default.
import numpy as np# Get feature importancesimportances = rf_clf.feature_importances_feature_names = X.columnsdf_importance = pd.DataFrame({'feature': feature_names, 'importance': importances})df_importance = df_importance.sort_values('importance', ascending=False).head(15)# Set plotting stylefig, ax = plt.subplots(figsize=(10, 7), dpi=100)# Plottingsns.barplot( x='importance', y='feature', data=df_importance, palette='viridis', ax=ax)# Beautifying the chartax.set_title('Feature Importance Analysis (from Random Forest)', fontsize=18, pad=20, weight='bold')ax.set_xlabel('Relative Importance (Mean Decrease in Impurity)', fontsize=14, labelpad=10)ax.set_ylabel('')ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.tick_params(axis='both', which='major', labelsize=12)plt.tight_layout()plt.show()
Figure 2: Feature Importance Ranking from the Random Forest Model
Business Insights: What Drives Default Risk?
From Figure 2, we can derive very clear business insights:
Historical payment status (PAY_0, PAY_2…) are by far the most important predictors. This aligns perfectly with financial common sense: a client’s recent payment history is the strongest signal of their future creditworthiness.
Bill amount (BILL_AMT1) and credit limit (LIMIT_BAL) are also very important.
Demographic information (like AGE) plays a role, but it’s far less important than the client’s recent behavioral data.
This type of analysis is invaluable for banks in setting credit policies and risk management strategies.
Chapter Summary
Theoretical Foundation: The ‘wisdom of the crowd’ has a solid mathematical basis (Hoeffding’s Inequality). Combining independent learners that are better than random can exponentially reduce the error rate.
Two Main Paths:
Bagging (e.g., Random Forest): Parallel training, uses voting/averaging to reduce variance and make the model more stable.
Boosting (e.g., AdaBoost): Sequential training, iteratively focuses on errors, uses a weighted combination to reduce bias and make the model more accurate.
Core Component: Decision trees are the ideal ‘building blocks’ for ensembles. Their inherent high variance or controllable ‘weakness’ makes them a perfect match for both Bagging and Boosting.
Practical Value: Ensemble learning not only delivers significant improvements in predictive performance but also provides deep business insights, such as feature importance.