12: Tree-Based Machine Learning Methods

From Parametric to Algorithmic Models

Parametric models (OLS, Logistic, Poisson) assume a specific functional form: \(Y = f(X; \beta) + \epsilon\).

Tree-based models take a fundamentally different approach:

No assumed functional form — the data decides the structure
Partition the feature space into rectangular regions via axis-parallel splits
Each region gets its own simple prediction (mean or mode)

Why trees matter in finance:

Naturally capture nonlinear relationships and interactions
Handle mixed data types (continuous + categorical) without preprocessing
Provide interpretable if-then rules (“If ROA < 2% AND Debt > 60%, flag as high risk”)

Decision Trees: Recursive Binary Partitioning

A decision tree recursively splits the feature space using axis-parallel cuts:

The tree asks: “Which single variable and cutoff best separates the outcome?” — then repeats recursively.

Splitting Criteria: How Trees Choose the Best Cut

Classification trees — Measuring impurity:

Criterion	Formula	Properties
Gini Impurity	\(G = 1 - \sum_{k=1}^K p_k^2\)	Range [0, 0.5]; 0 = pure; computationally faster
Entropy	\(H = -\sum_{k=1}^K p_k \log_2 p_k\)	Range [0, 1]; information-theoretic

Binary case (K=2): Gini = \(2p(1-p)\), Entropy = \(-p\log_2 p - (1-p)\log_2(1-p)\)

Regression trees — Minimize within-node variance:

\[\large{ MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2 }\]

Information Gain = Parent impurity − Weighted average of children’s impurity

The algorithm evaluates every possible split point for every feature and picks the one with maximum Information Gain.

Cost-Complexity Pruning: Fighting Overfitting

An unrestricted tree will grow until every leaf is pure — memorizing the training data.

Cost-complexity pruning adds a penalty for tree size:

\[\large{ C_\alpha(T) = \underbrace{\sum_{m=1}^{|T|} \sum_{x_i \in R_m} L(y_i, c_m)}_{\text{Training loss}} + \underbrace{\alpha |T|}_{\text{Complexity penalty}} }\]

\(|T|\) = number of terminal nodes (leaves)
\(\alpha\) = complexity parameter (tuned via cross-validation)
Small \(\alpha\) → large tree (low bias, high variance)
Large \(\alpha\) → small tree (high bias, low variance)

In practice: Grow a maximum tree, then prune back using cross-validated \(\alpha\).

Case Study: Credit Risk with Decision Trees

Data: 1,862 YRD listed companies (2023), 72 flagged as ST (3.87%)

Feature	Description
ROA (%)	Return on Assets — profitability
Debt Ratio (%)	Total debt / Total assets — leverage
Current Ratio	Current assets / Current liabilities — liquidity
ln(Total Assets)	Firm size (log-transformed)

Single Decision Tree Results (optimal depth from CV = 2):

Training accuracy: 96.17% | Test accuracy: 80.32%
Test AUC: 0.7872
Feature importance: ROA = 0.8952 (dominates), Debt Ratio = 0.0657
The tree essentially became: “Is ROA low?” — a one-rule model

Dirty Work: Structural Instability — The Butterfly Effect

The problem: Remove just 5% of training data, and the entire tree structure can change completely.

Why: The greedy splitting algorithm makes each decision based on a single threshold. Small changes in data near that threshold → different split → different subtrees → completely different model.

Implication: Never trust a single tree for production deployment.

Perturbation	Tree Structure Change	Risk
Remove 5% data	Root split variable may change	Model instability
Add one outlier	Entire branch restructured	Sensitivity to noise
Different random seed	Different train/test split	Unreproducible results

The solution: Don’t use one tree — use hundreds. This is the motivation for ensemble methods.

Dirty Work: The XOR Problem — When Trees Need Depth

Axis-parallel splits can’t always capture interactions efficiently:

\(X_1\) Location	\(X_2\) Location	Outcome
Left half	Bottom half	Class 0
Left half	Top half	Class 1
Right half	Bottom half	Class 1
Right half	Top half	Class 0

This is the XOR (exclusive-or) pattern. No single split helps — accuracy at depth 1 is 0.484 (worse than random!). But at depth 2, accuracy jumps to 1.000.

Lesson: A tree must be deep enough to capture interactions. For an interaction of order \(k\) (involving \(k\) variables), you need at least \(k\) levels of splits.

Analogy: This is why neural networks need multiple layers — single-layer networks also cannot learn XOR.

Random Forest: Wisdom of the Crowd

Key idea: Grow many decorrelated trees and average their predictions.

Two sources of randomness:

Bootstrap aggregation (bagging): Each tree trained on a bootstrap sample (sampling with replacement)
Feature randomization: At each split, consider only \(m\) random features
- Classification: \(m = \lfloor\sqrt{p}\rfloor\)
- Regression: \(m = \lfloor p/3 \rfloor\)

Why it works — the variance reduction formula:

\[\large{ \text{Var}(\bar{f}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2 }\]

\(\rho\) = average pairwise correlation between trees
\(B\) = number of trees
Feature randomization reduces \(\rho\) → reduces ensemble variance
More trees reduces the second term → always safe to grow more trees

Random Forest: Results on ST Prediction

500 trees, \(m = \sqrt{4} = 2\) features per split:

Metric	Decision Tree	Random Forest	Improvement
Test Accuracy	80.32%	90.52%	+10.2 ppt
Test AUC	0.7872	0.8414	+0.054
OOB Score	—	0.8887	Built-in validation

Feature importance (redistribution effect):

Feature	Single Tree	Random Forest
ROA	0.895	0.495
Debt Ratio	0.066	0.180
ln(Assets)	0.027	0.190
Current Ratio	0.012	0.135

Critical insight: Random Forest spreads importance more evenly — revealing predictive power that a single tree’s greedy algorithm missed.

Gradient Boosting and XGBoost

Boosting philosophy: Build trees sequentially, each one correcting the errors of the ensemble so far.

XGBoost objective (regularized):

\[\large{ \mathcal{L}^{(t)} = \sum_{i=1}^n L\left(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 }\]

\(f_t\) = new tree added at step \(t\)
\(\gamma T\) = penalty for number of leaves (structural regularization)
\(\frac{1}{2}\lambda\sum w_j^2\) = L2 penalty on leaf weights

Second-order Taylor approximation makes this efficient:

\[\large{ \mathcal{L}^{(t)} \approx \sum_{i} \left[g_i f_t(x_i) + \frac{1}{2}h_i f_t^2(x_i)\right] + \Omega(f_t) }\]

where \(g_i = \partial L / \partial \hat{y}\) (gradient) and \(h_i = \partial^2 L / \partial \hat{y}^2\) (Hessian).

XGBoost: Best Model Performance

Configuration: Learning rate 0.1, max depth 3, 80% subsample, regularization λ=1, γ=0.1

Metric	Decision Tree	Random Forest	XGBoost
Test Accuracy	80.32%	90.52%	87.48%
Test AUC	0.7872	0.8414	0.8532
Optimal rounds	—	500 trees	8 rounds

XGBoost wins on AUC (the metric that matters for ranking) while needing only 8 boosting rounds — extremely efficient.

Key hyperparameters to tune:

max_depth: Usually 3–6 (shallow trees prevent overfitting)
learning_rate (η): Smaller → more rounds needed but better generalization
n_estimators: Use early stopping on validation set
subsample: Row sampling ratio (0.8 typical)

SHAP: Making Black Boxes Interpretable

SHAP (SHapley Additive exPlanations) assigns each feature a contribution to each prediction:

\[\large{ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(p-|S|-1)!}{p!}\left[f(S \cup \{j\}) - f(S)\right] }\]

This is the Shapley value from cooperative game theory — the only attribution method satisfying:

Efficiency: Feature contributions sum to the prediction
Symmetry: Equal features get equal credit
Linearity: Additive across models
Null player: Irrelevant features get zero

Key finding: ROA is the dominant predictor — low ROA pushes SHAP values strongly positive (toward ST prediction).

Heuristic 1: The XOR Trap — Why Depth Matters

Experiment: Generate 500 points on a 2D grid with XOR pattern:

\[\large{ Y = \begin{cases} 1 & \text{if } (X_1 > 0 \text{ AND } X_2 > 0) \text{ OR } (X_1 < 0 \text{ AND } X_2 < 0) \\ 0 & \text{otherwise} \end{cases} }\]

Tree Depth	Training Accuracy	Interpretation
Depth = 1	0.484	Worse than random! Single cut is useless.
Depth = 2	1.000	Perfect — two cuts capture the interaction.

The lesson: Setting max_depth=1 (decision stumps) is dangerous when interactions exist between features. Always consider interaction depth when tuning.

Financial parallel: A stock’s risk depends not just on leverage or profitability alone, but on their combination. A high-debt firm with high ROA may be fine; a high-debt firm with low ROA is in trouble. This is an interaction that depth=1 cannot see.

Heuristic 2: Fitting Noise — The Overfitting Disaster

Experiment: Generate 300 observations where Y is pure random (no relationship with X at all):

Max Depth	Train Accuracy	Test Accuracy	Gap
Depth = 2	57.14%	51.67%	5.5 ppt
Depth = 5	81.43%	48.33%	33.1 ppt
Unlimited	100.00%	56.67%	43.3 ppt

The tree memorized pure noise with 100% training accuracy.

Key diagnostic: The gap between training and test accuracy. A gap exceeding 10 percentage points is a strong overfitting signal.

The antidote:

Cross-validation for depth/pruning parameter selection
Ensemble methods (RF, XGBoost) that average away the noise
Regularization (min samples per leaf, max features)

Model Comparison: The Full Picture

Model	Train AUC	Test AUC	Key Strength	Key Weakness
Decision Tree	~1.0	0.787	Interpretable rules	Unstable, overfits
Random Forest	~1.0	0.841	Robust, parallel	Less interpretable
XGBoost	~1.0	0.853	Best accuracy, fast	Requires tuning
Logistic (Ch.11)	—	0.840	Coefficients meaningful	Assumes linearity

Practical guidelines:

Start simple: Logistic regression as baseline — interpretable and often competitive
Scale up: Random Forest if you need robustness with minimal tuning
Optimize: XGBoost for maximum predictive performance (competitions, production)
Always: Use SHAP for post-hoc interpretability regardless of model choice

Summary: Tree-Based Methods Toolkit

Topic	Key Takeaway
Decision Trees	Recursive binary partitioning; greedy algorithm; axis-parallel splits
Gini vs. Entropy	Nearly identical in practice; Gini is default
Pruning	Cost-complexity pruning via CV prevents overfitting
Instability	Never trust a single tree — small data changes → completely different model
XOR Problem	Need sufficient depth to capture variable interactions
Random Forest	Bagging + feature randomization → decorrelated trees → lower variance
XGBoost	Sequential boosting + regularization → best predictive performance
SHAP	Shapley values → fair, consistent, interpretable feature attributions
Overfitting	Unlimited trees memorize noise; always check train-test gap

The trajectory: OLS → GLM → Trees → Ensembles → each step gains flexibility but loses interpretability. SHAP helps recover it.