12: Tree-Based Machine Learning Methods

From Parametric to Algorithmic Models

Parametric models (OLS, Logistic, Poisson) assume a specific functional form: \(Y = f(X; \beta) + \epsilon\).

Tree-based models take a fundamentally different approach:

  • No assumed functional form — the data decides the structure
  • Partition the feature space into rectangular regions via axis-parallel splits
  • Each region gets its own simple prediction (mean or mode)

Why trees matter in finance:

  • Naturally capture nonlinear relationships and interactions
  • Handle mixed data types (continuous + categorical) without preprocessing
  • Provide interpretable if-then rules (“If ROA < 2% AND Debt > 60%, flag as high risk”)

Decision Trees: Recursive Binary Partitioning

A decision tree recursively splits the feature space using axis-parallel cuts:

Decision Tree for Credit Risk A simple binary decision tree with ROA at root splitting to Debt Ratio, producing three leaf nodes for credit risk assessment. ROA < 2%? Yes Debt Ratio > 60%? No ✓ Healthy (91%) Yes ✗ High Risk (62%) No ~ Moderate (28%)

The tree asks: “Which single variable and cutoff best separates the outcome?” — then repeats recursively.

Splitting Criteria: How Trees Choose the Best Cut

Classification trees — Measuring impurity:

Criterion Formula Properties
Gini Impurity \(G = 1 - \sum_{k=1}^K p_k^2\) Range [0, 0.5]; 0 = pure; computationally faster
Entropy \(H = -\sum_{k=1}^K p_k \log_2 p_k\) Range [0, 1]; information-theoretic

Binary case (K=2): Gini = \(2p(1-p)\), Entropy = \(-p\log_2 p - (1-p)\log_2(1-p)\)

Regression trees — Minimize within-node variance:

\[\large{ MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2 }\]

Information Gain = Parent impurity − Weighted average of children’s impurity

The algorithm evaluates every possible split point for every feature and picks the one with maximum Information Gain.

Cost-Complexity Pruning: Fighting Overfitting

An unrestricted tree will grow until every leaf is pure — memorizing the training data.

Cost-complexity pruning adds a penalty for tree size:

\[\large{ C_\alpha(T) = \underbrace{\sum_{m=1}^{|T|} \sum_{x_i \in R_m} L(y_i, c_m)}_{\text{Training loss}} + \underbrace{\alpha |T|}_{\text{Complexity penalty}} }\]

  • \(|T|\) = number of terminal nodes (leaves)
  • \(\alpha\) = complexity parameter (tuned via cross-validation)
  • Small \(\alpha\) → large tree (low bias, high variance)
  • Large \(\alpha\) → small tree (high bias, low variance)

In practice: Grow a maximum tree, then prune back using cross-validated \(\alpha\).

Case Study: Credit Risk with Decision Trees

Data: 1,862 YRD listed companies (2023), 72 flagged as ST (3.87%)

Feature Description
ROA (%) Return on Assets — profitability
Debt Ratio (%) Total debt / Total assets — leverage
Current Ratio Current assets / Current liabilities — liquidity
ln(Total Assets) Firm size (log-transformed)

Single Decision Tree Results (optimal depth from CV = 2):

  • Training accuracy: 96.17% | Test accuracy: 80.32%
  • Test AUC: 0.7872
  • Feature importance: ROA = 0.8952 (dominates), Debt Ratio = 0.0657
  • The tree essentially became: “Is ROA low?” — a one-rule model

Dirty Work: Structural Instability — The Butterfly Effect

The problem: Remove just 5% of training data, and the entire tree structure can change completely.

Why: The greedy splitting algorithm makes each decision based on a single threshold. Small changes in data near that threshold → different split → different subtrees → completely different model.

Implication: Never trust a single tree for production deployment.

Perturbation Tree Structure Change Risk
Remove 5% data Root split variable may change Model instability
Add one outlier Entire branch restructured Sensitivity to noise
Different random seed Different train/test split Unreproducible results

The solution: Don’t use one tree — use hundreds. This is the motivation for ensemble methods.

Dirty Work: The XOR Problem — When Trees Need Depth

Axis-parallel splits can’t always capture interactions efficiently:

\(X_1\) Location \(X_2\) Location Outcome
Left half Bottom half Class 0
Left half Top half Class 1
Right half Bottom half Class 1
Right half Top half Class 0

This is the XOR (exclusive-or) pattern. No single split helps — accuracy at depth 1 is 0.484 (worse than random!). But at depth 2, accuracy jumps to 1.000.

Lesson: A tree must be deep enough to capture interactions. For an interaction of order \(k\) (involving \(k\) variables), you need at least \(k\) levels of splits.

Analogy: This is why neural networks need multiple layers — single-layer networks also cannot learn XOR.

Random Forest: Wisdom of the Crowd

Key idea: Grow many decorrelated trees and average their predictions.

Two sources of randomness:

  1. Bootstrap aggregation (bagging): Each tree trained on a bootstrap sample (sampling with replacement)
  2. Feature randomization: At each split, consider only \(m\) random features
    • Classification: \(m = \lfloor\sqrt{p}\rfloor\)
    • Regression: \(m = \lfloor p/3 \rfloor\)

Why it works — the variance reduction formula:

\[\large{ \text{Var}(\bar{f}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2 }\]

  • \(\rho\) = average pairwise correlation between trees
  • \(B\) = number of trees
  • Feature randomization reduces \(\rho\)reduces ensemble variance
  • More trees reduces the second term → always safe to grow more trees

Random Forest: Results on ST Prediction

500 trees, \(m = \sqrt{4} = 2\) features per split:

Metric Decision Tree Random Forest Improvement
Test Accuracy 80.32% 90.52% +10.2 ppt
Test AUC 0.7872 0.8414 +0.054
OOB Score 0.8887 Built-in validation

Feature importance (redistribution effect):

Feature Single Tree Random Forest
ROA 0.895 0.495
Debt Ratio 0.066 0.180
ln(Assets) 0.027 0.190
Current Ratio 0.012 0.135

Critical insight: Random Forest spreads importance more evenly — revealing predictive power that a single tree’s greedy algorithm missed.

Gradient Boosting and XGBoost

Boosting philosophy: Build trees sequentially, each one correcting the errors of the ensemble so far.

XGBoost objective (regularized):

\[\large{ \mathcal{L}^{(t)} = \sum_{i=1}^n L\left(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 }\]

  • \(f_t\) = new tree added at step \(t\)
  • \(\gamma T\) = penalty for number of leaves (structural regularization)
  • \(\frac{1}{2}\lambda\sum w_j^2\) = L2 penalty on leaf weights

Second-order Taylor approximation makes this efficient:

\[\large{ \mathcal{L}^{(t)} \approx \sum_{i} \left[g_i f_t(x_i) + \frac{1}{2}h_i f_t^2(x_i)\right] + \Omega(f_t) }\]

where \(g_i = \partial L / \partial \hat{y}\) (gradient) and \(h_i = \partial^2 L / \partial \hat{y}^2\) (Hessian).

XGBoost: Best Model Performance

Configuration: Learning rate 0.1, max depth 3, 80% subsample, regularization λ=1, γ=0.1

Metric Decision Tree Random Forest XGBoost
Test Accuracy 80.32% 90.52% 87.48%
Test AUC 0.7872 0.8414 0.8532
Optimal rounds 500 trees 8 rounds

XGBoost wins on AUC (the metric that matters for ranking) while needing only 8 boosting rounds — extremely efficient.

Key hyperparameters to tune:

  • max_depth: Usually 3–6 (shallow trees prevent overfitting)
  • learning_rate (η): Smaller → more rounds needed but better generalization
  • n_estimators: Use early stopping on validation set
  • subsample: Row sampling ratio (0.8 typical)

SHAP: Making Black Boxes Interpretable

SHAP (SHapley Additive exPlanations) assigns each feature a contribution to each prediction:

\[\large{ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(p-|S|-1)!}{p!}\left[f(S \cup \{j\}) - f(S)\right] }\]

This is the Shapley value from cooperative game theory — the only attribution method satisfying:

  1. Efficiency: Feature contributions sum to the prediction
  2. Symmetry: Equal features get equal credit
  3. Linearity: Additive across models
  4. Null player: Irrelevant features get zero

Key finding: ROA is the dominant predictor — low ROA pushes SHAP values strongly positive (toward ST prediction).

Heuristic 1: The XOR Trap — Why Depth Matters

Experiment: Generate 500 points on a 2D grid with XOR pattern:

\[\large{ Y = \begin{cases} 1 & \text{if } (X_1 > 0 \text{ AND } X_2 > 0) \text{ OR } (X_1 < 0 \text{ AND } X_2 < 0) \\ 0 & \text{otherwise} \end{cases} }\]

Tree Depth Training Accuracy Interpretation
Depth = 1 0.484 Worse than random! Single cut is useless.
Depth = 2 1.000 Perfect — two cuts capture the interaction.

The lesson: Setting max_depth=1 (decision stumps) is dangerous when interactions exist between features. Always consider interaction depth when tuning.

Financial parallel: A stock’s risk depends not just on leverage or profitability alone, but on their combination. A high-debt firm with high ROA may be fine; a high-debt firm with low ROA is in trouble. This is an interaction that depth=1 cannot see.

Heuristic 2: Fitting Noise — The Overfitting Disaster

Experiment: Generate 300 observations where Y is pure random (no relationship with X at all):

Max Depth Train Accuracy Test Accuracy Gap
Depth = 2 57.14% 51.67% 5.5 ppt
Depth = 5 81.43% 48.33% 33.1 ppt
Unlimited 100.00% 56.67% 43.3 ppt

The tree memorized pure noise with 100% training accuracy.

Key diagnostic: The gap between training and test accuracy. A gap exceeding 10 percentage points is a strong overfitting signal.

The antidote:

  1. Cross-validation for depth/pruning parameter selection
  2. Ensemble methods (RF, XGBoost) that average away the noise
  3. Regularization (min samples per leaf, max features)

Model Comparison: The Full Picture

Model Train AUC Test AUC Key Strength Key Weakness
Decision Tree ~1.0 0.787 Interpretable rules Unstable, overfits
Random Forest ~1.0 0.841 Robust, parallel Less interpretable
XGBoost ~1.0 0.853 Best accuracy, fast Requires tuning
Logistic (Ch.11) 0.840 Coefficients meaningful Assumes linearity

Practical guidelines:

  • Start simple: Logistic regression as baseline — interpretable and often competitive
  • Scale up: Random Forest if you need robustness with minimal tuning
  • Optimize: XGBoost for maximum predictive performance (competitions, production)
  • Always: Use SHAP for post-hoc interpretability regardless of model choice

Summary: Tree-Based Methods Toolkit

Topic Key Takeaway
Decision Trees Recursive binary partitioning; greedy algorithm; axis-parallel splits
Gini vs. Entropy Nearly identical in practice; Gini is default
Pruning Cost-complexity pruning via CV prevents overfitting
Instability Never trust a single tree — small data changes → completely different model
XOR Problem Need sufficient depth to capture variable interactions
Random Forest Bagging + feature randomization → decorrelated trees → lower variance
XGBoost Sequential boosting + regularization → best predictive performance
SHAP Shapley values → fair, consistent, interpretable feature attributions
Overfitting Unlimited trees memorize noise; always check train-test gap

The trajectory: OLS → GLM → Trees → Ensembles → each step gains flexibility but loses interpretability. SHAP helps recover it.