Parametric models (OLS, Logistic, Poisson) assume a specific functional form: \(Y = f(X; \beta) + \epsilon\).
Tree-based models take a fundamentally different approach:
Why trees matter in finance:
A decision tree recursively splits the feature space using axis-parallel cuts:
The tree asks: “Which single variable and cutoff best separates the outcome?” — then repeats recursively.
Classification trees — Measuring impurity:
| Criterion | Formula | Properties |
|---|---|---|
| Gini Impurity | \(G = 1 - \sum_{k=1}^K p_k^2\) | Range [0, 0.5]; 0 = pure; computationally faster |
| Entropy | \(H = -\sum_{k=1}^K p_k \log_2 p_k\) | Range [0, 1]; information-theoretic |
Binary case (K=2): Gini = \(2p(1-p)\), Entropy = \(-p\log_2 p - (1-p)\log_2(1-p)\)
Regression trees — Minimize within-node variance:
\[\large{ MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2 }\]
Information Gain = Parent impurity − Weighted average of children’s impurity
The algorithm evaluates every possible split point for every feature and picks the one with maximum Information Gain.
An unrestricted tree will grow until every leaf is pure — memorizing the training data.
Cost-complexity pruning adds a penalty for tree size:
\[\large{ C_\alpha(T) = \underbrace{\sum_{m=1}^{|T|} \sum_{x_i \in R_m} L(y_i, c_m)}_{\text{Training loss}} + \underbrace{\alpha |T|}_{\text{Complexity penalty}} }\]
In practice: Grow a maximum tree, then prune back using cross-validated \(\alpha\).
Data: 1,862 YRD listed companies (2023), 72 flagged as ST (3.87%)
| Feature | Description |
|---|---|
| ROA (%) | Return on Assets — profitability |
| Debt Ratio (%) | Total debt / Total assets — leverage |
| Current Ratio | Current assets / Current liabilities — liquidity |
| ln(Total Assets) | Firm size (log-transformed) |
Single Decision Tree Results (optimal depth from CV = 2):
The problem: Remove just 5% of training data, and the entire tree structure can change completely.
Why: The greedy splitting algorithm makes each decision based on a single threshold. Small changes in data near that threshold → different split → different subtrees → completely different model.
Implication: Never trust a single tree for production deployment.
| Perturbation | Tree Structure Change | Risk |
|---|---|---|
| Remove 5% data | Root split variable may change | Model instability |
| Add one outlier | Entire branch restructured | Sensitivity to noise |
| Different random seed | Different train/test split | Unreproducible results |
The solution: Don’t use one tree — use hundreds. This is the motivation for ensemble methods.
Axis-parallel splits can’t always capture interactions efficiently:
| \(X_1\) Location | \(X_2\) Location | Outcome |
|---|---|---|
| Left half | Bottom half | Class 0 |
| Left half | Top half | Class 1 |
| Right half | Bottom half | Class 1 |
| Right half | Top half | Class 0 |
This is the XOR (exclusive-or) pattern. No single split helps — accuracy at depth 1 is 0.484 (worse than random!). But at depth 2, accuracy jumps to 1.000.
Lesson: A tree must be deep enough to capture interactions. For an interaction of order \(k\) (involving \(k\) variables), you need at least \(k\) levels of splits.
Analogy: This is why neural networks need multiple layers — single-layer networks also cannot learn XOR.
Key idea: Grow many decorrelated trees and average their predictions.
Two sources of randomness:
Why it works — the variance reduction formula:
\[\large{ \text{Var}(\bar{f}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2 }\]
500 trees, \(m = \sqrt{4} = 2\) features per split:
| Metric | Decision Tree | Random Forest | Improvement |
|---|---|---|---|
| Test Accuracy | 80.32% | 90.52% | +10.2 ppt |
| Test AUC | 0.7872 | 0.8414 | +0.054 |
| OOB Score | — | 0.8887 | Built-in validation |
Feature importance (redistribution effect):
| Feature | Single Tree | Random Forest |
|---|---|---|
| ROA | 0.895 | 0.495 |
| Debt Ratio | 0.066 | 0.180 |
| ln(Assets) | 0.027 | 0.190 |
| Current Ratio | 0.012 | 0.135 |
Critical insight: Random Forest spreads importance more evenly — revealing predictive power that a single tree’s greedy algorithm missed.
Boosting philosophy: Build trees sequentially, each one correcting the errors of the ensemble so far.
XGBoost objective (regularized):
\[\large{ \mathcal{L}^{(t)} = \sum_{i=1}^n L\left(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)\right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 }\]
Second-order Taylor approximation makes this efficient:
\[\large{ \mathcal{L}^{(t)} \approx \sum_{i} \left[g_i f_t(x_i) + \frac{1}{2}h_i f_t^2(x_i)\right] + \Omega(f_t) }\]
where \(g_i = \partial L / \partial \hat{y}\) (gradient) and \(h_i = \partial^2 L / \partial \hat{y}^2\) (Hessian).
Configuration: Learning rate 0.1, max depth 3, 80% subsample, regularization λ=1, γ=0.1
| Metric | Decision Tree | Random Forest | XGBoost |
|---|---|---|---|
| Test Accuracy | 80.32% | 90.52% | 87.48% |
| Test AUC | 0.7872 | 0.8414 | 0.8532 |
| Optimal rounds | — | 500 trees | 8 rounds |
XGBoost wins on AUC (the metric that matters for ranking) while needing only 8 boosting rounds — extremely efficient.
Key hyperparameters to tune:
max_depth: Usually 3–6 (shallow trees prevent overfitting)learning_rate (η): Smaller → more rounds needed but better generalizationn_estimators: Use early stopping on validation setsubsample: Row sampling ratio (0.8 typical)SHAP (SHapley Additive exPlanations) assigns each feature a contribution to each prediction:
\[\large{ \phi_j = \sum_{S \subseteq N \setminus \{j\}} \frac{|S|!(p-|S|-1)!}{p!}\left[f(S \cup \{j\}) - f(S)\right] }\]
This is the Shapley value from cooperative game theory — the only attribution method satisfying:
Key finding: ROA is the dominant predictor — low ROA pushes SHAP values strongly positive (toward ST prediction).
Experiment: Generate 500 points on a 2D grid with XOR pattern:
\[\large{ Y = \begin{cases} 1 & \text{if } (X_1 > 0 \text{ AND } X_2 > 0) \text{ OR } (X_1 < 0 \text{ AND } X_2 < 0) \\ 0 & \text{otherwise} \end{cases} }\]
| Tree Depth | Training Accuracy | Interpretation |
|---|---|---|
| Depth = 1 | 0.484 | Worse than random! Single cut is useless. |
| Depth = 2 | 1.000 | Perfect — two cuts capture the interaction. |
The lesson: Setting max_depth=1 (decision stumps) is dangerous when interactions exist between features. Always consider interaction depth when tuning.
Financial parallel: A stock’s risk depends not just on leverage or profitability alone, but on their combination. A high-debt firm with high ROA may be fine; a high-debt firm with low ROA is in trouble. This is an interaction that depth=1 cannot see.
Experiment: Generate 300 observations where Y is pure random (no relationship with X at all):
| Max Depth | Train Accuracy | Test Accuracy | Gap |
|---|---|---|---|
| Depth = 2 | 57.14% | 51.67% | 5.5 ppt |
| Depth = 5 | 81.43% | 48.33% | 33.1 ppt |
| Unlimited | 100.00% | 56.67% | 43.3 ppt |
The tree memorized pure noise with 100% training accuracy.
Key diagnostic: The gap between training and test accuracy. A gap exceeding 10 percentage points is a strong overfitting signal.
The antidote:
| Model | Train AUC | Test AUC | Key Strength | Key Weakness |
|---|---|---|---|---|
| Decision Tree | ~1.0 | 0.787 | Interpretable rules | Unstable, overfits |
| Random Forest | ~1.0 | 0.841 | Robust, parallel | Less interpretable |
| XGBoost | ~1.0 | 0.853 | Best accuracy, fast | Requires tuning |
| Logistic (Ch.11) | — | 0.840 | Coefficients meaningful | Assumes linearity |
Practical guidelines:
| Topic | Key Takeaway |
|---|---|
| Decision Trees | Recursive binary partitioning; greedy algorithm; axis-parallel splits |
| Gini vs. Entropy | Nearly identical in practice; Gini is default |
| Pruning | Cost-complexity pruning via CV prevents overfitting |
| Instability | Never trust a single tree — small data changes → completely different model |
| XOR Problem | Need sufficient depth to capture variable interactions |
| Random Forest | Bagging + feature randomization → decorrelated trees → lower variance |
| XGBoost | Sequential boosting + regularization → best predictive performance |
| SHAP | Shapley values → fair, consistent, interpretable feature attributions |
| Overfitting | Unlimited trees memorize noise; always check train-test gap |
The trajectory: OLS → GLM → Trees → Ensembles → each step gains flexibility but loses interpretability. SHAP helps recover it.