11: Generalized Linear Models & Nonlinear Methods

Why Go Beyond OLS? The Nonlinear Reality

OLS regression assumes \(Y\) is continuous and normally distributed. But many real-world outcomes violate this:

Outcome Type Example OLS Problem
Binary (0/1) Default vs. No Default Predicted \(\hat{P}\) can exceed 1 or go below 0
Count (0,1,2,…) Number of analyst reports OLS predicts negative counts
Skewed positive Insurance claim amounts Normal residuals impossible

The solution: Generalized Linear Models (GLM) — a unified framework that extends OLS to handle all these cases.

The Exponential Family: A Unified Distribution Framework

GLMs require the response distribution to belong to the exponential family:

\[\large{ f(y|\theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\} }\]

Distribution Natural Parameter \(\theta\) \(b(\theta)\) Mean \(\mu\) Variance
Normal \(\mu\) \(\theta^2/2\) \(\theta\) \(\sigma^2\)
Binomial \(\ln\frac{p}{1-p}\) \(\ln(1+e^\theta)\) \(\frac{e^\theta}{1+e^\theta}\) \(np(1-p)\)
Poisson \(\ln\lambda\) \(e^\theta\) \(e^\theta\) \(\lambda\)

Key property: \(E(Y) = b'(\theta)\), \(\text{Var}(Y) = a(\phi) \cdot b''(\theta)\).

The Three Components of a GLM

Every GLM consists of exactly three components:

1. Random Component\(Y\) follows an exponential family distribution with mean \(\mu = E(Y)\)

2. Systematic Component — The linear predictor:

\[\large{ \eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]

3. Link Function — Connects \(\mu\) to \(\eta\):

\[\large{ g(\mu) = \eta \quad \Longleftrightarrow \quad \mu = g^{-1}(\eta) }\]

Distribution Link \(g(\mu)\) Range of \(\mu\) Range of \(\eta\)
Normal Identity: \(\mu\) \((-\infty, +\infty)\) \((-\infty, +\infty)\)
Binomial Logit: \(\ln\frac{\mu}{1-\mu}\) \((0, 1)\) \((-\infty, +\infty)\)
Poisson Log: \(\ln\mu\) \((0, +\infty)\) \((-\infty, +\infty)\)

Logistic Regression: The Model

For binary outcomes \(Y \in \{0, 1\}\), the logistic regression model is:

\[\large{ P(Y=1|\mathbf{X}) = \frac{e^{\eta}}{1 + e^{\eta}} = \frac{1}{1 + e^{-\eta}} }\]

The logit transformation linearizes the relationship:

\[\large{ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]

Interpreting coefficients — the Odds Ratio:

  • Odds = \(\frac{p}{1-p}\) (e.g., odds of 3:1 means \(p = 0.75\))
  • Odds Ratio = \(e^{\beta_j}\) — multiplicative change in odds per unit increase in \(X_j\)
  • Example: \(\beta_1 = 0.5 \Rightarrow OR = e^{0.5} = 1.65\) — odds increase by 65%
  • Warning: This is change in odds, not change in probability!

Case Study: Predicting Financial Distress (ST Status)

Context: In China’s A-share market, companies in financial trouble receive an “ST” (Special Treatment) designation — a natural binary classification target.

Data: 2023 annual reports from financial_statement.h5

  • Sample: 1,868 YRD non-financial listed companies
  • ST companies: 74 (3.96% — severe class imbalance)
Predictor Coefficient Odds Ratio p-value Interpretation
ROA (%) −0.0790 0.924 0.000 Each 1ppt ROA increase → 7.6% lower odds of ST
Debt Ratio (%) +0.0305 1.031 0.000 Each 1ppt increase → 3.1% higher odds
Current Ratio +0.0883 1.092 0.415 Not significant (CI includes 1)
ln(Assets) −0.4843 0.616 0.000 Larger firms → much lower odds

Pseudo \(R^2\) = 0.1642 | Model LLR \(p\) = 3.2×10⁻²¹

Model Evaluation: ROC Curve and AUC

After training with class_weight='balanced' and 70/30 stratified split:

  • AUC = 0.8400 — strong discriminatory power
  • Overall accuracy: 82%
  • ST recall: 0.68 (catches 68% of actual distressed firms)
  • ST precision: 0.14 (many false alarms — cost of class imbalance)

The precision-recall tradeoff: With only 4% positive rate, even a good model produces many false positives. In credit risk, this tradeoff is managed by:

  1. Adjusting the classification threshold
  2. Using cost-sensitive learning
  3. Combining model scores with expert judgment

Poisson Regression: Modeling Count Data

When the outcome is a non-negative integer (count), use Poisson regression:

\[\large{ \ln(\lambda) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]

Equivalently: \(\lambda = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}\)

Incidence Rate Ratio (IRR): \(IRR_j = e^{\beta_j}\) — multiplicative change in expected count

Critical assumption: \(E(Y) = \text{Var}(Y) = \lambda\) (equidispersion)

Case: Revenue units (¥ billions) for 5,396 listed companies:

Predictor Coefficient IRR Interpretation
ln(Assets) 0.4964 1.643 Each log-unit of assets → 64.3% more revenue
ROA (%) 0.0492 1.050 Each 1ppt ROA → 5.0% more revenue
Debt Ratio (%) 0.0063 1.006 Minimal effect

Overdispersion: When Poisson’s Assumption Fails

The Pearson dispersion parameter: \(\phi = \frac{1}{n-p}\sum\frac{(y_i - \hat{\lambda}_i)^2}{\hat{\lambda}_i}\)

For our revenue model: \(\phi\) = 8.78 \(\gg\) 1.0 — severe overdispersion

Consequences of ignoring overdispersion:

  • Standard errors are underestimated (by factor of \(\approx\sqrt{8.78} \approx 3\))
  • p-values are too small → false confidence in significance
  • Confidence intervals are too narrow

Remedies:

Method Variance Structure When to Use
Quasi-Poisson \(\text{Var}(Y) = \phi\lambda\) Moderate overdispersion
Negative Binomial \(\text{Var}(Y) = \lambda + \lambda^2/\theta\) Heavy overdispersion

Dirty Work: Perfect Separation — When MLE Explodes

The trap: If one variable perfectly predicts the outcome (e.g., all firms with ROA < −5% are ST, all with ROA > −5% are not), then MLE tries to push \(\beta \to \infty\).

Symptoms: Astronomical coefficients, infinite standard errors, p-values near 1, convergence warnings

Why it happens: The likelihood keeps increasing as \(|\beta|\) grows — there is no finite maximum.

Solutions:

  1. Firth’s Penalized Likelihood — adds a small penalty that prevents infinite estimates
  2. Regularization (Ridge/Lasso) — shrinks coefficients toward zero
  3. Reduce model complexity — remove the problematic variable or combine categories

Rule of thumb: If any coefficient exceeds 10 in absolute value or any SE exceeds 100, suspect separation.

Dirty Work: Overfitting & Occam’s Razor

The temptation: A 10th-degree polynomial fits 11 points with \(R^2 = 1.0\).

The truth: It will fail catastrophically on new data.

Bias-Variance Tradeoff Error curves showing training error decreasing with complexity while test error is U-shaped. Model Complexity Error Optimal Training Test Underfitting Overfitting

Occam’s Razor: Entia non sunt multiplicanda praeter necessitatem. A simple logistic regression often outperforms deep learning — more robust, more interpretable, easier to deploy.

Regularization: Ridge, Lasso, and Elastic Net

When you have many predictors, regularization prevents overfitting by penalizing large coefficients:

Method Penalty Key Property
Ridge (L2) \(\lambda\sum_{j=1}^p \beta_j^2\) Shrinks all coefficients toward zero; never exactly zero
Lasso (L1) \(\lambda\sum_{j=1}^p |\beta_j|\) Can set coefficients exactly to zero → variable selection
Elastic Net \(\lambda_1\sum|\beta_j| + \lambda_2\sum\beta_j^2\) Combines both: sparsity + stability

Ridge closed-form solution:

\[\large{ \hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{Y} }\]

Note: Adding \(\lambda\mathbf{I}\) guarantees invertibility — this solves multicollinearity!

Ridge vs. Lasso: Coefficient Path Comparison

Experiment: 300 YRD companies, 3 financial features → current ratio

Ridge vs Lasso Coefficient Paths Ridge coefficients shrink smoothly while Lasso coefficients hit zero at different lambda values. Ridge (L2) log(λ) → Lasso (L1) log(λ) → ROA Debt Ratio ln(Assets)

Cross-validation results:

Method Optimal λ Key Difference
OLS (baseline) 0.1022 All coefficients unrestricted
RidgeCV 104.81 0.0955 All coefficients shrunk, none zero
LassoCV 0.3728 0.0580 Some coefficients driven to zero

Heuristic 1: The Linear Probability Model Trap

What happens if you use OLS for binary outcomes?

In a simulation with 200 firms, fitting \(P(\text{Default}) = \beta_0 + \beta_1 X\) via OLS:

  • 128 out of 300 predictions (42.7%) fall outside \([0, 1]\)
  • Predicted probabilities range from approximately −0.2 to 1.3

The absurdity: A probability of 1.3 or −0.2 has no meaning.

Why logistic regression fixes this: The sigmoid function \(\frac{1}{1+e^{-\eta}}\) is bounded between 0 and 1 by mathematical construction — no matter how extreme the inputs.

Practical note: Despite this flaw, the Linear Probability Model (LPM) is still widely used in economics. Why? Because its coefficients are directly interpretable as marginal effects. But for prediction and risk scoring, logistic regression is always preferred.

Heuristic 2: Perfect Separation — The MLE Destroyer

Setup: Construct data where \(X \leq 5 \Rightarrow Y = 0\) and \(X > 5 \Rightarrow Y = 1\) — zero overlap.

Result: Logistic regression coefficient → ~5,000; SE → astronomically large; p-value → 1.0

Perfect Separation vs Normal Overlap Left panel shows perfect separation where MLE fails; right panel shows normal overlapping data where logistic regression works properly. Perfect Separation (MLE FAILS) Gap → β → ∞ Normal Overlap (MLE Works) Smooth sigmoid fit

Summary: The GLM Toolkit

Topic Key Takeaway
Exponential Family Unified distribution framework: Normal, Binomial, Poisson, Gamma
GLM Components Random + Systematic + Link function
Logistic Regression Binary outcomes; interpret via Odds Ratios (\(e^{\beta}\))
ST Prediction Case ROA strongest predictor (OR = 0.924); AUC = 0.84
Poisson Regression Count data; interpret via IRR (\(e^{\beta}\))
Overdispersion Always test \(\phi\); use Quasi-Poisson or Negative Binomial
Perfect Separation MLE fails → use Firth’s or regularization
Ridge (L2) Shrinks all coefficients; fixes multicollinearity
Lasso (L1) Automatic variable selection (exact zeros)
Elastic Net Best of both: sparsity + stability

The meta-lesson: Always match your model to your data’s distribution. OLS is just one member of a much larger family.