11: Generalized Linear Models & Nonlinear Methods

Why Go Beyond OLS? The Nonlinear Reality

OLS regression assumes \(Y\) is continuous and normally distributed. But many real-world outcomes violate this:

Outcome Type	Example	OLS Problem
Binary (0/1)	Default vs. No Default	Predicted \(\hat{P}\) can exceed 1 or go below 0
Count (0,1,2,…)	Number of analyst reports	OLS predicts negative counts
Skewed positive	Insurance claim amounts	Normal residuals impossible

The solution: Generalized Linear Models (GLM) — a unified framework that extends OLS to handle all these cases.

The Exponential Family: A Unified Distribution Framework

GLMs require the response distribution to belong to the exponential family:

\[\large{ f(y|\theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\} }\]

Distribution	Natural Parameter \(\theta\)	\(b(\theta)\)	Mean \(\mu\)	Variance
Normal	\(\mu\)	\(\theta^2/2\)	\(\theta\)	\(\sigma^2\)
Binomial	\(\ln\frac{p}{1-p}\)	\(\ln(1+e^\theta)\)	\(\frac{e^\theta}{1+e^\theta}\)	\(np(1-p)\)
Poisson	\(\ln\lambda\)	\(e^\theta\)	\(e^\theta\)	\(\lambda\)

Key property: \(E(Y) = b'(\theta)\), \(\text{Var}(Y) = a(\phi) \cdot b''(\theta)\).

The Three Components of a GLM

Every GLM consists of exactly three components:

1. Random Component — \(Y\) follows an exponential family distribution with mean \(\mu = E(Y)\)

2. Systematic Component — The linear predictor:

\[\large{ \eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]

3. Link Function — Connects \(\mu\) to \(\eta\):

\[\large{ g(\mu) = \eta \quad \Longleftrightarrow \quad \mu = g^{-1}(\eta) }\]

Distribution	Link \(g(\mu)\)	Range of \(\mu\)	Range of \(\eta\)
Normal	Identity: \(\mu\)	\((-\infty, +\infty)\)	\((-\infty, +\infty)\)
Binomial	Logit: \(\ln\frac{\mu}{1-\mu}\)	\((0, 1)\)	\((-\infty, +\infty)\)
Poisson	Log: \(\ln\mu\)	\((0, +\infty)\)	\((-\infty, +\infty)\)

Logistic Regression: The Model

For binary outcomes \(Y \in \{0, 1\}\), the logistic regression model is:

\[\large{ P(Y=1|\mathbf{X}) = \frac{e^{\eta}}{1 + e^{\eta}} = \frac{1}{1 + e^{-\eta}} }\]

The logit transformation linearizes the relationship:

\[\large{ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]

Interpreting coefficients — the Odds Ratio:

Odds = \(\frac{p}{1-p}\) (e.g., odds of 3:1 means \(p = 0.75\))
Odds Ratio = \(e^{\beta_j}\) — multiplicative change in odds per unit increase in \(X_j\)
Example: \(\beta_1 = 0.5 \Rightarrow OR = e^{0.5} = 1.65\) — odds increase by 65%
Warning: This is change in odds, not change in probability!

Case Study: Predicting Financial Distress (ST Status)

Context: In China’s A-share market, companies in financial trouble receive an “ST” (Special Treatment) designation — a natural binary classification target.

Data: 2023 annual reports from financial_statement.h5

Sample: 1,868 YRD non-financial listed companies
ST companies: 74 (3.96% — severe class imbalance)

Predictor	Coefficient	Odds Ratio	p-value	Interpretation
ROA (%)	−0.0790	0.924	0.000	Each 1ppt ROA increase → 7.6% lower odds of ST
Debt Ratio (%)	+0.0305	1.031	0.000	Each 1ppt increase → 3.1% higher odds
Current Ratio	+0.0883	1.092	0.415	Not significant (CI includes 1)
ln(Assets)	−0.4843	0.616	0.000	Larger firms → much lower odds

Pseudo \(R^2\) = 0.1642 | Model LLR \(p\) = 3.2×10⁻²¹

Model Evaluation: ROC Curve and AUC

After training with class_weight='balanced' and 70/30 stratified split:

AUC = 0.8400 — strong discriminatory power
Overall accuracy: 82%
ST recall: 0.68 (catches 68% of actual distressed firms)
ST precision: 0.14 (many false alarms — cost of class imbalance)

The precision-recall tradeoff: With only 4% positive rate, even a good model produces many false positives. In credit risk, this tradeoff is managed by:

Adjusting the classification threshold
Using cost-sensitive learning
Combining model scores with expert judgment

Poisson Regression: Modeling Count Data

When the outcome is a non-negative integer (count), use Poisson regression:

\[\large{ \ln(\lambda) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]

Equivalently: \(\lambda = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}\)

Incidence Rate Ratio (IRR): \(IRR_j = e^{\beta_j}\) — multiplicative change in expected count

Critical assumption: \(E(Y) = \text{Var}(Y) = \lambda\) (equidispersion)

Case: Revenue units (¥ billions) for 5,396 listed companies:

Predictor	Coefficient	IRR	Interpretation
ln(Assets)	0.4964	1.643	Each log-unit of assets → 64.3% more revenue
ROA (%)	0.0492	1.050	Each 1ppt ROA → 5.0% more revenue
Debt Ratio (%)	0.0063	1.006	Minimal effect

Overdispersion: When Poisson’s Assumption Fails

The Pearson dispersion parameter: \(\phi = \frac{1}{n-p}\sum\frac{(y_i - \hat{\lambda}_i)^2}{\hat{\lambda}_i}\)

For our revenue model: \(\phi\) = 8.78 \(\gg\) 1.0 — severe overdispersion

Consequences of ignoring overdispersion:

Standard errors are underestimated (by factor of \(\approx\sqrt{8.78} \approx 3\))
p-values are too small → false confidence in significance
Confidence intervals are too narrow

Remedies:

Method	Variance Structure	When to Use
Quasi-Poisson	\(\text{Var}(Y) = \phi\lambda\)	Moderate overdispersion
Negative Binomial	\(\text{Var}(Y) = \lambda + \lambda^2/\theta\)	Heavy overdispersion

Dirty Work: Perfect Separation — When MLE Explodes

The trap: If one variable perfectly predicts the outcome (e.g., all firms with ROA < −5% are ST, all with ROA > −5% are not), then MLE tries to push \(\beta \to \infty\).

Symptoms: Astronomical coefficients, infinite standard errors, p-values near 1, convergence warnings

Why it happens: The likelihood keeps increasing as \(|\beta|\) grows — there is no finite maximum.

Solutions:

Firth’s Penalized Likelihood — adds a small penalty that prevents infinite estimates
Regularization (Ridge/Lasso) — shrinks coefficients toward zero
Reduce model complexity — remove the problematic variable or combine categories

Rule of thumb: If any coefficient exceeds 10 in absolute value or any SE exceeds 100, suspect separation.

Dirty Work: Overfitting & Occam’s Razor

The temptation: A 10th-degree polynomial fits 11 points with \(R^2 = 1.0\).

The truth: It will fail catastrophically on new data.

Occam’s Razor: Entia non sunt multiplicanda praeter necessitatem. A simple logistic regression often outperforms deep learning — more robust, more interpretable, easier to deploy.

Regularization: Ridge, Lasso, and Elastic Net

When you have many predictors, regularization prevents overfitting by penalizing large coefficients:

Method	Penalty	Key Property
Ridge (L2)	\(\lambda\sum_{j=1}^p \beta_j^2\)	Shrinks all coefficients toward zero; never exactly zero
Lasso (L1)	\(\lambda\sum_{j=1}^p \|\beta_j\|\)	Can set coefficients exactly to zero → variable selection
Elastic Net	\(\lambda_1\sum\|\beta_j\| + \lambda_2\sum\beta_j^2\)	Combines both: sparsity + stability

Ridge closed-form solution:

\[\large{ \hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{Y} }\]

Note: Adding \(\lambda\mathbf{I}\) guarantees invertibility — this solves multicollinearity!

Ridge vs. Lasso: Coefficient Path Comparison

Experiment: 300 YRD companies, 3 financial features → current ratio

Cross-validation results:

Method	Optimal λ	R²	Key Difference
OLS (baseline)	—	0.1022	All coefficients unrestricted
RidgeCV	104.81	0.0955	All coefficients shrunk, none zero
LassoCV	0.3728	0.0580	Some coefficients driven to zero

Heuristic 1: The Linear Probability Model Trap

What happens if you use OLS for binary outcomes?

In a simulation with 200 firms, fitting \(P(\text{Default}) = \beta_0 + \beta_1 X\) via OLS:

128 out of 300 predictions (42.7%) fall outside \([0, 1]\)
Predicted probabilities range from approximately −0.2 to 1.3

The absurdity: A probability of 1.3 or −0.2 has no meaning.

Why logistic regression fixes this: The sigmoid function \(\frac{1}{1+e^{-\eta}}\) is bounded between 0 and 1 by mathematical construction — no matter how extreme the inputs.

Practical note: Despite this flaw, the Linear Probability Model (LPM) is still widely used in economics. Why? Because its coefficients are directly interpretable as marginal effects. But for prediction and risk scoring, logistic regression is always preferred.

Heuristic 2: Perfect Separation — The MLE Destroyer

Setup: Construct data where \(X \leq 5 \Rightarrow Y = 0\) and \(X > 5 \Rightarrow Y = 1\) — zero overlap.

Result: Logistic regression coefficient → ~5,000; SE → astronomically large; p-value → 1.0

Summary: The GLM Toolkit

Topic	Key Takeaway
Exponential Family	Unified distribution framework: Normal, Binomial, Poisson, Gamma
GLM Components	Random + Systematic + Link function
Logistic Regression	Binary outcomes; interpret via Odds Ratios (\(e^{\beta}\))
ST Prediction Case	ROA strongest predictor (OR = 0.924); AUC = 0.84
Poisson Regression	Count data; interpret via IRR (\(e^{\beta}\))
Overdispersion	Always test \(\phi\); use Quasi-Poisson or Negative Binomial
Perfect Separation	MLE fails → use Firth’s or regularization
Ridge (L2)	Shrinks all coefficients; fixes multicollinearity
Lasso (L1)	Automatic variable selection (exact zeros)
Elastic Net	Best of both: sparsity + stability

The meta-lesson: Always match your model to your data’s distribution. OLS is just one member of a much larger family.