OLS regression assumes \(Y\) is continuous and normally distributed. But many real-world outcomes violate this:
| Outcome Type | Example | OLS Problem |
|---|---|---|
| Binary (0/1) | Default vs. No Default | Predicted \(\hat{P}\) can exceed 1 or go below 0 |
| Count (0,1,2,…) | Number of analyst reports | OLS predicts negative counts |
| Skewed positive | Insurance claim amounts | Normal residuals impossible |
The solution: Generalized Linear Models (GLM) — a unified framework that extends OLS to handle all these cases.
GLMs require the response distribution to belong to the exponential family:
\[\large{ f(y|\theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\} }\]
| Distribution | Natural Parameter \(\theta\) | \(b(\theta)\) | Mean \(\mu\) | Variance |
|---|---|---|---|---|
| Normal | \(\mu\) | \(\theta^2/2\) | \(\theta\) | \(\sigma^2\) |
| Binomial | \(\ln\frac{p}{1-p}\) | \(\ln(1+e^\theta)\) | \(\frac{e^\theta}{1+e^\theta}\) | \(np(1-p)\) |
| Poisson | \(\ln\lambda\) | \(e^\theta\) | \(e^\theta\) | \(\lambda\) |
Key property: \(E(Y) = b'(\theta)\), \(\text{Var}(Y) = a(\phi) \cdot b''(\theta)\).
Every GLM consists of exactly three components:
1. Random Component — \(Y\) follows an exponential family distribution with mean \(\mu = E(Y)\)
2. Systematic Component — The linear predictor:
\[\large{ \eta = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]
3. Link Function — Connects \(\mu\) to \(\eta\):
\[\large{ g(\mu) = \eta \quad \Longleftrightarrow \quad \mu = g^{-1}(\eta) }\]
| Distribution | Link \(g(\mu)\) | Range of \(\mu\) | Range of \(\eta\) |
|---|---|---|---|
| Normal | Identity: \(\mu\) | \((-\infty, +\infty)\) | \((-\infty, +\infty)\) |
| Binomial | Logit: \(\ln\frac{\mu}{1-\mu}\) | \((0, 1)\) | \((-\infty, +\infty)\) |
| Poisson | Log: \(\ln\mu\) | \((0, +\infty)\) | \((-\infty, +\infty)\) |
For binary outcomes \(Y \in \{0, 1\}\), the logistic regression model is:
\[\large{ P(Y=1|\mathbf{X}) = \frac{e^{\eta}}{1 + e^{\eta}} = \frac{1}{1 + e^{-\eta}} }\]
The logit transformation linearizes the relationship:
\[\large{ \text{logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]
Interpreting coefficients — the Odds Ratio:
Context: In China’s A-share market, companies in financial trouble receive an “ST” (Special Treatment) designation — a natural binary classification target.
Data: 2023 annual reports from financial_statement.h5
| Predictor | Coefficient | Odds Ratio | p-value | Interpretation |
|---|---|---|---|---|
| ROA (%) | −0.0790 | 0.924 | 0.000 | Each 1ppt ROA increase → 7.6% lower odds of ST |
| Debt Ratio (%) | +0.0305 | 1.031 | 0.000 | Each 1ppt increase → 3.1% higher odds |
| Current Ratio | +0.0883 | 1.092 | 0.415 | Not significant (CI includes 1) |
| ln(Assets) | −0.4843 | 0.616 | 0.000 | Larger firms → much lower odds |
Pseudo \(R^2\) = 0.1642 | Model LLR \(p\) = 3.2×10⁻²¹
After training with class_weight='balanced' and 70/30 stratified split:
The precision-recall tradeoff: With only 4% positive rate, even a good model produces many false positives. In credit risk, this tradeoff is managed by:
When the outcome is a non-negative integer (count), use Poisson regression:
\[\large{ \ln(\lambda) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p }\]
Equivalently: \(\lambda = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}\)
Incidence Rate Ratio (IRR): \(IRR_j = e^{\beta_j}\) — multiplicative change in expected count
Critical assumption: \(E(Y) = \text{Var}(Y) = \lambda\) (equidispersion)
Case: Revenue units (¥ billions) for 5,396 listed companies:
| Predictor | Coefficient | IRR | Interpretation |
|---|---|---|---|
| ln(Assets) | 0.4964 | 1.643 | Each log-unit of assets → 64.3% more revenue |
| ROA (%) | 0.0492 | 1.050 | Each 1ppt ROA → 5.0% more revenue |
| Debt Ratio (%) | 0.0063 | 1.006 | Minimal effect |
The Pearson dispersion parameter: \(\phi = \frac{1}{n-p}\sum\frac{(y_i - \hat{\lambda}_i)^2}{\hat{\lambda}_i}\)
For our revenue model: \(\phi\) = 8.78 \(\gg\) 1.0 — severe overdispersion
Consequences of ignoring overdispersion:
Remedies:
| Method | Variance Structure | When to Use |
|---|---|---|
| Quasi-Poisson | \(\text{Var}(Y) = \phi\lambda\) | Moderate overdispersion |
| Negative Binomial | \(\text{Var}(Y) = \lambda + \lambda^2/\theta\) | Heavy overdispersion |
The trap: If one variable perfectly predicts the outcome (e.g., all firms with ROA < −5% are ST, all with ROA > −5% are not), then MLE tries to push \(\beta \to \infty\).
Symptoms: Astronomical coefficients, infinite standard errors, p-values near 1, convergence warnings
Why it happens: The likelihood keeps increasing as \(|\beta|\) grows — there is no finite maximum.
Solutions:
Rule of thumb: If any coefficient exceeds 10 in absolute value or any SE exceeds 100, suspect separation.
The temptation: A 10th-degree polynomial fits 11 points with \(R^2 = 1.0\).
The truth: It will fail catastrophically on new data.
Occam’s Razor: Entia non sunt multiplicanda praeter necessitatem. A simple logistic regression often outperforms deep learning — more robust, more interpretable, easier to deploy.
When you have many predictors, regularization prevents overfitting by penalizing large coefficients:
| Method | Penalty | Key Property |
|---|---|---|
| Ridge (L2) | \(\lambda\sum_{j=1}^p \beta_j^2\) | Shrinks all coefficients toward zero; never exactly zero |
| Lasso (L1) | \(\lambda\sum_{j=1}^p |\beta_j|\) | Can set coefficients exactly to zero → variable selection |
| Elastic Net | \(\lambda_1\sum|\beta_j| + \lambda_2\sum\beta_j^2\) | Combines both: sparsity + stability |
Ridge closed-form solution:
\[\large{ \hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{Y} }\]
Note: Adding \(\lambda\mathbf{I}\) guarantees invertibility — this solves multicollinearity!
Experiment: 300 YRD companies, 3 financial features → current ratio
Cross-validation results:
| Method | Optimal λ | R² | Key Difference |
|---|---|---|---|
| OLS (baseline) | — | 0.1022 | All coefficients unrestricted |
| RidgeCV | 104.81 | 0.0955 | All coefficients shrunk, none zero |
| LassoCV | 0.3728 | 0.0580 | Some coefficients driven to zero |
What happens if you use OLS for binary outcomes?
In a simulation with 200 firms, fitting \(P(\text{Default}) = \beta_0 + \beta_1 X\) via OLS:
The absurdity: A probability of 1.3 or −0.2 has no meaning.
Why logistic regression fixes this: The sigmoid function \(\frac{1}{1+e^{-\eta}}\) is bounded between 0 and 1 by mathematical construction — no matter how extreme the inputs.
Practical note: Despite this flaw, the Linear Probability Model (LPM) is still widely used in economics. Why? Because its coefficients are directly interpretable as marginal effects. But for prediction and risk scoring, logistic regression is always preferred.
Setup: Construct data where \(X \leq 5 \Rightarrow Y = 0\) and \(X > 5 \Rightarrow Y = 1\) — zero overlap.
Result: Logistic regression coefficient → ~5,000; SE → astronomically large; p-value → 1.0
| Topic | Key Takeaway |
|---|---|
| Exponential Family | Unified distribution framework: Normal, Binomial, Poisson, Gamma |
| GLM Components | Random + Systematic + Link function |
| Logistic Regression | Binary outcomes; interpret via Odds Ratios (\(e^{\beta}\)) |
| ST Prediction Case | ROA strongest predictor (OR = 0.924); AUC = 0.84 |
| Poisson Regression | Count data; interpret via IRR (\(e^{\beta}\)) |
| Overdispersion | Always test \(\phi\); use Quasi-Poisson or Negative Binomial |
| Perfect Separation | MLE fails → use Firth’s or regularization |
| Ridge (L2) | Shrinks all coefficients; fixes multicollinearity |
| Lasso (L1) | Automatic variable selection (exact zeros) |
| Elastic Net | Best of both: sparsity + stability |
The meta-lesson: Always match your model to your data’s distribution. OLS is just one member of a much larger family.