10: Multivariate Regression Models

From Simple to Multiple Regression

Simple: \(Y_i = \beta_0 + \beta_1 X_{1i} + \varepsilon_i\) (one predictor)

Multiple: \(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i\)

Why go multiple?

  1. Omitted variable bias: If a variable affects \(Y\) and correlates with \(X\), omitting it biases \(\hat{\beta}_1\).
  2. Higher explanatory power: Multiple factors drive most real-world outcomes.
  3. Ceteris paribus: Each \(\hat{\beta}_j\) measures the effect of \(X_j\) holding all other variables constant.

The partial regression interpretation is the key advantage.

Matrix Formulation

\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

where \(\mathbf{Y}_{n \times 1}\), \(\mathbf{X}_{n \times (p+1)}\), \(\boldsymbol{\beta}_{(p+1) \times 1}\), \(\boldsymbol{\varepsilon}_{n \times 1}\)

OLS solution: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Variance-covariance: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\)

The hat matrix: \(\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}\) where \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\)

The diagonal elements \(h_{ii}\) measure leverage — how much observation \(i\) influences its own fitted value.

Classical Assumptions (CLRM)

# Assumption Violation Consequence Test
1 Linearity Biased estimates Residual plots
2 Random sampling Invalid inference Design check
3 No perfect collinearity \((\mathbf{X}'\mathbf{X})\) not invertible VIF
4 \(E(\varepsilon \| \mathbf{X}) = 0\) Biased \(\hat{\beta}\) Omitted variable test
5 Homoscedasticity Invalid SE Breusch-Pagan
6 \(\varepsilon \sim N(0,\sigma^2)\) Invalid small-sample tests Shapiro-Wilk

Under Assumptions 1–5: OLS is BLUE (Gauss-Markov Theorem).

Adding Assumption 6: OLS is the MLE and fully efficient.

Model Assessment: R² vs. Adjusted R²

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum \hat{\varepsilon}_i^2}{\sum(Y_i - \bar{Y})^2}\]

\[\bar{R}^2 = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - (1-R^2)\frac{n-1}{n-p-1}\]

Critical difference:

  • \(R^2\) always increases when adding variables (even noise)
  • \(\bar{R}^2\) penalizes adding useless variables — can decrease

Rule of thumb: A variable improves the model if \(|t| > 1\) for that coefficient (approximately when its addition increases Adjusted \(R^2\)).

Overall Significance: The F-Test

\[F = \frac{SSR/p}{SSE/(n-p-1)} = \frac{R^2/p}{(1-R^2)/(n-p-1)} \sim F_{p, \; n-p-1}\]

Testing: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (all slopes are zero)

Individual coefficients: \(t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\)

Important nuance:

  • F-test significant + all t-tests insignificant → multicollinearity
  • This happens when predictors are highly correlated with each other

Information Criteria: AIC and BIC

\[\text{AIC} = n \ln(SSE/n) + 2(p+1)\]

\[\text{BIC} = n \ln(SSE/n) + \ln(n)(p+1)\]

Criterion Penalty Property
AIC \(2k\) Tends to select larger models
BIC \(\ln(n) \cdot k\) Heavier penalty; more parsimonious

Model selection rule: Lower is better. Choose the model with the minimum AIC or BIC.

When AIC and BIC disagree: BIC is generally preferred for explanation (finding the “true” model), AIC for prediction.

Case: What Drives ROE in YRD Companies?

Data: 1,938 YRD listed companies.

Model: \(\text{ROE} = \beta_0 + \beta_1\ln(\text{Assets}) + \beta_2\text{DebtRatio} + \beta_3\text{AssetTurnover} + \varepsilon\)

Variable Coeff. Robust SE \(t\) \(p\)
\(\ln(\text{Assets})\) 1.809 0.527 3.43 0.001
Debt Ratio −0.149 0.014 −10.61 < 0.001
Asset Turnover 7.803 1.267 6.16 < 0.001
  • \(R^2 = 0.1235\), \(F(3, 1934) = 90.85\), \(p \approx 0\)
  • Breusch-Pagan \(\text{LM} = 197.47\) → Severe heteroscedasticity

Multicollinearity Diagnosis: VIF

Variance Inflation Factor:

\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is from regressing \(X_j\) on all other predictors.

Variable VIF Concern?
\(\ln(\text{Assets})\) 6.80 Moderate ⚠️
Debt Ratio 6.70 Moderate ⚠️
Asset Turnover 2.88 Low ✓

Rules of thumb: VIF > 5 → concern; VIF > 10 → serious problem.

Why VIF matters: High VIF → inflated standard errors → unstable coefficients.

Fix: Drop one correlated variable, or use regularization (Ridge/Lasso).

Model Comparison: Which Specification Wins?

Model Variables \(R^2\) Adj. \(R^2\) AIC BIC
M1 ln(Assets) only 0.0197 0.0192 14535 14546
M2 +DebtRatio 0.0741 0.0731 14427 14444
M3 (Full) +AssetTurnover 0.1235 0.1221 14312 14334

Full model wins across all criteria. Each variable contributes meaningfully.

Partial F-test: Testing whether AssetTurnover significantly improves M2:

\(F = \frac{(SSE_{M2} - SSE_{M3})/1}{SSE_{M3}/(n-4)} = 109.2\), \(p \approx 0\)Yes, significant improvement

Dirty Work: The Stepwise Regression Trap

What it is: Automatically adding/removing variables based on p-values.

Why it’s dangerous:

  1. Inflated R²: Sequential testing exploits chance patterns
  2. Biased coefficients: Selected variables have “Winner’s Curse” — their effects are overstated
  3. Invalid p-values: Multiple testing without correction → actual \(\alpha \gg 0.05\)

Better alternatives:

Approach Advantage
Domain knowledge Theory-driven variable selection
Lasso (L1) Automatic selection with shrinkage
Ridge (L2) Handles collinearity without selection

Rule: Use theory first, regularization second, stepwise never.

Dirty Work: The Table 2 Fallacy

Setup: You estimate \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\)

Common mistake: Interpreting every coefficient as a causal effect.

Reality:

  • Only \(\beta_1\) (your variable of interest) has a causal design
  • \(\beta_2, \ldots, \beta_p\) are controls — not all are causal
  • Some controls are confounders (needed for causal identification)
  • Others are mediators (should NOT be controlled for)

The name “Table 2 Fallacy” comes from economics papers where Table 1 is summary statistics and Table 2 shows regression results — researchers often discuss all coefficients as if they’re equally interpretable.

Heuristic: Simpson’s Paradox in Regression

Setup: Estimate \(Y = \beta_0 + \beta_1 X_1\), then add a group indicator \(G\).

Model \(\hat{\beta}_1\) Interpretation
Without \(G\) +1.77 \(X_1\) appears to increase \(Y\)
With \(G\) −0.56 \(X_1\) actually decreases \(Y\)

The coefficient reversed sign!

Why? The group variable \(G\) was a confounder — correlated with both \(X_1\) and \(Y\). Omitting \(G\) caused omitted variable bias.

Lesson: Regression coefficients can change completely — even reverse — when you add or remove a variable. This is why theory must guide your model specification.

Heuristic: The Suppressor Variable

Setup: \(Z\) is uncorrelated with \(Y\) (\(r_{ZY} = 0.07\)), but correlated with \(X\) (\(r_{ZX} = 0.63\)).

Model \(R^2\) What happened?
\(Y \sim X\) 0.51 Baseline
\(Y \sim X + Z\) 0.80 \(R^2\) jumped by 0.29!

Paradox: Adding a variable with no correlation to Y dramatically improves the model.

Mechanism: \(Z\) “suppresses” irrelevant variance in \(X\), purifying \(X\)’s signal.

Lesson: A variable’s usefulness in regression cannot be determined by its bivariate correlation with \(Y\) alone.

Chapter Summary

Concept Key Takeaway
Multiple regression Ceteris paribus interpretation; controls for confounders
CLRM assumptions Zero conditional mean (\(E(\varepsilon\|X)=0\)) is the most critical
\(R^2\) vs. Adj. \(R^2\) Always use Adjusted \(R^2\) for model comparison
F-test vs. t-tests F significant + all t’s insignificant → multicollinearity
VIF > 5 warrants concern; > 10 is a serious problem
AIC/BIC Lower is better; BIC more conservative
Stepwise Never use — inflated R², biased coefficients, invalid p-values
Table 2 Fallacy Not all coefficients in a regression are causal
Suppressor A variable’s value depends on the full model, not bivariate \(r\)