10: Multivariate Regression Models

From Simple to Multiple Regression

Simple: \(Y_i = \beta_0 + \beta_1 X_{1i} + \varepsilon_i\) (one predictor)

Multiple: \(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i\)

Why go multiple?

Omitted variable bias: If a variable affects \(Y\) and correlates with \(X\), omitting it biases \(\hat{\beta}_1\).
Higher explanatory power: Multiple factors drive most real-world outcomes.
Ceteris paribus: Each \(\hat{\beta}_j\) measures the effect of \(X_j\) holding all other variables constant.

The partial regression interpretation is the key advantage.

Matrix Formulation

\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

where \(\mathbf{Y}_{n \times 1}\), \(\mathbf{X}_{n \times (p+1)}\), \(\boldsymbol{\beta}_{(p+1) \times 1}\), \(\boldsymbol{\varepsilon}_{n \times 1}\)

OLS solution: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Variance-covariance: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\)

The hat matrix: \(\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}\) where \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\)

The diagonal elements \(h_{ii}\) measure leverage — how much observation \(i\) influences its own fitted value.

Classical Assumptions (CLRM)

#	Assumption	Violation Consequence	Test
1	Linearity	Biased estimates	Residual plots
2	Random sampling	Invalid inference	Design check
3	No perfect collinearity	\((\mathbf{X}'\mathbf{X})\) not invertible	VIF
4	\(E(\varepsilon \\| \mathbf{X}) = 0\)	Biased \(\hat{\beta}\)	Omitted variable test
5	Homoscedasticity	Invalid SE	Breusch-Pagan
6	\(\varepsilon \sim N(0,\sigma^2)\)	Invalid small-sample tests	Shapiro-Wilk

Under Assumptions 1–5: OLS is BLUE (Gauss-Markov Theorem).

Adding Assumption 6: OLS is the MLE and fully efficient.

Model Assessment: R² vs. Adjusted R²

\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum \hat{\varepsilon}_i^2}{\sum(Y_i - \bar{Y})^2}\]

\[\bar{R}^2 = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - (1-R^2)\frac{n-1}{n-p-1}\]

Critical difference:

\(R^2\) always increases when adding variables (even noise)
\(\bar{R}^2\) penalizes adding useless variables — can decrease

Rule of thumb: A variable improves the model if \(|t| > 1\) for that coefficient (approximately when its addition increases Adjusted \(R^2\)).

Overall Significance: The F-Test

\[F = \frac{SSR/p}{SSE/(n-p-1)} = \frac{R^2/p}{(1-R^2)/(n-p-1)} \sim F_{p, \; n-p-1}\]

Testing: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (all slopes are zero)

Individual coefficients: \(t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\)

Important nuance:

F-test significant + all t-tests insignificant → multicollinearity
This happens when predictors are highly correlated with each other

Information Criteria: AIC and BIC

\[\text{AIC} = n \ln(SSE/n) + 2(p+1)\]

\[\text{BIC} = n \ln(SSE/n) + \ln(n)(p+1)\]

Criterion	Penalty	Property
AIC	\(2k\)	Tends to select larger models
BIC	\(\ln(n) \cdot k\)	Heavier penalty; more parsimonious

Model selection rule: Lower is better. Choose the model with the minimum AIC or BIC.

When AIC and BIC disagree: BIC is generally preferred for explanation (finding the “true” model), AIC for prediction.

Case: What Drives ROE in YRD Companies?

Data: 1,938 YRD listed companies.

Model: \(\text{ROE} = \beta_0 + \beta_1\ln(\text{Assets}) + \beta_2\text{DebtRatio} + \beta_3\text{AssetTurnover} + \varepsilon\)

Variable	Coeff.	Robust SE	\(t\)	\(p\)
\(\ln(\text{Assets})\)	1.809	0.527	3.43	0.001
Debt Ratio	−0.149	0.014	−10.61	< 0.001
Asset Turnover	7.803	1.267	6.16	< 0.001

\(R^2 = 0.1235\), \(F(3, 1934) = 90.85\), \(p \approx 0\)
Breusch-Pagan \(\text{LM} = 197.47\) → Severe heteroscedasticity

Multicollinearity Diagnosis: VIF

Variance Inflation Factor:

\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]

where \(R_j^2\) is from regressing \(X_j\) on all other predictors.

Variable	VIF	Concern?
\(\ln(\text{Assets})\)	6.80	Moderate ⚠️
Debt Ratio	6.70	Moderate ⚠️
Asset Turnover	2.88	Low ✓

Rules of thumb: VIF > 5 → concern; VIF > 10 → serious problem.

Why VIF matters: High VIF → inflated standard errors → unstable coefficients.

Fix: Drop one correlated variable, or use regularization (Ridge/Lasso).

Model Comparison: Which Specification Wins?

Model	Variables	\(R^2\)	Adj. \(R^2\)	AIC	BIC
M1	ln(Assets) only	0.0197	0.0192	14535	14546
M2	+DebtRatio	0.0741	0.0731	14427	14444
M3 (Full)	+AssetTurnover	0.1235	0.1221	14312	14334

Full model wins across all criteria. Each variable contributes meaningfully.

Partial F-test: Testing whether AssetTurnover significantly improves M2:

\(F = \frac{(SSE_{M2} - SSE_{M3})/1}{SSE_{M3}/(n-4)} = 109.2\), \(p \approx 0\) → Yes, significant improvement

Dirty Work: The Stepwise Regression Trap

What it is: Automatically adding/removing variables based on p-values.

Why it’s dangerous:

Inflated R²: Sequential testing exploits chance patterns
Biased coefficients: Selected variables have “Winner’s Curse” — their effects are overstated
Invalid p-values: Multiple testing without correction → actual \(\alpha \gg 0.05\)

Better alternatives:

Approach	Advantage
Domain knowledge	Theory-driven variable selection
Lasso (L1)	Automatic selection with shrinkage
Ridge (L2)	Handles collinearity without selection

Rule: Use theory first, regularization second, stepwise never.

Dirty Work: The Table 2 Fallacy

Setup: You estimate \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\)

Common mistake: Interpreting every coefficient as a causal effect.

Reality:

Only \(\beta_1\) (your variable of interest) has a causal design
\(\beta_2, \ldots, \beta_p\) are controls — not all are causal
Some controls are confounders (needed for causal identification)
Others are mediators (should NOT be controlled for)

The name “Table 2 Fallacy” comes from economics papers where Table 1 is summary statistics and Table 2 shows regression results — researchers often discuss all coefficients as if they’re equally interpretable.

Heuristic: Simpson’s Paradox in Regression

Setup: Estimate \(Y = \beta_0 + \beta_1 X_1\), then add a group indicator \(G\).

Model	\(\hat{\beta}_1\)	Interpretation
Without \(G\)	+1.77	\(X_1\) appears to increase \(Y\)
With \(G\)	−0.56	\(X_1\) actually decreases \(Y\)

The coefficient reversed sign!

Why? The group variable \(G\) was a confounder — correlated with both \(X_1\) and \(Y\). Omitting \(G\) caused omitted variable bias.

Lesson: Regression coefficients can change completely — even reverse — when you add or remove a variable. This is why theory must guide your model specification.

Heuristic: The Suppressor Variable

Setup: \(Z\) is uncorrelated with \(Y\) (\(r_{ZY} = 0.07\)), but correlated with \(X\) (\(r_{ZX} = 0.63\)).

Model	\(R^2\)	What happened?
\(Y \sim X\)	0.51	Baseline
\(Y \sim X + Z\)	0.80	\(R^2\) jumped by 0.29!

Paradox: Adding a variable with no correlation to Y dramatically improves the model.

Mechanism: \(Z\) “suppresses” irrelevant variance in \(X\), purifying \(X\)’s signal.

Lesson: A variable’s usefulness in regression cannot be determined by its bivariate correlation with \(Y\) alone.

Chapter Summary

Concept	Key Takeaway
Multiple regression	Ceteris paribus interpretation; controls for confounders
CLRM assumptions	Zero conditional mean (\(E(\varepsilon\\|X)=0\)) is the most critical
\(R^2\) vs. Adj. \(R^2\)	Always use Adjusted \(R^2\) for model comparison
F-test vs. t-tests	F significant + all t’s insignificant → multicollinearity
VIF	> 5 warrants concern; > 10 is a serious problem
AIC/BIC	Lower is better; BIC more conservative
Stepwise	Never use — inflated R², biased coefficients, invalid p-values
Table 2 Fallacy	Not all coefficients in a regression are causal
Suppressor	A variable’s value depends on the full model, not bivariate \(r\)