Simple: \(Y_i = \beta_0 + \beta_1 X_{1i} + \varepsilon_i\) (one predictor)
Multiple: \(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i\)
Why go multiple?
The partial regression interpretation is the key advantage.
\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
where \(\mathbf{Y}_{n \times 1}\), \(\mathbf{X}_{n \times (p+1)}\), \(\boldsymbol{\beta}_{(p+1) \times 1}\), \(\boldsymbol{\varepsilon}_{n \times 1}\)
OLS solution: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
Variance-covariance: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\)
The hat matrix: \(\hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y}\) where \(\mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\)
The diagonal elements \(h_{ii}\) measure leverage — how much observation \(i\) influences its own fitted value.
| # | Assumption | Violation Consequence | Test |
|---|---|---|---|
| 1 | Linearity | Biased estimates | Residual plots |
| 2 | Random sampling | Invalid inference | Design check |
| 3 | No perfect collinearity | \((\mathbf{X}'\mathbf{X})\) not invertible | VIF |
| 4 | \(E(\varepsilon \| \mathbf{X}) = 0\) | Biased \(\hat{\beta}\) | Omitted variable test |
| 5 | Homoscedasticity | Invalid SE | Breusch-Pagan |
| 6 | \(\varepsilon \sim N(0,\sigma^2)\) | Invalid small-sample tests | Shapiro-Wilk |
Under Assumptions 1–5: OLS is BLUE (Gauss-Markov Theorem).
Adding Assumption 6: OLS is the MLE and fully efficient.
\[R^2 = 1 - \frac{SSE}{SST} = 1 - \frac{\sum \hat{\varepsilon}_i^2}{\sum(Y_i - \bar{Y})^2}\]
\[\bar{R}^2 = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - (1-R^2)\frac{n-1}{n-p-1}\]
Critical difference:
Rule of thumb: A variable improves the model if \(|t| > 1\) for that coefficient (approximately when its addition increases Adjusted \(R^2\)).
\[F = \frac{SSR/p}{SSE/(n-p-1)} = \frac{R^2/p}{(1-R^2)/(n-p-1)} \sim F_{p, \; n-p-1}\]
Testing: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (all slopes are zero)
Individual coefficients: \(t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \sim t_{n-p-1}\)
Important nuance:
\[\text{AIC} = n \ln(SSE/n) + 2(p+1)\]
\[\text{BIC} = n \ln(SSE/n) + \ln(n)(p+1)\]
| Criterion | Penalty | Property |
|---|---|---|
| AIC | \(2k\) | Tends to select larger models |
| BIC | \(\ln(n) \cdot k\) | Heavier penalty; more parsimonious |
Model selection rule: Lower is better. Choose the model with the minimum AIC or BIC.
When AIC and BIC disagree: BIC is generally preferred for explanation (finding the “true” model), AIC for prediction.
Data: 1,938 YRD listed companies.
Model: \(\text{ROE} = \beta_0 + \beta_1\ln(\text{Assets}) + \beta_2\text{DebtRatio} + \beta_3\text{AssetTurnover} + \varepsilon\)
| Variable | Coeff. | Robust SE | \(t\) | \(p\) |
|---|---|---|---|---|
| \(\ln(\text{Assets})\) | 1.809 | 0.527 | 3.43 | 0.001 |
| Debt Ratio | −0.149 | 0.014 | −10.61 | < 0.001 |
| Asset Turnover | 7.803 | 1.267 | 6.16 | < 0.001 |
Variance Inflation Factor:
\[\text{VIF}_j = \frac{1}{1 - R_j^2}\]
where \(R_j^2\) is from regressing \(X_j\) on all other predictors.
| Variable | VIF | Concern? |
|---|---|---|
| \(\ln(\text{Assets})\) | 6.80 | Moderate ⚠️ |
| Debt Ratio | 6.70 | Moderate ⚠️ |
| Asset Turnover | 2.88 | Low ✓ |
Rules of thumb: VIF > 5 → concern; VIF > 10 → serious problem.
Why VIF matters: High VIF → inflated standard errors → unstable coefficients.
Fix: Drop one correlated variable, or use regularization (Ridge/Lasso).
| Model | Variables | \(R^2\) | Adj. \(R^2\) | AIC | BIC |
|---|---|---|---|---|---|
| M1 | ln(Assets) only | 0.0197 | 0.0192 | 14535 | 14546 |
| M2 | +DebtRatio | 0.0741 | 0.0731 | 14427 | 14444 |
| M3 (Full) | +AssetTurnover | 0.1235 | 0.1221 | 14312 | 14334 |
Full model wins across all criteria. Each variable contributes meaningfully.
Partial F-test: Testing whether AssetTurnover significantly improves M2:
\(F = \frac{(SSE_{M2} - SSE_{M3})/1}{SSE_{M3}/(n-4)} = 109.2\), \(p \approx 0\) → Yes, significant improvement
What it is: Automatically adding/removing variables based on p-values.
Why it’s dangerous:
Better alternatives:
| Approach | Advantage |
|---|---|
| Domain knowledge | Theory-driven variable selection |
| Lasso (L1) | Automatic selection with shrinkage |
| Ridge (L2) | Handles collinearity without selection |
Rule: Use theory first, regularization second, stepwise never.
Setup: You estimate \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\)
Common mistake: Interpreting every coefficient as a causal effect.
Reality:
The name “Table 2 Fallacy” comes from economics papers where Table 1 is summary statistics and Table 2 shows regression results — researchers often discuss all coefficients as if they’re equally interpretable.
Setup: Estimate \(Y = \beta_0 + \beta_1 X_1\), then add a group indicator \(G\).
| Model | \(\hat{\beta}_1\) | Interpretation |
|---|---|---|
| Without \(G\) | +1.77 | \(X_1\) appears to increase \(Y\) |
| With \(G\) | −0.56 | \(X_1\) actually decreases \(Y\) |
The coefficient reversed sign!
Why? The group variable \(G\) was a confounder — correlated with both \(X_1\) and \(Y\). Omitting \(G\) caused omitted variable bias.
Lesson: Regression coefficients can change completely — even reverse — when you add or remove a variable. This is why theory must guide your model specification.
Setup: \(Z\) is uncorrelated with \(Y\) (\(r_{ZY} = 0.07\)), but correlated with \(X\) (\(r_{ZX} = 0.63\)).
| Model | \(R^2\) | What happened? |
|---|---|---|
| \(Y \sim X\) | 0.51 | Baseline |
| \(Y \sim X + Z\) | 0.80 | \(R^2\) jumped by 0.29! |
Paradox: Adding a variable with no correlation to Y dramatically improves the model.
Mechanism: \(Z\) “suppresses” irrelevant variance in \(X\), purifying \(X\)’s signal.
Lesson: A variable’s usefulness in regression cannot be determined by its bivariate correlation with \(Y\) alone.
| Concept | Key Takeaway |
|---|---|
| Multiple regression | Ceteris paribus interpretation; controls for confounders |
| CLRM assumptions | Zero conditional mean (\(E(\varepsilon\|X)=0\)) is the most critical |
| \(R^2\) vs. Adj. \(R^2\) | Always use Adjusted \(R^2\) for model comparison |
| F-test vs. t-tests | F significant + all t’s insignificant → multicollinearity |
| VIF | > 5 warrants concern; > 10 is a serious problem |
| AIC/BIC | Lower is better; BIC more conservative |
| Stepwise | Never use — inflated R², biased coefficients, invalid p-values |
| Table 2 Fallacy | Not all coefficients in a regression are causal |
| Suppressor | A variable’s value depends on the full model, not bivariate \(r\) |