09: Analysis of Variance (ANOVA)
Why Not Just Run Multiple t-Tests?
Problem: Comparing means across \(k = 5\) groups requires \(\binom{5}{2} = 10\) pairwise t-tests.
Family-wise error rate (FWER):
\[\text{FWER} = 1 - (1 - \alpha)^m\]
| 3 |
14.3% |
| 10 |
40.1% |
| 45 |
90.1% |
At 10 groups, 45 tests give a 90% chance of at least one false positive!
Solution: ANOVA uses a single global F-test first, then targeted post-hoc comparisons.
SS Decomposition: The Core Logic
\[\underbrace{\sum_{ij}(Y_{ij} - \bar{Y})^2}_{SST} = \underbrace{\sum_i n_i(\bar{Y}_i - \bar{Y})^2}_{SSB} + \underbrace{\sum_{ij}(Y_{ij} - \bar{Y}_i)^2}_{SSW}\]
Total = Between + Within
| Between |
\(SSB\) |
\(k-1\) |
\(MSB = \frac{SSB}{k-1}\) |
\(\frac{MSB}{MSW}\) |
| Within |
\(SSW\) |
\(N-k\) |
\(MSW = \frac{SSW}{N-k}\) |
|
| Total |
\(SST\) |
\(N-1\) |
|
|
Logic: If group means truly differ, \(MSB \gg MSW\), so \(F \gg 1\).
The F-Test: Decision Rule
\[F = \frac{MSB}{MSW} \sim F_{k-1, \; N-k}\]
Decision: Reject \(H_0\) if \(F > F_{\alpha, \; k-1, \; N-k}\)
Key property: \(E(MSW) = \sigma^2\) always, but
\[E(MSB) = \sigma^2 + \frac{n \sum \alpha_i^2}{k-1}\]
So \(MSB\) estimates \(\sigma^2\) only when \(H_0\) is true. Under \(H_a\), \(MSB > MSW\) in expectation.
Important: A significant F-test only tells you that at least one mean differs. It does not tell you which ones.
Effect Sizes: Beyond Statistical Significance
| \(\eta^2\) (Eta-squared) |
\(\frac{SSB}{SST}\) |
% of variance explained (biased upward) |
| \(\omega^2\) (Omega-squared) |
\(\frac{SSB - (k-1)MSW}{SST + MSW}\) |
Bias-corrected \(\eta^2\) |
| Cohen’s \(f\) |
\(\sqrt{\frac{\eta^2}{1-\eta^2}}\) |
Small: 0.10, Medium: 0.25, Large: 0.40 |
Why report effect sizes?
With large \(n\), even trivial differences are “significant.” Effect sizes tell you whether the difference matters in practice.
Case: ROE Across Five YRD Industries
Data: 593 YRD companies across 5 industries.
| Banking |
18 |
10.63 |
3.12 |
| Real Estate |
78 |
3.64 |
11.84 |
| Pharmaceutical |
166 |
4.56 |
11.81 |
| IT |
105 |
0.82 |
19.86 |
| Manufacturing |
226 |
5.95 |
14.61 |
- \(F(4, 588) = 6.904\), \(p = 2 \times 10^{-5}\) → Highly significant
- \(\eta^2 = 0.0466\) → Small effect
- Levene’s test: \(W = 3.29\), \(p = 0.011\) → Unequal variances!
Post-Hoc Comparisons: Tukey HSD
\[q = \frac{\bar{Y}_i - \bar{Y}_j}{\sqrt{MSW \cdot \frac{1}{2}(\frac{1}{n_i} + \frac{1}{n_j})}}\]
Tukey HSD results (4 significant pairs):
| Banking − IT |
+9.81% |
0.002 |
| Banking − Real Estate |
+6.99% |
0.038 |
| Manufacturing − IT |
+5.13% |
0.028 |
| Banking − Pharma |
+6.07% |
0.066 |
Key insight: Banking stands apart from all other industries. The IT sector has the lowest ROE on average.
When Assumptions Fail: Welch’s ANOVA
Problem: Unequal variances (Levene’s \(p = 0.011\)).
Welch’s ANOVA:
\[F_W = \frac{\frac{1}{k-1}\sum w_i(\bar{Y}_i - \tilde{Y})^2}{1 + \frac{2(k-2)}{k^2-1}\sum\frac{(1-w_i/\sum w_i)^2}{n_i-1}}\]
where weights \(w_i = \frac{n_i}{s_i^2}\)
Advantages:
- Does not require equal variances
- More robust to unequal group sizes
- Recommended as default by many statisticians
Two-Way ANOVA: Interaction Effects
\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]
Case: Industry (3 levels) × Province (4 levels), 303 YRD companies.
| Industry |
2 |
6.34 |
0.002 |
Yes |
| Province |
3 |
0.35 |
0.791 |
No |
| Interaction |
6 |
2.39 |
0.038 |
Yes |
Interpretation: Industry matters, province alone doesn’t, but the combination does — certain industries perform differently in certain provinces.
Dirty Work: Post-Hoc Hacking
The trap: Run all pairwise comparisons, then report only the significant ones as if they were pre-planned hypotheses.
Simulation: 20 random pairwise tests →
- Expected false positives: \(20 \times 0.05 = 1\)
- Probability of \(\geq 1\) false positive: \(1 - 0.95^{20} = \mathbf{64.2\%}\)
Common corrections:
| Bonferroni |
\(\alpha^* = \alpha/m\) |
Conservative; low power |
| Holm |
Sequential; \(p_{(j)} < \alpha/(m-j+1)\) |
Less conservative |
| Benjamini-Hochberg |
Controls FDR instead of FWER |
More powerful |
Dirty Work: Assumption Breakdown
Homoscedasticity is ANOVA’s Achilles’ heel.
| Normality |
Shapiro-Wilk |
Robust with \(n > 30\) per group (CLT) |
| Equal variances |
Levene’s test |
Use Welch’s ANOVA |
| Independence |
Design check |
No easy fix — invalid inference |
Practical recommendation:
Use Welch’s ANOVA as your default. It works well under both equal and unequal variance conditions, with minimal power loss.
Heuristic: Multiple Comparison Trap
Setup: 10 groups drawn from the same population (\(\mu = 0, \sigma = 1\)).
- \(\binom{10}{2} = 45\) pairwise tests
- True FWER: \(1 - 0.95^{45} = \mathbf{90.1\%}\)
Heuristic: One Outlier Can Destroy ANOVA
Setup: Three groups, each \(n = 20\), clearly different means.
| Clean data |
None |
40.98 |
\(10^{-13}\) |
Yes |
| One outlier |
Value = 100 |
0.43 |
0.65 |
No |
Effect: A single observation in one group can inflate \(MSW\) so dramatically that a genuinely significant difference becomes invisible.
Defense: Always check residual plots and consider robust ANOVA or trimmed means.
Chapter Summary
| Why ANOVA |
Multiple t-tests inflate false positive rate exponentially |
| F-test |
Tests if any group mean differs (not which ones) |
| Effect sizes |
\(\eta^2\), \(\omega^2\), Cohen’s \(f\) — report alongside \(p\)-values |
| Post-hoc |
Tukey HSD, Bonferroni, or BH — correct for multiple comparisons |
| Two-way |
The interaction term is often the most interesting finding |
| Welch’s ANOVA |
Recommended default — robust to unequal variances |
| Post-hoc hacking |
20 tests × \(\alpha = 0.05\) → 64% FWER → must correct |
| Outlier fragility |
One extreme value can nullify a real effect |