09: Analysis of Variance (ANOVA)

Why Not Just Run Multiple t-Tests?

Problem: Comparing means across \(k = 5\) groups requires \(\binom{5}{2} = 10\) pairwise t-tests.

Family-wise error rate (FWER):

\[\text{FWER} = 1 - (1 - \alpha)^m\]

Comparisons (\(m\))	FWER
3	14.3%
10	40.1%
45	90.1%

At 10 groups, 45 tests give a 90% chance of at least one false positive!

Solution: ANOVA uses a single global F-test first, then targeted post-hoc comparisons.

One-Way ANOVA: The Model

\[Y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \quad \varepsilon_{ij} \overset{iid}{\sim} N(0, \sigma^2)\]

\(Y_{ij}\): observation \(j\) in group \(i\)
\(\mu\): grand mean
\(\alpha_i\): effect of group \(i\) (with \(\sum \alpha_i = 0\))
\(\varepsilon_{ij}\): random error

The question: Are all \(\alpha_i = 0\)? (All group means equal?)

Assumptions:

Independence across observations
Normality within each group
Equal variances (homoscedasticity) across groups

SS Decomposition: The Core Logic

\[\underbrace{\sum_{ij}(Y_{ij} - \bar{Y})^2}_{SST} = \underbrace{\sum_i n_i(\bar{Y}_i - \bar{Y})^2}_{SSB} + \underbrace{\sum_{ij}(Y_{ij} - \bar{Y}_i)^2}_{SSW}\]

Total = Between + Within

Source	SS	df	MS	F
Between	\(SSB\)	\(k-1\)	\(MSB = \frac{SSB}{k-1}\)	\(\frac{MSB}{MSW}\)
Within	\(SSW\)	\(N-k\)	\(MSW = \frac{SSW}{N-k}\)
Total	\(SST\)	\(N-1\)

Logic: If group means truly differ, \(MSB \gg MSW\), so \(F \gg 1\).

The F-Test: Decision Rule

\[F = \frac{MSB}{MSW} \sim F_{k-1, \; N-k}\]

Decision: Reject \(H_0\) if \(F > F_{\alpha, \; k-1, \; N-k}\)

Key property: \(E(MSW) = \sigma^2\) always, but

\[E(MSB) = \sigma^2 + \frac{n \sum \alpha_i^2}{k-1}\]

So \(MSB\) estimates \(\sigma^2\) only when \(H_0\) is true. Under \(H_a\), \(MSB > MSW\) in expectation.

Important: A significant F-test only tells you that at least one mean differs. It does not tell you which ones.

Effect Sizes: Beyond Statistical Significance

Measure	Formula	Interpretation
\(\eta^2\) (Eta-squared)	\(\frac{SSB}{SST}\)	% of variance explained (biased upward)
\(\omega^2\) (Omega-squared)	\(\frac{SSB - (k-1)MSW}{SST + MSW}\)	Bias-corrected \(\eta^2\)
Cohen’s \(f\)	\(\sqrt{\frac{\eta^2}{1-\eta^2}}\)	Small: 0.10, Medium: 0.25, Large: 0.40

Why report effect sizes?

With large \(n\), even trivial differences are “significant.” Effect sizes tell you whether the difference matters in practice.

Case: ROE Across Five YRD Industries

Data: 593 YRD companies across 5 industries.

Industry	\(n\)	Mean ROE (%)	SD (%)
Banking	18	10.63	3.12
Real Estate	78	3.64	11.84
Pharmaceutical	166	4.56	11.81
IT	105	0.82	19.86
Manufacturing	226	5.95	14.61

\(F(4, 588) = 6.904\), \(p = 2 \times 10^{-5}\) → Highly significant
\(\eta^2 = 0.0466\) → Small effect
Levene’s test: \(W = 3.29\), \(p = 0.011\) → Unequal variances!

Post-Hoc Comparisons: Tukey HSD

\[q = \frac{\bar{Y}_i - \bar{Y}_j}{\sqrt{MSW \cdot \frac{1}{2}(\frac{1}{n_i} + \frac{1}{n_j})}}\]

Tukey HSD results (4 significant pairs):

Comparison	Mean Diff	\(p\)-adj
Banking − IT	+9.81%	0.002
Banking − Real Estate	+6.99%	0.038
Manufacturing − IT	+5.13%	0.028
Banking − Pharma	+6.07%	0.066

Key insight: Banking stands apart from all other industries. The IT sector has the lowest ROE on average.

When Assumptions Fail: Welch’s ANOVA

Problem: Unequal variances (Levene’s \(p = 0.011\)).

Welch’s ANOVA:

\[F_W = \frac{\frac{1}{k-1}\sum w_i(\bar{Y}_i - \tilde{Y})^2}{1 + \frac{2(k-2)}{k^2-1}\sum\frac{(1-w_i/\sum w_i)^2}{n_i-1}}\]

where weights \(w_i = \frac{n_i}{s_i^2}\)

Advantages:

Does not require equal variances
More robust to unequal group sizes
Recommended as default by many statisticians

Two-Way ANOVA: Interaction Effects

\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]

Case: Industry (3 levels) × Province (4 levels), 303 YRD companies.

Source	df	\(F\)	\(p\)	Significant?
Industry	2	6.34	0.002	Yes
Province	3	0.35	0.791	No
Interaction	6	2.39	0.038	Yes

Interpretation: Industry matters, province alone doesn’t, but the combination does — certain industries perform differently in certain provinces.

Dirty Work: Post-Hoc Hacking

The trap: Run all pairwise comparisons, then report only the significant ones as if they were pre-planned hypotheses.

Simulation: 20 random pairwise tests →

Expected false positives: \(20 \times 0.05 = 1\)
Probability of \(\geq 1\) false positive: \(1 - 0.95^{20} = \mathbf{64.2\%}\)

Common corrections:

Method	Rule	Trade-off
Bonferroni	\(\alpha^* = \alpha/m\)	Conservative; low power
Holm	Sequential; \(p_{(j)} < \alpha/(m-j+1)\)	Less conservative
Benjamini-Hochberg	Controls FDR instead of FWER	More powerful

Dirty Work: Assumption Breakdown

Homoscedasticity is ANOVA’s Achilles’ heel.

Assumption	Test	What to Do When Violated
Normality	Shapiro-Wilk	Robust with \(n > 30\) per group (CLT)
Equal variances	Levene’s test	Use Welch’s ANOVA
Independence	Design check	No easy fix — invalid inference

Practical recommendation:

Use Welch’s ANOVA as your default. It works well under both equal and unequal variance conditions, with minimal power loss.

Heuristic: Multiple Comparison Trap

Setup: 10 groups drawn from the same population (\(\mu = 0, \sigma = 1\)).

\(\binom{10}{2} = 45\) pairwise tests
True FWER: \(1 - 0.95^{45} = \mathbf{90.1\%}\)

Heuristic: One Outlier Can Destroy ANOVA

Setup: Three groups, each \(n = 20\), clearly different means.

Scenario	Outlier	\(F\)	\(p\)	Significant?
Clean data	None	40.98	\(10^{-13}\)	Yes
One outlier	Value = 100	0.43	0.65	No

Effect: A single observation in one group can inflate \(MSW\) so dramatically that a genuinely significant difference becomes invisible.

Defense: Always check residual plots and consider robust ANOVA or trimmed means.

Chapter Summary

Concept	Key Takeaway
Why ANOVA	Multiple t-tests inflate false positive rate exponentially
F-test	Tests if any group mean differs (not which ones)
Effect sizes	\(\eta^2\), \(\omega^2\), Cohen’s \(f\) — report alongside \(p\)-values
Post-hoc	Tukey HSD, Bonferroni, or BH — correct for multiple comparisons
Two-way	The interaction term is often the most interesting finding
Welch’s ANOVA	Recommended default — robust to unequal variances
Post-hoc hacking	20 tests × \(\alpha = 0.05\) → 64% FWER → must correct
Outlier fragility	One extreme value can nullify a real effect