09: Analysis of Variance (ANOVA)

Why Not Just Run Multiple t-Tests?

Problem: Comparing means across \(k = 5\) groups requires \(\binom{5}{2} = 10\) pairwise t-tests.

Family-wise error rate (FWER):

\[\text{FWER} = 1 - (1 - \alpha)^m\]

Comparisons (\(m\)) FWER
3 14.3%
10 40.1%
45 90.1%

At 10 groups, 45 tests give a 90% chance of at least one false positive!

Solution: ANOVA uses a single global F-test first, then targeted post-hoc comparisons.

One-Way ANOVA: The Model

\[Y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \quad \varepsilon_{ij} \overset{iid}{\sim} N(0, \sigma^2)\]

  • \(Y_{ij}\): observation \(j\) in group \(i\)
  • \(\mu\): grand mean
  • \(\alpha_i\): effect of group \(i\) (with \(\sum \alpha_i = 0\))
  • \(\varepsilon_{ij}\): random error

The question: Are all \(\alpha_i = 0\)? (All group means equal?)

Assumptions:

  1. Independence across observations
  2. Normality within each group
  3. Equal variances (homoscedasticity) across groups

SS Decomposition: The Core Logic

\[\underbrace{\sum_{ij}(Y_{ij} - \bar{Y})^2}_{SST} = \underbrace{\sum_i n_i(\bar{Y}_i - \bar{Y})^2}_{SSB} + \underbrace{\sum_{ij}(Y_{ij} - \bar{Y}_i)^2}_{SSW}\]

Total = Between + Within

Source SS df MS F
Between \(SSB\) \(k-1\) \(MSB = \frac{SSB}{k-1}\) \(\frac{MSB}{MSW}\)
Within \(SSW\) \(N-k\) \(MSW = \frac{SSW}{N-k}\)
Total \(SST\) \(N-1\)

Logic: If group means truly differ, \(MSB \gg MSW\), so \(F \gg 1\).

The F-Test: Decision Rule

\[F = \frac{MSB}{MSW} \sim F_{k-1, \; N-k}\]

Decision: Reject \(H_0\) if \(F > F_{\alpha, \; k-1, \; N-k}\)

Key property: \(E(MSW) = \sigma^2\) always, but

\[E(MSB) = \sigma^2 + \frac{n \sum \alpha_i^2}{k-1}\]

So \(MSB\) estimates \(\sigma^2\) only when \(H_0\) is true. Under \(H_a\), \(MSB > MSW\) in expectation.

Important: A significant F-test only tells you that at least one mean differs. It does not tell you which ones.

Effect Sizes: Beyond Statistical Significance

Measure Formula Interpretation
\(\eta^2\) (Eta-squared) \(\frac{SSB}{SST}\) % of variance explained (biased upward)
\(\omega^2\) (Omega-squared) \(\frac{SSB - (k-1)MSW}{SST + MSW}\) Bias-corrected \(\eta^2\)
Cohen’s \(f\) \(\sqrt{\frac{\eta^2}{1-\eta^2}}\) Small: 0.10, Medium: 0.25, Large: 0.40

Why report effect sizes?

With large \(n\), even trivial differences are “significant.” Effect sizes tell you whether the difference matters in practice.

Case: ROE Across Five YRD Industries

Data: 593 YRD companies across 5 industries.

Industry \(n\) Mean ROE (%) SD (%)
Banking 18 10.63 3.12
Real Estate 78 3.64 11.84
Pharmaceutical 166 4.56 11.81
IT 105 0.82 19.86
Manufacturing 226 5.95 14.61
  • \(F(4, 588) = 6.904\), \(p = 2 \times 10^{-5}\)Highly significant
  • \(\eta^2 = 0.0466\) → Small effect
  • Levene’s test: \(W = 3.29\), \(p = 0.011\)Unequal variances!

Post-Hoc Comparisons: Tukey HSD

\[q = \frac{\bar{Y}_i - \bar{Y}_j}{\sqrt{MSW \cdot \frac{1}{2}(\frac{1}{n_i} + \frac{1}{n_j})}}\]

Tukey HSD results (4 significant pairs):

Comparison Mean Diff \(p\)-adj
Banking − IT +9.81% 0.002
Banking − Real Estate +6.99% 0.038
Manufacturing − IT +5.13% 0.028
Banking − Pharma +6.07% 0.066

Key insight: Banking stands apart from all other industries. The IT sector has the lowest ROE on average.

When Assumptions Fail: Welch’s ANOVA

Problem: Unequal variances (Levene’s \(p = 0.011\)).

Welch’s ANOVA:

\[F_W = \frac{\frac{1}{k-1}\sum w_i(\bar{Y}_i - \tilde{Y})^2}{1 + \frac{2(k-2)}{k^2-1}\sum\frac{(1-w_i/\sum w_i)^2}{n_i-1}}\]

where weights \(w_i = \frac{n_i}{s_i^2}\)

Advantages:

  • Does not require equal variances
  • More robust to unequal group sizes
  • Recommended as default by many statisticians

Two-Way ANOVA: Interaction Effects

\[Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\]

Case: Industry (3 levels) × Province (4 levels), 303 YRD companies.

Source df \(F\) \(p\) Significant?
Industry 2 6.34 0.002 Yes
Province 3 0.35 0.791 No
Interaction 6 2.39 0.038 Yes

Interpretation: Industry matters, province alone doesn’t, but the combination does — certain industries perform differently in certain provinces.

Dirty Work: Post-Hoc Hacking

The trap: Run all pairwise comparisons, then report only the significant ones as if they were pre-planned hypotheses.

Simulation: 20 random pairwise tests →

  • Expected false positives: \(20 \times 0.05 = 1\)
  • Probability of \(\geq 1\) false positive: \(1 - 0.95^{20} = \mathbf{64.2\%}\)

Common corrections:

Method Rule Trade-off
Bonferroni \(\alpha^* = \alpha/m\) Conservative; low power
Holm Sequential; \(p_{(j)} < \alpha/(m-j+1)\) Less conservative
Benjamini-Hochberg Controls FDR instead of FWER More powerful

Dirty Work: Assumption Breakdown

Homoscedasticity is ANOVA’s Achilles’ heel.

Assumption Test What to Do When Violated
Normality Shapiro-Wilk Robust with \(n > 30\) per group (CLT)
Equal variances Levene’s test Use Welch’s ANOVA
Independence Design check No easy fix — invalid inference

Practical recommendation:

Use Welch’s ANOVA as your default. It works well under both equal and unequal variance conditions, with minimal power loss.

Heuristic: Multiple Comparison Trap

Setup: 10 groups drawn from the same population (\(\mu = 0, \sigma = 1\)).

  • \(\binom{10}{2} = 45\) pairwise tests
  • True FWER: \(1 - 0.95^{45} = \mathbf{90.1\%}\)
Family-Wise Error Rate Growth A chart showing FWER grows rapidly from 5% to over 90% as the number of comparisons increases. FWER Explodes with Number of Comparisons Number of Comparisons FWER 0% 50% 100% Danger Zone: >80% 1 5 10 20 45 45 tests → 90.1% FWER

Heuristic: One Outlier Can Destroy ANOVA

Setup: Three groups, each \(n = 20\), clearly different means.

Scenario Outlier \(F\) \(p\) Significant?
Clean data None 40.98 \(10^{-13}\) Yes
One outlier Value = 100 0.43 0.65 No

Effect: A single observation in one group can inflate \(MSW\) so dramatically that a genuinely significant difference becomes invisible.

Defense: Always check residual plots and consider robust ANOVA or trimmed means.

Chapter Summary

Concept Key Takeaway
Why ANOVA Multiple t-tests inflate false positive rate exponentially
F-test Tests if any group mean differs (not which ones)
Effect sizes \(\eta^2\), \(\omega^2\), Cohen’s \(f\) — report alongside \(p\)-values
Post-hoc Tukey HSD, Bonferroni, or BH — correct for multiple comparisons
Two-way The interaction term is often the most interesting finding
Welch’s ANOVA Recommended default — robust to unequal variances
Post-hoc hacking 20 tests × \(\alpha = 0.05\) → 64% FWER → must correct
Outlier fragility One extreme value can nullify a real effect