Intuition: If the variables are independent, knowing the row shouldn’t help predict the column.
Case: Industry vs. Region in the YRD
Question: Is the industry distribution independent of province (Shanghai, Jiangsu, Zhejiang)?
Testing the top 4 industries across 3 provinces:
\(\chi^2\) statistic computed from \(4 \times 3\) table
\(df = (4-1)(3-1) = 6\)
Result: Cannot reject independence (\(p > 0.05\))
Interpretation: The three YRD provinces have surprisingly homogeneous industry structures — the tech and manufacturing mix is similar across Shanghai, Jiangsu, and Zhejiang.
Cramér’s V confirms: the association strength is negligible.
Homogeneity vs. Independence: What’s the Difference?
Aspect
Independence Test
Homogeneity Test
Design
One sample, two variables
Multiple samples, one variable
Question
Are X and Y independent?
Do groups have the same distribution?
Example
Is industry independent of profitability?
Do Shanghai, Jiangsu, Zhejiang have the same industry mix?
Math
Identical \(\chi^2\) formula
Identical \(\chi^2\) formula
They use the same test statistic but answer conceptually different questions.
McNemar Test: Paired Categorical Data
For before/after or matched-pair binary outcomes:
After: Yes
After: No
Before: Yes
\(a\)
\(b\)
Before: No
\(c\)
\(d\)
Only the discordant pairs (\(b\) and \(c\)) carry information:
\[\chi^2 = \frac{(b - c)^2}{b + c}\]
Use case: Did a training program change employees’ views? (Before vs. after survey)
Fisher’s Exact Test: When Chi-Square Fails
Problem: Chi-square requires \(E_i \geq 5\). With small samples, this fails.
Solution: Fisher’s exact test computes the exact probability using the hypergeometric distribution.
Case: Risk Control Model Comparison (Securities firm, 50 flagged transactions)
Model A Correct
Model A Wrong
Total
Model B Correct
20
5
25
Model B Wrong
15
10
25
Total
35
15
50
Odds Ratio = 6.0
Fisher exact \(p = 0.009\)
Conclusion: Model A significantly outperforms Model B
Heuristic: Is the Lottery Random?
Test: Simulate 100 draws of China’s “Double Color Ball” (红球) lottery and test whether the 33 numbers appear with equal frequency.
Under \(H_0\): each number has probability \(1/33\)
Apply chi-square goodness-of-fit test
Result:\(p = 0.265\) → Cannot reject randomness
Key insight: Even with a fair lottery, some numbers WILL appear more often than others due to sampling variability. The test tells us whether the deviation is beyond what chance alone would produce.
Heuristic: Are Zodiac Signs and Success Related?
Test: Among 100 wealthy individuals, test whether their zodiac sign distribution differs from population proportions (weighted by days per sign).
\(H_0\): Zodiac distribution matches calendar proportions
\(\chi^2\) test with 11 df
Result:\(p = 0.536\) → No evidence of association
P-Hacking warning: If you tested 12 individual signs, one might appear “significant” at \(\alpha = 0.05\) by pure chance (\(12 \times 0.05 = 0.6\) expected false positives).
Chapter Summary
Test
When to Use
Key Condition
Goodness-of-fit
Does data match a theoretical distribution?
\(E_i \geq 5\)
Independence
Are two categorical variables independent?
\(E_{ij} \geq 5\)
Homogeneity
Do groups have the same distribution?
Same as independence
McNemar
Paired before/after binary data
Sufficient discordant pairs
Fisher exact
Small sample sizes (\(E_{ij} < 5\))
Any sample size
Always report Cramér’s V alongside \(\chi^2\) to guard against the curse of large \(N\).