06: Goodness-of-Fit and Contingency Tables

From Continuous to Categorical: A Different Kind of Test

Everything so far has focused on continuous data (means, variances).

But many business questions involve categorical data:

Does customer preference differ across regions?
Is the industry distribution of firms uniform?
Are two classification variables independent?

The chi-square (\(\chi^2\)) test family provides the tools for these questions.

The Chi-Square Distribution

If \(Z_1, Z_2, \ldots, Z_k\) are independent standard normal random variables, then:

\[X = \sum_{i=1}^k Z_i^2 \sim \chi^2_k\]

Key properties:

Property	Value
Mean	\(E[X] = k\)
Variance	\(\text{Var}(X) = 2k\)
Shape	Right-skewed, approaches Normal as \(k \to \infty\)
Support	\([0, \infty)\) — always non-negative

The parameter \(k\) (degrees of freedom) controls the distribution’s center and spread.

Degrees of Freedom: Intuition

Degrees of freedom (df) = number of values free to vary given constraints.

Example: If 5 category frequencies must sum to \(n = 100\):

You can freely choose the first 4 frequencies
The 5th is determined: \(f_5 = 100 - f_1 - f_2 - f_3 - f_4\)
df = \(5 - 1 = 4\)

General formulas:

Test Type	df
Goodness-of-fit (\(k\) categories, \(m\) estimated parameters)	\(k - 1 - m\)
Independence (\(r \times c\) table)	\((r-1)(c-1)\)

Chi-Square Goodness-of-Fit Test

Question: Does an observed frequency distribution match a hypothesized one?

\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]

where \(O_i\) = observed frequency, \(E_i\) = expected frequency under \(H_0\).

Mathematical essence: Each term \((O_i - E_i)/\sqrt{E_i}\) is approximately standard normal, so the sum of squares follows \(\chi^2\).

Critical condition: All \(E_i \geq 5\) (otherwise the normal approximation breaks down).

Six-Step Testing Procedure

Step	Action
1. Hypotheses	\(H_0\): Data follows the specified distribution
2. Significance level	Choose \(\alpha\) (typically 0.05)
3. Expected frequencies	Compute \(E_i\) under \(H_0\)
4. Test statistic	\(\chi^2 = \sum (O_i - E_i)^2 / E_i\)
5. Critical value / p-value	Compare to \(\chi^2_{k-1-m}\)
6. Decision	Reject \(H_0\) if \(\chi^2 > \chi^2_{\text{critical}}\)

Always verify: All \(E_i \geq 5\). If not, merge adjacent categories.

Case: Industry Distribution in the YRD

Question: Are the top 5 industries equally represented among YRD listed companies?

Industry	Observed	Expected (uniform)
Computer & Communication	287	203.8
Machinery	242	203.8
Chemical	212	203.8
Electrical Equipment	147	203.8
Pharmaceutical	131	203.8

\(\chi^2 = 63.5\), \(p < 0.001\)

Conclusion: The distribution is far from uniform — Computer & Communication is significantly over-represented.

Dirty Work: The Art of Binning

Binning (how you define categories) directly affects results.

The problem: Different binning schemes → different p-values → potential manipulation.

Three golden rules:

Rule	Principle
Theory first	Let domain knowledge guide category boundaries
Equal-frequency	When no theory exists, aim for equal expected counts
Merge sparse	Combine categories until all \(E_i \geq 5\)

Defense against binning hacking: Report results under multiple binning schemes (Multiverse Analysis).

Dirty Work: The Curse of Large N

With large samples, everything becomes statistically significant.

The paradox:

\(\chi^2\) statistic scales linearly with \(n\)
Even trivial deviations from the null become “significant”
A perfectly functioning roulette wheel will “fail” the uniformity test with enough spins

Solution: Always report effect sizes

Measure	Formula	Interpretation
Cramér’s V	\(V = \sqrt{\chi^2 / (n \cdot \min(r-1, c-1))}\)	\(V < 0.1\): negligible
Phi coefficient	\(\phi = \sqrt{\chi^2 / n}\)	For \(2 \times 2\) tables

Contingency Tables and Independence

A contingency table cross-tabulates two categorical variables.

Expected frequency under independence:

\[E_{ij} = \frac{R_i \times C_j}{n}\]

where \(R_i\) = row total, \(C_j\) = column total, \(n\) = grand total.

Test statistic:

\[\chi^2 = \sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \quad df = (r-1)(c-1)\]

Intuition: If the variables are independent, knowing the row shouldn’t help predict the column.

Case: Industry vs. Region in the YRD

Question: Is the industry distribution independent of province (Shanghai, Jiangsu, Zhejiang)?

Testing the top 4 industries across 3 provinces:

\(\chi^2\) statistic computed from \(4 \times 3\) table
\(df = (4-1)(3-1) = 6\)
Result: Cannot reject independence (\(p > 0.05\))

Interpretation: The three YRD provinces have surprisingly homogeneous industry structures — the tech and manufacturing mix is similar across Shanghai, Jiangsu, and Zhejiang.

Cramér’s V confirms: the association strength is negligible.

Homogeneity vs. Independence: What’s the Difference?

Aspect	Independence Test	Homogeneity Test
Design	One sample, two variables	Multiple samples, one variable
Question	Are X and Y independent?	Do groups have the same distribution?
Example	Is industry independent of profitability?	Do Shanghai, Jiangsu, Zhejiang have the same industry mix?
Math	Identical \(\chi^2\) formula	Identical \(\chi^2\) formula

They use the same test statistic but answer conceptually different questions.

McNemar Test: Paired Categorical Data

For before/after or matched-pair binary outcomes:

	After: Yes	After: No
Before: Yes	\(a\)	\(b\)
Before: No	\(c\)	\(d\)

Only the discordant pairs (\(b\) and \(c\)) carry information:

\[\chi^2 = \frac{(b - c)^2}{b + c}\]

Use case: Did a training program change employees’ views? (Before vs. after survey)

Fisher’s Exact Test: When Chi-Square Fails

Problem: Chi-square requires \(E_i \geq 5\). With small samples, this fails.

Solution: Fisher’s exact test computes the exact probability using the hypergeometric distribution.

Case: Risk Control Model Comparison (Securities firm, 50 flagged transactions)

	Model A Correct	Model A Wrong	Total
Model B Correct	20	5	25
Model B Wrong	15	10	25
Total	35	15	50

Odds Ratio = 6.0
Fisher exact \(p = 0.009\)
Conclusion: Model A significantly outperforms Model B

Heuristic: Is the Lottery Random?

Test: Simulate 100 draws of China’s “Double Color Ball” (红球) lottery and test whether the 33 numbers appear with equal frequency.

Under \(H_0\): each number has probability \(1/33\)
Apply chi-square goodness-of-fit test
Result: \(p = 0.265\) → Cannot reject randomness

Key insight: Even with a fair lottery, some numbers WILL appear more often than others due to sampling variability. The test tells us whether the deviation is beyond what chance alone would produce.

Heuristic: Are Zodiac Signs and Success Related?

Test: Among 100 wealthy individuals, test whether their zodiac sign distribution differs from population proportions (weighted by days per sign).

\(H_0\): Zodiac distribution matches calendar proportions
\(\chi^2\) test with 11 df
Result: \(p = 0.536\) → No evidence of association

P-Hacking warning: If you tested 12 individual signs, one might appear “significant” at \(\alpha = 0.05\) by pure chance (\(12 \times 0.05 = 0.6\) expected false positives).

Chapter Summary

Test	When to Use	Key Condition
Goodness-of-fit	Does data match a theoretical distribution?	\(E_i \geq 5\)
Independence	Are two categorical variables independent?	\(E_{ij} \geq 5\)
Homogeneity	Do groups have the same distribution?	Same as independence
McNemar	Paired before/after binary data	Sufficient discordant pairs
Fisher exact	Small sample sizes (\(E_{ij} < 5\))	Any sample size

Always report Cramér’s V alongside \(\chi^2\) to guard against the curse of large \(N\).