06: Goodness-of-Fit and Contingency Tables

From Continuous to Categorical: A Different Kind of Test

Everything so far has focused on continuous data (means, variances).

But many business questions involve categorical data:

  • Does customer preference differ across regions?
  • Is the industry distribution of firms uniform?
  • Are two classification variables independent?

The chi-square (\(\chi^2\)) test family provides the tools for these questions.

The Chi-Square Distribution

If \(Z_1, Z_2, \ldots, Z_k\) are independent standard normal random variables, then:

\[X = \sum_{i=1}^k Z_i^2 \sim \chi^2_k\]

Key properties:

Property Value
Mean \(E[X] = k\)
Variance \(\text{Var}(X) = 2k\)
Shape Right-skewed, approaches Normal as \(k \to \infty\)
Support \([0, \infty)\) — always non-negative

The parameter \(k\) (degrees of freedom) controls the distribution’s center and spread.

Degrees of Freedom: Intuition

Degrees of freedom (df) = number of values free to vary given constraints.

Example: If 5 category frequencies must sum to \(n = 100\):

  • You can freely choose the first 4 frequencies
  • The 5th is determined: \(f_5 = 100 - f_1 - f_2 - f_3 - f_4\)
  • df = \(5 - 1 = 4\)

General formulas:

Test Type df
Goodness-of-fit (\(k\) categories, \(m\) estimated parameters) \(k - 1 - m\)
Independence (\(r \times c\) table) \((r-1)(c-1)\)

Chi-Square Goodness-of-Fit Test

Question: Does an observed frequency distribution match a hypothesized one?

\[\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}\]

where \(O_i\) = observed frequency, \(E_i\) = expected frequency under \(H_0\).

Mathematical essence: Each term \((O_i - E_i)/\sqrt{E_i}\) is approximately standard normal, so the sum of squares follows \(\chi^2\).

Critical condition: All \(E_i \geq 5\) (otherwise the normal approximation breaks down).

Six-Step Testing Procedure

Step Action
1. Hypotheses \(H_0\): Data follows the specified distribution
2. Significance level Choose \(\alpha\) (typically 0.05)
3. Expected frequencies Compute \(E_i\) under \(H_0\)
4. Test statistic \(\chi^2 = \sum (O_i - E_i)^2 / E_i\)
5. Critical value / p-value Compare to \(\chi^2_{k-1-m}\)
6. Decision Reject \(H_0\) if \(\chi^2 > \chi^2_{\text{critical}}\)

Always verify: All \(E_i \geq 5\). If not, merge adjacent categories.

Case: Industry Distribution in the YRD

Question: Are the top 5 industries equally represented among YRD listed companies?

Industry Observed Expected (uniform)
Computer & Communication 287 203.8
Machinery 242 203.8
Chemical 212 203.8
Electrical Equipment 147 203.8
Pharmaceutical 131 203.8

\(\chi^2 = 63.5\), \(p < 0.001\)

Conclusion: The distribution is far from uniform — Computer & Communication is significantly over-represented.

Dirty Work: The Art of Binning

Binning (how you define categories) directly affects results.

The problem: Different binning schemes → different p-values → potential manipulation.

Three golden rules:

Rule Principle
Theory first Let domain knowledge guide category boundaries
Equal-frequency When no theory exists, aim for equal expected counts
Merge sparse Combine categories until all \(E_i \geq 5\)

Defense against binning hacking: Report results under multiple binning schemes (Multiverse Analysis).

Dirty Work: The Curse of Large N

With large samples, everything becomes statistically significant.

The paradox:

  • \(\chi^2\) statistic scales linearly with \(n\)
  • Even trivial deviations from the null become “significant”
  • A perfectly functioning roulette wheel will “fail” the uniformity test with enough spins

Solution: Always report effect sizes

Measure Formula Interpretation
Cramér’s V \(V = \sqrt{\chi^2 / (n \cdot \min(r-1, c-1))}\) \(V < 0.1\): negligible
Phi coefficient \(\phi = \sqrt{\chi^2 / n}\) For \(2 \times 2\) tables

Contingency Tables and Independence

A contingency table cross-tabulates two categorical variables.

Expected frequency under independence:

\[E_{ij} = \frac{R_i \times C_j}{n}\]

where \(R_i\) = row total, \(C_j\) = column total, \(n\) = grand total.

Test statistic:

\[\chi^2 = \sum_i \sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \quad df = (r-1)(c-1)\]

Intuition: If the variables are independent, knowing the row shouldn’t help predict the column.

Case: Industry vs. Region in the YRD

Question: Is the industry distribution independent of province (Shanghai, Jiangsu, Zhejiang)?

Testing the top 4 industries across 3 provinces:

  • \(\chi^2\) statistic computed from \(4 \times 3\) table
  • \(df = (4-1)(3-1) = 6\)
  • Result: Cannot reject independence (\(p > 0.05\))

Interpretation: The three YRD provinces have surprisingly homogeneous industry structures — the tech and manufacturing mix is similar across Shanghai, Jiangsu, and Zhejiang.

Cramér’s V confirms: the association strength is negligible.

Homogeneity vs. Independence: What’s the Difference?

Aspect Independence Test Homogeneity Test
Design One sample, two variables Multiple samples, one variable
Question Are X and Y independent? Do groups have the same distribution?
Example Is industry independent of profitability? Do Shanghai, Jiangsu, Zhejiang have the same industry mix?
Math Identical \(\chi^2\) formula Identical \(\chi^2\) formula

They use the same test statistic but answer conceptually different questions.

McNemar Test: Paired Categorical Data

For before/after or matched-pair binary outcomes:

After: Yes After: No
Before: Yes \(a\) \(b\)
Before: No \(c\) \(d\)

Only the discordant pairs (\(b\) and \(c\)) carry information:

\[\chi^2 = \frac{(b - c)^2}{b + c}\]

Use case: Did a training program change employees’ views? (Before vs. after survey)

Fisher’s Exact Test: When Chi-Square Fails

Problem: Chi-square requires \(E_i \geq 5\). With small samples, this fails.

Solution: Fisher’s exact test computes the exact probability using the hypergeometric distribution.

Case: Risk Control Model Comparison (Securities firm, 50 flagged transactions)

Model A Correct Model A Wrong Total
Model B Correct 20 5 25
Model B Wrong 15 10 25
Total 35 15 50
  • Odds Ratio = 6.0
  • Fisher exact \(p = 0.009\)
  • Conclusion: Model A significantly outperforms Model B

Heuristic: Is the Lottery Random?

Test: Simulate 100 draws of China’s “Double Color Ball” (红球) lottery and test whether the 33 numbers appear with equal frequency.

  • Under \(H_0\): each number has probability \(1/33\)
  • Apply chi-square goodness-of-fit test
  • Result: \(p = 0.265\) → Cannot reject randomness

Key insight: Even with a fair lottery, some numbers WILL appear more often than others due to sampling variability. The test tells us whether the deviation is beyond what chance alone would produce.

Chapter Summary

Test When to Use Key Condition
Goodness-of-fit Does data match a theoretical distribution? \(E_i \geq 5\)
Independence Are two categorical variables independent? \(E_{ij} \geq 5\)
Homogeneity Do groups have the same distribution? Same as independence
McNemar Paired before/after binary data Sufficient discordant pairs
Fisher exact Small sample sizes (\(E_{ij} < 5\)) Any sample size

Always report Cramér’s V alongside \(\chi^2\) to guard against the curse of large \(N\).