07: Inference for Means

Three Flavors of Mean Comparison

Test Design Question
One-sample t Single group vs. benchmark Is the mean different from \(\mu_0\)?
Two-sample t Two independent groups Do two populations have the same mean?
Paired t Same subjects, two conditions Did the treatment change the mean?

Choosing the wrong test for your data design is one of the most common errors in applied statistics.

This chapter provides a systematic framework for selecting and applying the correct test.

One-Sample t-Test: Theory

Question: Is the population mean \(\mu\) different from a benchmark value \(\mu_0\)?

\[t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} \sim t_{n-1}\]

Why t, not z? Because we estimate \(\sigma\) with \(s\), introducing additional uncertainty. The t-distribution has heavier tails than the normal, reflecting this extra source of randomness.

Geometric interpretation: Both the numerator (\(\bar{X}\)) and denominator (\(s\)) are random, creating a ratio distribution that is more spread out than the normal.

Case: Bank Profit Margins vs. Industry Benchmark

Setup: 43 A-share banks, net profit margin vs. 30% industry benchmark.

Metric Value
Sample mean 37.80%
Sample std dev 9.04%
\(t\)-statistic 5.65
\(p\)-value 0.000001
Cohen’s d 0.863 (large)

Decision: Reject \(H_0\). Bank profit margins are significantly and substantially above the 30% benchmark.

Key: Cohen’s d = 0.863 confirms the effect is practically meaningful, not just statistically detectable.

Two Independent Samples: The Setup

Question: Do two populations have different means?

\[t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{SE_{\bar{X}_1 - \bar{X}_2}}\]

Critical first step: Test for variance equality using Levene’s test.

Levene’s Result Use This Test Why
\(p > 0.05\) (equal variance) Student’s t Pooled SE is more efficient
\(p \leq 0.05\) (unequal variance) Welch’s t Separate SE estimates

Modern recommendation: Always use Welch’s t. It loses < 1% efficiency when variances are equal, but protects you when they’re not.

Student’s t vs. Welch’s t

Student’s t (equal variance assumed):

\[t = \frac{\bar{X}_1 - \bar{X}_2}{S_p \sqrt{1/n_1 + 1/n_2}}, \quad S_p = \sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}}\]

Welch’s t (no variance assumption):

\[t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}\]

with Satterthwaite approximation for degrees of freedom.

Key difference: Student’s pools the variances; Welch’s keeps them separate.

Case: Shanghai vs. Guangdong Daily Returns

Data: Daily stock returns for all listed firms in each region.

Region Mean Return Std Dev \(n\)
Shanghai 0.0235% ~400+ firms
Guangdong 0.0351% ~500+ firms
  • Welch’s t: \(p = 0.248\)
  • Decision: Cannot reject equal means

Interpretation: Despite a seemingly large number of firms, the day-to-day return difference of 0.012 percentage points is well within sampling noise. This is consistent with the Efficient Market Hypothesis — no systematic regional advantage in daily returns.

Dirty Work: Independence Violation

The assumption: Observations within each group must be independent.

Three ways this fails in finance:

Violation Mechanism Consequence
Clustering Firms in same industry share common factors Effective \(n\) < nominal \(n\)
Serial correlation Time series data — today’s return depends on yesterday’s \(s\) underestimates true uncertainty
Common shocks All firms react to same macro events Cross-sectional correlation

Result: Standard errors are too small → t-statistics are inflated → false rejections.

Dirty Work: The “N = 30” Myth

The myth: “If \(n \geq 30\), the CLT ensures the t-test is valid.”

The reality:

Distribution Shape \(n\) Required for Valid t-Test
Symmetric ~15-20
Moderate skew ~30-50
Heavy skew ~100-160
Bimodal Even \(n = 1000\) may not suffice

The N = 30 rule originated from Fisher-era table lookup convenience, not from any theoretical result.

Modern advice: Always use Welch’s t (robust to unequal variances) and examine the distribution shape before trusting your results.

Paired t-Test: Theory

Design: Same subjects measured twice (before/after, method A/method B).

Step 1: Compute differences \(d_i = X_{i,\text{after}} - X_{i,\text{before}}\)

Step 2: Apply one-sample t-test to the differences:

\[t = \frac{\bar{d} - \mu_d}{s_d / \sqrt{n}} \sim t_{n-1}\]

Why pairing? Each subject serves as its own control, removing between-subject variability and dramatically increasing power.

Case: Bank Stocks 2022 vs. 2023 Returns

Data: 43 bank stocks, annual returns in both years.

Metric Value
Mean difference (\(\bar{d}\)) 4.08 percentage points
Standard deviation of differences (\(s_d\)) 21.86 pp
\(t\)-statistic 1.22
\(p\)-value 0.228
Cohen’s d 0.187 (small)

Decision: The 4.08 pp improvement is not statistically significant. The large variance in individual bank performance (21.86 pp) overwhelms the modest average improvement.

Lesson: Even a seemingly meaningful average difference can be non-significant when individual variation is high.

Statistical Power: The Fourth Probability

\(H_0\) True \(H_0\) False
Reject \(H_0\) Type I Error (\(\alpha\)) Power (\(1-\beta\)) ✓
Fail to reject Correct (\(1-\alpha\)) Type II Error (\(\beta\))

Power = probability of detecting a real effect when it exists.

Courtroom analogy:

  • \(\alpha\) = convicting an innocent person (controlled at 5%)
  • \(\beta\) = letting a guilty person go free
  • Power = correctly convicting the guilty

Target: Power \(\geq 80\%\) (convention).

Factors Affecting Power

Factors Affecting Statistical Power Diagram showing four key factors that influence the statistical power of a test. Power (1−β) ↑ Effect Size (Δ) ↑ Sample Size (n) ↑ Alpha Level (α) ↓ Variance (σ²)

Increase power by: larger effect, larger \(n\), higher \(\alpha\), or lower \(\sigma^2\).

In practice, only sample size is under the researcher’s control.

Sample Size Formula for Two-Sample Test

\[n = \frac{2(\sigma_1^2 + \sigma_2^2)(z_{1-\alpha/2} + z_{1-\beta})^2}{\Delta^2}\]

where \(\Delta = \mu_1 - \mu_2\) is the minimum detectable difference.

A/B Testing Example:

Parameter Value
Baseline conversion rate 5%
Minimum detectable effect 1 pp (to 6%)
Power 80%
Required \(n\) per group 3,550
Total required 7,100

Heuristic: The Noise Trader Simulation

Setup: 10,000 individuals make purely random trades (50/50 coin flip).

After 1 year:

  • Some traders have impressive track records (by pure luck)
  • Media profiles the “top performers”
  • Their historical returns show \(p < 0.05\) in backtests

But their future performance reverts to average — because skill was never present.

Lesson: Survivorship bias + selection bias can create the illusion of significance. Always ask: was the hypothesis formed before or after seeing the data?

The Complete Decision Framework

T-Test Decision Framework A flowchart for selecting the appropriate t-test based on study design. Comparing Means? One group Two groups One-Sample t-Test Same subjects? No Yes Welch's t-Test (Always preferred over Student's t) Paired t-Test Always report: p-value + effect size (Cohen's d or Hedges' g)

Chapter Summary

Topic Key Takeaway
One-sample t Compare mean to a fixed benchmark (\(\mu_0\))
Two-sample t Always use Welch’s t — nearly as efficient as Student’s when variances are equal
Paired t Compute differences first; dramatically increases power
Effect size Cohen’s d (small: 0.2, medium: 0.5, large: 0.8)
Independence Clustering and serial correlation inflate t-statistics
N = 30 myth Based on table convenience, not theory
Power analysis Plan sample size before collecting data

The golden rule: A well-designed study with proper power analysis is worth more than any post-hoc statistical fix.