07: Inference for Means

Three Flavors of Mean Comparison

Test	Design	Question
One-sample t	Single group vs. benchmark	Is the mean different from \(\mu_0\)?
Two-sample t	Two independent groups	Do two populations have the same mean?
Paired t	Same subjects, two conditions	Did the treatment change the mean?

Choosing the wrong test for your data design is one of the most common errors in applied statistics.

This chapter provides a systematic framework for selecting and applying the correct test.

One-Sample t-Test: Theory

Question: Is the population mean \(\mu\) different from a benchmark value \(\mu_0\)?

\[t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} \sim t_{n-1}\]

Why t, not z? Because we estimate \(\sigma\) with \(s\), introducing additional uncertainty. The t-distribution has heavier tails than the normal, reflecting this extra source of randomness.

Geometric interpretation: Both the numerator (\(\bar{X}\)) and denominator (\(s\)) are random, creating a ratio distribution that is more spread out than the normal.

Case: Bank Profit Margins vs. Industry Benchmark

Setup: 43 A-share banks, net profit margin vs. 30% industry benchmark.

Metric	Value
Sample mean	37.80%
Sample std dev	9.04%
\(t\)-statistic	5.65
\(p\)-value	0.000001
Cohen’s d	0.863 (large)

Decision: Reject \(H_0\). Bank profit margins are significantly and substantially above the 30% benchmark.

Key: Cohen’s d = 0.863 confirms the effect is practically meaningful, not just statistically detectable.

Two Independent Samples: The Setup

Question: Do two populations have different means?

\[t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{SE_{\bar{X}_1 - \bar{X}_2}}\]

Critical first step: Test for variance equality using Levene’s test.

Levene’s Result	Use This Test	Why
\(p > 0.05\) (equal variance)	Student’s t	Pooled SE is more efficient
\(p \leq 0.05\) (unequal variance)	Welch’s t	Separate SE estimates

Modern recommendation: Always use Welch’s t. It loses < 1% efficiency when variances are equal, but protects you when they’re not.

Student’s t vs. Welch’s t

Student’s t (equal variance assumed):

\[t = \frac{\bar{X}_1 - \bar{X}_2}{S_p \sqrt{1/n_1 + 1/n_2}}, \quad S_p = \sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1+n_2-2}}\]

Welch’s t (no variance assumption):

\[t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{S_1^2/n_1 + S_2^2/n_2}}\]

with Satterthwaite approximation for degrees of freedom.

Key difference: Student’s pools the variances; Welch’s keeps them separate.

Case: Shanghai vs. Guangdong Daily Returns

Data: Daily stock returns for all listed firms in each region.

Region	Mean Return	Std Dev	\(n\)
Shanghai	0.0235%	—	~400+ firms
Guangdong	0.0351%	—	~500+ firms

Welch’s t: \(p = 0.248\)
Decision: Cannot reject equal means

Interpretation: Despite a seemingly large number of firms, the day-to-day return difference of 0.012 percentage points is well within sampling noise. This is consistent with the Efficient Market Hypothesis — no systematic regional advantage in daily returns.

Dirty Work: Independence Violation

The assumption: Observations within each group must be independent.

Three ways this fails in finance:

Violation	Mechanism	Consequence
Clustering	Firms in same industry share common factors	Effective \(n\) < nominal \(n\)
Serial correlation	Time series data — today’s return depends on yesterday’s	\(s\) underestimates true uncertainty
Common shocks	All firms react to same macro events	Cross-sectional correlation

Result: Standard errors are too small → t-statistics are inflated → false rejections.

Dirty Work: The “N = 30” Myth

The myth: “If \(n \geq 30\), the CLT ensures the t-test is valid.”

The reality:

Distribution Shape	\(n\) Required for Valid t-Test
Symmetric	~15-20
Moderate skew	~30-50
Heavy skew	~100-160
Bimodal	Even \(n = 1000\) may not suffice

The N = 30 rule originated from Fisher-era table lookup convenience, not from any theoretical result.

Modern advice: Always use Welch’s t (robust to unequal variances) and examine the distribution shape before trusting your results.

Paired t-Test: Theory

Design: Same subjects measured twice (before/after, method A/method B).

Step 1: Compute differences \(d_i = X_{i,\text{after}} - X_{i,\text{before}}\)

Step 2: Apply one-sample t-test to the differences:

\[t = \frac{\bar{d} - \mu_d}{s_d / \sqrt{n}} \sim t_{n-1}\]

Why pairing? Each subject serves as its own control, removing between-subject variability and dramatically increasing power.

Case: Bank Stocks 2022 vs. 2023 Returns

Data: 43 bank stocks, annual returns in both years.

Metric	Value
Mean difference (\(\bar{d}\))	4.08 percentage points
Standard deviation of differences (\(s_d\))	21.86 pp
\(t\)-statistic	1.22
\(p\)-value	0.228
Cohen’s d	0.187 (small)

Decision: The 4.08 pp improvement is not statistically significant. The large variance in individual bank performance (21.86 pp) overwhelms the modest average improvement.

Lesson: Even a seemingly meaningful average difference can be non-significant when individual variation is high.

Statistical Power: The Fourth Probability

	\(H_0\) True	\(H_0\) False
Reject \(H_0\)	Type I Error (\(\alpha\))	Power (\(1-\beta\)) ✓
Fail to reject	Correct (\(1-\alpha\))	Type II Error (\(\beta\))

Power = probability of detecting a real effect when it exists.

Courtroom analogy:

\(\alpha\) = convicting an innocent person (controlled at 5%)
\(\beta\) = letting a guilty person go free
Power = correctly convicting the guilty

Target: Power \(\geq 80\%\) (convention).

Factors Affecting Power

Increase power by: larger effect, larger \(n\), higher \(\alpha\), or lower \(\sigma^2\).

In practice, only sample size is under the researcher’s control.

Sample Size Formula for Two-Sample Test

\[n = \frac{2(\sigma_1^2 + \sigma_2^2)(z_{1-\alpha/2} + z_{1-\beta})^2}{\Delta^2}\]

where \(\Delta = \mu_1 - \mu_2\) is the minimum detectable difference.

A/B Testing Example:

Parameter	Value
Baseline conversion rate	5%
Minimum detectable effect	1 pp (to 6%)
Power	80%
Required \(n\) per group	3,550
Total required	7,100

Heuristic: The Noise Trader Simulation

Setup: 10,000 individuals make purely random trades (50/50 coin flip).

After 1 year:

Some traders have impressive track records (by pure luck)
Media profiles the “top performers”
Their historical returns show \(p < 0.05\) in backtests

But their future performance reverts to average — because skill was never present.

Lesson: Survivorship bias + selection bias can create the illusion of significance. Always ask: was the hypothesis formed before or after seeing the data?

The Complete Decision Framework

Chapter Summary

Topic	Key Takeaway
One-sample t	Compare mean to a fixed benchmark (\(\mu_0\))
Two-sample t	Always use Welch’s t — nearly as efficient as Student’s when variances are equal
Paired t	Compute differences first; dramatically increases power
Effect size	Cohen’s d (small: 0.2, medium: 0.5, large: 0.8)
Independence	Clustering and serial correlation inflate t-statistics
N = 30 myth	Based on table convenience, not theory
Power analysis	Plan sample size before collecting data

The golden rule: A well-designed study with proper power analysis is worth more than any post-hoc statistical fix.