Why t, not z? Because we estimate \(\sigma\) with \(s\), introducing additional uncertainty. The t-distribution has heavier tails than the normal, reflecting this extra source of randomness.
Geometric interpretation: Both the numerator (\(\bar{X}\)) and denominator (\(s\)) are random, creating a ratio distribution that is more spread out than the normal.
Case: Bank Profit Margins vs. Industry Benchmark
Setup: 43 A-share banks, net profit margin vs. 30% industry benchmark.
Metric
Value
Sample mean
37.80%
Sample std dev
9.04%
\(t\)-statistic
5.65
\(p\)-value
0.000001
Cohen’s d
0.863 (large)
Decision: Reject \(H_0\). Bank profit margins are significantly and substantially above the 30% benchmark.
Key: Cohen’s d = 0.863 confirms the effect is practically meaningful, not just statistically detectable.
Two Independent Samples: The Setup
Question: Do two populations have different means?
with Satterthwaite approximation for degrees of freedom.
Key difference: Student’s pools the variances; Welch’s keeps them separate.
Case: Shanghai vs. Guangdong Daily Returns
Data: Daily stock returns for all listed firms in each region.
Region
Mean Return
Std Dev
\(n\)
Shanghai
0.0235%
—
~400+ firms
Guangdong
0.0351%
—
~500+ firms
Welch’s t:\(p = 0.248\)
Decision: Cannot reject equal means
Interpretation: Despite a seemingly large number of firms, the day-to-day return difference of 0.012 percentage points is well within sampling noise. This is consistent with the Efficient Market Hypothesis — no systematic regional advantage in daily returns.
Dirty Work: Independence Violation
The assumption: Observations within each group must be independent.
Three ways this fails in finance:
Violation
Mechanism
Consequence
Clustering
Firms in same industry share common factors
Effective \(n\) < nominal \(n\)
Serial correlation
Time series data — today’s return depends on yesterday’s
\(s\) underestimates true uncertainty
Common shocks
All firms react to same macro events
Cross-sectional correlation
Result: Standard errors are too small → t-statistics are inflated → false rejections.
Dirty Work: The “N = 30” Myth
The myth: “If \(n \geq 30\), the CLT ensures the t-test is valid.”
The reality:
Distribution Shape
\(n\) Required for Valid t-Test
Symmetric
~15-20
Moderate skew
~30-50
Heavy skew
~100-160
Bimodal
Even \(n = 1000\) may not suffice
The N = 30 rule originated from Fisher-era table lookup convenience, not from any theoretical result.
Modern advice: Always use Welch’s t (robust to unequal variances) and examine the distribution shape before trusting your results.
Paired t-Test: Theory
Design: Same subjects measured twice (before/after, method A/method B).
Why pairing? Each subject serves as its own control, removing between-subject variability and dramatically increasing power.
Case: Bank Stocks 2022 vs. 2023 Returns
Data: 43 bank stocks, annual returns in both years.
Metric
Value
Mean difference (\(\bar{d}\))
4.08 percentage points
Standard deviation of differences (\(s_d\))
21.86 pp
\(t\)-statistic
1.22
\(p\)-value
0.228
Cohen’s d
0.187 (small)
Decision: The 4.08 pp improvement is not statistically significant. The large variance in individual bank performance (21.86 pp) overwhelms the modest average improvement.
Lesson: Even a seemingly meaningful average difference can be non-significant when individual variation is high.
Statistical Power: The Fourth Probability
\(H_0\) True
\(H_0\) False
Reject \(H_0\)
Type I Error (\(\alpha\))
Power (\(1-\beta\)) ✓
Fail to reject
Correct (\(1-\alpha\))
Type II Error (\(\beta\))
Power = probability of detecting a real effect when it exists.
Courtroom analogy:
\(\alpha\) = convicting an innocent person (controlled at 5%)
\(\beta\) = letting a guilty person go free
Power = correctly convicting the guilty
Target: Power \(\geq 80\%\) (convention).
Factors Affecting Power
Increase power by: larger effect, larger \(n\), higher \(\alpha\), or lower \(\sigma^2\).
In practice, only sample size is under the researcher’s control.
where \(\Delta = \mu_1 - \mu_2\) is the minimum detectable difference.
A/B Testing Example:
Parameter
Value
Baseline conversion rate
5%
Minimum detectable effect
1 pp (to 6%)
Power
80%
Required \(n\) per group
3,550
Total required
7,100
Heuristic: The Noise Trader Simulation
Setup: 10,000 individuals make purely random trades (50/50 coin flip).
After 1 year:
Some traders have impressive track records (by pure luck)
Media profiles the “top performers”
Their historical returns show \(p < 0.05\) in backtests
But their future performance reverts to average — because skill was never present.
Lesson: Survivorship bias + selection bias can create the illusion of significance. Always ask: was the hypothesis formed before or after seeing the data?
The Complete Decision Framework
Chapter Summary
Topic
Key Takeaway
One-sample t
Compare mean to a fixed benchmark (\(\mu_0\))
Two-sample t
Always use Welch’s t — nearly as efficient as Student’s when variances are equal
Paired t
Compute differences first; dramatically increases power
Effect size
Cohen’s d (small: 0.2, medium: 0.5, large: 0.8)
Independence
Clustering and serial correlation inflate t-statistics
N = 30 myth
Based on table convenience, not theory
Power analysis
Plan sample size before collecting data
The golden rule: A well-designed study with proper power analysis is worth more than any post-hoc statistical fix.