05: Inferential Statistics

Why Inference? From Sample to Population

In practice, we never observe the entire population.

  • A bank cannot survey every potential customer
  • An asset manager cannot wait for infinite return data
  • A regulator cannot audit every transaction

Inferential statistics provides the mathematical machinery to draw rigorous conclusions about a population from a limited sample — and to quantify our uncertainty.

The Three Pillars of Inference

Pillar Question It Answers Key Output
Point Estimation What is our best single guess for \(\theta\)? \(\hat{\theta}\)
Interval Estimation What range plausibly contains \(\theta\)? Confidence Interval
Hypothesis Testing Is a specific claim about \(\theta\) supported by data? p-value, decision

These three tools form a complete inferential toolkit for data-driven decision making in finance and business.

Point Estimation: Three Desirable Properties

A good estimator \(\hat{\theta}\) should satisfy:

1. Unbiasedness: On average, it hits the target.

\[E[\hat{\theta}] = \theta\]

2. Efficiency: Among all unbiased estimators, it has the smallest variance (Cramér-Rao Lower Bound).

3. Consistency: As \(n \to \infty\), the estimator converges to the true value.

\[\hat{\theta}_n \xrightarrow{p} \theta\]

Maximum Likelihood Estimation (MLE)

MLE answers: What parameter value makes the observed data most probable?

Given observations \(x_1, x_2, \ldots, x_n\), the likelihood function is:

\[L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)\]

The MLE maximizes this (or equivalently, the log-likelihood):

\[\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^{n} \ln f(x_i \mid \theta)\]

Key property: MLE is asymptotically efficient — it achieves the Cramér-Rao bound as \(n \to \infty\).

MLE in Action: YRD Profitability Rate

Case: Among 1,978 listed companies in the Yangtze River Delta (2023 Q3), what fraction is profitable?

  • Model: Each company’s profitability is Bernoulli(\(p\))
  • Data: 1,678 companies reported positive net income
  • MLE: \(\hat{p} = 1678 / 1978 = 84.83\%\)

Caution: MLE for variance is \(\frac{1}{n}\sum(x_i - \bar{x})^2\), which is biased. The unbiased version divides by \(n-1\).

This illustrates a general lesson: MLE is not always unbiased, but it is always consistent.

Confidence Intervals: The Concept

A 95% confidence interval does NOT mean:

“There is a 95% probability that \(\mu\) lies in this interval.”

It means:

“If we repeated this sampling procedure infinitely many times, 95% of the constructed intervals would contain the true \(\mu\).”

The randomness is in the interval, not in \(\mu\). The population parameter \(\mu\) is fixed but unknown.

CI for the Mean: Two Cases

Case 1: \(\sigma\) known (rare in practice)

\[\bar{X} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]

Case 2: \(\sigma\) unknown (the standard case)

\[\bar{X} \pm t_{\alpha/2, \, n-1} \cdot \frac{s}{\sqrt{n}}\]

CI for a proportion:

\[\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

The margin of error shrinks at rate \(1/\sqrt{n}\) — to halve the margin, you need 4× the sample size.

CI Case: Average ROE of YRD Electronics Firms

Data: 190 electronic industry firms in YRD, 2023 Q3.

  • Sample mean ROE: \(\bar{X} = 2.10\%\)
  • Sample std dev: \(s = 12.38\%\)
Confidence Level \(t\) Critical Value CI
90% 1.653 [0.62%, 3.59%]
95% 1.973 [0.33%, 3.87%]
99% 2.602 [−0.24%, 4.44%]

Interpretation: We are 95% confident that the true average ROE for YRD electronics firms lies between 0.33% and 3.87%.

Sample Size Planning

How large a sample do we need?

For estimating a mean with margin of error \(E\):

\[n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2\]

For estimating a proportion:

\[n = \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}\]

Example: Zhejiang textile industry revenue survey (\(\sigma = 50\) million CNY):

Target Margin Required \(n\)
±10 million CNY 97
±5 million CNY 385

Dirty Work: P-Hacking

P-hacking = testing many hypotheses until you find a “significant” one.

Simulation: Test 100 completely random features against a random target at \(\alpha = 0.05\).

  • Expected false positives: \(100 \times 0.05 = 5\)
  • Observed: ~5 features pass the significance test purely by chance

The lesson: With enough tests, you will find “significant” results even when nothing is real.

Defense: Pre-registration, multiple testing corrections (Bonferroni, FDR), and replication.

Dirty Work: The File Drawer Problem

Publication bias creates a distorted view of reality:

  • Studies with \(p < 0.05\) get published
  • Studies with \(p > 0.05\) stay in the “file drawer”
  • Published literature systematically overestimates effect sizes
File Drawer Problem Illustration showing published significant studies visible above, while many non-significant studies remain hidden in the file drawer below. Published Studies p < 0.05 ✓ 3 out of 20 studies (15% publication rate) File Drawer p ≥ 0.05 ✗ 17 out of 20 studies (85% remain unpublished) Result: Literature overestimates true effects

Hypothesis Testing: The Logic of Proof by Contradiction

Hypothesis testing follows the logic of proof by contradiction:

  1. Assume the null hypothesis \(H_0\) is true
  2. Compute how surprising the observed data would be under \(H_0\)
  3. If very surprising (small p-value), reject \(H_0\)
\(H_0\) True \(H_0\) False
Reject \(H_0\) Type I Error (\(\alpha\)) Correct (Power = \(1-\beta\))
Fail to reject Correct Type II Error (\(\beta\))

Convention: \(\alpha = 0.05\) (willing to accept 5% false positive rate).

The p-Value: What It Actually Means

\[p\text{-value} = P(\text{data this extreme or more} \mid H_0 \text{ is true})\]

Three critical misconceptions:

Misconception Reality
“p = probability \(H_0\) is true” p measures data surprise, not hypothesis probability
\(p < 0.05\) means the effect is large” Statistical significance ≠ practical importance
\(p > 0.05\) means no effect exists” Absence of evidence ≠ evidence of absence

The ASA Statement (2016): “A p-value does not measure the probability that the studied hypothesis is true.”

One-Sample t-Test

Question: Does the population mean equal a hypothesized value \(\mu_0\)?

\[t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} \sim t_{n-1}\]

Case: YRD Bank ROE vs. 2.5% Benchmark

  • 18 YRD-region banks, mean ROE = 8.54%
  • \(t = 8.13\), \(p \approx 0.000\)
  • Cohen’s d = 1.92 (very large effect)

Conclusion: YRD banks’ ROE is significantly and substantially above the 2.5% benchmark.

Two-Sample t-Test

Question: Do two populations have the same mean?

\[t = \frac{\bar{X}_1 - \bar{X}_2}{S_p \sqrt{1/n_1 + 1/n_2}}\]

where the pooled standard deviation is:

\[S_p = \sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1 + n_2 - 2}}\]

Case: Shanghai (425 firms) vs. Anhui (168 firms) ROE

  • Shanghai: \(\bar{X}_1 = 2.19\%\), Anhui: \(\bar{X}_2 = 0.32\%\)
  • \(p = 0.054\) → Fail to reject at 5% level
  • Borderline result — more data might tip the balance

Statistical Significance vs. Practical Importance

A result can be:

Practically Important Practically Trivial
Statistically Significant Ideal finding Large \(n\) trap
Not Significant Underpowered study True null

Example: With 100,000 users per group, a conversion rate difference of 0.05% (5.00% vs 5.05%) yields \(p = 0.054\).

Always report effect sizes alongside p-values:

  • Cohen’s d for means: small (0.2), medium (0.5), large (0.8)
  • Odds ratio for proportions

Heuristic: The Hot Hand Fallacy

Scenario: A basketball player makes 8 consecutive shots. Is she “hot”?

Statistical reality:

  • In a sequence of Bernoulli trials (50% success rate), runs of 8 occur more often than intuition suggests
  • Our brains are pattern-seeking machines — we see streaks where randomness exists
  • The same fallacy applies to fund manager “hot streaks”

Lesson: Before attributing performance to skill, always test against the null hypothesis of pure randomness.

Heuristic: Regression to the Mean

Observation: Extreme values in one period tend to be less extreme in the next.

  • The top-performing fund this year will likely underperform next year
  • A company with an exceptionally high ROE will likely see it decline
  • Students who score highest on one exam tend to score lower on the next

This is NOT mysterious — it’s a direct mathematical consequence of imperfect correlation between successive measurements.

Implication: Don’t confuse regression to the mean with actual causal deterioration.

Chapter Summary

Concept Key Takeaway
Point Estimation MLE is consistent and asymptotically efficient
Confidence Intervals Width \(\propto 1/\sqrt{n}\); 4× data for half the margin
Hypothesis Testing Proof by contradiction; control Type I error at \(\alpha\)
p-Value Measures data surprise, NOT probability of \(H_0\)
Effect Size Always report alongside p-value
P-Hacking Multiple testing inflates false positive rate
File Drawer Publication bias overestimates effects

The golden rule: Statistical significance without practical significance is meaningless.