05: Inferential Statistics
Why Inference? From Sample to Population
In practice, we never observe the entire population.
A bank cannot survey every potential customer
An asset manager cannot wait for infinite return data
A regulator cannot audit every transaction
Inferential statistics provides the mathematical machinery to draw rigorous conclusions about a population from a limited sample — and to quantify our uncertainty .
This chapter bridges descriptive statistics and decision-making. Everything we’ve computed so far — means, variances, correlations — were sample statistics. The big question is: what can these numbers tell us about the true, unknown population parameters?
The Three Pillars of Inference
Point Estimation
What is our best single guess for \(\theta\) ?
\(\hat{\theta}\)
Interval Estimation
What range plausibly contains \(\theta\) ?
Confidence Interval
Hypothesis Testing
Is a specific claim about \(\theta\) supported by data?
p-value, decision
These three tools form a complete inferential toolkit for data-driven decision making in finance and business.
Think of it this way: point estimation gives you a single number, interval estimation gives you a range with a confidence level, and hypothesis testing gives you a yes/no decision framework. We need all three.
Point Estimation: Three Desirable Properties
A good estimator \(\hat{\theta}\) should satisfy:
1. Unbiasedness: On average, it hits the target.
\[E[\hat{\theta}] = \theta\]
2. Efficiency: Among all unbiased estimators, it has the smallest variance (Cramér-Rao Lower Bound).
3. Consistency: As \(n \to \infty\) , the estimator converges to the true value.
\[\hat{\theta}_n \xrightarrow{p} \theta\]
Unbiasedness means no systematic error. Efficiency means minimum random error. Consistency means that with enough data, you will eventually get the right answer. Note that even the sample variance uses \(n-1\) (Bessel’s correction) precisely to achieve unbiasedness.
Maximum Likelihood Estimation (MLE)
MLE answers: What parameter value makes the observed data most probable?
Given observations \(x_1, x_2, \ldots, x_n\) , the likelihood function is:
\[L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)\]
The MLE maximizes this (or equivalently, the log-likelihood):
\[\hat{\theta}_{MLE} = \arg\max_\theta \sum_{i=1}^{n} \ln f(x_i \mid \theta)\]
Key property: MLE is asymptotically efficient — it achieves the Cramér-Rao bound as \(n \to \infty\) .
MLE is the workhorse of modern statistics. For the Bernoulli case, if we observe k successes in n trials, the MLE for p is simply k/n. For normal data, the MLE for mu is the sample mean. The beauty of MLE is its generality — it works for any parametric model.
MLE in Action: YRD Profitability Rate
Case: Among 1,978 listed companies in the Yangtze River Delta (2023 Q3), what fraction is profitable?
Model: Each company’s profitability is Bernoulli(\(p\) )
Data: 1,678 companies reported positive net income
MLE: \(\hat{p} = 1678 / 1978 = 84.83\%\)
Caution: MLE for variance is \(\frac{1}{n}\sum(x_i - \bar{x})^2\) , which is biased . The unbiased version divides by \(n-1\) .
This illustrates a general lesson: MLE is not always unbiased, but it is always consistent.
This is real data from the Yangtze River Delta. The MLE for a proportion is beautifully simple — just the sample proportion. But remember, for the variance parameter, MLE gives a biased estimate. That’s why we always use n-1 in practice for sample variance.
Confidence Intervals: The Concept
A 95% confidence interval does NOT mean:
“There is a 95% probability that \(\mu\) lies in this interval.”
It means:
“If we repeated this sampling procedure infinitely many times, 95% of the constructed intervals would contain the true \(\mu\) .”
The randomness is in the interval , not in \(\mu\) . The population parameter \(\mu\) is fixed but unknown.
This is one of the most commonly misunderstood concepts in statistics. The frequentist interpretation says mu is a fixed constant — it’s the interval that’s random. Each time you draw a new sample, you get a different interval, and 95% of those intervals will capture the true mu.
CI for the Mean: Two Cases
Case 1: \(\sigma\) known (rare in practice)
\[\bar{X} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]
Case 2: \(\sigma\) unknown (the standard case)
\[\bar{X} \pm t_{\alpha/2, \, n-1} \cdot \frac{s}{\sqrt{n}}\]
CI for a proportion:
\[\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
The margin of error shrinks at rate \(1/\sqrt{n}\) — to halve the margin, you need 4× the sample size .
The key insight is that margin of error decreases with the square root of n, not linearly. This has enormous practical implications for sample size planning. Doubling your sample size only reduces the margin by about 30%. To cut it in half, you need four times as many observations.
CI Case: Average ROE of YRD Electronics Firms
Data: 190 electronic industry firms in YRD, 2023 Q3.
Sample mean ROE: \(\bar{X} = 2.10\%\)
Sample std dev: \(s = 12.38\%\)
90%
1.653
[0.62%, 3.59%]
95%
1.973
[0.33%, 3.87%]
99%
2.602
[−0.24%, 4.44%]
Interpretation: We are 95% confident that the true average ROE for YRD electronics firms lies between 0.33% and 3.87%.
Notice how the interval widens as confidence level increases. At 99% confidence, the interval actually includes zero — meaning we can’t even be sure the average ROE is positive! This illustrates the fundamental trade-off between confidence and precision.
Sample Size Planning
How large a sample do we need?
For estimating a mean with margin of error \(E\) :
\[n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2\]
For estimating a proportion:
\[n = \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}\]
Example: Zhejiang textile industry revenue survey (\(\sigma = 50\) million CNY):
±10 million CNY
97
±5 million CNY
385
Sample size planning should happen BEFORE data collection. Notice the quadratic relationship — halving the margin of error from 10 million to 5 million requires roughly 4 times the sample size (from 97 to 385). In corporate research, budget constraints often determine the achievable precision.
Dirty Work: P-Hacking
P-hacking = testing many hypotheses until you find a “significant” one.
Simulation: Test 100 completely random features against a random target at \(\alpha = 0.05\) .
Expected false positives: \(100 \times 0.05 = 5\)
Observed: ~5 features pass the significance test purely by chance
The lesson: With enough tests, you will find “significant” results even when nothing is real.
Defense: Pre-registration, multiple testing corrections (Bonferroni, FDR), and replication.
This is one of the most important warnings in modern statistics. If you test 100 hypotheses at the 5% level, you expect about 5 false positives. Some researchers — consciously or not — keep testing different variables, subgroups, or model specifications until they find a significant p-value. This is p-hacking, and it’s a major contributor to the replication crisis.
Dirty Work: The File Drawer Problem
Publication bias creates a distorted view of reality:
Studies with \(p < 0.05\) get published
Studies with \(p > 0.05\) stay in the “file drawer”
Published literature systematically overestimates effect sizes
File Drawer Problem
Illustration showing published significant studies visible above, while many non-significant studies remain hidden in the file drawer below.
Published Studies
p < 0.05 ✓
3 out of 20 studies
(15% publication rate)
File Drawer
p ≥ 0.05 ✗
17 out of 20 studies
(85% remain unpublished)
Result: Literature overestimates true effects
The file drawer problem means that what we see in published journals is not representative of all research conducted. If 20 teams independently study the same question and only the 3 that find significant results publish, the literature will suggest a strong effect where there may be none. This is why replication studies and meta-analyses are so important.
The p-Value: What It Actually Means
\[p\text{-value} = P(\text{data this extreme or more} \mid H_0 \text{ is true})\]
Three critical misconceptions:
“p = probability \(H_0\) is true”
p measures data surprise, not hypothesis probability
“\(p < 0.05\) means the effect is large”
Statistical significance ≠ practical importance
“\(p > 0.05\) means no effect exists”
Absence of evidence ≠ evidence of absence
The ASA Statement (2016): “A p-value does not measure the probability that the studied hypothesis is true.”
The p-value is perhaps the most misunderstood quantity in all of science. It is NOT the probability that the null hypothesis is true. It is the probability of observing data as extreme as what we got, assuming the null IS true. This is a subtle but crucial distinction. Also, remember that a small p-value with a tiny effect size may be statistically significant but practically meaningless.
One-Sample t-Test
Question: Does the population mean equal a hypothesized value \(\mu_0\) ?
\[t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} \sim t_{n-1}\]
Case: YRD Bank ROE vs. 2.5% Benchmark
18 YRD-region banks, mean ROE = 8.54%
\(t = 8.13\) , \(p \approx 0.000\)
Cohen’s d = 1.92 (very large effect)
Conclusion: YRD banks’ ROE is significantly and substantially above the 2.5% benchmark.
The t-test is beautifully simple: take the difference between the sample mean and the hypothesized value, divide by the standard error. If this ratio is large (far from zero), the data are inconsistent with the null. In this bank case, the t-statistic of 8.13 is enormous — the p-value is essentially zero. But notice we also report Cohen’s d, which shows the effect is practically large, not just statistically significant.
Two-Sample t-Test
Question: Do two populations have the same mean?
\[t = \frac{\bar{X}_1 - \bar{X}_2}{S_p \sqrt{1/n_1 + 1/n_2}}\]
where the pooled standard deviation is:
\[S_p = \sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n_1 + n_2 - 2}}\]
Case: Shanghai (425 firms) vs. Anhui (168 firms) ROE
Shanghai: \(\bar{X}_1 = 2.19\%\) , Anhui: \(\bar{X}_2 = 0.32\%\)
\(p = 0.054\) → Fail to reject at 5% level
Borderline result — more data might tip the balance
The two-sample test compares means across two independent groups. The pooled standard deviation combines information from both samples. In this Shanghai vs Anhui comparison, the p-value of 0.054 is tantalizingly close to the 0.05 threshold. This is exactly the kind of result that should make you uncomfortable with rigid cutoffs — the difference is 1.87 percentage points, which might be economically meaningful even if not statistically significant at the conventional level.
Statistical Significance vs. Practical Importance
A result can be:
Statistically Significant
Ideal finding
Large \(n\) trap
Not Significant
Underpowered study
True null
Example: With 100,000 users per group, a conversion rate difference of 0.05% (5.00% vs 5.05%) yields \(p = 0.054\) .
Always report effect sizes alongside p-values:
Cohen’s d for means: small (0.2), medium (0.5), large (0.8)
Odds ratio for proportions
This is arguably the most important slide in this chapter. With enough data, you can detect arbitrarily small differences. A conversion rate improvement of 0.05 percentage points might be statistically detectable with 100,000 users, but is it worth redesigning your website for? Always pair statistical significance with a measure of practical importance.
Heuristic: The Hot Hand Fallacy
Scenario: A basketball player makes 8 consecutive shots. Is she “hot”?
Statistical reality:
In a sequence of Bernoulli trials (50% success rate), runs of 8 occur more often than intuition suggests
Our brains are pattern-seeking machines — we see streaks where randomness exists
The same fallacy applies to fund manager “hot streaks”
Lesson: Before attributing performance to skill, always test against the null hypothesis of pure randomness .
Humans are terrible at recognizing randomness. We see patterns in coin flips, stock charts, and sports statistics that aren’t really there. When a fund manager has three great years in a row, we assume skill. But with thousands of fund managers, some will have great runs purely by chance. This connects directly to our hypothesis testing framework — always ask: could this have happened by chance?
Heuristic: Regression to the Mean
Observation: Extreme values in one period tend to be less extreme in the next.
The top-performing fund this year will likely underperform next year
A company with an exceptionally high ROE will likely see it decline
Students who score highest on one exam tend to score lower on the next
This is NOT mysterious — it’s a direct mathematical consequence of imperfect correlation between successive measurements.
Implication: Don’t confuse regression to the mean with actual causal deterioration .
Francis Galton discovered this phenomenon studying the heights of parents and children. It’s purely statistical: if your first measurement is extreme (far from the mean), your second measurement — which is imperfectly correlated — will tend to be closer to the mean. This has enormous implications in finance: the “best” stocks or funds in one period rarely stay the best. This is regression to the mean, not necessarily a change in fundamental quality.
Chapter Summary
Point Estimation
MLE is consistent and asymptotically efficient
Confidence Intervals
Width \(\propto 1/\sqrt{n}\) ; 4× data for half the margin
Hypothesis Testing
Proof by contradiction; control Type I error at \(\alpha\)
p-Value
Measures data surprise, NOT probability of \(H_0\)
Effect Size
Always report alongside p-value
P-Hacking
Multiple testing inflates false positive rate
File Drawer
Publication bias overestimates effects
The golden rule: Statistical significance without practical significance is meaningless.
Let me leave you with one thought. The machinery of inference — estimation, confidence intervals, hypothesis tests — is powerful but dangerous. These tools tell you about statistical patterns, but they don’t automatically tell you about real-world importance. Always combine statistical results with domain knowledge and common sense.