01: Introduction to Statistics
Why Statistics? The Engine Behind Modern Business Decisions
In the era of big data, statistics is the core methodology for extracting insights from data.
- Every business decision involves uncertainty
- Statistics provides a scientific framework to quantify and manage uncertainty
- From risk management to marketing optimization, statistics is everywhere
Four Real-World Applications in Finance & Business
Statistics powers critical decisions across the financial industry:
| Portfolio Risk Management |
Mean-Variance, Correlation |
Ch. 2, 8 |
| Financial Quality Assessment |
Descriptive Statistics, Testing |
Ch. 2, 5 |
| Quantitative Factor Investing |
Regression, Machine Learning |
Ch. 8–13 |
| Macroeconomic Policy Analysis |
Time Series, Inference |
Ch. 5–7 |
Application 1: Portfolio Risk — Diversification Through Correlation
The Markowitz Mean-Variance framework uses correlation to build optimal portfolios.
Key insight: combining assets with low or negative correlation reduces portfolio risk.
\[ \large{ \sigma_p^2 = \sum_i \sum_j w_i w_j \sigma_{ij} } \]
- \(w_i\): weight of asset \(i\) in the portfolio
- \(\sigma_{ij}\): covariance between asset \(i\) and asset \(j\)
Application 2: A/B Testing — Is the New Strategy Really Better?
A securities firm tests a new multi-factor stock selection model:
| Treatment (New Model) |
10,000 |
25% |
| Control (Old Model) |
10,000 |
22% |
The core question: Is the 3 percentage point difference statistically significant, or just random noise?
The Law of Large Numbers: Why More Data = More Certainty
As sample size \(n\) increases, the sample mean converges to the true population mean:
\[ \large{ \bar{X}_n \xrightarrow{P} \mu \quad \text{as } n \to \infty } \]
- With 10 coin flips, getting 70% heads is plausible
- With 10,000 coin flips, getting 70% heads is virtually impossible
- This is why A/B tests need large sample sizes
Data Fundamentals: The Structure of a Dataset
Every dataset is organized as a rectangular table (DataFrame):
- Rows = Observations (individual units, e.g., companies, transactions)
- Columns = Variables (attributes measured for each unit)
| 002415 |
Hikvision |
Technology |
280.5 |
21.3 |
| 002142 |
Bank of Ningbo |
Banking |
185.2 |
16.8 |
| 600585 |
Conch Cement |
Materials |
142.7 |
15.2 |
Variable Types: The Complete Taxonomy
Numerical vs. Categorical: The Key Distinction
Numerical variables have meaningful arithmetic operations:
- Discrete: countable values (number of employees, number of trades)
- Continuous: any value in a range (stock price, revenue, ROE)
Categorical variables represent group membership:
- Nominal: no natural ordering (industry sector, province)
- Ordinal: meaningful ordering (credit rating: AAA > AA > A)
Critical trap: Just because data looks like a number doesn’t make it numerical!
Three Types of Variable Relationships
Real data from Chinese A-share companies reveals three fundamental patterns:
| Negative correlation |
Stock price vs. trading volume |
Higher prices → lower volume |
| Positive correlation |
Revenue vs. net profit |
Growth drives profitability |
| No association |
Province vs. ROE |
Location doesn’t predict ROE |
Confounding Variables: The Hidden Threat
A confounding variable influences both the supposed cause and effect.
Data Collection: Observational vs. Experimental Studies
| Researcher control |
None — observe as-is |
Active manipulation |
| Causation |
Cannot establish |
Can establish |
| Cost |
Lower |
Higher |
| Types |
Cross-sectional, Retrospective, Prospective |
Randomized controlled |
Gold standard: Randomized Controlled Experiment with:
- Control group — baseline comparison
- Randomization — eliminates confounders
- Replication — adequate sample size
Sampling: From Population to Sample
Key concepts:
- Population (\(N\)): All units of interest (e.g., all 5,000+ A-share companies)
- Sample (\(n\)): A subset selected for study (e.g., 100 randomly chosen companies)
- Parameter (\(\mu\)): True population value (usually unknown)
- Statistic (\(\bar{x}\)): Calculated from sample data (our best estimate)
Why sample? Because studying the entire population is usually:
- Too expensive
- Too time-consuming
- Sometimes physically impossible
Four Types of Sampling Bias
| Selection bias |
Systematic exclusion |
Only sampling large-cap stocks |
| Voluntary response |
Self-selected participants |
Online satisfaction surveys |
| Nonresponse bias |
Non-respondents differ |
High-net-worth ignore surveys |
| Survivorship bias |
Only ‘survivors’ observed |
Backtesting on active stocks only |
Survivorship Bias: The Hidden Danger in Financial Analysis
Abraham Wald’s WWII insight: armor the areas with no bullet holes — those planes never came back.
Financial parallel:
- Backtesting on currently active stocks only → ignoring bankrupt companies
- Result: systematically inflated returns
Evidence from the Chinese A-share market:
- Total historical listings: ~5,000+
- Currently active: ~4,500
- Delisted: ~200+
- Ignoring delisted companies = upward bias
Course Roadmap: Four Parts, Thirteen Chapters
Chapter 1 Summary: Key Takeaways
Core Concepts:
- Statistics is the science of learning from data under uncertainty
- Variable types (numerical vs. categorical) determine applicable methods
- Association ≠ Causation — always check for confounders
Data Collection:
- Observational studies cannot establish causation
- Experiments require control, randomization, and replication
- Sampling bias (especially survivorship bias) invalidates results
Statistical Thinking:
- Embrace uncertainty — quantify it, don’t ignore it
- Data quality matters — Garbage In, Garbage Out
- Statistics is a way of thinking, not just a set of formulas