01: Introduction to Statistics

Why Statistics? The Engine Behind Modern Business Decisions

In the era of big data, statistics is the core methodology for extracting insights from data.

Every business decision involves uncertainty
Statistics provides a scientific framework to quantify and manage uncertainty
From risk management to marketing optimization, statistics is everywhere

Four Real-World Applications in Finance & Business

Statistics powers critical decisions across the financial industry:

Application	Statistical Tool	Chapter
Portfolio Risk Management	Mean-Variance, Correlation	Ch. 2, 8
Financial Quality Assessment	Descriptive Statistics, Testing	Ch. 2, 5
Quantitative Factor Investing	Regression, Machine Learning	Ch. 8–13
Macroeconomic Policy Analysis	Time Series, Inference	Ch. 5–7

Application 1: Portfolio Risk — Diversification Through Correlation

The Markowitz Mean-Variance framework uses correlation to build optimal portfolios.

Key insight: combining assets with low or negative correlation reduces portfolio risk.

\[ \large{ \sigma_p^2 = \sum_i \sum_j w_i w_j \sigma_{ij} } \]

\(w_i\): weight of asset \(i\) in the portfolio
\(\sigma_{ij}\): covariance between asset \(i\) and asset \(j\)

Application 2: A/B Testing — Is the New Strategy Really Better?

A securities firm tests a new multi-factor stock selection model:

Group	Sample Size	Win Rate
Treatment (New Model)	10,000	25%
Control (Old Model)	10,000	22%

The core question: Is the 3 percentage point difference statistically significant, or just random noise?

The Law of Large Numbers: Why More Data = More Certainty

As sample size \(n\) increases, the sample mean converges to the true population mean:

\[ \large{ \bar{X}_n \xrightarrow{P} \mu \quad \text{as } n \to \infty } \]

With 10 coin flips, getting 70% heads is plausible
With 10,000 coin flips, getting 70% heads is virtually impossible
This is why A/B tests need large sample sizes

Data Fundamentals: The Structure of a Dataset

Every dataset is organized as a rectangular table (DataFrame):

Rows = Observations (individual units, e.g., companies, transactions)
Columns = Variables (attributes measured for each unit)

Stock Code	Company Name	Industry	Market Cap (B)	ROE (%)
002415	Hikvision	Technology	280.5	21.3
002142	Bank of Ningbo	Banking	185.2	16.8
600585	Conch Cement	Materials	142.7	15.2

Variable Types: The Complete Taxonomy

Figure 1: Variable type classification taxonomy

Numerical vs. Categorical: The Key Distinction

Numerical variables have meaningful arithmetic operations:

Discrete: countable values (number of employees, number of trades)
Continuous: any value in a range (stock price, revenue, ROE)

Categorical variables represent group membership:

Nominal: no natural ordering (industry sector, province)
Ordinal: meaningful ordering (credit rating: AAA > AA > A)

Critical trap: Just because data looks like a number doesn’t make it numerical!

Three Types of Variable Relationships

Real data from Chinese A-share companies reveals three fundamental patterns:

Pattern	Example	Implication
Negative correlation	Stock price vs. trading volume	Higher prices → lower volume
Positive correlation	Revenue vs. net profit	Growth drives profitability
No association	Province vs. ROE	Location doesn’t predict ROE

Critical Warning: Association ≠ Causation

Observing a statistical association does NOT prove causation.

Classic examples of spurious associations:

Air conditioner sales ↑ and beer sales ↑ in summer → common cause: temperature
CEO education and firm performance correlate → confounders: family, industry

Three conditions required for causal inference:

Temporal precedence — cause precedes effect
Strong association — statistically significant relationship
No confounders — all alternative explanations eliminated

Confounding Variables: The Hidden Threat

A confounding variable influences both the supposed cause and effect.

Data Collection: Observational vs. Experimental Studies

Feature	Observational Study	Experimental Study
Researcher control	None — observe as-is	Active manipulation
Causation	Cannot establish	Can establish
Cost	Lower	Higher
Types	Cross-sectional, Retrospective, Prospective	Randomized controlled

Gold standard: Randomized Controlled Experiment with:

Control group — baseline comparison
Randomization — eliminates confounders
Replication — adequate sample size

Sampling: From Population to Sample

Key concepts:

Population (\(N\)): All units of interest (e.g., all 5,000+ A-share companies)
Sample (\(n\)): A subset selected for study (e.g., 100 randomly chosen companies)
Parameter (\(\mu\)): True population value (usually unknown)
Statistic (\(\bar{x}\)): Calculated from sample data (our best estimate)

Why sample? Because studying the entire population is usually:

Too expensive
Too time-consuming
Sometimes physically impossible

Four Types of Sampling Bias

Bias Type	Definition	Finance Example
Selection bias	Systematic exclusion	Only sampling large-cap stocks
Voluntary response	Self-selected participants	Online satisfaction surveys
Nonresponse bias	Non-respondents differ	High-net-worth ignore surveys
Survivorship bias	Only ‘survivors’ observed	Backtesting on active stocks only

Survivorship Bias: The Hidden Danger in Financial Analysis

Abraham Wald’s WWII insight: armor the areas with no bullet holes — those planes never came back.

Financial parallel:

Backtesting on currently active stocks only → ignoring bankrupt companies
Result: systematically inflated returns

Evidence from the Chinese A-share market:

Total historical listings: ~5,000+
Currently active: ~4,500
Delisted: ~200+
Ignoring delisted companies = upward bias

The ‘Dirty Work’: Data Quality is Non-Negotiable

Real-world data is messy. Before any analysis, you must address:

Missing values — companies that don’t report certain metrics
Outliers — extreme values that distort statistical summaries
Data inconsistencies — incorrect entries, duplicate records

The golden rule:

Garbage In, Garbage Out — No statistical method compensates for bad data.

Course Roadmap: Four Parts, Thirteen Chapters

Figure 2: Course structure: from describing data to predicting outcomes

Chapter 1 Summary: Key Takeaways

Core Concepts:

Statistics is the science of learning from data under uncertainty
Variable types (numerical vs. categorical) determine applicable methods
Association ≠ Causation — always check for confounders

Data Collection:

Observational studies cannot establish causation
Experiments require control, randomization, and replication
Sampling bias (especially survivorship bias) invalidates results

Statistical Thinking:

Embrace uncertainty — quantify it, don’t ignore it
Data quality matters — Garbage In, Garbage Out
Statistics is a way of thinking, not just a set of formulas