01: Introduction to Statistics

Why Statistics? The Engine Behind Modern Business Decisions

In the era of big data, statistics is the core methodology for extracting insights from data.

  • Every business decision involves uncertainty
  • Statistics provides a scientific framework to quantify and manage uncertainty
  • From risk management to marketing optimization, statistics is everywhere

Four Real-World Applications in Finance & Business

Statistics powers critical decisions across the financial industry:

Application Statistical Tool Chapter
Portfolio Risk Management Mean-Variance, Correlation Ch. 2, 8
Financial Quality Assessment Descriptive Statistics, Testing Ch. 2, 5
Quantitative Factor Investing Regression, Machine Learning Ch. 8–13
Macroeconomic Policy Analysis Time Series, Inference Ch. 5–7

Application 1: Portfolio Risk — Diversification Through Correlation

The Markowitz Mean-Variance framework uses correlation to build optimal portfolios.

Key insight: combining assets with low or negative correlation reduces portfolio risk.

\[ \large{ \sigma_p^2 = \sum_i \sum_j w_i w_j \sigma_{ij} } \]

  • \(w_i\): weight of asset \(i\) in the portfolio
  • \(\sigma_{ij}\): covariance between asset \(i\) and asset \(j\)

Application 2: A/B Testing — Is the New Strategy Really Better?

A securities firm tests a new multi-factor stock selection model:

Group Sample Size Win Rate
Treatment (New Model) 10,000 25%
Control (Old Model) 10,000 22%

The core question: Is the 3 percentage point difference statistically significant, or just random noise?

The Law of Large Numbers: Why More Data = More Certainty

As sample size \(n\) increases, the sample mean converges to the true population mean:

\[ \large{ \bar{X}_n \xrightarrow{P} \mu \quad \text{as } n \to \infty } \]

  • With 10 coin flips, getting 70% heads is plausible
  • With 10,000 coin flips, getting 70% heads is virtually impossible
  • This is why A/B tests need large sample sizes

Data Fundamentals: The Structure of a Dataset

Every dataset is organized as a rectangular table (DataFrame):

  • Rows = Observations (individual units, e.g., companies, transactions)
  • Columns = Variables (attributes measured for each unit)
Stock Code Company Name Industry Market Cap (B) ROE (%)
002415 Hikvision Technology 280.5 21.3
002142 Bank of Ningbo Banking 185.2 16.8
600585 Conch Cement Materials 142.7 15.2

Variable Types: The Complete Taxonomy

Variable Type Taxonomy A tree diagram showing variable types: Numerical (Discrete, Continuous) and Categorical (Nominal, Ordinal) Variables Numerical Categorical Discrete Continuous Nominal Ordinal e.g., Number of employees, trades e.g., Stock price, revenue, ROE e.g., Industry, province, gender e.g., Credit rating (AAA > AA > A) Warning: Numbers ≠ Numerical Variables! Zip codes, stock codes, and ID numbers are CATEGORICAL
Figure 1: Variable type classification taxonomy

Numerical vs. Categorical: The Key Distinction

Numerical variables have meaningful arithmetic operations:

  • Discrete: countable values (number of employees, number of trades)
  • Continuous: any value in a range (stock price, revenue, ROE)

Categorical variables represent group membership:

  • Nominal: no natural ordering (industry sector, province)
  • Ordinal: meaningful ordering (credit rating: AAA > AA > A)

Critical trap: Just because data looks like a number doesn’t make it numerical!

Three Types of Variable Relationships

Real data from Chinese A-share companies reveals three fundamental patterns:

Pattern Example Implication
Negative correlation Stock price vs. trading volume Higher prices → lower volume
Positive correlation Revenue vs. net profit Growth drives profitability
No association Province vs. ROE Location doesn’t predict ROE

Critical Warning: Association ≠ Causation

Observing a statistical association does NOT prove causation.

Classic examples of spurious associations:

  • Air conditioner sales ↑ and beer sales ↑ in summer → common cause: temperature
  • CEO education and firm performance correlate → confounders: family, industry

Three conditions required for causal inference:

  1. Temporal precedence — cause precedes effect
  2. Strong association — statistically significant relationship
  3. No confounders — all alternative explanations eliminated

Confounding Variables: The Hidden Threat

A confounding variable influences both the supposed cause and effect.

Confounding variable diagram Shows how a confounder Z affects both X and Y, creating a spurious association Confounder Z Variable X Variable Y Spurious association

Data Collection: Observational vs. Experimental Studies

Feature Observational Study Experimental Study
Researcher control None — observe as-is Active manipulation
Causation Cannot establish Can establish
Cost Lower Higher
Types Cross-sectional, Retrospective, Prospective Randomized controlled

Gold standard: Randomized Controlled Experiment with:

  • Control group — baseline comparison
  • Randomization — eliminates confounders
  • Replication — adequate sample size

Sampling: From Population to Sample

Key concepts:

  • Population (\(N\)): All units of interest (e.g., all 5,000+ A-share companies)
  • Sample (\(n\)): A subset selected for study (e.g., 100 randomly chosen companies)
  • Parameter (\(\mu\)): True population value (usually unknown)
  • Statistic (\(\bar{x}\)): Calculated from sample data (our best estimate)

Why sample? Because studying the entire population is usually:

  • Too expensive
  • Too time-consuming
  • Sometimes physically impossible

Four Types of Sampling Bias

Bias Type Definition Finance Example
Selection bias Systematic exclusion Only sampling large-cap stocks
Voluntary response Self-selected participants Online satisfaction surveys
Nonresponse bias Non-respondents differ High-net-worth ignore surveys
Survivorship bias Only ‘survivors’ observed Backtesting on active stocks only

Survivorship Bias: The Hidden Danger in Financial Analysis

Abraham Wald’s WWII insight: armor the areas with no bullet holes — those planes never came back.

Financial parallel:

  • Backtesting on currently active stocks only → ignoring bankrupt companies
  • Result: systematically inflated returns

Evidence from the Chinese A-share market:

  • Total historical listings: ~5,000+
  • Currently active: ~4,500
  • Delisted: ~200+
  • Ignoring delisted companies = upward bias

The ‘Dirty Work’: Data Quality is Non-Negotiable

Real-world data is messy. Before any analysis, you must address:

  1. Missing values — companies that don’t report certain metrics
  2. Outliers — extreme values that distort statistical summaries
  3. Data inconsistencies — incorrect entries, duplicate records

The golden rule:

Garbage In, Garbage Out — No statistical method compensates for bad data.

Course Roadmap: Four Parts, Thirteen Chapters

Course Roadmap A vertical flowchart showing four parts: Descriptive, Probability, Inferential, Advanced Part 1: Descriptive Statistics Ch. 2: Center, Spread, Shape Part 2: Probability Foundations Ch. 3-4: Rules, Distributions, CLT Part 3: Inferential Statistics Ch. 5-9: CI, Testing, ANOVA, Regression Part 4: Advanced Methods Ch. 10-13: MLR, GLM, Trees, Clustering → What does the data look like? → How does randomness work? → What can we conclude? → How do we predict?
Figure 2: Course structure: from describing data to predicting outcomes

Chapter 1 Summary: Key Takeaways

Core Concepts:

  • Statistics is the science of learning from data under uncertainty
  • Variable types (numerical vs. categorical) determine applicable methods
  • Association ≠ Causation — always check for confounders

Data Collection:

  • Observational studies cannot establish causation
  • Experiments require control, randomization, and replication
  • Sampling bias (especially survivorship bias) invalidates results

Statistical Thinking:

  • Embrace uncertainty — quantify it, don’t ignore it
  • Data quality matters — Garbage In, Garbage Out
  • Statistics is a way of thinking, not just a set of formulas