04: Probability Distributions

Why Probability Distributions Matter

A probability distribution is a mathematical model that describes how outcomes are spread across possible values.

What It Does	Business Application
Models frequency of events	Customer arrivals per hour (Poisson)
Quantifies risk	Stock returns (Normal / t-distribution)
Enables prediction	Default probability (Binomial)
Justifies inference	Sample mean behavior (CLT → Normal)

Key idea: Once you identify the right distribution, you unlock a full toolkit of probabilities, expectations, and confidence intervals.

Random Variables: From Outcomes to Numbers

A random variable $X$ maps sample space outcomes to real numbers.

Type	Example	Values
Discrete	Number of defaults in portfolio	0, 1, 2, …
Continuous	Daily stock return	Any real number

Notation:

$P(X = x)$ — probability mass (discrete)
$f(x)$ — probability density (continuous)
$F(x) = P(X \leq x)$ — cumulative distribution function (CDF)

Key distinction: For continuous $X$, $P(X = x) = 0$ for any single point. We only talk about intervals: $P(a < X < b) = \int_a^b f(x)\,dx$.

Discrete Distributions: The Probability Mass Function (PMF)

For a discrete random variable $X$:

\[ \large{ P(X = x_i) = p_i, \quad \sum_i p_i = 1 } \]

Expected Value:

\[ \large{ E[X] = \sum_i x_i \cdot p_i } \]

Variance:

\[ \large{ \text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2 } \]

The Binomial Distribution: Counting Successes

Setting: $n$ independent trials, each with success probability $p$.

\[ \large{ P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}, \quad k = 0, 1, \ldots, n } \]

Moments:

$E[X] = np$
$\text{Var}(X) = np(1-p)$

Financial Example: Portfolio of 100 bonds, each with 5% default probability:

Expected defaults: $E[X] = 100 \times 0.05 = 5$
Std dev: $\sqrt{100 \times 0.05 \times 0.95} \approx 2.18$

Case: Converting Website Visitors to Customers

Setup: 1,000 visitors/day, conversion rate = 3%.

Number of conversions $X \sim \text{Binomial}(n=1000, p=0.03)$
$E[X] = 30$ conversions
$\text{SD}(X) = \sqrt{1000 \times 0.03 \times 0.97} \approx 5.39$

Business question: What is $P(X < 20)$?

Using the Normal approximation:

\[ Z = \frac{20 - 30}{5.39} \approx -1.86 \implies P(X < 20) \approx 3.1\% \]

Fewer than 20 conversions is a rare event — investigate if it occurs!

The Poisson Distribution: Counting Rare Events

Setting: Events occur randomly at an average rate $\lambda$ per time interval.

\[ \large{ P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!}, \quad k = 0, 1, 2, \ldots } \]

Remarkable property: $E[X] = \text{Var}(X) = \lambda$

Financial Applications:

Event	$\lambda$ (per period)
Customer complaints per day	4.2
Trading system failures per month	1.5
Credit defaults per quarter	2.8

Poisson as the Limit of Binomial

When $n$ is large, $p$ is small, and $\lambda = np$ is moderate:

\[ \large{ \binom{n}{k}p^k(1-p)^{n-k} \xrightarrow{n\to\infty} \frac{e^{-\lambda}\lambda^k}{k!} } \]

The derivation sketch:

$\binom{n}{k} \approx \frac{n^k}{k!}$ for large $n$
$p^k = \left(\frac{\lambda}{n}\right)^k$
$(1-p)^{n-k} \approx e^{-\lambda}$

Rule of thumb: Use Poisson when $n > 100$ and $p < 0.01$.

Continuous Distributions: The Normal Distribution

\[ \large{ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) } \]

The 68-95-99.7 Rule:

Range	Probability
$\mu \pm 1\sigma$	68.3%
$\mu \pm 2\sigma$	95.4%
$\mu \pm 3\sigma$	99.7%

Standardization: $Z = \frac{X - \mu}{\sigma} \sim N(0,1)$

Why the Normal Distribution Dominates Statistics

Four reasons:

The Central Limit Theorem (coming soon) — sample means are approximately normal
Mathematical convenience — closed-form likelihood, conjugate prior
Maximum entropy — among all distributions with given mean and variance, the normal has maximum entropy (least assumptions)
Historical momentum — Gauss used it; it became the default

But beware: Stock returns are NOT truly normal (fat tails, skewness). Using normal models for risk management led to massive underestimation of tail risk in 2008.

The Exponential Distribution: Time Between Events

\[ \large{ f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 } \]

Moments: $E[X] = \frac{1}{\lambda}$, $\text{Var}(X) = \frac{1}{\lambda^2}$

Memoryless Property:

\[ \large{ P(X > s + t \mid X > s) = P(X > t) } \]

“Given you’ve already waited $s$ minutes, the probability of waiting at least $t$ more is the same as if you just started.”

Application: Time between customer arrivals, system failures, or transactions.

The Central Limit Theorem: The Most Important Theorem in Statistics

Statement: If $X_1, X_2, \ldots, X_n$ are i.i.d. with mean $\mu$ and variance $\sigma^2$, then as $n \to \infty$:

\[ \large{ \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0,1) } \]

In words: The sample mean is approximately normal for large $n$, regardless of the original distribution.

This is revolutionary: You don’t need to know the population distribution to do inference!

CLT: Visual Intuition

How Large Must $n$ Be? Practical Guidelines

Source Population Shape	Minimum $n$ for CLT
Symmetric (e.g., uniform)	≥ 15
Moderately skewed	≥ 30
Heavily skewed / outliers	≥ 50–100

Financial data warning: Stock returns have fat tails, so:

Daily returns → $n \geq 50$ recommended
Monthly returns → $n \geq 30$ usually sufficient
For extreme quantiles (VaR) → CLT is inadequate; use Bootstrap

‘Dirty Work’: Mediocristan vs. Extremistan

Nassim Taleb’s classification of random phenomena:

Characteristic	Mediocristan	Extremistan
Tail behavior	Thin (exponential decay)	Fat (power law)
Extreme events	Negligible impact	Dominates total
CLT applies?	Yes	No (or slowly)
Example	Height, weight, IQ	Wealth, city size, book sales
Financial analog	Interest on savings	Venture capital returns

The 80/20 Rule: In Extremistan, 20% of causes produce 80% of effects. The top 1% of stocks drive a disproportionate share of index returns.

Sampling Distribution of the Sample Mean

If $X_1, \ldots, X_n \overset{iid}{\sim} (\mu, \sigma^2)$, then the sample mean:

\[ \large{ \bar{X} \sim \left(\mu, \frac{\sigma^2}{n}\right) } \]

Key implications:

Unbiased: $E[\bar{X}] = \mu$ — the sample mean targets the population mean
Precision increases with $n$: $\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}$
To halve the standard error, you need 4× the sample size

The Chi-Square Distribution

If $Z_1, \ldots, Z_k \overset{iid}{\sim} N(0,1)$:

\[ \large{ \chi^2_k = Z_1^2 + Z_2^2 + \cdots + Z_k^2 } \]

Key properties:

$E[\chi^2_k] = k$, $\text{Var}(\chi^2_k) = 2k$
Right-skewed, but approaches normality as $k$ increases
Application: testing variance, goodness-of-fit tests

Connection to variance:

\[ \frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1} \]

The t-Distribution: For Small Samples with Unknown Variance

\[ \large{ t_\nu = \frac{Z}{\sqrt{\chi^2_\nu / \nu}}, \quad Z \sim N(0,1), \quad \chi^2_\nu \text{ independent} } \]

Compared to the Normal:

Feature	Normal	t-distribution
Tails	Thin	Heavier
Shape parameter	None	$\nu$ (degrees of freedom)
As $\nu \to \infty$	—	Converges to Normal
Use when	$\sigma$ known or $n$ large	$\sigma$ unknown and $n$ small

Rule of thumb: Use $t$ when $n < 30$ and population variance is unknown (most real situations).

The F-Distribution: Comparing Two Variances

\[ \large{ F_{\nu_1,\nu_2} = \frac{\chi^2_{\nu_1}/\nu_1}{\chi^2_{\nu_2}/\nu_2} } \]

Key application: Testing whether two groups have equal variance

\[ F = \frac{s_1^2}{s_2^2} \sim F_{\nu_1, \nu_2} \]

The F-distribution also appears in:

ANOVA F-test (Chapter 9)
Regression overall significance test (Chapter 8)
Any hypothesis testing that compares variance ratios

The St. Petersburg Paradox: When Expected Value Breaks Down

The game: Flip a coin until heads. If heads appears on flip $n$, you win $2^n$ dollars.

Expected payoff:

\[ E[X] = \sum_{n=1}^{\infty}\frac{1}{2^n}\cdot 2^n = \sum_{n=1}^{\infty}1 = \infty \]

Yet nobody would pay more than ~$20 to play!

Resolution (Daniel Bernoulli, 1738): Use logarithmic utility $U(x) = \ln(x)$:

\[ E[U(X)] = \sum_{n=1}^{\infty}\frac{1}{2^n}\cdot \ln(2^n) = \ln 2 \sum_{n=1}^{\infty}\frac{n}{2^n} = 2\ln 2 \approx 1.39 \]

This equals about $4 in certainty equivalent — explaining the paradox.

Benford’s Law: A Tool for Fraud Detection

Claim: In many natural datasets, the leading digit $d$ follows:

\[ \large{ P(\text{first digit} = d) = \log_{10}\left(1 + \frac{1}{d}\right) } \]

Digit	1	2	3	4	5	6	7	8	9
Prob	30.1%	17.6%	12.5%	9.7%	7.9%	6.7%	5.8%	5.1%	4.6%

Application: If financial statements deviate significantly from Benford’s Law, it may indicate data fabrication. Auditors use this as a screening tool.

Chapter 4 Summary

Discrete Distributions:

Binomial (fixed trials, success counting) and Poisson (rare events, no fixed $n$)

Continuous Distributions:

Normal (ubiquitous, CLT foundation), Exponential (waiting times)

The Central Limit Theorem:

Sample means → Normal, regardless of source — the foundation of all inference

Sampling Distributions:

$\chi^2$ (variance testing), $t$ (mean testing, small $n$), $F$ (variance comparison, ANOVA)

Key Warning: CLT may not apply in Extremistan — fat-tailed data requires special methods.