Before building models, understand your data.
Descriptive statistics answer three fundamental questions:
| Question | Measure | Key Metrics |
|---|---|---|
| Where is the center? | Central tendency | Mean, Median, Mode |
| How spread out? | Dispersion | Variance, SD, IQR, CV |
| What shape? | Distribution shape | Skewness, Kurtosis |
Using financial statement data from A-share companies:
For each metric, we compute center, spread, and shape to build a statistical portrait of the company.
This is the foundation of fundamental analysis in finance.
The sample mean is defined as:
\[ \large{ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i } \]
Key Properties:
\[ \large{ \bar{x} = \arg\min_c \sum_{i=1}^n (x_i - c)^2 } \]
CEO Salary Example:
| Employee | Salary |
|---|---|
| Employee 1–5 | ¥8K, ¥9K, ¥10K, ¥11K, ¥12K |
| CEO | ¥100K |
The mean is pulled toward extreme values — a single outlier can destroy its representativeness.
When we analyze revenue data from A-share companies in the Yangtze River Delta:
| Metric | Value |
|---|---|
| Mean Revenue | ¥231.72 billion |
| Median Revenue | ¥61.88 billion |
| Ratio (Mean / Median) | 3.7× |
Interpretation: The mean is 3.7× the median — a clear sign of right-skewed distribution. A few mega-corporations (like SAIC Motor) pull the mean far above the typical company.
\[ \large{ \text{Median} = \begin{cases} x_{\left(\frac{n+1}{2}\right)}, & n \text{ odd} \\ \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2}+1\right)}}{2}, & n \text{ even} \end{cases} } \]
Why use the median?
\[ \large{ \text{Median} = \arg\min_c \sum_{i=1}^n |x_i - c| } \]
Rule of thumb: If Mean ≈ Median → symmetric; if Mean >> Median → right-skewed.
The mode is the most frequently occurring value.
When to use:
Limitation: May not exist (uniform distribution) or may not be unique (multimodal).
| Criterion | Mean | Median | Mode |
|---|---|---|---|
| Data type | Numerical | Numerical | Any |
| Sensitive to outliers | Yes | No | No |
| Skewed data | Biased | Preferred | — |
| Mathematical properties | Best | Good | Limited |
| Typical use case | Symmetric data | Income, prices | Categories |
Golden rule: Always report both mean and median. If they differ substantially, investigate the data shape.
Sample Variance (with Bessel’s correction):
\[ \large{ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 } \]
Sample Standard Deviation:
\[ \large{ s = \sqrt{s^2} } \]
Why divide by \(n-1\) (not \(n\))?
When we use \(\bar{x}\) instead of \(\mu\), we systematically underestimate variance. Dividing by \(n-1\) corrects this bias:
\[ \large{ E\left[\sum_{i=1}^n (X_i - \bar{X})^2\right] = (n-1)\sigma^2 } \]
Interquartile Range (IQR):
\[ \large{ \text{IQR} = Q_3 - Q_1 } \]
Coefficient of Variation (CV):
\[ \large{ CV = \frac{s}{\bar{x}} \times 100\% } \]
Daily return statistics for three industries using A-share data:
| Industry | Representative Stocks | Mean Return | Std Dev | CV |
|---|---|---|---|---|
| Banking | Bank of Ningbo, SPD Bank | Low | Low | ~1,345% |
| Technology | Hikvision, iFlytek | Medium | High | ~9,126% |
| Utilities | Shanghai Electric Power | Low | Low | ~2,800% |
Key insight: CV reveals that tech stocks are ~7× more volatile per unit of return than banking stocks.
\[ \large{ \text{Skewness} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{s^3} } \]
| Value | Interpretation | Financial Example |
|---|---|---|
| Skewness = 0 | Symmetric | Rare in practice |
| Skewness > 0 | Right-skewed (long right tail) | Revenue, income |
| Skewness < 0 | Left-skewed (long left tail) | Stock returns |
For investors: Negative skew means crash risk — extreme losses are more likely than extreme gains.
\[ \large{ \text{Excess Kurtosis} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^4}{s^4} - 3 } \]
| Value | Interpretation | Implication |
|---|---|---|
| = 0 | Mesokurtic (normal) | Baseline |
| > 0 | Leptokurtic (fat tails) | More ‘black swans’ |
| < 0 | Platykurtic (thin tails) | Fewer extremes |
Common misconception: Kurtosis does NOT measure “peakedness” — it measures tail thickness.
Finance fact: Stock returns typically have kurtosis >> 0, meaning extreme events occur far more often than the normal distribution predicts.
Empirical analysis of Hikvision (002415.SZ) daily returns:
| Statistic | Value | Interpretation |
|---|---|---|
| Mean | 0.04%/day | Slight positive drift |
| Std Dev | 2.33% | Moderate volatility |
| Skewness | 0.13 | Nearly symmetric |
| Excess Kurtosis | 1.79 | Fat tails confirmed |
Key finding: Kurtosis = 1.79 >> 0 means the probability of extreme moves (>3σ) is much higher than normal theory predicts. This has critical implications for VaR models.
Two standard methods to identify outliers:
Method 1: Z-Score
\[ \large{ Z_i = \frac{x_i - \bar{x}}{s}, \quad \text{flag if } |Z_i| > 3 } \]
Method 2: IQR Fences
Z-Score is parametric (assumes approximate normality); IQR is nonparametric (works for any distribution).
Winsorization replaces values beyond a chosen percentile with the boundary value.
Process:
Impact on financial data (PE ratios):
| Metric | Before | After | Change |
|---|---|---|---|
| Mean | Inflated | Reduced | −18% |
| Std Dev | Large | Smaller | −62% |
Winsorization preserves data (unlike deletion) while reducing outlier influence.
The histogram reveals the shape of a distribution.
Key design choices:
Always overlay reference lines:
For categorical variables, we summarize using:
When to use which:
| Chart Type | Best For | Avoid When |
|---|---|---|
| Bar chart | Comparing categories | Too many categories (>10) |
| Pie chart | Showing proportions | Categories are similar in size |
| Stacked bar | Comparing compositions | More than 5 sub-categories |
The Pearson correlation coefficient measures linear association:
\[ \large{ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} } \]
| Value of \(r\) | Interpretation |
|---|---|
| \(r = +1\) | Perfect positive linear relationship |
| \(r = -1\) | Perfect negative linear relationship |
| \(r = 0\) | No linear relationship (but nonlinear may exist!) |
Spurious correlations in finance:
Always ask three questions:
A common technique to exaggerate trends:
| Honest Chart | Deceptive Chart |
|---|---|
| Y-axis starts at 0 | Y-axis starts near minimum |
| Changes look proportional | Small changes look dramatic |
| Viewer gets accurate impression | Viewer overreacts |
Example: Quarterly sales of [100, 102, 103, 105]:
Always check the axis scale when reading a chart.
A trend that appears in subgroups can reverse when groups are combined.
Classic example: Treatment A has higher survival rate in both severe and mild cases, but Treatment B has higher overall survival rate — because B treated more mild cases.
Business implication: Always disaggregate data before drawing conclusions.
Measures of Center:
Measures of Spread:
Distribution Shape:
Data Quality:
Visualization: