02: Descriptive Data Analysis

Descriptive Statistics: The First Step in Every Analysis

Before building models, understand your data.

Descriptive statistics answer three fundamental questions:

Question	Measure	Key Metrics
Where is the center?	Central tendency	Mean, Median, Mode
How spread out?	Dispersion	Variance, SD, IQR, CV
What shape?	Distribution shape	Skewness, Kurtosis

Application: Statistical Profiling of Listed Companies

Using financial statement data from A-share companies:

ROE (Return on Equity): profitability measure
Asset Turnover Ratio: efficiency measure
Debt-to-Asset Ratio: leverage measure

For each metric, we compute center, spread, and shape to build a statistical portrait of the company.

This is the foundation of fundamental analysis in finance.

The Arithmetic Mean: Intuitive but Fragile

The sample mean is defined as:

\[ \large{ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i } \]

Key Properties:

Linearity: \(E[aX + b] = aE[X] + b\)
Unbiasedness: \(E[\bar{X}] = \mu\) (on average, the sample mean equals the population mean)
Optimization view: The mean minimizes the sum of squared deviations:

\[ \large{ \bar{x} = \arg\min_c \sum_{i=1}^n (x_i - c)^2 } \]

The Mean’s Fatal Flaw: Sensitivity to Outliers

CEO Salary Example:

Employee	Salary
Employee 1–5	¥8K, ¥9K, ¥10K, ¥11K, ¥12K
CEO	¥100K

Without CEO: Mean = ¥10K (representative)
With CEO: Mean = ¥25K (misleading!)

The mean is pulled toward extreme values — a single outlier can destroy its representativeness.

Case: Revenue Distribution of YRD Listed Companies

When we analyze revenue data from A-share companies in the Yangtze River Delta:

Metric	Value
Mean Revenue	¥231.72 billion
Median Revenue	¥61.88 billion
Ratio (Mean / Median)	3.7×

Interpretation: The mean is 3.7× the median — a clear sign of right-skewed distribution. A few mega-corporations (like SAIC Motor) pull the mean far above the typical company.

The Median: Robust Alternative to the Mean

\[ \large{ \text{Median} = \begin{cases} x_{\left(\frac{n+1}{2}\right)}, & n \text{ odd} \\ \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2}+1\right)}}{2}, & n \text{ even} \end{cases} } \]

Why use the median?

Robust: Unaffected by extreme values
Optimization view: Minimizes the sum of absolute deviations:

\[ \large{ \text{Median} = \arg\min_c \sum_{i=1}^n |x_i - c| } \]

Rule of thumb: If Mean ≈ Median → symmetric; if Mean >> Median → right-skewed.

Mode: Best for Categorical and Multimodal Data

The mode is the most frequently occurring value.

When to use:

Nominal data: “What is the most common industry?” → Mode
Multimodal distributions: Two peaks in customer spending → bimodal
Discrete data: Most common number of transactions per day

Limitation: May not exist (uniform distribution) or may not be unique (multimodal).

Mean vs. Median vs. Mode: When to Use Each

Criterion	Mean	Median	Mode
Data type	Numerical	Numerical	Any
Sensitive to outliers	Yes	No	No
Skewed data	Biased	Preferred	—
Mathematical properties	Best	Good	Limited
Typical use case	Symmetric data	Income, prices	Categories

Golden rule: Always report both mean and median. If they differ substantially, investigate the data shape.

Variance and Standard Deviation: Measuring Spread

Sample Variance (with Bessel’s correction):

\[ \large{ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 } \]

Sample Standard Deviation:

\[ \large{ s = \sqrt{s^2} } \]

Why divide by \(n-1\) (not \(n\))?

When we use \(\bar{x}\) instead of \(\mu\), we systematically underestimate variance. Dividing by \(n-1\) corrects this bias:

\[ \large{ E\left[\sum_{i=1}^n (X_i - \bar{X})^2\right] = (n-1)\sigma^2 } \]

IQR and Coefficient of Variation

Interquartile Range (IQR):

\[ \large{ \text{IQR} = Q_3 - Q_1 } \]

Contains the middle 50% of data
Robust to outliers (unlike range or variance)

Coefficient of Variation (CV):

\[ \large{ CV = \frac{s}{\bar{x}} \times 100\% } \]

Dimensionless — enables comparison across different scales
Example: Is banking (CV ≈ 1,345%) or tech (CV ≈ 9,126%) industry return more volatile?

Case: Risk Comparison Across Industries (2023)

Daily return statistics for three industries using A-share data:

Industry	Representative Stocks	Mean Return	Std Dev	CV
Banking	Bank of Ningbo, SPD Bank	Low	Low	~1,345%
Technology	Hikvision, iFlytek	Medium	High	~9,126%
Utilities	Shanghai Electric Power	Low	Low	~2,800%

Key insight: CV reveals that tech stocks are ~7× more volatile per unit of return than banking stocks.

Skewness: Measuring Asymmetry

\[ \large{ \text{Skewness} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{s^3} } \]

Value	Interpretation	Financial Example
Skewness = 0	Symmetric	Rare in practice
Skewness > 0	Right-skewed (long right tail)	Revenue, income
Skewness < 0	Left-skewed (long left tail)	Stock returns

For investors: Negative skew means crash risk — extreme losses are more likely than extreme gains.

Kurtosis: Measuring Tail Thickness

\[ \large{ \text{Excess Kurtosis} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^4}{s^4} - 3 } \]

Value	Interpretation	Implication
= 0	Mesokurtic (normal)	Baseline
> 0	Leptokurtic (fat tails)	More ‘black swans’
< 0	Platykurtic (thin tails)	Fewer extremes

Common misconception: Kurtosis does NOT measure “peakedness” — it measures tail thickness.

Finance fact: Stock returns typically have kurtosis >> 0, meaning extreme events occur far more often than the normal distribution predicts.

Case: Hikvision Daily Return Distribution (2020–2023)

Empirical analysis of Hikvision (002415.SZ) daily returns:

Statistic	Value	Interpretation
Mean	0.04%/day	Slight positive drift
Std Dev	2.33%	Moderate volatility
Skewness	0.13	Nearly symmetric
Excess Kurtosis	1.79	Fat tails confirmed

Key finding: Kurtosis = 1.79 >> 0 means the probability of extreme moves (>3σ) is much higher than normal theory predicts. This has critical implications for VaR models.

The ‘Dirty Work’: Outlier Detection

Two standard methods to identify outliers:

Method 1: Z-Score

\[ \large{ Z_i = \frac{x_i - \bar{x}}{s}, \quad \text{flag if } |Z_i| > 3 } \]

Method 2: IQR Fences

Lower fence: \(Q_1 - 1.5 \times \text{IQR}\)
Upper fence: \(Q_3 + 1.5 \times \text{IQR}\)

Z-Score is parametric (assumes approximate normality); IQR is nonparametric (works for any distribution).

Winsorization: Taming Extreme Values

Winsorization replaces values beyond a chosen percentile with the boundary value.

Process:

Set boundaries at 1st and 99th percentiles
Replace values below P1 with P1; above P99 with P99

Impact on financial data (PE ratios):

Metric	Before	After	Change
Mean	Inflated	Reduced	−18%
Std Dev	Large	Smaller	−62%

Winsorization preserves data (unlike deletion) while reducing outlier influence.

Data Visualization: The Histogram

The histogram reveals the shape of a distribution.

Key design choices:

Number of bins: Too few → lose detail; Too many → noisy
Sturges’ rule: \(k = 1 + \log_2(n)\) — simple but assumes normality
Freedman-Diaconis rule: \(h = 2 \cdot \text{IQR} \cdot n^{-1/3}\) — adaptive to spread

Always overlay reference lines:

Mean (dashed) and Median (solid) → reveals asymmetry
±1σ bands → shows spread

The Box Plot: Five-Number Summary in One Picture

Categorical Data Analysis: Frequency Tables & Charts

For categorical variables, we summarize using:

Frequency table: count and relative frequency for each category
Bar chart: shows comparison across categories
Pie chart: shows proportion of whole (use sparingly)

When to use which:

Chart Type	Best For	Avoid When
Bar chart	Comparing categories	Too many categories (>10)
Pie chart	Showing proportions	Categories are similar in size
Stacked bar	Comparing compositions	More than 5 sub-categories

Bivariate Relationships: Pearson Correlation

The Pearson correlation coefficient measures linear association:

\[ \large{ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} } \]

Value of \(r\)	Interpretation
\(r = +1\)	Perfect positive linear relationship
\(r = -1\)	Perfect negative linear relationship
\(r = 0\)	No linear relationship (but nonlinear may exist!)

Critical Reminder: Correlation ≠ Causation

Spurious correlations in finance:

Stock market index ↑ and GDP ↑ → common business cycle driver
Stock price ↑ and employee count ↑ → reverse causation (growth drives hiring)
Ice cream sales ↑ and drowning deaths ↑ → confounding variable (temperature)

Always ask three questions:

Is there a plausible causal mechanism?
Could a confounding variable explain the association?
Does the direction of causation make sense?

Misleading Graphs: The Truncated Y-Axis Trap

A common technique to exaggerate trends:

Honest Chart	Deceptive Chart
Y-axis starts at 0	Y-axis starts near minimum
Changes look proportional	Small changes look dramatic
Viewer gets accurate impression	Viewer overreacts

Example: Quarterly sales of [100, 102, 103, 105]:

Y from 0: looks like a flat line (correct)
Y from 98: looks like explosive growth (misleading!)

Always check the axis scale when reading a chart.

Simpson’s Paradox: When Aggregation Reverses the Truth

A trend that appears in subgroups can reverse when groups are combined.

Classic example: Treatment A has higher survival rate in both severe and mild cases, but Treatment B has higher overall survival rate — because B treated more mild cases.

Business implication: Always disaggregate data before drawing conclusions.

Chapter 2 Summary

Measures of Center:

Mean (best for symmetric), Median (best for skewed), Mode (best for categorical)

Measures of Spread:

Variance/SD (standard), IQR (robust), CV (comparable across scales)

Distribution Shape:

Skewness (asymmetry) and Kurtosis (tail thickness, NOT peakedness)

Data Quality:

Detect outliers (Z-score, IQR fences), treat with winsorization

Visualization:

Histograms, box plots, scatter plots; beware of misleading graphs