02: Descriptive Data Analysis

Descriptive Statistics: The First Step in Every Analysis

Before building models, understand your data.

Descriptive statistics answer three fundamental questions:

Question Measure Key Metrics
Where is the center? Central tendency Mean, Median, Mode
How spread out? Dispersion Variance, SD, IQR, CV
What shape? Distribution shape Skewness, Kurtosis

Application: Statistical Profiling of Listed Companies

Using financial statement data from A-share companies:

  • ROE (Return on Equity): profitability measure
  • Asset Turnover Ratio: efficiency measure
  • Debt-to-Asset Ratio: leverage measure

For each metric, we compute center, spread, and shape to build a statistical portrait of the company.

This is the foundation of fundamental analysis in finance.

The Arithmetic Mean: Intuitive but Fragile

The sample mean is defined as:

\[ \large{ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i } \]

Key Properties:

  • Linearity: \(E[aX + b] = aE[X] + b\)
  • Unbiasedness: \(E[\bar{X}] = \mu\) (on average, the sample mean equals the population mean)
  • Optimization view: The mean minimizes the sum of squared deviations:

\[ \large{ \bar{x} = \arg\min_c \sum_{i=1}^n (x_i - c)^2 } \]

The Mean’s Fatal Flaw: Sensitivity to Outliers

CEO Salary Example:

Employee Salary
Employee 1–5 ¥8K, ¥9K, ¥10K, ¥11K, ¥12K
CEO ¥100K
  • Without CEO: Mean = ¥10K (representative)
  • With CEO: Mean = ¥25K (misleading!)

The mean is pulled toward extreme values — a single outlier can destroy its representativeness.

Case: Revenue Distribution of YRD Listed Companies

When we analyze revenue data from A-share companies in the Yangtze River Delta:

Metric Value
Mean Revenue ¥231.72 billion
Median Revenue ¥61.88 billion
Ratio (Mean / Median) 3.7×

Interpretation: The mean is 3.7× the median — a clear sign of right-skewed distribution. A few mega-corporations (like SAIC Motor) pull the mean far above the typical company.

The Median: Robust Alternative to the Mean

\[ \large{ \text{Median} = \begin{cases} x_{\left(\frac{n+1}{2}\right)}, & n \text{ odd} \\ \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2}+1\right)}}{2}, & n \text{ even} \end{cases} } \]

Why use the median?

  • Robust: Unaffected by extreme values
  • Optimization view: Minimizes the sum of absolute deviations:

\[ \large{ \text{Median} = \arg\min_c \sum_{i=1}^n |x_i - c| } \]

Rule of thumb: If Mean ≈ Median → symmetric; if Mean >> Median → right-skewed.

Mode: Best for Categorical and Multimodal Data

The mode is the most frequently occurring value.

When to use:

  • Nominal data: “What is the most common industry?” → Mode
  • Multimodal distributions: Two peaks in customer spending → bimodal
  • Discrete data: Most common number of transactions per day

Limitation: May not exist (uniform distribution) or may not be unique (multimodal).

Mean vs. Median vs. Mode: When to Use Each

Criterion Mean Median Mode
Data type Numerical Numerical Any
Sensitive to outliers Yes No No
Skewed data Biased Preferred
Mathematical properties Best Good Limited
Typical use case Symmetric data Income, prices Categories

Golden rule: Always report both mean and median. If they differ substantially, investigate the data shape.

Variance and Standard Deviation: Measuring Spread

Sample Variance (with Bessel’s correction):

\[ \large{ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 } \]

Sample Standard Deviation:

\[ \large{ s = \sqrt{s^2} } \]

Why divide by \(n-1\) (not \(n\))?

When we use \(\bar{x}\) instead of \(\mu\), we systematically underestimate variance. Dividing by \(n-1\) corrects this bias:

\[ \large{ E\left[\sum_{i=1}^n (X_i - \bar{X})^2\right] = (n-1)\sigma^2 } \]

IQR and Coefficient of Variation

Interquartile Range (IQR):

\[ \large{ \text{IQR} = Q_3 - Q_1 } \]

  • Contains the middle 50% of data
  • Robust to outliers (unlike range or variance)

Coefficient of Variation (CV):

\[ \large{ CV = \frac{s}{\bar{x}} \times 100\% } \]

  • Dimensionless — enables comparison across different scales
  • Example: Is banking (CV ≈ 1,345%) or tech (CV ≈ 9,126%) industry return more volatile?

Case: Risk Comparison Across Industries (2023)

Daily return statistics for three industries using A-share data:

Industry Representative Stocks Mean Return Std Dev CV
Banking Bank of Ningbo, SPD Bank Low Low ~1,345%
Technology Hikvision, iFlytek Medium High ~9,126%
Utilities Shanghai Electric Power Low Low ~2,800%

Key insight: CV reveals that tech stocks are ~7× more volatile per unit of return than banking stocks.

Skewness: Measuring Asymmetry

\[ \large{ \text{Skewness} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{s^3} } \]

Value Interpretation Financial Example
Skewness = 0 Symmetric Rare in practice
Skewness > 0 Right-skewed (long right tail) Revenue, income
Skewness < 0 Left-skewed (long left tail) Stock returns

For investors: Negative skew means crash risk — extreme losses are more likely than extreme gains.

Kurtosis: Measuring Tail Thickness

\[ \large{ \text{Excess Kurtosis} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^4}{s^4} - 3 } \]

Value Interpretation Implication
= 0 Mesokurtic (normal) Baseline
> 0 Leptokurtic (fat tails) More ‘black swans’
< 0 Platykurtic (thin tails) Fewer extremes

Common misconception: Kurtosis does NOT measure “peakedness” — it measures tail thickness.

Finance fact: Stock returns typically have kurtosis >> 0, meaning extreme events occur far more often than the normal distribution predicts.

Case: Hikvision Daily Return Distribution (2020–2023)

Empirical analysis of Hikvision (002415.SZ) daily returns:

Statistic Value Interpretation
Mean 0.04%/day Slight positive drift
Std Dev 2.33% Moderate volatility
Skewness 0.13 Nearly symmetric
Excess Kurtosis 1.79 Fat tails confirmed

Key finding: Kurtosis = 1.79 >> 0 means the probability of extreme moves (>3σ) is much higher than normal theory predicts. This has critical implications for VaR models.

The ‘Dirty Work’: Outlier Detection

Two standard methods to identify outliers:

Method 1: Z-Score

\[ \large{ Z_i = \frac{x_i - \bar{x}}{s}, \quad \text{flag if } |Z_i| > 3 } \]

Method 2: IQR Fences

  • Lower fence: \(Q_1 - 1.5 \times \text{IQR}\)
  • Upper fence: \(Q_3 + 1.5 \times \text{IQR}\)

Z-Score is parametric (assumes approximate normality); IQR is nonparametric (works for any distribution).

Winsorization: Taming Extreme Values

Winsorization replaces values beyond a chosen percentile with the boundary value.

Process:

  1. Set boundaries at 1st and 99th percentiles
  2. Replace values below P1 with P1; above P99 with P99

Impact on financial data (PE ratios):

Metric Before After Change
Mean Inflated Reduced −18%
Std Dev Large Smaller −62%

Winsorization preserves data (unlike deletion) while reducing outlier influence.

Data Visualization: The Histogram

The histogram reveals the shape of a distribution.

Key design choices:

  • Number of bins: Too few → lose detail; Too many → noisy
  • Sturges’ rule: \(k = 1 + \log_2(n)\) — simple but assumes normality
  • Freedman-Diaconis rule: \(h = 2 \cdot \text{IQR} \cdot n^{-1/3}\) — adaptive to spread

Always overlay reference lines:

  • Mean (dashed) and Median (solid) → reveals asymmetry
  • ±1σ bands → shows spread

The Box Plot: Five-Number Summary in One Picture

Anatomy of a Box Plot Diagram showing components of a box plot: whiskers, Q1, median, Q3, IQR, and outliers Q1 (25th) Median (50th) Q3 (75th) IQR = Q3 − Q1 Lower Fence Q1 − 1.5×IQR Upper Fence Q3 + 1.5×IQR Outlier Outliers

Categorical Data Analysis: Frequency Tables & Charts

For categorical variables, we summarize using:

  • Frequency table: count and relative frequency for each category
  • Bar chart: shows comparison across categories
  • Pie chart: shows proportion of whole (use sparingly)

When to use which:

Chart Type Best For Avoid When
Bar chart Comparing categories Too many categories (>10)
Pie chart Showing proportions Categories are similar in size
Stacked bar Comparing compositions More than 5 sub-categories

Bivariate Relationships: Pearson Correlation

The Pearson correlation coefficient measures linear association:

\[ \large{ r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} } \]

Value of \(r\) Interpretation
\(r = +1\) Perfect positive linear relationship
\(r = -1\) Perfect negative linear relationship
\(r = 0\) No linear relationship (but nonlinear may exist!)

Critical Reminder: Correlation ≠ Causation

Spurious correlations in finance:

  • Stock market index ↑ and GDP ↑ → common business cycle driver
  • Stock price ↑ and employee count ↑ → reverse causation (growth drives hiring)
  • Ice cream sales ↑ and drowning deaths ↑ → confounding variable (temperature)

Always ask three questions:

  1. Is there a plausible causal mechanism?
  2. Could a confounding variable explain the association?
  3. Does the direction of causation make sense?

Misleading Graphs: The Truncated Y-Axis Trap

A common technique to exaggerate trends:

Honest Chart Deceptive Chart
Y-axis starts at 0 Y-axis starts near minimum
Changes look proportional Small changes look dramatic
Viewer gets accurate impression Viewer overreacts

Example: Quarterly sales of [100, 102, 103, 105]:

  • Y from 0: looks like a flat line (correct)
  • Y from 98: looks like explosive growth (misleading!)

Always check the axis scale when reading a chart.

Simpson’s Paradox: When Aggregation Reverses the Truth

A trend that appears in subgroups can reverse when groups are combined.

Classic example: Treatment A has higher survival rate in both severe and mild cases, but Treatment B has higher overall survival rate — because B treated more mild cases.

Business implication: Always disaggregate data before drawing conclusions.

Chapter 2 Summary

Measures of Center:

  • Mean (best for symmetric), Median (best for skewed), Mode (best for categorical)

Measures of Spread:

  • Variance/SD (standard), IQR (robust), CV (comparable across scales)

Distribution Shape:

  • Skewness (asymmetry) and Kurtosis (tail thickness, NOT peakedness)

Data Quality:

  • Detect outliers (Z-score, IQR fences), treat with winsorization

Visualization:

  • Histograms, box plots, scatter plots; beware of misleading graphs