08: Correlation and Simple Regression

From Association to Prediction

Two complementary questions:

Tool Question Output
Correlation How strongly are X and Y associated? A number \(r \in [-1, +1]\)
Regression How does Y change when X changes? A line \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)

They are linked: \(\hat{\beta}_1 = r \cdot \frac{s_Y}{s_X}\)

Financial application: CAPM — \(R_{i,t} - R_{f,t} = \alpha_i + \beta_i(R_{m,t} - R_{f,t}) + \varepsilon_{i,t}\)

The slope \(\beta_i\) measures the stock’s systematic risk.

Pearson Correlation: Definition

\[r_{XY} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}\]

Equivalently: \(r_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\)

Properties:

  • \(r \in [-1, +1]\)
  • \(r = +1\): perfect positive linear relationship
  • \(r = 0\): no linear relationship (could still be nonlinear!)
  • \(|r|\) measures linear association only

Testing Correlation Significance

\(H_0: \rho = 0\) (no linear association in the population)

\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2}\]

Confidence interval via Fisher’s z-transformation:

\[z = \frac{1}{2}\ln\left(\frac{1+r}{1-r}\right), \quad \text{SE}(z) = \frac{1}{\sqrt{n-3}}\]

Why Fisher’s z? The sampling distribution of \(r\) is skewed near \(\pm 1\). The z-transformation makes it approximately normal, enabling valid confidence intervals.

Beyond Pearson: Rank-Based Correlations

Measure Formula Best For
Spearman \(\rho_s\) \(1 - \frac{6\sum d_i^2}{n(n^2-1)}\) Monotonic (not just linear) relationships
Kendall \(\tau\) \(\frac{C - D}{\binom{n}{2}}\) Robust to outliers; small samples

Spearman = Pearson on ranks. Detects any monotonic pattern.

Kendall counts concordant vs. discordant pairs. More interpretable and robust.

Case: Hikvision Price-Volume Correlation

Data: Hikvision (002415.XSHE), 241 trading days in 2023.

Measure Value \(p\)-value Significant?
Pearson \(r\) 0.1197 0.064 No (at 5%)
Spearman \(\rho_s\) 0.2300 0.0003 Yes (at 1%)

Key insight: Pearson sees a weak, non-significant linear relationship. Spearman picks up a stronger monotonic association between returns and volume changes.

Lesson: The choice of correlation measure can change your conclusions. Always match the measure to your data’s characteristics.

Simple Linear Regression: The Model

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Assumptions (Gauss-Markov):

  1. Linearity: \(E(Y|X) = \beta_0 + \beta_1 X\)
  2. Random sampling
  3. \(E(\varepsilon | X) = 0\) (strict exogeneity)
  4. \(\text{Var}(\varepsilon | X) = \sigma^2\) (homoscedasticity)

Under these conditions, OLS is BLUE (Best Linear Unbiased Estimator).

OLS Estimation: The Formulas

Minimizing \(\sum(Y_i - \hat{Y}_i)^2\) yields:

\[\hat{\beta}_1 = \frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2}, \quad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

In matrix notation: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Fundamental link: \(\hat{\beta}_1 = r \cdot \frac{s_Y}{s_X}\)

This means the regression slope is the correlation rescaled by the ratio of standard deviations.

Model Assessment: R² and Standard Error

\[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\]

  • \(R^2 = 0\): model explains nothing
  • \(R^2 = 1\): model explains everything

Misleading! \(R^2\) always increases with more variables (even random noise).

Standard Error of Regression:

\[s_e = \sqrt{\frac{SSE}{n-2}}\]

measures the average prediction error in the original units of Y.

Testing the Slope: Is \(\beta_1\) Significant?

\[t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2}\]

where \(SE(\hat{\beta}_1) = \frac{s_e}{\sqrt{\sum(X_i-\bar{X})^2}}\)

Confidence interval: \(\hat{\beta}_1 \pm t_{\alpha/2, n-2} \cdot SE(\hat{\beta}_1)\)

Interpretation: If the confidence interval for \(\beta_1\) excludes zero, we reject \(H_0: \beta_1 = 0\) — there is a statistically significant linear relationship.

Case: YRD Company Size and Revenue

Data: 1,833 YRD listed companies.

Metric Value
Model Revenue = 28.97 + 0.1848 × TotalAssets
\(\hat{\beta}_1\) 0.1848 (SE = 0.0066, \(t\) = 27.81, \(p \approx 0\))
\(R^2\) 0.2969
\(s_e\) 91.01 billion CNY
HC3 Robust SE 0.0153 (\(t\) = 12.10)

Prediction: 100B assets → Revenue = 47.45B (95% CI: [44.82, 50.09])

Key: Standard SE (0.0066) vs. Robust SE (0.0153) — the robust SE is 2.3× larger due to heteroscedasticity.

Dirty Work: Spurious Regression

The trap: Two independent random walks can produce \(R^2\) of 0.6 to 0.9!

Why? Non-stationary time series (unit root processes) both trend over time, creating illusory correlation.

Financial example: Regressing stock price levels on each other → high but meaningless \(R^2\).

Solution:

  • Use returns (first differences) instead of price levels
  • Test for stationarity (ADF test)
  • If both series are I(1), test for cointegration before regression

Dirty Work: Heteroscedasticity

Problem: Residual variance changes with \(X\).

Consequences:

  • OLS coefficients are still unbiased
  • But standard errors are invalid
  • Confidence intervals and p-values are wrong

Diagnosis: Plot residuals vs. fitted values — look for a fan shape.

Fix: Use HC3 robust standard errors (cov_type='HC3').

YRD Example: Standard SE(\(\hat{\beta}_1\)) = 0.0066 vs. HC3 SE = 0.0153 (2.3× larger!)

Heuristic: Anscombe’s Quartet

Anscombe's Quartet Four scatter plots with identical statistics (r=0.816, R²=0.667) but vastly different visual patterns. All Four: r = 0.816, R² = 0.667, Y = 3.00 + 0.50X I: Linear II: Curved III: Outlier IV: Leverage Always visualize your data before fitting models! Summary statistics alone can be completely misleading.

Heuristic: Kitchen Sink Regression

Setup: \(n = 50\), Y and all X’s are pure random noise (no true relationship).

Variables Added \(R^2\) Adjusted \(R^2\)
1 0.002 ≈ 0
10 0.198 ≈ 0
30 0.636 ≈ 0
48 1.000 ≈ 0

Shocking: With 48 predictors and 50 observations, \(R^2 = 1.0\) — a perfect fit to pure noise!

Why? With \(p \approx n\) variables, the model has enough degrees of freedom to memorize every data point.

Lesson: Use Adjusted \(R^2\), which correctly stays near zero throughout.

Heuristic: Collider Bias

Setup: 5,000 people have beauty and talent — two independent traits.

Sample \(r\) \(p\)-value Relationship
Full population 0.002 0.881 None
Top 10% selected −0.727 ≈ 0 Strong negative!

The “Hollywood paradox”: Among celebrities, beautiful people appear less talented, and talented people appear less beautiful.

Mechanism: Selection on the sum of two variables (beauty + talent) creates a spurious negative correlation in the selected subsample.

Lesson: Analyzing only successful firms, published papers, or star performers can produce completely misleading correlations.

Chapter Summary

Concept Key Takeaway
Pearson \(r\) Measures linear association only
Spearman/Kendall Captures monotonic and nonlinear patterns
Simple regression \(\hat{\beta}_1 = r \cdot s_Y/s_X\); OLS is BLUE under Gauss-Markov
\(R^2\) Can never decrease with more variables — use Adjusted \(R^2\)
Robust SE Always use HC3 when heteroscedasticity is present
Spurious regression Never regress non-stationary levels — use returns
Anscombe’s Quartet Always visualize before fitting
Collider bias Selection on outcome creates false correlations