08: Correlation and Simple Regression

From Association to Prediction

Two complementary questions:

Tool	Question	Output
Correlation	How strongly are X and Y associated?	A number \(r \in [-1, +1]\)
Regression	How does Y change when X changes?	A line \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)

They are linked: \(\hat{\beta}_1 = r \cdot \frac{s_Y}{s_X}\)

Financial application: CAPM — \(R_{i,t} - R_{f,t} = \alpha_i + \beta_i(R_{m,t} - R_{f,t}) + \varepsilon_{i,t}\)

The slope \(\beta_i\) measures the stock’s systematic risk.

Pearson Correlation: Definition

\[r_{XY} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}\]

Equivalently: \(r_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\)

Properties:

\(r \in [-1, +1]\)
\(r = +1\): perfect positive linear relationship
\(r = 0\): no linear relationship (could still be nonlinear!)
\(|r|\) measures linear association only

Testing Correlation Significance

\(H_0: \rho = 0\) (no linear association in the population)

\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2}\]

Confidence interval via Fisher’s z-transformation:

\[z = \frac{1}{2}\ln\left(\frac{1+r}{1-r}\right), \quad \text{SE}(z) = \frac{1}{\sqrt{n-3}}\]

Why Fisher’s z? The sampling distribution of \(r\) is skewed near \(\pm 1\). The z-transformation makes it approximately normal, enabling valid confidence intervals.

Beyond Pearson: Rank-Based Correlations

Measure	Formula	Best For
Spearman \(\rho_s\)	\(1 - \frac{6\sum d_i^2}{n(n^2-1)}\)	Monotonic (not just linear) relationships
Kendall \(\tau\)	\(\frac{C - D}{\binom{n}{2}}\)	Robust to outliers; small samples

Spearman = Pearson on ranks. Detects any monotonic pattern.

Kendall counts concordant vs. discordant pairs. More interpretable and robust.

Case: Hikvision Price-Volume Correlation

Data: Hikvision (002415.XSHE), 241 trading days in 2023.

Measure	Value	\(p\)-value	Significant?
Pearson \(r\)	0.1197	0.064	No (at 5%)
Spearman \(\rho_s\)	0.2300	0.0003	Yes (at 1%)

Key insight: Pearson sees a weak, non-significant linear relationship. Spearman picks up a stronger monotonic association between returns and volume changes.

Lesson: The choice of correlation measure can change your conclusions. Always match the measure to your data’s characteristics.

Simple Linear Regression: The Model

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\]

Assumptions (Gauss-Markov):

Linearity: \(E(Y|X) = \beta_0 + \beta_1 X\)
Random sampling
\(E(\varepsilon | X) = 0\) (strict exogeneity)
\(\text{Var}(\varepsilon | X) = \sigma^2\) (homoscedasticity)

Under these conditions, OLS is BLUE (Best Linear Unbiased Estimator).

OLS Estimation: The Formulas

Minimizing \(\sum(Y_i - \hat{Y}_i)^2\) yields:

\[\hat{\beta}_1 = \frac{\sum(X_i-\bar{X})(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2}, \quad \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}\]

In matrix notation: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)

Fundamental link: \(\hat{\beta}_1 = r \cdot \frac{s_Y}{s_X}\)

This means the regression slope is the correlation rescaled by the ratio of standard deviations.

Model Assessment: R² and Standard Error

\[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\]

\(R^2 = 0\): model explains nothing
\(R^2 = 1\): model explains everything

Misleading! \(R^2\) always increases with more variables (even random noise).

Standard Error of Regression:

\[s_e = \sqrt{\frac{SSE}{n-2}}\]

measures the average prediction error in the original units of Y.

Testing the Slope: Is \(\beta_1\) Significant?

\[t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2}\]

where \(SE(\hat{\beta}_1) = \frac{s_e}{\sqrt{\sum(X_i-\bar{X})^2}}\)

Confidence interval: \(\hat{\beta}_1 \pm t_{\alpha/2, n-2} \cdot SE(\hat{\beta}_1)\)

Interpretation: If the confidence interval for \(\beta_1\) excludes zero, we reject \(H_0: \beta_1 = 0\) — there is a statistically significant linear relationship.

Case: YRD Company Size and Revenue

Data: 1,833 YRD listed companies.

Metric	Value
Model	Revenue = 28.97 + 0.1848 × TotalAssets
\(\hat{\beta}_1\)	0.1848 (SE = 0.0066, \(t\) = 27.81, \(p \approx 0\))
\(R^2\)	0.2969
\(s_e\)	91.01 billion CNY
HC3 Robust SE	0.0153 (\(t\) = 12.10)

Prediction: 100B assets → Revenue = 47.45B (95% CI: [44.82, 50.09])

Key: Standard SE (0.0066) vs. Robust SE (0.0153) — the robust SE is 2.3× larger due to heteroscedasticity.

Dirty Work: Spurious Regression

The trap: Two independent random walks can produce \(R^2\) of 0.6 to 0.9!

Why? Non-stationary time series (unit root processes) both trend over time, creating illusory correlation.

Financial example: Regressing stock price levels on each other → high but meaningless \(R^2\).

Solution:

Use returns (first differences) instead of price levels
Test for stationarity (ADF test)
If both series are I(1), test for cointegration before regression

Dirty Work: Heteroscedasticity

Problem: Residual variance changes with \(X\).

Consequences:

OLS coefficients are still unbiased ✓
But standard errors are invalid ✗
Confidence intervals and p-values are wrong ✗

Diagnosis: Plot residuals vs. fitted values — look for a fan shape.

Fix: Use HC3 robust standard errors (cov_type='HC3').

YRD Example: Standard SE(\(\hat{\beta}_1\)) = 0.0066 vs. HC3 SE = 0.0153 (2.3× larger!)

Heuristic: Anscombe’s Quartet

Heuristic: Kitchen Sink Regression

Setup: \(n = 50\), Y and all X’s are pure random noise (no true relationship).

Variables Added	\(R^2\)	Adjusted \(R^2\)
1	0.002	≈ 0
10	0.198	≈ 0
30	0.636	≈ 0
48	1.000	≈ 0

Shocking: With 48 predictors and 50 observations, \(R^2 = 1.0\) — a perfect fit to pure noise!

Why? With \(p \approx n\) variables, the model has enough degrees of freedom to memorize every data point.

Lesson: Use Adjusted \(R^2\), which correctly stays near zero throughout.

Heuristic: Collider Bias

Setup: 5,000 people have beauty and talent — two independent traits.

Sample	\(r\)	\(p\)-value	Relationship
Full population	0.002	0.881	None
Top 10% selected	−0.727	≈ 0	Strong negative!

The “Hollywood paradox”: Among celebrities, beautiful people appear less talented, and talented people appear less beautiful.

Mechanism: Selection on the sum of two variables (beauty + talent) creates a spurious negative correlation in the selected subsample.

Lesson: Analyzing only successful firms, published papers, or star performers can produce completely misleading correlations.

Chapter Summary

Concept	Key Takeaway
Pearson \(r\)	Measures linear association only
Spearman/Kendall	Captures monotonic and nonlinear patterns
Simple regression	\(\hat{\beta}_1 = r \cdot s_Y/s_X\); OLS is BLUE under Gauss-Markov
\(R^2\)	Can never decrease with more variables — use Adjusted \(R^2\)
Robust SE	Always use HC3 when heteroscedasticity is present
Spurious regression	Never regress non-stationary levels — use returns
Anscombe’s Quartet	Always visualize before fitting
Collider bias	Selection on outcome creates false correlations