08: Correlation and Simple Regression
From Association to Prediction
Two complementary questions:
Correlation
How strongly are X and Y associated?
A number \(r \in [-1, +1]\)
Regression
How does Y change when X changes?
A line \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X\)
They are linked: \(\hat{\beta}_1 = r \cdot \frac{s_Y}{s_X}\)
Financial application: CAPM — \(R_{i,t} - R_{f,t} = \alpha_i + \beta_i(R_{m,t} - R_{f,t}) + \varepsilon_{i,t}\)
The slope \(\beta_i\) measures the stock’s systematic risk .
This chapter covers the two most fundamental tools in statistics — correlation and regression. Correlation tells you the strength of association. Regression tells you the functional relationship and allows prediction. These are the building blocks of essentially all quantitative finance, from CAPM to factor models.
Pearson Correlation: Definition
\[r_{XY} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2}\sqrt{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}}\]
Equivalently: \(r_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\)
Properties:
\(r \in [-1, +1]\)
\(r = +1\) : perfect positive linear relationship
\(r = 0\) : no linear relationship (could still be nonlinear!)
\(|r|\) measures linear association only
Pearson’s r is the standardized covariance. By dividing covariance by the product of standard deviations, we get a dimensionless measure that always falls between negative one and positive one. But remember — it only measures linear relationships. A perfect U-shaped relationship can have r close to zero.
Testing Correlation Significance
\(H_0: \rho = 0\) (no linear association in the population)
\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2}\]
Confidence interval via Fisher’s z-transformation:
\[z = \frac{1}{2}\ln\left(\frac{1+r}{1-r}\right), \quad \text{SE}(z) = \frac{1}{\sqrt{n-3}}\]
Why Fisher’s z? The sampling distribution of \(r\) is skewed near \(\pm 1\) . The z-transformation makes it approximately normal, enabling valid confidence intervals.
Testing whether a correlation is zero uses a t-test. But building confidence intervals requires Fisher’s ingenious z-transformation. The problem is that r has a highly skewed sampling distribution when the true correlation is far from zero — for example, if the true rho is 0.9, r can’t go above 1 but has lots of room below. The z-transformation symmetrizes this distribution.
Beyond Pearson: Rank-Based Correlations
Spearman \(\rho_s\)
\(1 - \frac{6\sum d_i^2}{n(n^2-1)}\)
Monotonic (not just linear) relationships
Kendall \(\tau\)
\(\frac{C - D}{\binom{n}{2}}\)
Robust to outliers; small samples
Spearman = Pearson on ranks . Detects any monotonic pattern.
Kendall counts concordant vs. discordant pairs . More interpretable and robust.
When the relationship is monotonic but not linear — for instance, diminishing marginal returns — Pearson’s r underestimates the true association. Spearman’s rho fixes this by converting data to ranks first. Kendall’s tau is even more robust and works well with small samples and tied observations. In financial data with outliers and non-normal distributions, rank-based correlations are often more reliable.
Case: Hikvision Price-Volume Correlation
Data: Hikvision (002415.XSHE), 241 trading days in 2023.
Pearson \(r\)
0.1197
0.064
No (at 5%)
Spearman \(\rho_s\)
0.2300
0.0003
Yes (at 1%)
Key insight: Pearson sees a weak, non-significant linear relationship. Spearman picks up a stronger monotonic association between returns and volume changes.
Lesson: The choice of correlation measure can change your conclusions. Always match the measure to your data’s characteristics.
This Hikvision example perfectly illustrates why you need multiple correlation measures. The price-volume relationship is real but not perfectly linear — large volume spikes tend to accompany large price moves, but the relationship is nonlinear. Pearson misses part of this because it only looks for linear patterns. Spearman, working with ranks, captures the full monotonic association.
Model Assessment: R² and Standard Error
\[R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\]
\(R^2 = 0\) : model explains nothing
\(R^2 = 1\) : model explains everything
Misleading! \(R^2\) always increases with more variables (even random noise).
Standard Error of Regression:
\[s_e = \sqrt{\frac{SSE}{n-2}}\]
measures the average prediction error in the original units of Y.
R-squared is the proportion of total variation in Y that is explained by the model. It’s the most commonly reported fit statistic, but it has a serious flaw — it can never decrease when you add variables, even useless ones. We’ll see this dramatically in the Kitchen Sink Regression heuristic later. The standard error of regression is often more useful because it tells you, in practical terms, how far off your predictions typically are.
Testing the Slope: Is \(\beta_1\) Significant?
\[t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)} \sim t_{n-2}\]
where \(SE(\hat{\beta}_1) = \frac{s_e}{\sqrt{\sum(X_i-\bar{X})^2}}\)
Confidence interval: \(\hat{\beta}_1 \pm t_{\alpha/2, n-2} \cdot SE(\hat{\beta}_1)\)
Interpretation: If the confidence interval for \(\beta_1\) excludes zero, we reject \(H_0: \beta_1 = 0\) — there is a statistically significant linear relationship.
The t-test for the slope asks whether the observed relationship could have arisen by chance. The standard error of the slope depends on two things: how noisy the data is (measured by the standard error of regression) and how spread out the X values are. More spread in X gives more information about the slope, leading to smaller standard errors and more powerful tests.
Case: YRD Company Size and Revenue
Data: 1,833 YRD listed companies.
Model
Revenue = 28.97 + 0.1848 × TotalAssets
\(\hat{\beta}_1\)
0.1848 (SE = 0.0066, \(t\) = 27.81, \(p \approx 0\) )
\(R^2\)
0.2969
\(s_e\)
91.01 billion CNY
HC3 Robust SE
0.0153 (\(t\) = 12.10)
Prediction: 100B assets → Revenue = 47.45B (95% CI: [44.82, 50.09])
Key: Standard SE (0.0066) vs. Robust SE (0.0153) — the robust SE is 2.3× larger due to heteroscedasticity.
This case study uses real data from 1,833 Yangtze River Delta companies. Each additional 1 billion yuan in total assets is associated with 0.18 billion yuan in additional revenue. The R-squared of 0.297 means asset size explains about 30% of revenue variation. But notice the difference between standard and robust standard errors — the robust SE is more than double. This tells us heteroscedasticity is a serious issue, and we’d be overconfident if we used the standard SE.
Dirty Work: Spurious Regression
The trap: Two independent random walks can produce \(R^2\) of 0.6 to 0.9!
Why? Non-stationary time series (unit root processes) both trend over time, creating illusory correlation.
Financial example: Regressing stock price levels on each other → high but meaningless \(R^2\) .
Solution:
Use returns (first differences) instead of price levels
Test for stationarity (ADF test)
If both series are I(1), test for cointegration before regression
Spurious regression is one of the most dangerous traps in financial time series analysis. If you regress one stock’s price on another’s, you’ll almost certainly get a highly significant relationship with R-squared above 0.5 — even if the stocks are completely unrelated. The reason is that both price series are non-stationary: they wander upward over time due to market trends. The solution is simple: use returns instead of prices. Returns are stationary, so any correlation you find is real.
Dirty Work: Heteroscedasticity
Problem: Residual variance changes with \(X\) .
Consequences:
OLS coefficients are still unbiased ✓
But standard errors are invalid ✗
Confidence intervals and p-values are wrong ✗
Diagnosis: Plot residuals vs. fitted values — look for a fan shape .
Fix: Use HC3 robust standard errors (cov_type='HC3').
YRD Example: Standard SE(\(\hat{\beta}_1\) ) = 0.0066 vs. HC3 SE = 0.0153 (2.3× larger!)
Heteroscedasticity means the spread of residuals is not constant — it’s often larger for bigger companies. The good news is that OLS coefficients are still unbiased. The bad news is that the standard errors are wrong, which means your p-values and confidence intervals are unreliable. In the YRD regression, the robust standard error is 2.3 times larger than the ordinary one. Using the ordinary SE would lead us to think we know the slope much more precisely than we actually do.
Heuristic: Anscombe’s Quartet
Anscombe's Quartet
Four scatter plots with identical statistics (r=0.816, R²=0.667) but vastly different visual patterns.
All Four: r = 0.816, R² = 0.667, Y = 3.00 + 0.50X
I: Linear
II: Curved
III: Outlier
IV: Leverage
Always visualize your data before fitting models!
Summary statistics alone can be completely misleading.
Anscombe’s Quartet is perhaps the most famous demonstration in all of statistics. Four completely different datasets produce identical summary statistics — same mean, same variance, same correlation, same regression line. But visually they’re totally different: one is a normal linear relationship, one is curved, one has an outlier, and one has a single leverage point driving the entire regression. The lesson is simple and profound: always plot your data.
Heuristic: Kitchen Sink Regression
Setup: \(n = 50\) , Y and all X’s are pure random noise (no true relationship).
1
0.002
≈ 0
10
0.198
≈ 0
30
0.636
≈ 0
48
1.000
≈ 0
Shocking: With 48 predictors and 50 observations, \(R^2 = 1.0\) — a perfect fit to pure noise!
Why? With \(p \approx n\) variables, the model has enough degrees of freedom to memorize every data point.
Lesson: Use Adjusted \(R^2\) , which correctly stays near zero throughout.
This is a profound and disturbing demonstration. With 50 observations and 48 random predictors, R-squared is literally 1.0 — a perfect fit to pure noise. The model has essentially memorized the entire dataset. But Adjusted R-squared sees through this illusion and stays near zero, correctly signaling no real explanatory power. This is why we never judge a model by R-squared alone, and why adding more variables without theoretical justification is dangerous.
Heuristic: Collider Bias
Setup: 5,000 people have beauty and talent — two independent traits.
Full population
0.002
0.881
None
Top 10% selected
−0.727
≈ 0
Strong negative!
The “Hollywood paradox”: Among celebrities, beautiful people appear less talented, and talented people appear less beautiful.
Mechanism: Selection on the sum of two variables (beauty + talent) creates a spurious negative correlation in the selected subsample.
Lesson: Analyzing only successful firms, published papers, or star performers can produce completely misleading correlations.
This is collider bias in action. In the full population, beauty and talent are independent — their correlation is essentially zero. But if you select only people who became famous — which requires either great beauty OR great talent OR some of both — you create a spurious negative correlation. Among the selected, someone with high beauty ‘needed’ less talent to get in, and vice versa. This exact phenomenon occurs when you study only listed companies, only published studies, or only successful funds.
Chapter Summary
Pearson \(r\)
Measures linear association only
Spearman/Kendall
Captures monotonic and nonlinear patterns
Simple regression
\(\hat{\beta}_1 = r \cdot s_Y/s_X\) ; OLS is BLUE under Gauss-Markov
\(R^2\)
Can never decrease with more variables — use Adjusted \(R^2\)
Robust SE
Always use HC3 when heteroscedasticity is present
Spurious regression
Never regress non-stationary levels — use returns
Anscombe’s Quartet
Always visualize before fitting
Collider bias
Selection on outcome creates false correlations
The key messages from this chapter: correlation and regression are two sides of the same coin, linked by the formula beta equals r times the ratio of standard deviations. Always go beyond Pearson with rank-based measures. Always visualize your data, as Anscombe’s Quartet teaches. Always use robust standard errors. And beware of the many traps — spurious regression from non-stationarity, kitchen sink overfitting, and collider bias from sample selection.