5. Correlation and Regression

5.1 Correlation

We assume for this section that XX and YY are two random variables with finite first and second moments. We recap the notion of correlation.

Recap (Correlation): The correlation between XX and YY is defined as Cor ⁣[X, Y]=Cov ⁣[X, Y]Var ⁣[X]Var ⁣[Y] \Cor{X}{Y} = \frac{\Cov{X}{Y}}{\sqrt{\Var{X} \Var{Y}}}
Note: We use the notation ρX,Y=Cor ⁣[X, Y]\rho_{X,Y} = \Cor{X}{Y}.
Proposition 5.1 (Properties of correlation): Some properties of the correlation are:
  • Cor ⁣[X, Y][1,1]\Cor{X}{Y} \in [-1,1]
  • Cor ⁣[X, Y]=1\Cor{X}{Y} = 1 if and only if Y=a+bXY = a + bX for some a,bRa,b \in \R, b0b \neq 0
  • If XX and YY are independent, then Cor ⁣[X, Y]=0\Cor{X}{Y} = 0
Note: If ρX,Y=0\rho_{X,Y} = 0 we call XX and YY uncorrelated.

We find the correlation of two random variables XX and YY by fitting the SLMI model Y=a1+a2X+εY = a_1 + a_2 X + \epsilon via a=arg mina1,a2RE ⁣[(Y(a1+a2X))2] \dvec a^{\star} = \argmin_{a_1,a_2 \in \R} \E{(Y - (a_1 + a_2 X))^2} Then the slope is a2=Cov[X,Y]Var[X]=ρX,YσYσXa^{\star}_2 = \frac{\Cov*[X,Y]}{\Var*[X]} = \rho_{X,Y} \frac{\sigma_Y}{\sigma_X}. Note that if we fit X=b1+b2Y+εX = b_1 + b_2 Y + \epsilon in the same fashion, then ρX,Y=a2b2\abs{\rho_{X,Y}} = \sqrt{a^{\star}_2 b^{\star}_2}.

Note: If we normalize Y~=YμYσY\tilde{Y} = \frac{Y - \mu_Y}{\sigma_Y} and X~=XμXσX\tilde{X} = \frac{X - \mu_X}{\sigma_X} and fit the SLM Y=aXY = aX then a=ρX,Ya^{\star} = \rho_{X,Y}.

5.2 Empirical Correlation

Definition 5.2 (Empirical correlation): The empirical correlation of an iid sample x1,,xnx_1, \ldots, x_n of XX and y1,,yny_1, \ldots, y_n of YY is cor(x,y)=cov ⁣(x,y)var ⁣(x)var ⁣(y) \cor{\dvec x,\dvec y} = \frac{\cov{\dvec x,\dvec y}}{\sqrt{\var{\dvec x}\var{\dvec{y}}}}
Note: We use the notation ρ^X,Y=cor(x,y)\hat{\rho}_{X,Y} = \cor{\dvec x,\dvec y} which is called the Pearson's correlation coefficient.
Proposition 5.3 (Empirical correlation biased): The empirical correlation is biased with E ⁣[cor(X,Y)]ρX,YρX,Y(1ρX,Y2)2n\E{\cor{\rvec X,\rvec Y}} \approx \rho_{X,Y} - \frac{\rho_{X,Y}(1 - \rho_{X,Y}^2)}{2n}.
Proposition 5.4 (Properties of the empirical correlation): Some properties of the empirical correlation are:
  • cor(x,y)[1,1]\cor{\dvec x,\dvec y} \in [-1,1]
  • cor(x,y)=1\cor{\dvec x,\dvec y} = 1 if and only if yi=a+bxiy_i = a + bx_i for some a,bRa,b \in \R, b0b \neq 0
  • If XX and YY are uncorrelated, then E ⁣[cor(X,Y)]=0\E{\cor{\rvec X,\rvec Y}} = 0

We find the correlation of two random variables XX and YY by fitting the OLS estimator of the SLMI model Yi=β1+β2xi+εiY_i = \beta_1 + \beta_2 x_i + \epsilon_i where YiiidYY_i \simiid Y and xix_i are realizations of XiiidXX_i \simiid X. Then the slope is βˇ2=cov ⁣(x,y)var ⁣(x)=ρ^X,Yσ^Yσ^X\betacheck_2 = \frac{\cov{\dvec x,\dvec y}}{\var{\dvec x}} = \hat{\rho}_{X,Y} \frac{\sigmahat_Y}{\sigmahat_X}.

Proposition 5.5 (Fisher Z transformation): Let XX and YY be jointly Gaussian and let Z=tanh1(ρ^X,Y)=12log1+ρ^X,Y1ρ^X,YZ = \tanh^{-1}(\hat{\rho}_{X,Y}) = \frac{1}{2} \log{\frac{1 + \hat{\rho}_{X,Y}}{1 - \hat{\rho}_{X,Y}}}. Then we can approximate the distribution of ZZ via ZaN ⁣(tanh1(ρ^X,Y),1n3) Z \sima \lawN\of{\tanh^{-1}(\hat{\rho}_{X,Y}), \frac{1}{n-3}}
Note: The distribution for ZZ holds approximately true for n10n \geq 10.

If we want to test for uncorrelation, i.e. H0:ρX,Y=0\H_0 :\rho_{X,Y} = 0, we have thus three tests available:

  • Looking at a confidence limit diagram for ρ^X,Y\hat{\rho}_{X,Y}
  • The t-test or F-test of H0:β2=0\H_0: \beta_2 = 0
  • Using the Fisher Z transformation and testing H0:Z=0\H_0 : Z = 0
Note: As correlation measures linear dependence very different patterns may lead to the same value.

5.3 Partial Correlation

Definition 5.6 (Partial correlation): Let X1,X2,YX_1,X_2,Y be random variables. Then the partial correlation between YY and X2X_2 given X1X_1 is ρY,X2X1=ρY,X2ρY,X1ρX2,X1(1ρY,X12)(1ρX2,X12) \rho_{Y,X_2\mid* X_1} = \frac{\rho_{Y,X_2} - \rho_{Y,X_1}\rho_{X_2,X_1}}{\sqrt{(1-\rho_{Y,X_1}^2)(1-\rho_{X_2,X_1}^2)}}
Note: The estimated partial correlation is defined analogously.

The partial correlation measures the linear dependence between YY and X2X_2 after accounting for the linear dependence of YY and X2X_2 on X1X_1. The empirical partial correlation ρ^Y,X2X1\hat{\rho}_{Y,X_2\mid* X_1} can be computed in a Frisch-Waugh-Lovell way:

  1. Regress YY versus X1X_1 with intercept to obtain the residuals by ε¬2\vepsilon_{\neg 2}.
  2. Regress X2X_2 versus X1X_1 with intercept to obtain the residuals r2\dvec r_{2}.
  3. The empirical correlation between ε¬2\vepsilon_{\neg 2} and r2\dvec r_{2} is exactly the empirical partial correlation ρ^Y,X2X1\hat{\rho}_{Y,X_2\mid* X_1}.

Thus, in OLS regression with an intercept, one can relate partial correlations to the estimated parameters via ρ^Y,XjX[:¬j]βˇj\hat{\rho}_{Y,X_{j}\mid* \dmat X_{[:\neg j]}} \propto \betacheck_j.

5.4 Rank Correlations

Since Pearson's correlation is not robust towards outliers, some sort of rank correlation is often used. There are two types.

TODO