Statistical Modelling 2025-11-06

5. Correlation and Regression

5.1 Correlation

We assume for this section that $X$ and $Y$ are two random variables with finite first and second moments. We recap the notion of correlation.

Recap (Correlation): The correlation between

X

and

Y

is defined as

\Cor{X}{Y} = \frac{\Cov{X}{Y}}{\sqrt{\Var{X} \Var{Y}}}

Note: We use the notation

\rho_{X,Y} = \Cor{X}{Y}

Proposition 5.1 (Properties of correlation): Some properties of the correlation are:

$\Cor{X}{Y} \in [-1,1]$
$\Cor{X}{Y} = 1$ if and only if $Y = a + bX$ for some $a,b \in \R$ $b \neq 0$
If $X$ and $Y$ are independent, then $\Cor{X}{Y} = 0$

Note: If

\rho_{X,Y} = 0

we call

X

and

Y

uncorrelated.

We find the correlation of two random variables $X$ and $Y$ by fitting the SLMI model $Y = a_1 + a_2 X + \epsilon$ via $\dvec a^{\star} = \argmin_{a_1,a_2 \in \R} \E{(Y - (a_1 + a_2 X))^2}$ Then the slope is $a^{\star}_2 = \frac{\Cov*[X,Y]}{\Var*[X]} = \rho_{X,Y} \frac{\sigma_Y}{\sigma_X}$ Note that if we fit $X = b_1 + b_2 Y + \epsilon$ in the same fashion, then $\abs{\rho_{X,Y}} = \sqrt{a^{\star}_2 b^{\star}_2}$

Note: If we normalize

\tilde{Y} = \frac{Y - \mu_Y}{\sigma_Y}

and

\tilde{X} = \frac{X - \mu_X}{\sigma_X}

and fit the SLM

Y = aX

then

a^{\star} = \rho_{X,Y}

5.2 Empirical Correlation

Definition 5.2 (Empirical correlation): The empirical correlation of an iid sample

x_1, \ldots, x_n

X

and

y_1, \ldots, y_n

Y

\cor{\dvec x,\dvec y} = \frac{\cov{\dvec x,\dvec y}}{\sqrt{\var{\dvec x}\var{\dvec{y}}}}

Note: We use the notation

\hat{\rho}_{X,Y} = \cor{\dvec x,\dvec y}

which is called the Pearson's correlation coefficient.

Proposition 5.3 (Empirical correlation biased): The empirical correlation is biased with

\E{\cor{\rvec X,\rvec Y}} \approx \rho_{X,Y} - \frac{\rho_{X,Y}(1 - \rho_{X,Y}^2)}{2n}

Proposition 5.4 (Properties of the empirical correlation): Some properties of the empirical correlation are:

$\cor{\dvec x,\dvec y} \in [-1,1]$
$\cor{\dvec x,\dvec y} = 1$ if and only if $y_i = a + bx_i$ for some $a,b \in \R$ $b \neq 0$
If $X$ and $Y$ are uncorrelated, then $\E{\cor{\rvec X,\rvec Y}} = 0$

We find the correlation of two random variables $X$ and $Y$ by fitting the OLS estimator of the SLMI model $Y_i = \beta_1 + \beta_2 x_i + \epsilon_i$ where $Y_i \simiid Y$ and $x_i$ are realizations of $X_i \simiid X$ Then the slope is $\betacheck_2 = \frac{\cov{\dvec x,\dvec y}}{\var{\dvec x}} = \hat{\rho}_{X,Y} \frac{\sigmahat_Y}{\sigmahat_X}$

Proposition 5.5 (Fisher Z transformation): Let

X

and

Y

be jointly Gaussian and let

Z = \tanh^{-1}(\hat{\rho}_{X,Y}) = \frac{1}{2} \log{\frac{1 + \hat{\rho}_{X,Y}}{1 - \hat{\rho}_{X,Y}}}

Then we can approximate the distribution of

Z

via

Z \sima \lawN\of{\tanh^{-1}(\hat{\rho}_{X,Y}), \frac{1}{n-3}}

Note: The distribution for

Z

holds approximately true for

n \geq 10

If we want to test for uncorrelation, i.e. $\H_0 :\rho_{X,Y} = 0$ we have thus three tests available:

Looking at a confidence limit diagram for $\hat{\rho}_{X,Y}$
The t-test or F-test of $\H_0: \beta_2 = 0$
Using the Fisher Z transformation and testing $\H_0 : Z = 0$

Note: As correlation measures linear dependence very different patterns may lead to the same value.

5.3 Partial Correlation

Definition 5.6 (Partial correlation): Let

X_1,X_2,Y

be random variables. Then the partial correlation between

Y

and

X_2

given

X_1

\rho_{Y,X_2\mid* X_1} = \frac{\rho_{Y,X_2} - \rho_{Y,X_1}\rho_{X_2,X_1}}{\sqrt{(1-\rho_{Y,X_1}^2)(1-\rho_{X_2,X_1}^2)}}

Note: The estimated partial correlation is defined analogously.

The partial correlation measures the linear dependence between $Y$ and $X_2$ after accounting for the linear dependence of $Y$ and $X_2$ on $X_1$ The empirical partial correlation $\hat{\rho}_{Y,X_2\mid* X_1}$ can be computed in a Frisch-Waugh-Lovell way:

Regress $Y$ versus $X_1$ with intercept to obtain the residuals by $\vepsilon_{\neg 2}$
Regress $X_2$ versus $X_1$ with intercept to obtain the residuals $\dvec r_{2}$
The empirical correlation between $\vepsilon_{\neg 2}$ and $\dvec r_{2}$ is exactly the empirical partial correlation $\hat{\rho}_{Y,X_2\mid* X_1}$

Thus, in OLS regression with an intercept, one can relate partial correlations to the estimated parameters via $\hat{\rho}_{Y,X_{j}\mid* \dmat X_{[:\neg j]}} \propto \betacheck_j$

5.4 Rank Correlations

Since Pearson's correlation is not robust towards outliers, some sort of rank correlation is often used. There are two types.

TODO