1. Multiple Linear Regression

Linear regression is a widely used statistical model in a broad variety of applications. It is one of the easiest examples to demonstrate important aspects of statistical modelling.

1.1 The Linear Model

Definition 1.1 (Linear model): For each observation i{1,,n}i \in \set{1, \ldots, n}, let YiY_i be the response variable and xi,1,,xi,px_{i,1}, \ldots, x_{i,p} be the predictors. In the linear model the response variable is a linear function of the predictors up to some error εi\epsilon_i Yi=j=1pβjxi,j+εi Y_i = \sum_{j=1}^p \beta_j x_{i,j} + \epsilon_i where i:E ⁣[εi] ⁣=0\forall i : \E{\epsilon_i} = 0.
Note: We usually, but not necessarily always, assume:
  • i:Var ⁣[εi] ⁣=σ2\forall i : \Var{\epsilon_i} = \sigma^2
  • ε1,,εn\epsilon_1, \ldots, \epsilon_n are iid\iid

We call nn the sample size and pp the number of predictors. The goal is to estimate the parameters β1,,βp\beta_1, \ldots, \beta_p, to study their relevance and to estimate the error variance. The parameters βj\beta_j and distribution parameters of εi\epsilon_i, e.g. σ2\sigma^2, are unknown. The errors εi\epsilon_i are unobservable, while the response variables YiY_i and the predictors xi,jx_{i,j} are given. We can rewrite the model using the vector notation Yn=Xn×pβp+εn \rvec* Y n = \dmat* X n p \cdot \gvec* \beta p + \gvec* \epsilon n where YRn\rvec Y \in \R^n is the random vector of response variables, XRn×p\dmat X \in \R^{n\times p} is the matrix of predictors, βRp\gvec \beta \in \R^{p} is vector of unknown parameters and εRn\gvec \epsilon \in \R^n is the random vector of errors. We typically assume that the sample size is larger than the number of predictors, i.e. n>pn > p, and that the matrix X\dmat X has full rank pp.

Note: We use the notation xpi=[xi,1xi,p]=Xn×p[i,:] \dvec* x p _i = \begin{bmatrix} x_{i,1} \\ \vdots \\ x_{i,p} \end{bmatrix} = \dmat* X n p _{[i,:]} for a single observation i{1,,n}i \in \set{1, \ldots, n} of all the pp predictors and xn:,j=[x1,jxn,j]=Xn×p[:,j] \dvec* x n _{:,j} = \begin{bmatrix} x_{1,j} \\ \vdots \\ x_{n,j} \end{bmatrix} = \dmat* X n p _{[:,j]} for all nn observations of a single predictor j{1,,p}j \in \set{1, \ldots, p}.
Note: To model an intercept, we set the first predictor variable to be a constant, i.e. x:,1=1\dvec x _{:,1} = \dvec 1, resulting in the model Yi=β1+j=2pβjxi,j+εiY_i = \beta_1 + \sum_{j=2}^p \beta_j x_{i,j} + \epsilon_i.
Note (On stochastic models): The linear model involves some stochastic components: the error terms εi\epsilon_i are random variables and hence the response variables YiY_i as well. The predictor variables xi,jx_{i,j} are assumed to be non-random. However, in some applications, which are not discussed in this course, it is more appropriate to treat the predictor variables as random. The stochastic nature of the error terms εi\epsilon_i can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.
Example (Regression through the origin): Yi=βxi+εiY_i = \beta x_i + \epsilon_i
Example (Simple linear regression): Yi=β1+β2xi+εiY_i = \beta_1 + \beta_2 x_i + \epsilon_i
Example (Transformed predictors): Yi=β1+β2log ⁣(xi,2) ⁣+β3sin ⁣(xi,3) ⁣+εiY_i = \beta_1 + \beta_2 \log{x_{i,2}} + \beta_3 \sin{x_{i,3}} + \epsilon_i

1.2 Least Squares

We assume the linear model Y=Xβ+ε\rvec Y = \dmat X \cdot \gvec \beta + \gvec \epsilon and want to find a good estimate of β\gvec \beta. We define the Ordinary Least Squares estimator.

Definition 1.2 (OLS estimator): The least squares estimator β^\gvec \betahat is defined as β^p=arg minβRpYnXn×pβp2 \gvec* \betahat p = \argmin_{\gvec \beta \in \R^p} \normbig{\rvec* Y n - \dmat* X n p \cdot \gvec* \beta p}^2 Assuming that X\dmat X has rank pp, the minimizer can be computed explicitly by β^p=(Xp×nXn×p)1Xp×nYn \gvec* \betahat p = \pabig{\dmattr* X n p \dmat * X n p}^{-1} \dmattr* X n p \cdot \rvec* Y n
Note: We distinguish between the least squares estimator, β^\gvec \betahat, which is a random vector, and the resulting least squares estimate, b^\dvec \bhat, which is deterministic and computed from a specific data sample y\dvec y.
Proof: As YXβ2\norm{\rvec Y - \dmat X \cdot \gvec \beta}^2 is convex we find the minimizwer by setting its gradient to 0\dvec 0: YXβ2=2X(YXβ)=!0 \nabla \norm{\rvec Y - \dmat X \cdot \gvec \beta}^2 = -2 \dmattr X (\rvec Y - \dmat X \cdot \gvec \beta) \stackrel{!}{=} \dvec 0 This yields the normal equations XXβ^=XY \dmattr X \dmat X \cdot \gvec \betahat = \dmattr X \cdot \rvec Y Under the assumption that X\dmat X has rank pp the matrix XXRp×p\dmattr X \dmat X \in \R^{p \times p} has full rank and is invertible, thus β^=(XX)1XY \gvec \betahat = \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \rvec Y
Definition 1.3 (Residuals): We define the residuals as Ri=Yij=1pβ^jxi,j R_i = Y_i - \sum_{j=1}^p \betahat_j x_{i,j}
Note: The equivalent vector notation is R=YXβ^\rvec R = \rvec Y - \dmat X \cdot \rvec \betahat.

1.2.1 Assumptions

We need some assumptions so that fitting a linear model by least squares is reasonable and that tests and confidence intervals are approximately valid. We distinguish between the OLS assumptions needed for estimation guarantees and the additional normality of errors assumption needed for inference.

Proposition 1.4 (OLS assumptions):
  1. The linear model is correct, i.e. Y=Xβ+ε\rvec Y = \dmat X \cdot \gvec \beta + \gvec \epsilon with E ⁣[ε] ⁣=0\E{\gvec \epsilon} = \dvec 0.
  2. All xi\dvec x_i are exact, i.e. they can be observed perfectly.
  3. The errors are homoscedastic, i.e. i:Var ⁣[εi] ⁣=σ2\forall i: \Var{\epsilon_i} = \sigma^2.
  4. The errors are uncorrelated, i.e. ij:Cov ⁣[εi, εj] ⁣=0\forall i \neq j : \Cov{\epsilon_i}{\epsilon_j} = 0.
Proposition 1.5 (Normality of error assumption):
  1. The errors are jointly normal, i.e. εN\gvec \epsilon \sim \lawN.
Note:
  • 1., 3. and 4. imply εN(0,σ2I)\gvec \epsilon \sim \lawN(\dvec 0, \sigma^2 \dmat I) and εiiidN(0,σ2)\epsilon_i \simiid \lawN(0, \sigma^2).
  • If 2. does not hold, we need corrections from errors in variables methods.
  • If 3. does not hold, we can use weighted least squares.
  • If 4. does not hold, we can use generalized least squares.
  • If 5. does not hold, we can still use least squares but, as OLS is sensitive to outliers, we might prefer robust methods instead.

We emphasize that we do not make any assumptions on the predictor variables, except that the matrix X\dmat X has full rank pp. This ensures that there is no perfect multicollinearity among the predictors, i.e. x:,j\dmat x _{:,j} are linearly independent.

1.2.2 Geometrical Interpretation

The vector of response variables Y\rvec Y is a random vector in Rn\R^n and X={Xβ:βRp}\setX = \set{\dmat X \cdot \gvec \beta : \gvec \beta \in \R^p} describes the pp-dimensional subspace XRn\setX \subset \R^n spanned by x:,j\dvec x _{:,j}. The least squares estimator β^\gvec \betahat is then such that Xβ^\dmat X \cdot \gvec \betahat is closest to Y\rvec Y with respect to the Euclidean distance. We denote the vector of fitted values by Y^=Xβ^\rvec \Yhat = \dmat X \cdot \gvec \betahat.

Recap (Orthogonal projection matrix): A square matrix P\dmat P is called and orthogonal projection matrix if P2=P=P\dmat P ^2 = \dmat P = \dmattr P.
Proposition 1.6 (Interpretation of fitted values): The vector of fitted values Y^\rvec \Yhat is the orthogonal projection of Y\rvec Y onto the pp-dimensional subspace X\setX.
Proof: Since Y^=Xβ^=X(XX)1XY\rvec \Yhat = \dmat X \cdot \gvec \betahat = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X \cdot \rvec Y, the map YY^\rvec Y \mapsto \rvec \Yhat can be represented by the matrix Pn×n=X(XX)1X \dmat* P n n = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X It is evident that P=P\dmattr P = \dmat P. It remains to check P2=P\dmat P ^2 = \dmat P: P2=X(XX)1XX(XX)1X=X(XX)1X \dmat P ^2 = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X

The vector of residuals is then defined by R=YY^\rvec R = \rvec Y - \rvec \Yhat. Geometrically, the residuals are orthogonal to X\setX because Y^\rvec \Yhat is the orthogonal projection of Y\rvec Y onto X\setX. This means that j:Rx:,j=0\forall j: \rvectr R \dvec x _{:,j} = 0 or equivalently XR=0\dmattr X \rvec R = \dvec 0.

Proposition 1.7 (Interpretation of residuals): The vector of residuals R\rvec R is the orthogonal projection of Y\rvec Y onto the (np)(n{-}p)-dimensional orthogonal complement X=RnX\setX^{\bot} = \R^n \setminus \setX.
Proof: Since R=YY^=YX(XX)1XY\rvec R = \rvec Y - \rvec \Yhat = \rvec Y - \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X \cdot \rvec Y, the map YR\rvec Y \mapsto \rvec R can be represented by the matrix In×nPn×n=IX(XX)1X \dmat* I n n - \dmat* P n n = \dmat I - \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X which is an orthogonal projection matrix since (IP)=IP(\dmat I - \dmat P)^\tr = \dmat I - \dmat P and (IP)2=I22IP+P2=IP (\dmat I - \dmat P)^2 = \dmat I ^2 - 2 \dmat I \dmat P + \dmat P ^2 = \dmat I - \dmat P

1.2.3 Distributional Properties

We point out that the least squares estimator β^\gvec \betahat is a random vector, i.e. for new samples from the same data-generating mechanism Y\rvec Y, the data y\dvec y would look differently every time and hence also the least squares estimate b^\dvec \bhat. We analyze the first two moments of the random vectors and variables.

Proposition 1.8 (First moments of least square estimator):
  1. E ⁣[β^] ⁣=β\E{\gvec \betahat} = \gvec \beta, i.e. β^\gvec \betahat is an unbiased estimator.
  2. Cov ⁣[β^, β^] ⁣=σ2(XX)1\Cov{\gvec \betahat}{\gvec \betahat} = \sigma^2 (\dmattr X \dmat X)^{-1}.
Proof: E ⁣[β^] ⁣=(XX)1XE ⁣[Y] ⁣=(XX)1XE ⁣[Xβ+ε] ⁣=β+(XX)1XE ⁣[ε] ⁣=β\begin{align*} \E{\gvec \betahat} &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \E{\rvec Y} \\ &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \E{\dmat X \cdot \gvec \beta + \gvec \epsilon} \\ & = \gvec \beta + \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \E{\gvec \epsilon} \\ & = \gvec \beta \end{align*} and

\begin{align*} \Cov{\gvec \betahat}{\gvec \betahat} &= \E{\gvec \betahat \gvectr \betahat} - \E{\gvec \betahat} \E{ \gvec \betahat}^\tr\\ &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \E{(\dmat X \cdot \gvec \beta + \gvec \epsilon)\pa{\gvectr \beta \cdot \dmattr X + \gvectr \epsilon}} \cdot \dmat X \pa{\dmattr X \dmat X}^{-1} - \gvec \beta \gvectr \beta \\ &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \Cov{\gvec \epsilon}{\gvec \epsilon} \cdot \dmat X \pa{\dmattr X \dmat X}^{-1} \\ &= \sigma^2 \pa{\dmattr X \dmat X}^{-1} \dmattr X \dmat X \pa{\dmattr X \dmat X}^{-1} \\ &= \sigma^2 \pa{\dmattr X \dmat X}^{-1} \end{align*}

Note: We write σj2=σ2(XX)[j,j]1\sigma^2_j = \sigma^2 (\dmattr X \dmat X)^{-1}_{[j,j]} for the variance and σj=σj2\sigma_j = \sqrt{\sigma^2_j} for the standard deviation of β^j\betahat_j.
Proposition 1.9 (First moments of fitted values):
  1. E ⁣[Y^] ⁣=Xβ\E{\rvec \Yhat} = \dmat X \gvec \beta.
  2. Cov ⁣[Y^, Y^] ⁣=σ2P\Cov{\rvec \Yhat}{\rvec \Yhat} = \sigma^2 \dmat P.
Proof: E ⁣[Y^] ⁣=E ⁣[Xβ^] ⁣=Xβ \E{\rvec \Yhat} = \E{\dmat X \gvec \betahat} = \dmat X \gvec \beta and

\begin{align*} \Cov{\rvec \Yhat}{\rvec \Yhat} &= \E{\rvec \Yhat \rvectr \Yhat} - \E{\rvec \Yhat} \E{ \rvec \Yhat}^\tr \\ &= \dmat X \cdot \E{\gvec \betahat \gvectr \betahat} \cdot \dmattr X - \dmat X \gvec \beta \gvectr \beta \dmattr X \\ &= \dmat X \cdot \pa{\Cov{\gvec \betahat}{\gvec \betahat} + \gvec \beta \gvectr \beta} \cdot \dmattr X - \dmat X \gvec \beta \gvectr \beta \dmattr X \\ &= \sigma^2 \dmat X \pa{\dmattr X \dmat X}^{-1} \dmattr X \\ &= \sigma^2 \dmat P \end{align*}

Note: The first two moments of the vector of response variables are E ⁣[Y] ⁣=Xβ\E{\rvec Y} = \dmat X \gvec \beta and Cov ⁣[Y, Y] ⁣=σ2I\Cov{\rvec Y}{\rvec Y} = \sigma^2 \dmat I. Furthermore we have Cov ⁣[Y, Y^] ⁣=σ2P\Cov{\rvec Y}{\rvec \Yhat} = \sigma^2 \dmat P.
Proposition 1.10 (First moments of residuals):
  1. E ⁣[R] ⁣=0\E{\rvec R} = \dvec 0.
  2. Cov ⁣[R, R] ⁣=σ2(IP)\Cov{\rvec R}{\rvec R} = \sigma^2 (\dmat I - \dmat P).
Proof: We already proved E ⁣[R] ⁣=0\E{\rvec R} = \dvec 0. Further

\begin{align*} \Cov{\rvec R}{\rvec R} &= \E{\rvec R \rvectr R} \\ &= \E{(\rvec Y - \rvec \Yhat) (\rvec Y - \rvec \Yhat)^\tr} \\ &= \E{\rvec Y \rvectr Y - \rvec Y \rvectr \Yhat - \rvec \Yhat \rvectr Y + \rvec \Yhat \rvectr \Yhat} \\ &= \E{\rvec Y \rvectr Y} - 2 \Cov{\rvec Y}{\rvec \Yhat} - 2 \E{\rvec Y} \E{ \rvec \Yhat}^\tr + \E{\rvec \Yhat \rvectr \Yhat} \\ &= \E{\rvec Y \rvectr Y} - 2 \Cov{\rvec Y}{\rvec \Yhat} - 2 \dmat X \gvec \beta \gvectr \beta \dmattr X + \E{\rvec \Yhat \rvectr \Yhat} \\ &= \Cov{\rvec Y}{\rvec Y} - 2\Cov{\rvec Y}{\rvec \Yhat} + \Cov{\rvec \Yhat}{\rvec \Yhat} \\ &= \sigma^2 \dmat I - 2 \sigma^2 \dmat P + \sigma^2 \dmat P \\ &= \sigma^2 (\dmat I - \dmat P) \end{align*}

For the derivation of the moments we did not need to assume any specific distribution of the errors ε\gvec \epsilon. Using the normality of error assumption εN\epsilon \sim \lawN, we can derive the specific distributions of the random vectos and variables.

Proposition 1.11 (Normal distributions):
  • β^Np ⁣(β,σ2(XX)1) ⁣\gvec \betahat \sim \overdim \lawN p \of{\beta, \sigma^2 (\dmattr X \dmat X)^{-1}}
  • YNn ⁣(Xβ,σ2I) ⁣\rvec Y \sim \overdim \lawN n \of{\dmat X \gvec \beta, \sigma^2 \dmat I}
  • Y^Nn ⁣(Xβ,σ2P) ⁣\rvec \Yhat \sim \overdim \lawN n \of{\dmat X \gvec \beta, \sigma^2 \dmat P}
  • RNn ⁣(0,σ2(IP)) ⁣\rvec R \sim \overdim \lawN n \of{\dvec 0, \sigma^2 (\dmat I - \dmat P)}
Proof: The distributions follow from εN\gvec \epsilon \sim \lawN and from the fact that any affine transformation of a multivariate normal are normally distributed as well.
Note: The normality of error assumption is often not fulfilled in practice. We can then rely on the central limit theorem which implies that for large sample size nn, the distributions are still approximately normal. However, it is often much better to use robust methods in case of non-Gaussian errors which are not discussed here.

1.2.4 Error Variance Estimator

We introduce an estimator for the variance of the errors.

Definition 1.12 (Error variance estimator): The estimator for the error variance σ2\sigma^2 is σ^2=1npi=1nRi2 \sigmahat^2 = \frac{1}{n-p} \sum_{i=1}^n R_i^2 in the least squares setting.
Proposition 1.13 (Estimator is unbiased): The error variance estimator σ^2\sigmahat^2 is unbiased, i.e. E ⁣[σ^2] ⁣=σ2\E{\sigmahat^2} = \sigma^2.
Proof: E ⁣[σ^2] ⁣=E ⁣[1npi=1nRi2] ⁣=1npi=1nVar ⁣[Ri] ⁣=σ2npi=1n(1P[i,i])=σ2np(ntr ⁣(P) ⁣)=σ2np(np)=σ2\begin{align*} \E{\sigmahat^2} &= \E{\frac{1}{n-p} \sum_{i=1}^n R_i^2} \\ &= \frac{1}{n-p} \sum_{i=1}^n \Var{R_i} \\ &= \frac{\sigma^2}{n-p} \sum_{i=1}^n (1 - \dmat P _{[i,i]}) \\ &= \frac{\sigma^2}{n-p} (n - \trace{\dmat P}) \\ &= \frac{\sigma^2}{n-p} (n - p) \\ &= \sigma^2 \end{align*}
Proposition 1.14 (Distribution): The distribution of the error variance estimator is σ^2σ2npχnp2 \sigmahat^2 \sim \frac{\sigma^2}{n-p} \lawChi{n-p}

1.3 Tests and Confidence Intervals

1.3.1 Test of Regression Coefficients

We impose the OLS and normality of error assumptions. As we have seen before, the estimator β^\gvec \betahat is normally distributed. If we are interested whether the jj-th predictor variable is relevant, we can test the null-hypothesis H0,j:βj=0H_{0,j} : \beta_j = 0 against the alternative HA,j:βj0H_{A,j} : \beta_j \neq 0.

Recap (t\statT-distribution): Let ZN(0,1)Z \sim \lawN(0,1) and Vχm2V \sim \lawChi m. The random variable T=Zm1V T = \frac{Z}{\sqrt{m^{-1} V}} follows the t\statT-distribution with mm degrees of freedom, i.e. TtmT \sim \lawT m.
Definition 1.15 (t\statT-statistic): The distribution of β^j=0\betahat_j = 0 under the null-hypothesis H0,j:βj=0H_{0,j} : \beta_j = 0 is Tj=β^jσ^2(XX)[j,j]1=β^jσ^jtnp T_j = \frac{\betahat_j}{\sqrt{\sigmahat^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}}} = \frac{\betahat_j}{\sigmahat_j} \sim \lawT{n-p} We call TjT_j the t\statT-statistic of the jj-th predictor variable.
Proof: We know β^jN(βj,σ2(XX)[j,j]1) \betahat_j \sim \lawN\pa{\beta_j, \sigma^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}} and σ^2σ2npχnp2 \sigmahat^2 \sim \frac{\sigma^2}{n-p} \lawChi{n-p} thus, under H0,jH_{0,j} we have β^jσ2(XX)[j,j]1(np)1npσ2σ^2=β^jσ^2(XX)[j,j]1tnp \frac{\frac{\betahat_j}{\sqrt{\sigma^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}}}}{\sqrt{(n-p)^{-1} \frac{n-p}{\sigma^2} \sigmahat^2}} = \frac{\betahat_j}{\sqrt{\sigmahat^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}}} \sim \lawT{n-p}

The test corresponding to the t\statT-statistic is called the t\statT-test. In practice, we can thus quantify the relevance of individual predictor variables by looking at the size of the test-statistics T1,,TpT_1, \ldots, T_p or at the corresponding p\p-value which may be more informative.

Note: Given we observe the t\statT-statistic tjt_j for some sample of response variables y\dvec y the p\p-value of the t\statT-test is ptj=2P ⁣(tjT) ⁣\p_{t_j} = 2 \P{\abs{t_j} \leq T} where TtnpT \sim \lawT{n-p}.

The problem by looking at individual tests H0,jH_ {0,j} is, besides the multiple testing problem in general, that it can happen that all individual tests do not reject the null-hypotheses although it is true that some predictor variables have a significant effect. This can occur because of correlation among the predictor variables. We clarify how to interpret the results of the t\statT-tests.

Note (Interpretation of t\statT-tests): An individual t\statT-test for H0,jH_ {0,j} should be interpreted as quantifying the effect of the jj-th predictor variable after having subtracted the linear effect of all other predictor variables kjk \neq j on Y\rvec Y.

Similarly to the t\statT-tests, one can derive confidence intervals for the unknown parameter βj\beta_j.

Definition 1.16 (Confidence interval): The interval given by β^j±σ^jftnp ⁣(1α2) ⁣ \betahat_j \pm \sigmahat_j \cdot f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}} is the two-sided confidence interval Cj\setC_{j} which covers the true βj\beta_j with probability 1α1 - \alpha, i.e. P ⁣(βjCj) ⁣=1α\P{\beta_j \in \setC_{j}} = 1 - \alpha.
Proof: Recall that σ^j=σ^2(XX)[j,j]1 \sigmahat_j = \sqrt{\sigmahat^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}} Given βj\beta_j we have β^jβjσ^jtnp    P ⁣(ftnp ⁣(1α2) ⁣<β^jβjσ^j<ftnp ⁣(1α2) ⁣) ⁣=1α    P ⁣(β^jσ^jftnp ⁣(1α2) ⁣<βj<β^j+σ^jftnp ⁣(1α2) ⁣) ⁣=1α\begin{align*} & \frac{\betahat_j - \beta_j}{\sigmahat_j} \sim \lawT{n-p} \\ \implies \quad & \P{-f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}} < \frac{\betahat_j - \beta_j}{\sigmahat_j} < f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}}} = 1 - \alpha \\ \implies \quad & \P{\betahat_j - \sigmahat_j \cdot f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}} < \beta_j < \betahat_j + \sigmahat_j \cdot f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}}} = 1 - \alpha \\ \end{align*}

1.3.2 Test of Global Null-Hypothesis

To test whether there exists any effect from the predictor variables, we can look at the global null-hypothesis H0:β2==βp=0H_0 : \beta_2 = \ldots = \beta_p = 0 versus the alternative hypothesis HA:(j{2,,p}:βj0)H_A : \pabig{\exists j \in \set{2,\ldots,p} : \beta_j \neq 0}. Such a test can be developed with an analysis of variance decomposition which takes a simple form for this special case.

Note: We denote with Y\rvec \Yline the vectorized global arithmetic mean, i.e. Y=Y1\rvec \Yline = \Yline \cdot \dvec 1 where Y=n1i=1nYi\Yline = n^{-1} \sum_{i=1}^n Y_i.
Definition 1.17 (Anova decomposition): The total variation YY2\norm{\rvec Y - \rvec \Yline}^2 of YiY_i around the mean Y\Yline can be decomposed as YY2=Y^Y2+YY^2 \norm{\rvec Y - \rvec \Yline}^2 = \normbig{\rvec \Yhat - \rvec \Yline}^2 + \normbig{\rvec Y - \rvec \Yhat}^2 where Y^Y2\normbig{\rvec \Yhat - \rvec \Yline}^2 is the variation of fitted values Y^i\Yhat_i around the mean Y\Yline, and YY^2=R2\normbig{\rvec Y - \rvec \Yhat}^2 = \norm{\rvec R}^2 is the variation of the residuals.
Proof: Recall that our linear model includes the intercept, i.e. X[:,1]=1\dmat X_{[:,1]} = \dvec 1 and recall X={Xβ:βRp}\setX = \set{\dmat X \cdot \gvec \beta : \gvec \beta \in \R^p}. Let β~\gvec{\tilde{\beta}} with β~1=Y\tilde{\beta}_1 = \line{Y} and j1:β~j=0\forall j \neq 1: \tilde{\beta}_j = 0, then Xβ~=i=1pβ~jX[:,j]=β~1X[:,1]=Y1 \dmat X \cdot \gvec{\tilde{\beta}} = \sum_{i=1}^p \tilde{\beta}_j \cdot \dmat X_{[:,j]} = \tilde{\beta}_1 \cdot \dmat X_{[:,1]} = \Yline \cdot \dvec 1 hence YX\rvec \Yline \in \setX and YYX\rvec Y - \rvec \Yline \in \setX. This implies YYR\rvec Y - \rvec \Yline \perp \rvec R as RX\rvec R \in \setX^{\bot}. The decomposition of the total squared error YY2\norm{\rvec Y - \rvec \Yline}^2 then follows from Pythagoras.

The idea now is to look at the proportion between the variance of the fitted values Y^i\Yhat_i around the mean Y\Yline and variance of the residuals σ^2\sigmahat^2. This is a scale-free quantity.

Definition 1.18 (F\statF-statistic): Under the global null-hypothesis H0:β2==βp=0H_0 : \beta_2 = \ldots = \beta_p = 0 we have the distribution F=Y^Y2YY^2npp1=Y^Y2(p1)σ^2Fp1,np F = \frac{\normbig{\rvec \Yhat - \rvec \Yline}^2}{\normbig{\rvec Y - \rvec \Yhat}^2} \cdot \frac{n-p}{p-1} = \frac{\normbig{\rvec \Yhat - \rvec \Yline}^2}{(p-1) \cdot \sigmahat^2} \sim \lawF{p-1}{n-p} We call FF the F\statF-statistic.

The corresponding test is called the F\statF-test.

Note: Given we observe the F\statF-statistic ff for some sample of response variables y\dmat y the p\p-value of the F\statF-test is pf=P ⁣(tF) ⁣\p_{f} = \P{t \leq F} where FFp1,npF \sim \lawF{p-1}{n-p}.

Besides performing a global F\statF-test to quantify the statistical significance of the predictor variables, we often want to describe the goodness of fit of the linear model for explaining the data. A meaningufl quantity is the coeffecient of determination.

Definition 1.19 (Coefficient of determination): The proportion of the total variation of Y\rvec Y around Y\rvec \Yline explained by the regression is R2=Y^Y2YY2 R^2 = \frac{\normbig{\rvec \Yhat - \rvec \Yline}^2}{\normbig{\rvec Y - \rvec \Yline}^2} We call R2R^2 the coefficient of determination.

1.4 Check of Model Assumptions

The residuals Ri=YiY^iR_i = Y_i - \hat{Y}_i can serve as an approximation of the unobservable error term εi\epsilon_i and for checking whether the linear model is appropriate.

1.4.1 The Tukey-Anscombe Plot

...

Because the correlation between residuals and fitted values is mechanically forced to be zero under OLS, any pattern you see in the plot (like a curve or a fan shape) is not a simple linear trend. It must be evidence of a model violation, such as non-linearity in the data (the curve) or non-constant variance/heteroscedasticity (the fan shape).