1. Multiple Linear Regression

Linear regression is a widely used statistical model in a broad variety of applications. It is one of the easiest examples to demonstrate important aspects of statistical modelling.

1.1 The Linear Model

A sample is a collection of observations. In this text, nn denotes the sample size. For each observation i{1,,n}i \in \set{1, \ldots, n} let YiY_i be the response variable and xi,1,,xi,px_{i,1}, \ldots, x_{i,p} be the predictors.

Assumption 1.1 (Linear model in fixed design): The linear model is the assumption that for each observation i{1,,n}i \in \set{1, \ldots, n} the response variable is a linear function of the predictors up to some error: Yi=j=1pβjxi,j+εi Y_i = \sum_{j=1}^p \beta_j x_{i,j} + \epsilon_i

The linear model is a theoretical framework assumption of how the response is generated from the predictors. We call nn the sample size and pp the number of predictors. Assuming correctness of the linear model, the goal is to estimate the parameters β1,,βp\beta_1, \ldots, \beta_p, to study their relevance and to estimate the error variance. The parameters βj\beta_j and distribution parameters of εi\epsilon_i, e.g. the error variance Var ⁣[εi]=σi2\Var{\epsilon_i} = \sigma_i^2, are unknown. The errors εi\epsilon_i are unobservable, while the response variables YiY_i and the predictors xi,jx_{i,j} are given. We can rewrite the model using the vector notation Yn=Xn×pβp+εn \rvec* Y n = \dmat* X n p \cdot \gvec* \beta p + \gvec* \epsilon n where YRn\rvec Y \in \R^n is the random vector of response variables, XRn×p\dmat X \in \R^{n\times p} is the matrix of predictors, βRp\gvec \beta \in \R^{p} is vector of unknown parameters and εRn\gvec \epsilon \in \R^n is the random vector of errors. We typically assume that the sample size is larger than the number of predictors, i.e. n>pn > p, and that the matrix X\dmat X has full rank pp.

Note: We use the notation xpi=[xi,1xi,p]=Xn×p[i,:] \dvec* x p _i = \begin{bmatrix} x_{i,1} \\ \vdots \\ x_{i,p} \end{bmatrix} = \dmat* X n p _{[i,:]} for a single observation i{1,,n}i \in \set{1, \ldots, n} of all the pp predictors and xn:,j=[x1,jxn,j]=Xn×p[:,j] \dvec* x n _{:,j} = \begin{bmatrix} x_{1,j} \\ \vdots \\ x_{n,j} \end{bmatrix} = \dmat* X n p _{[:,j]} for all nn observations of a single predictor j{1,,p}j \in \set{1, \ldots, p}.
Note: To model an intercept, we set the first predictor variable to be a constant, i.e. x:,1=1\dvec x _{:,1} = \dvec 1, resulting in the model Yi=β1+j=2pβjxi,j+εiY_i = \beta_1 + \sum_{j=2}^p \beta_j x_{i,j} + \epsilon_i.
Example (Regression through the origin): Yi=βxi+εiY_i = \beta x_i + \epsilon_i
Example (Simple linear regression): Yi=β1+β2xi+εiY_i = \beta_1 + \beta_2 x_i + \epsilon_i
Example (Transformed predictors): Yi=β1+β2logxi,2+β3sinxi,3+εiY_i = \beta_1 + \beta_2 \log{x_{i,2}} + \beta_3 \sin{x_{i,3}} + \epsilon_i

1.1.1 On Stochastic Components

Standard linear regression involves some stochastic components: the error terms ε\epsilon are random variables and hence the response YY as well. The stochastic nature of the error terms ε\epsilon can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.

Definition 1.2 (Fixed design): A fixed design describes a study where the values of the predictors xi,jx_{i,j} are pre-determined and controlled. These values are treated as non-stochastic, and the analysis is concerned with the response YY at these specific levels.
Note: In fixed design, the data is considered a sample from the probability distribution of YY.
Definition 1.3 (Random design): A random design describes a study, typically observational, where the values of the predictors are not controlled. Instead, both the predictors XjX_j and responses YY are treated as random.
Note: In random design, the data is considered a sample from the joint probability distribution of (X1,,Xp,Y)(X_1, \ldots, X_p, Y).

In standard linear regression we assume that predictors are non-stochastic. This can either be because the data was observed in a fixed design study and the predictors are truly non-stochastic or that the analysis is performed conditional on the observed values, i.e.: given the specific values x1,j,,xn,jx_{1,j}, \ldots, x_{n,j} we happen to observe from XjX_j, what is the relationship between X\rvec X and YY. In some settings, i.e. random design studies, it may be of value to consider the predictors as stochastic. We will consider this later.

1.2 Ordinary Least Squares

We assume the linear model Y=Xβ+ε\rvec Y = \dmat X \cdot \gvec \beta + \gvec \epsilon as well as E ⁣[ε]=0\E{\gvec \epsilon} = \gvec 0 and Cov ⁣[ε, ε]=σ2I\Cov{\gvec \epsilon}{\gvec \epsilon} = \sigma^2 \dmat I for the error. We will fomalize the set of assumptions later. and want to find a good estimate of β\gvec \beta. We define the Ordinary Least Squares (OLS) estimator.

Definition 1.4 (Mean squared error): The mean squared error mseζ\mse*{\gvec \zeta} w.r.t to the true value ζ\gvec \zeta is the function mseζ ⁣(ξ)=E ⁣[(ζξ)2] \mse{\zeta}{\xi} = \E{(\zeta - \xi)^2}
Definition 1.5 (Empirical mean squared error): Given that we have nn observations ξRn\gvec \xi \in \R^n, the empirical mean squared error mse^ζ\msehat*{\gvec \zeta} w.r.t. to the vector ζRn\gvec \zeta \in \R^n of true values is the function mse^ζ ⁣(ξ)=1ni=1n(ζiξi)2=ζξ2np \msehat{\gvec \zeta}{\gvec \xi} = \frac{1}{n}\sum_{i=1}^n (\zeta_i - \xi_i)^2 = \frac{\norm{\gvec \zeta - \gvec \xi}^2}{n-p}
Definition 1.6 (OLS estimator): The ordinary least squares estimator β^\gvec \betahat is defined as β^p=arg minβ~RpYnXn×pβ~p2=arg minβ~Rpmse^Y ⁣(Xβ~) \gvec* \betahat p = \argmin_{\gvec \betatilde \in \R^p} \normbig{\rvec* Y n - \dmat* X n p \cdot \gvec* \betatilde p}^2 = \argmin_{\gvec \betatilde \in \R^p} \msehatbig{\rvec Y}{\dmat X \cdot \gvec \betatilde}
Note: We distinguish between the least squares estimator β^=arg minmse^Y\gvec \betahat = \argmin \msehat*{\rvec Y}, which is stochastic, and the resulting least squares estimate βˇ=arg minmse^y\gvec \betacheck = \argmin \msehat*{\dvec y}, which is deterministic and computed from a specific data sample y\dvec y.
Proposition 1.7 (OLS closed form): Assuming that X\dmat X has rank pp, the OLS estimator can be computed explicitly by β^p=(Xp×nXn×p)1Xp×nYn \gvec* \betahat p = \pabig{\dmattr* X n p \dmat * X n p}^{-1} \dmattr* X n p \cdot \rvec* Y n
Proof: As YXβ~2\normbig{\rvec Y - \dmat X \cdot \gvec \betatilde}^2 is convex we find the minimizer by setting its gradient to 0\dvec 0: YXβ~2=2X(YXβ~)=!0 \nabla \normbig{\rvec Y - \dmat X \cdot \gvec \betatilde}^2 = -2 \dmattr X (\rvec Y - \dmat X \cdot \gvec \betatilde) \stackrel{!}{=} \dvec 0 This yields the normal equations XXβ^=XY \dmattr X \dmat X \cdot \gvec \betahat = \dmattr X \cdot \rvec Y Under the assumption that X\dmat X has rank pp the matrix XXRp×p\dmattr X \dmat X \in \R^{p \times p} has full rank and is invertible, thus β^=(XX)1XY \gvec \betahat = \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \rvec Y
Definition 1.8 (Residuals): We define the residuals as Ri=Yij=1pβ^jxi,j R_i = Y_i - \sum_{j=1}^p \betahat_j x_{i,j}
Note: The equivalent vector notation is R=YXβ^\rvec R = \rvec Y - \dmat X \cdot \rvec \betahat.

1.2.1 Assumptions for OLS

We list a set of assumptions so that fitting a linear model by least squares is reasonable and that tests and confidence intervals are approximately valid. We distinguish between

  • the “OLS assumptions” which are needed for the Gauss-Markov theorem
  • the optional “normality of errors assumption” which facilitates inference
Assumption 1.9 (OLS assumptions in fixed design):
  1. Linear model: The linear model assumption is correct, i.e. Y=Xβ+ε\rvec Y = \dmat X \cdot \gvec \beta + \gvec \epsilon.
  2. Strict exogenity: For non-stochastic predicors this implies that E ⁣[ε=0]\E{\gvec \epsilon = 0}.
  3. Homoscedasticity of errors: The errors have constant variance, i.e. i:Var ⁣[εi]=σ2\forall i: \Var{\epsilon_i} = \sigma^2.
  4. Independence of errors: The errors are uncorrelated, i.e. ij:Cov ⁣[εi, εj]=0\forall i \neq j : \Cov{\epsilon_i}{\epsilon_j} = 0.
  5. No multicollinearity: The predictors must all be linearly independent, i.e. rank ⁣(X)=p\rank{\dmat X} = p.
Note:
  • If 2. does not hold, we need corrections from errors in variables methods.
  • If 3. does not hold, we can use weighted least squares.
  • If 4. does not hold, we can use generalized least squares.
  • If 5. does not hold, the OLS is not unique.

We emphasize that, except for the “no multicolineraity” assumption, the OLS assumptions do not make any assumptions on the predictors.

Definition 1.10 (Linear estimator): An estimator θ^\gvec \thetahat is called a linear estimator if it is a linear combination of the responses, i.e. θ^j=i=1nai,jYi\thetahat_j = \sum_{i = 1}^{n} a_{i,j} Y_i or equivalently θ^=AY \gvec \thetahat = \dmat A \rvec Y
Definition 1.11 (Best unbiased linear estimator): Let S\setS be the set of all unbiased linear estimators. The best unbiased linear estimator (BLUE) is the estimator θ^S\gvec \thetahat \in \setS which satisfies XRn×p:θ^=arg minθ^SmseY ⁣(Xθ^) \forall \dmat X \in \R^{n \times p} : \gvec \thetahat = \argmin_{\gvec \thetahat \in \setS} \msebig{\rvec Y}{\dmat X \gvec \thetahat}
Theorem 1.12 (Gauss-Markov theorem): Under the OLS assumptions the BLUE is the OLS estimator β^\gvec \betahat.
Assumption 1.13 (Normality of errors): The errors are jointly normal, i.e. εN\gvec \epsilon \sim \lawN.
Note: OLS assumptions 2., 3., 4. and the normality of error assumption imply εN(0,σ2I)\gvec \epsilon \sim \lawN(\dvec 0, \sigma^2 \dmat I) and εiiidN(0,σ2)\epsilon_i \simiid \lawN(0, \sigma^2).

This is an additional assumption that is not needed for OLS, but helpful to make statements about distributions. If normality of errors does not hold, we can still use least squares but might prefer robust methods that are less sensitive to outliers instead. Note that the normal distribution is just one of the possible distributions that can be assumed for the errors.

1.2.2 Geometrical Interpretation

The vector of response variables Y\rvec Y is a random vector in Rn\R^n and

\setX = \setmid{\dmat X \cdot \gvec \beta}{\gvec \beta \in \R^p}

describes the pp-dimensional subspace XRn\setX \subset \R^n spanned by x:,j\dvec x _{:,j}. The least squares estimator β^\gvec \betahat is then such that Xβ^\dmat X \cdot \gvec \betahat is closest to Y\rvec Y with respect to the Euclidean distance. We denote the vector of fitted values by Y^=Xβ^\rvec \Yhat = \dmat X \cdot \gvec \betahat.

Recap (Orthogonal projection matrix): A square matrix P\dmat P is called and orthogonal projection matrix if P2=P=P\dmat P ^2 = \dmat P = \dmattr P.
Proposition 1.14 (Interpretation of fitted values): The vector of fitted values Y^\rvec \Yhat is the orthogonal projection of Y\rvec Y onto the pp-dimensional subspace X\setX.
Proof: Since Y^=Xβ^=X(XX)1XY\rvec \Yhat = \dmat X \cdot \gvec \betahat = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X \cdot \rvec Y, the map YY^\rvec Y \mapsto \rvec \Yhat can be represented by the matrix Pn×n=X(XX)1X \dmat* P n n = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X It is evident that P=P\dmattr P = \dmat P. It remains to check P2=P\dmat P ^2 = \dmat P: P2=X(XX)1XX(XX)1X=X(XX)1X \dmat P ^2 = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X = \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X

The vector of residuals is then defined by R=YY^\rvec R = \rvec Y - \rvec \Yhat. Geometrically, the residuals are orthogonal to X\setX because Y^\rvec \Yhat is the orthogonal projection of Y\rvec Y onto X\setX. This means that j:Rx:,j=0\forall j: \rvectr R \dvec x _{:,j} = 0 or equivalently XR=0\dmattr X \rvec R = \dvec 0.

Proposition 1.15 (Interpretation of residuals): The vector of residuals R\rvec R is the orthogonal projection of Y\rvec Y onto the (np)(n{-}p)-dimensional orthogonal complement X=RnX\setX^{\bot} = \R^n \setminus \setX.
Proof: Since R=YY^=YX(XX)1XY\rvec R = \rvec Y - \rvec \Yhat = \rvec Y - \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X \cdot \rvec Y, the map YR\rvec Y \mapsto \rvec R can be represented by the matrix In×nPn×n=IX(XX)1X \dmat* I n n - \dmat* P n n = \dmat I - \dmat X \pabig{\dmattr X \dmat X}^{-1} \dmattr X which is an orthogonal projection matrix since (IP)=IP(\dmat I - \dmat P)^\tr = \dmat I - \dmat P and (IP)2=I22IP+P2=IP (\dmat I - \dmat P)^2 = \dmat I ^2 - 2 \dmat I \dmat P + \dmat P ^2 = \dmat I - \dmat P

1.2.3 Distributional Properties

We point out that the least squares estimator β^\gvec \betahat is a random vector, i.e. for new samples from the same data-generating mechanism Y\rvec Y, the data y\dvec y would look differently every time and hence also the least squares estimate βˇ\gvec \betacheck. We analyze the first two moments of the random vectors and variables.

Proposition 1.16 (First moments of least square estimator):
  1. E ⁣[β^]=β\E{\gvec \betahat} = \gvec \beta, i.e. β^\gvec \betahat is an unbiased estimator.
  2. Cov ⁣[β^, β^]=σ2(XX)1\Cov{\gvec \betahat}{\gvec \betahat} = \sigma^2 (\dmattr X \dmat X)^{-1}.
Proof: E ⁣[β^]=(XX)1XE ⁣[Y]=(XX)1XE ⁣[Xβ+ε]=β+(XX)1XE ⁣[ε]=β\begin{align*} \E{\gvec \betahat} &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \E{\rvec Y} \\ &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \E{\dmat X \cdot \gvec \beta + \gvec \epsilon} \\ & = \gvec \beta + \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \E{\gvec \epsilon} \\ & = \gvec \beta \end{align*} and Cov ⁣[β^, β^]=E ⁣[β^β^]E ⁣[β^]E ⁣[β^]=(XX)1XE ⁣[(Xβ+ε)(βX+ε)]X(XX)1ββ=(XX)1XCov ⁣[ε, ε]X(XX)1=σ2(XX)1XX(XX)1=σ2(XX)1\begin{align*} \Cov{\gvec \betahat}{\gvec \betahat} &= \E{\gvec \betahat \gvectr \betahat} - \E{\gvec \betahat} \E{ \gvec \betahat}^\tr\\ &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \E{(\dmat X \cdot \gvec \beta + \gvec \epsilon)\pa{\gvectr \beta \cdot \dmattr X + \gvectr \epsilon}} \cdot \dmat X \pa{\dmattr X \dmat X}^{-1} - \gvec \beta \gvectr \beta \\ &= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \Cov{\gvec \epsilon}{\gvec \epsilon} \cdot \dmat X \pa{\dmattr X \dmat X}^{-1} \\ &= \sigma^2 \pa{\dmattr X \dmat X}^{-1} \dmattr X \dmat X \pa{\dmattr X \dmat X}^{-1} \\ &= \sigma^2 \pa{\dmattr X \dmat X}^{-1} \end{align*}
Note: We write σj2=σ2(XX)[j,j]1\sigma^2_j = \sigma^2 (\dmattr X \dmat X)^{-1}_{[j,j]} for the variance and σj=σj2\sigma_j = \sqrt{\sigma^2_j} for the standard deviation of β^j\betahat_j.
Proposition 1.17 (First moments of fitted values):
  1. E ⁣[Y^]=Xβ\E{\rvec \Yhat} = \dmat X \gvec \beta.
  2. Cov ⁣[Y^, Y^]=σ2P\Cov{\rvec \Yhat}{\rvec \Yhat} = \sigma^2 \dmat P.
Proof: E ⁣[Y^]=E ⁣[Xβ^]=Xβ \E{\rvec \Yhat} = \E{\dmat X \gvec \betahat} = \dmat X \gvec \beta and Cov ⁣[Y^, Y^]=E ⁣[Y^Y^]E ⁣[Y^]E ⁣[Y^]=XE ⁣[β^β^]XXββX=X(Cov ⁣[β^, β^]+ββ)XXββX=σ2X(XX)1X=σ2P\begin{align*} \Cov{\rvec \Yhat}{\rvec \Yhat} &= \E{\rvec \Yhat \rvectr \Yhat} - \E{\rvec \Yhat} \E{ \rvec \Yhat}^\tr \\ &= \dmat X \cdot \E{\gvec \betahat \gvectr \betahat} \cdot \dmattr X - \dmat X \gvec \beta \gvectr \beta \dmattr X \\ &= \dmat X \cdot \pa{\Cov{\gvec \betahat}{\gvec \betahat} + \gvec \beta \gvectr \beta} \cdot \dmattr X - \dmat X \gvec \beta \gvectr \beta \dmattr X \\ &= \sigma^2 \dmat X \pa{\dmattr X \dmat X}^{-1} \dmattr X \\ &= \sigma^2 \dmat P \end{align*}
Note: The first two moments of the vector of response variables are E ⁣[Y]=Xβ\E{\rvec Y} = \dmat X \gvec \beta and Cov ⁣[Y, Y]=σ2I\Cov{\rvec Y}{\rvec Y} = \sigma^2 \dmat I. Furthermore we have Cov ⁣[Y, Y^]=σ2P\Cov{\rvec Y}{\rvec \Yhat} = \sigma^2 \dmat P.
Proposition 1.18 (First moments of residuals):
  1. E ⁣[R]=0\E{\rvec R} = \dvec 0.
  2. Cov ⁣[R, R]=σ2(IP)\Cov{\rvec R}{\rvec R} = \sigma^2 (\dmat I - \dmat P).
Proof: We already proved E ⁣[R]=0\E{\rvec R} = \dvec 0. Further Cov ⁣[R, R]=E ⁣[RR]=E ⁣[(YY^)(YY^)]=E ⁣[YYYY^Y^Y+Y^Y^]=E ⁣[YY]2Cov ⁣[Y, Y^]2E ⁣[Y]E ⁣[Y^]+E ⁣[Y^Y^]=E ⁣[YY]2Cov ⁣[Y, Y^]2XββX+E ⁣[Y^Y^]=Cov ⁣[Y, Y]2Cov ⁣[Y, Y^]+Cov ⁣[Y^, Y^]=σ2I2σ2P+σ2P=σ2(IP)\begin{align*} \Cov{\rvec R}{\rvec R} &= \E{\rvec R \rvectr R} \\ &= \E{(\rvec Y - \rvec \Yhat) (\rvec Y - \rvec \Yhat)^\tr} \\ &= \E{\rvec Y \rvectr Y - \rvec Y \rvectr \Yhat - \rvec \Yhat \rvectr Y + \rvec \Yhat \rvectr \Yhat} \\ &= \E{\rvec Y \rvectr Y} - 2 \Cov{\rvec Y}{\rvec \Yhat} - 2 \E{\rvec Y} \E{ \rvec \Yhat}^\tr + \E{\rvec \Yhat \rvectr \Yhat} \\ &= \E{\rvec Y \rvectr Y} - 2 \Cov{\rvec Y}{\rvec \Yhat} - 2 \dmat X \gvec \beta \gvectr \beta \dmattr X + \E{\rvec \Yhat \rvectr \Yhat} \\ &= \Cov{\rvec Y}{\rvec Y} - 2\Cov{\rvec Y}{\rvec \Yhat} + \Cov{\rvec \Yhat}{\rvec \Yhat} \\ &= \sigma^2 \dmat I - 2 \sigma^2 \dmat P + \sigma^2 \dmat P \\ &= \sigma^2 (\dmat I - \dmat P) \end{align*}

For the derivation of the moments we did not need to assume any specific distribution of the errors ε\gvec \epsilon. Using the normality of error assumption εN\epsilon \sim \lawN, we can derive the specific distributions of the random vectos and variables.

Proposition 1.19 (Normal distributions):
  • β^Np ⁣(β,σ2(XX)1)\gvec \betahat \sim \overdim \lawN p \of{\beta, \sigma^2 (\dmattr X \dmat X)^{-1}}
  • YNn ⁣(Xβ,σ2I)\rvec Y \sim \overdim \lawN n \of{\dmat X \gvec \beta, \sigma^2 \dmat I}
  • Y^Nn ⁣(Xβ,σ2P)\rvec \Yhat \sim \overdim \lawN n \of{\dmat X \gvec \beta, \sigma^2 \dmat P}
  • RNn ⁣(0,σ2(IP))\rvec R \sim \overdim \lawN n \of{\dvec 0, \sigma^2 (\dmat I - \dmat P)}
Proof: The distributions follow from εN\gvec \epsilon \sim \lawN and from the fact that any affine transformation of a multivariate normal are normally distributed as well.
Note: The normality of error assumption is often not fulfilled in practice. We can then rely on the central limit theorem which implies that for large sample size nn, the distributions are still approximately normal. However, it is often much better to use robust methods in case of non-Gaussian errors which are not discussed here.

1.2.4 Error Variance Estimator

We introduce an estimator for the variance of the errors.

Definition 1.20 (Error variance estimator): The estimator for the error variance σ2\sigma^2 is σ^2=1npi=1nRi2 \sigmahat^2 = \frac{1}{n-p} \sum_{i=1}^n R_i^2 in the least squares setting.
Proposition 1.21 (Estimator is unbiased): The error variance estimator σ^2\sigmahat^2 is unbiased, i.e. E ⁣[σ^2]=σ2\E{\sigmahat^2} = \sigma^2.
Proof: E ⁣[σ^2]=E ⁣[1npi=1nRi2]=1npi=1nVar ⁣[Ri]=σ2npi=1n(1P[i,i])=σ2np(ntr ⁣(P))=σ2np(np)=σ2\begin{align*} \E{\sigmahat^2} &= \E{\frac{1}{n-p} \sum_{i=1}^n R_i^2} \\ &= \frac{1}{n-p} \sum_{i=1}^n \Var{R_i} \\ &= \frac{\sigma^2}{n-p} \sum_{i=1}^n (1 - \dmat P _{[i,i]}) \\ &= \frac{\sigma^2}{n-p} (n - \trace{\dmat P}) \\ &= \frac{\sigma^2}{n-p} (n - p) \\ &= \sigma^2 \end{align*}
Proposition 1.22 (Distribution): The distribution of the error variance estimator is σ^2σ2npχnp2 \sigmahat^2 \sim \frac{\sigma^2}{n-p} \lawChi{n-p}

1.3 Tests and Confidence Intervals

1.3.1 Test of Regression Coefficients

We impose the OLS and normality of error assumptions. As we have seen before, the estimator β^\gvec \betahat is normally distributed. If we are interested whether the jj-th predictor variable is relevant, we can test the null-hypothesis H0,j:βj=0H_{0,j} : \beta_j = 0 against the alternative HA,j:βj0H_{A,j} : \beta_j \neq 0.

Recap (t\statT-distribution): Let ZN(0,1)Z \sim \lawN(0,1) and Vχm2V \sim \lawChi m. The random variable T=Zm1V T = \frac{Z}{\sqrt{m^{-1} V}} follows the t\statT-distribution with mm degrees of freedom, i.e. TtmT \sim \lawT m.
Definition 1.23 (t\statT-statistic): The distribution of β^j=0\betahat_j = 0 under the null-hypothesis H0,j:βj=0H_{0,j} : \beta_j = 0 is Tj=β^jσ^2(XX)[j,j]1=β^jσ^jtnp T_j = \frac{\betahat_j}{\sqrt{\sigmahat^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}}} = \frac{\betahat_j}{\sigmahat_j} \sim \lawT{n-p} We call TjT_j the t\statT-statistic of the jj-th predictor variable.
Proof: We know β^jN(βj,σ2(XX)[j,j]1) \betahat_j \sim \lawN\pa{\beta_j, \sigma^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}} and σ^2σ2npχnp2 \sigmahat^2 \sim \frac{\sigma^2}{n-p} \lawChi{n-p} thus, under H0,jH_{0,j} we have β^jσ2(XX)[j,j]1(np)1npσ2σ^2=β^jσ^2(XX)[j,j]1tnp \frac{\frac{\betahat_j}{\sqrt{\sigma^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}}}}{\sqrt{(n-p)^{-1} \frac{n-p}{\sigma^2} \sigmahat^2}} = \frac{\betahat_j}{\sqrt{\sigmahat^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}}} \sim \lawT{n-p}

The test corresponding to the t\statT-statistic is called the t\statT-test. In practice, we can thus quantify the relevance of individual predictor variables by looking at the size of the test-statistics T1,,TpT_1, \ldots, T_p or at the corresponding p\p-value which may be more informative.

Note: Given we observe the t\statT-statistic tjt_j for some sample of response variables y\dvec y the p\p-value of the t\statT-test is ptj=2P ⁣(tjT)\p_{t_j} = 2 \probP{\abs{t_j} \leq T} where TtnpT \sim \lawT{n-p}.

The problem by looking at individual tests H0,jH_ {0,j} is, besides the multiple testing problem in general, that it can happen that all individual tests do not reject the null-hypotheses although it is true that some predictor variables have a significant effect. This can occur because of correlation among the predictor variables. We clarify how to interpret the results of the t\statT-tests.

Note (Interpretation of t\statT-tests): An individual t\statT-test for H0,jH_ {0,j} should be interpreted as quantifying the effect of the jj-th predictor variable after having subtracted the linear effect of all other predictor variables kjk \neq j on Y\rvec Y.

Similarly to the t\statT-tests, one can derive confidence intervals for the unknown parameter βj\beta_j.

Definition 1.24 (Confidence interval): The interval given by β^j±σ^jftnp ⁣(1α2) \betahat_j \pm \sigmahat_j \cdot f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}} is the two-sided confidence interval Cj\setC_{j} which covers the true βj\beta_j with probability 1α1 - \alpha, i.e. P ⁣(βjCj)=1α\probP{\beta_j \in \setC_{j}} = 1 - \alpha.
Proof: Recall that σ^j=σ^2(XX)[j,j]1 \sigmahat_j = \sqrt{\sigmahat^2 \pabig{\dmattr X \dmat X}^{-1}_{[j,j]}} Given βj\beta_j we have β^jβjσ^jtnp    P ⁣(ftnp ⁣(1α2)<β^jβjσ^j<ftnp ⁣(1α2))=1α    P ⁣(β^jσ^jftnp ⁣(1α2)<βj<β^j+σ^jftnp ⁣(1α2))=1α\begin{align*} & \frac{\betahat_j - \beta_j}{\sigmahat_j} \sim \lawT{n-p} \\ \implies \quad & \probP{-f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}} < \frac{\betahat_j - \beta_j}{\sigmahat_j} < f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}}} = 1 - \alpha \\ \implies \quad & \probP{\betahat_j - \sigmahat_j \cdot f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}} < \beta_j < \betahat_j + \sigmahat_j \cdot f_{\lawT{n-p}}\of{1-\frac{\alpha}{2}}} = 1 - \alpha \\ \end{align*}

1.3.2 Test of Global Null-Hypothesis

To test whether there exists any effect from the predictor variables, we can look at the global null-hypothesis H0:β2==βp=0H_0 : \beta_2 = \ldots = \beta_p = 0 versus the alternative hypothesis HA:(j{2,,p}:βj0)H_A : \pabig{\exists j \in \set{2,\ldots,p} : \beta_j \neq 0}. Such a test can be developed with an analysis of variance decomposition which takes a simple form for this special case.

Note: We denote with Y\rvec \Yline the vectorized global arithmetic mean, i.e. Y=Y1\rvec \Yline = \Yline \cdot \dvec 1 where Y=n1i=1nYi\Yline = n^{-1} \sum_{i=1}^n Y_i.
Definition 1.25 (Anova decomposition): The total variation YY2\norm{\rvec Y - \rvec \Yline}^2 of YiY_i around the mean Y\Yline can be decomposed as YY2=Y^Y2+YY^2 \norm{\rvec Y - \rvec \Yline}^2 = \normbig{\rvec \Yhat - \rvec \Yline}^2 + \normbig{\rvec Y - \rvec \Yhat}^2 where Y^Y2\normbig{\rvec \Yhat - \rvec \Yline}^2 is the variation of fitted values Y^i\Yhat_i around the mean Y\Yline, and YY^2=R2\normbig{\rvec Y - \rvec \Yhat}^2 = \norm{\rvec R}^2 is the variation of the residuals.
Proof: Recall that our linear model includes the intercept, i.e. X[:,1]=1\dmat X_{[:,1]} = \dvec 1 and recall

\setX = \setmid{\dmat X \cdot \gvec \beta}{\gvec \beta \in \R^p}

.
Let β~\gvec{\tilde{\beta}} with β~1=Y\tilde{\beta}_1 = \line{Y} and j1:β~j=0\forall j \neq 1: \tilde{\beta}_j = 0, then Xβ~=i=1pβ~jX[:,j]=β~1X[:,1]=Y1 \dmat X \cdot \gvec{\tilde{\beta}} = \sum_{i=1}^p \tilde{\beta}_j \cdot \dmat X_{[:,j]} = \tilde{\beta}_1 \cdot \dmat X_{[:,1]} = \Yline \cdot \dvec 1 hence YX\rvec \Yline \in \setX and YYX\rvec Y - \rvec \Yline \in \setX. This implies YYR\rvec Y - \rvec \Yline \perp \rvec R as RX\rvec R \in \setX^{\bot}. The decomposition of the total squared error YY2\norm{\rvec Y - \rvec \Yline}^2 then follows from Pythagoras.

The idea now is to look at the proportion between the variance of the fitted values Y^i\Yhat_i around the mean Y\Yline and variance of the residuals σ^2\sigmahat^2. This is a scale-free quantity.

Definition 1.26 (F\statF-statistic): Under the global null-hypothesis H0:β2==βp=0H_0 : \beta_2 = \ldots = \beta_p = 0 we have the distribution F=Y^Y2YY^2npp1=Y^Y2(p1)σ^2Fp1,np F = \frac{\normbig{\rvec \Yhat - \rvec \Yline}^2}{\normbig{\rvec Y - \rvec \Yhat}^2} \cdot \frac{n-p}{p-1} = \frac{\normbig{\rvec \Yhat - \rvec \Yline}^2}{(p-1) \cdot \sigmahat^2} \sim \lawF{p-1}{n-p} We call FF the F\statF-statistic.

The corresponding test is called the F\statF-test.

Note: Given we observe the F\statF-statistic ff for some sample of response variables y\dmat y the p\p-value of the F\statF-test is pf=P ⁣(tF)\p_{f} = \probP{t \leq F} where FFp1,npF \sim \lawF{p-1}{n-p}.

Besides performing a global F\statF-test to quantify the statistical significance of the predictor variables, we often want to describe the goodness of fit of the linear model for explaining the data. A meaningufl quantity is the coeffecient of determination.

Definition 1.27 (Coefficient of determination): The proportion of the total variation of Y\rvec Y around Y\rvec \Yline explained by the regression is R2=Y^Y2YY2 R^2 = \frac{\normbig{\rvec \Yhat - \rvec \Yline}^2}{\normbig{\rvec Y - \rvec \Yline}^2} We call R2R^2 the coefficient of determination.

1.4 Check of Model Assumptions

The residuals Ri=YiY^iR_i = Y_i - \hat{Y}_i can serve as an approximation of the unobservable error term εi\epsilon_i and for checking whether the linear model is appropriate.

1.4.1 The Tukey-Anscombe Plot

The Tukey-Anscombe is a graphical tool: we plot the residuals RiR_i against the fitted values Y^i\Yhat_i, i.e.

\setmid{(\Yhat_i, R_i)}{i \in \set{1,\ldots, n}}

Under OLS assumptions we have RY^\rvec R \perp \rvec \Yhat and thus Cor ⁣[R, Y^]=0 \Cor{\rvec R}{\rvec \Yhat} = \dmat 0 Hence, any pattern we might see in the plot, such as a curve or a fan shape, must be evidence of a model violation, such as non-linearity in the data, e.g. a curve, or non-constant variance, e.g the fan shape. In case of a systematic relationship in the Tukey-Anscombe plot a transformation, such as the log-transform logY\log{\rvec Y} or the square root transformation Y\sqrt{\rvec Y}, might help to stabilize the variance.

1.4.2 The Normal Plot

Assumptions for the distribution of random variables can be graphically checked with the quantile-quantile, in short QQ, plot. In the special case of checking for the normal distribution, the QQ plot is also referred to as the normal plot.

Note: We write qN(0,1)(k)q_{\lawN(0,1)}(k) for the kk-th quantile of N(0,1)\lawN(0,1) and q^R(k)\hat{q}_{\rvec R}(k) the kk-th empirical quantile derived from the empirical residual distribution.

To test the normality of errors assumption, we plot the empirical quantiles of the residuals q^R(k)\hat{q}_{\rvec R}{(k)} against the theoretical of the normal distribution qN(0,1)(k)q_{\lawN(0,1)}{(k)}, i.e.

\setmid{\pa{q_{\lawN(0,1)}\of{\frac{m}{n+1}}, \hat{q}_{\rvec R}\of{\frac{m}{n+1}}}}{m \in \set{1, \ldots, n}}

If the residuals were distributed as iid\iid N(0,1)\lawN(0,1), the normal plot would exhibit an approximate straight line with intercept 00 and slope 11.

1.4.3 Plot for detecting Serial Correlation

For checking independence of the errors we plot the residuals RiR_i against the observation number ii, i.e.

\setmid{(i, R_i)}{i \in \set{1, \ldots, n}}

If the residuals vary randomly around the zero line, there are no indications for serial correlations among the errors εi\epsilon_i. On the other hand, if neighboring residuals look similar, the independence assumption for the errors seems violated.

1.5 Generalizing the Linear Model

1.5.1 Random Design

In random design we assume that the predictors are stochastic.

Assumption 1.28 (Linear model in random design): The linear model is the assumption that for each observation i{1,,n}i \in \set{1, \ldots, n} the response variable is a linear function of the predictors up to some error: Yi=j=1pβjXi,j+εi Y_i = \sum_{j=1}^p \beta_j X_{i,j} + \epsilon_i

We can now make sense of the strict exogenity assumption.

Assumption 1.29 (Strict exogenity): Conditionally on observing the data, the expectation of the error term is zero, i.e. E ⁣[εi | Xi=xi]=0\E{\epsilon_i \mid \rvec X_i = \dvec{x}_i} = 0.

The OLS assumptions in the random design framework are as follows.

Assumption 1.30 (OLS assumptions in random design):
  1. Linear model: The linear model assumption is correct, i.e. Y=Xβ+ε\rvec Y = \rmat X \cdot \gvec \beta + \gvec \epsilon.
  2. Strict exogenity: E ⁣[εi | Xi=xi]=0\E{\epsilon_i \mid \rvec X_i = \dvec{x}_i} = 0.
  3. Homoscedasticity of errors: The errors have constant variance, i.e. i:Var ⁣[εi]=σ2\forall i: \Var{\epsilon_i} = \sigma^2.
  4. Independence of errors: The errors are uncorrelated, i.e. ij:Cov ⁣[εi, εj]=0\forall i \neq j : \Cov{\epsilon_i}{\epsilon_j} = 0.
  5. No multicollinearity: The predictors must all be linearly independent, i.e. P ⁣(rank ⁣(X)=p)=1\probP{\rank{\rmat X} = p} = 1.

1.5.2 Regression Model

The linear model is a special case of the regression model.

Assumption 1.31 (Regression model in random design): Let m:RpRm : \R^p \to \R be a function. The regression model is the assumption that for each observation i{1,,n}i \in \set{1, \ldots, n} the response variable has the following relationship to the predictors: Yi=m(Xi)+εi Y_i = m(\rvec X_i) + \epsilon_i

Under strict exogenity we can write the function mm as m(xi)=E ⁣[YiXi=xi]m(\dvec x_i) = \E{Y_i | \rvec X_i = \dvec x_i}.