Linear regression is a widely used statistical model in a broad variety of applications. It is one of the easiest examples to demonstrate important aspects of statistical modelling.
1.1 The Linear Model
Definition 1.1 (Linear model): For each observation i∈{1,…,n}, let Yi be the response variable and xi,1,…,xi,p be the predictors. In the linear model the response variable is a linear function of the predictors up to some error εiYi=j=1∑pβjxi,j+εi
where ∀i:E[εi]=0.
Note: We usually, but not necessarily always, assume:
∀i:Var[εi]=σ2
ε1,…,εn are iid
We call n the sample size and p the number of predictors. The goal is to estimate the parameters β1,…,βp, to study their relevance and to estimate the error variance. The parameters βj and distribution parameters of εi, e.g. σ2, are unknown. The errors εi are unobservable, while the response variables Yi and the predictors xi,j are given. We can rewrite the model using the vector notation
Yn=Xn×p⋅βp+εn
where Y∈Rn is the random vector of response variables, X∈Rn×p is the matrix of predictors, β∈Rp is vector of unknown parameters and ε∈Rn is the random vector of errors. We typically assume that the sample size is larger than the number of predictors, i.e. n>p, and that the matrix X has full rank p.
Note: We use the notation
xpi=xi,1⋮xi,p=Xn×p[i,:]
for a single observation i∈{1,…,n} of all the p predictors and
xn:,j=x1,j⋮xn,j=Xn×p[:,j]
for all n observations of a single predictor j∈{1,…,p}.
Note: To model an intercept, we set the first predictor variable to be a constant, i.e. x:,1=1, resulting in the model Yi=β1+∑j=2pβjxi,j+εi.
Note (On stochastic models): The linear model involves some stochastic components: the error terms εi are random variables and hence the response variables Yi as well. The predictor variables xi,j are assumed to be non-random. However, in some applications, which are not discussed in this course, it is more appropriate to treat the predictor variables as random. The stochastic nature of the error terms εi can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.
Example (Regression through the origin):Yi=βxi+εi
Example (Simple linear regression):Yi=β1+β2xi+εi
Example (Transformed predictors):Yi=β1+β2log(xi,2)+β3sin(xi,3)+εi
1.2 Least Squares
We assume the linear model Y=X⋅β+ε and want to find a good estimate of β. We define the Ordinary Least Squares estimator.
Definition 1.2 (OLS estimator): The least squares estimator β^ is defined as
β^p=β∈RpargminYn−Xn×p⋅βp2
Assuming that X has rank p, the minimizer can be computed explicitly by
β^p=(p×nX⊤Xn×p)−1p×nX⊤⋅Yn
Note: We distinguish between the least squares estimator, β^, which is a random vector, and the resulting least squares estimate, b^, which is deterministic and computed from a specific data sample y.
Proof: As ∥Y−X⋅β∥2 is convex we find the minimizwer by setting its gradient to 0:∇∥Y−X⋅β∥2=−2X⊤(Y−X⋅β)=!0
This yields the normal equations
X⊤X⋅β^=X⊤⋅Y
Under the assumption that X has rank p the matrix X⊤X∈Rp×p has full rank and is invertible, thus
β^=(X⊤X)−1X⊤⋅Y
Definition 1.3 (Residuals): We define the residuals as
Ri=Yi−j=1∑pβ^jxi,j
Note: The equivalent vector notation is R=Y−X⋅β^.
1.2.1 Assumptions
We need some assumptions so that fitting a linear model by least squares is reasonable and that tests and confidence intervals are approximately valid. We distinguish between the OLS assumptions needed for estimation guarantees and the additional normality of errors assumption needed for inference.
Proposition 1.4 (OLS assumptions):
The linear model is correct, i.e. Y=X⋅β+ε with E[ε]=0.
All xi are exact, i.e. they can be observed perfectly.
The errors are homoscedastic, i.e. ∀i:Var[εi]=σ2.
The errors are uncorrelated, i.e. ∀i=j:Cov[εi,εj]=0.
Proposition 1.5 (Normality of error assumption):
The errors are jointly normal, i.e. ε∼N.
Note:
1., 3. and 4. imply ε∼N(0,σ2I) and εi∼iidN(0,σ2).
If 2. does not hold, we need corrections from errors in variables methods.
If 3. does not hold, we can use weighted least squares.
If 4. does not hold, we can use generalized least squares.
If 5. does not hold, we can still use least squares but, as OLS is sensitive to outliers, we might prefer robust methods instead.
We emphasize that we do not make any assumptions on the predictor variables, except that the matrix X has full rank p. This ensures that there is no perfect multicollinearity among the predictors, i.e. x:,j are linearly independent.
1.2.2 Geometrical Interpretation
The vector of response variables Y is a random vector in Rn and X={X⋅β:β∈Rp} describes the p-dimensional subspace X⊂Rn spanned by x:,j. The least squares estimator β^ is then such that X⋅β^ is closest to Y with respect to the Euclidean distance. We denote the vector of fitted values by Y^=X⋅β^.
Recap (Orthogonal projection matrix): A square matrix P is called and orthogonal projection matrix if P2=P=P⊤.
Proposition 1.6 (Interpretation of fitted values): The vector of fitted values Y^ is the orthogonal projection of Y onto the p-dimensional subspace X.
Proof: Since Y^=X⋅β^=X(X⊤X)−1X⊤⋅Y, the map Y↦Y^ can be represented by the matrix
Pn×n=X(X⊤X)−1X⊤
It is evident that P⊤=P. It remains to check P2=P:P2=X(X⊤X)−1X⊤X(X⊤X)−1X⊤=X(X⊤X)−1X⊤
The vector of residuals is then defined by R=Y−Y^. Geometrically, the residuals are orthogonal to X because Y^ is the orthogonal projection of Y onto X. This means that ∀j:R⊤x:,j=0 or equivalently X⊤R=0.
Proposition 1.7 (Interpretation of residuals): The vector of residuals R is the orthogonal projection of Y onto the (n−p)-dimensional orthogonal complement X⊥=Rn∖X.
Proof: Since R=Y−Y^=Y−X(X⊤X)−1X⊤⋅Y, the map Y↦R can be represented by the matrix
In×n−Pn×n=I−X(X⊤X)−1X⊤
which is an orthogonal projection matrix since (I−P)⊤=I−P and
(I−P)2=I2−2IP+P2=I−P
1.2.3 Distributional Properties
We point out that the least squares estimator β^ is a random vector, i.e. for new samples from the same data-generating mechanism Y, the data y would look differently every time and hence also the least squares estimate b^. We analyze the first two moments of the random vectors and variables.
Proposition 1.8 (First moments of least square estimator):
E[β^]=β, i.e. β^ is an unbiased estimator.
Cov[β^,β^]=σ2(X⊤X)−1.
Proof:E[β^]=(X⊤X)−1X⊤⋅E[Y]=(X⊤X)−1X⊤⋅E[X⋅β+ε]=β+(X⊤X)−1X⊤⋅E[ε]=β
and
\begin{align*}
\Cov{\gvec \betahat}{\gvec \betahat} &= \E{\gvec \betahat \gvectr \betahat} - \E{\gvec \betahat} \E{ \gvec \betahat}^\tr\\
&= \pa{\dmattr X \dmat X}^{-1} \dmattr X \E{(\dmat X \cdot \gvec \beta + \gvec \epsilon)\pa{\gvectr \beta \cdot \dmattr X + \gvectr \epsilon}} \cdot \dmat X \pa{\dmattr X \dmat X}^{-1} - \gvec \beta \gvectr \beta \\
&= \pa{\dmattr X \dmat X}^{-1} \dmattr X \cdot \Cov{\gvec \epsilon}{\gvec \epsilon} \cdot \dmat X \pa{\dmattr X \dmat X}^{-1} \\
&= \sigma^2 \pa{\dmattr X \dmat X}^{-1} \dmattr X \dmat X \pa{\dmattr X \dmat X}^{-1} \\
&= \sigma^2 \pa{\dmattr X \dmat X}^{-1}
\end{align*}
Note: We write σj2=σ2(X⊤X)[j,j]−1 for the variance and σj=σj2 for the standard deviation of β^j.
Proposition 1.9 (First moments of fitted values):
E[Y^]=Xβ.
Cov[Y^,Y^]=σ2P.
Proof:E[Y^]=E[Xβ^]=Xβ
and
\begin{align*}
\Cov{\rvec \Yhat}{\rvec \Yhat} &= \E{\rvec \Yhat \rvectr \Yhat} - \E{\rvec \Yhat} \E{ \rvec \Yhat}^\tr \\
&= \dmat X \cdot \E{\gvec \betahat \gvectr \betahat} \cdot \dmattr X - \dmat X \gvec \beta \gvectr \beta \dmattr X \\
&= \dmat X \cdot \pa{\Cov{\gvec \betahat}{\gvec \betahat} + \gvec \beta \gvectr \beta} \cdot \dmattr X - \dmat X \gvec \beta \gvectr \beta \dmattr X \\
&= \sigma^2 \dmat X \pa{\dmattr X \dmat X}^{-1} \dmattr X \\
&= \sigma^2 \dmat P
\end{align*}
Note: The first two moments of the vector of response variables are E[Y]=Xβ and Cov[Y,Y]=σ2I. Furthermore we have Cov[Y,Y^]=σ2P.
Proposition 1.10 (First moments of residuals):
E[R]=0.
Cov[R,R]=σ2(I−P).
Proof: We already proved E[R]=0. Further
\begin{align*}
\Cov{\rvec R}{\rvec R} &= \E{\rvec R \rvectr R} \\
&= \E{(\rvec Y - \rvec \Yhat) (\rvec Y - \rvec \Yhat)^\tr} \\
&= \E{\rvec Y \rvectr Y - \rvec Y \rvectr \Yhat - \rvec \Yhat \rvectr Y + \rvec \Yhat \rvectr \Yhat} \\
&= \E{\rvec Y \rvectr Y} - 2 \Cov{\rvec Y}{\rvec \Yhat} - 2 \E{\rvec Y} \E{ \rvec \Yhat}^\tr + \E{\rvec \Yhat \rvectr \Yhat} \\
&= \E{\rvec Y \rvectr Y} - 2 \Cov{\rvec Y}{\rvec \Yhat} - 2 \dmat X \gvec \beta \gvectr \beta \dmattr X + \E{\rvec \Yhat \rvectr \Yhat} \\
&= \Cov{\rvec Y}{\rvec Y} - 2\Cov{\rvec Y}{\rvec \Yhat} + \Cov{\rvec \Yhat}{\rvec \Yhat} \\
&= \sigma^2 \dmat I - 2 \sigma^2 \dmat P + \sigma^2 \dmat P \\
&= \sigma^2 (\dmat I - \dmat P)
\end{align*}
For the derivation of the moments we did not need to assume any specific distribution of the errors ε. Using the normality of error assumption ε∼N, we can derive the specific distributions of the random vectos and variables.
Proposition 1.11 (Normal distributions):
β^∼Np(β,σ2(X⊤X)−1)
Y∼Nn(Xβ,σ2I)
Y^∼Nn(Xβ,σ2P)
R∼Nn(0,σ2(I−P))
Proof: The distributions follow from ε∼N and from the fact that any affine transformation of a multivariate normal are normally distributed as well.
Note: The normality of error assumption is often not fulfilled in practice. We can then rely on the central limit theorem which implies that for large sample size n, the distributions are still approximately normal. However, it is often much better to use robust methods in case of non-Gaussian errors which are not discussed here.
1.2.4 Error Variance Estimator
We introduce an estimator for the variance of the errors.
Definition 1.12 (Error variance estimator): The estimator for the error variance σ2 is
σ^2=n−p1i=1∑nRi2
in the least squares setting.
Proposition 1.13 (Estimator is unbiased): The error variance estimator σ^2 is unbiased, i.e. E[σ^2]=σ2.
Proposition 1.14 (Distribution): The distribution of the error variance estimator is
σ^2∼n−pσ2χn−p2
1.3 Tests and Confidence Intervals
1.3.1 Test of Regression Coefficients
We impose the OLS and normality of error assumptions. As we have seen before, the estimator β^ is normally distributed. If we are interested whether the j-th predictor variable is relevant, we can test the null-hypothesis H0,j:βj=0 against the alternative HA,j:βj=0.
Recap (t-distribution): Let Z∼N(0,1) and V∼χm2. The random variable
T=m−1VZ
follows the t-distribution with m degrees of freedom, i.e. T∼tm.
Definition 1.15 (t-statistic): The distribution of β^j=0 under the null-hypothesis H0,j:βj=0 is
Tj=σ^2(X⊤X)[j,j]−1β^j=σ^jβ^j∼tn−p
We call Tj the t-statistic of the j-th predictor variable.
Proof: We know
β^j∼N(βj,σ2(X⊤X)[j,j]−1)
and
σ^2∼n−pσ2χn−p2
thus, under H0,j we have
(n−p)−1σ2n−pσ^2σ2(X⊤X)[j,j]−1β^j=σ^2(X⊤X)[j,j]−1β^j∼tn−p
The test corresponding to the t-statistic is called the t-test. In practice, we can thus quantify the relevance of individual predictor variables by looking at the size of the test-statistics T1,…,Tp or at the corresponding p-value which may be more informative.
Note: Given we observe the t-statistic tj for some sample of response variables y the p-value of the t-test is ptj=2P(∣tj∣≤T) where T∼tn−p.
The problem by looking at individual tests H0,j is, besides the multiple testing problem in general, that it can happen that all individual tests do not reject the null-hypotheses although it is true that some predictor variables have a significant effect. This can occur because of correlation among the predictor variables. We clarify how to interpret the results of the t-tests.
Note (Interpretation of t-tests): An individual t-test for H0,j should be interpreted as quantifying the effect of the j-th predictor variable after having subtracted the linear effect of all other predictor variables k=j on Y.
Similarly to the t-tests, one can derive confidence intervals for the unknown parameter βj.
Definition 1.16 (Confidence interval): The interval given by
β^j±σ^j⋅ftn−p(1−2α)
is the two-sided confidence interval Cj which covers the true βj with probability 1−α, i.e. P(βj∈Cj)=1−α.
Proof: Recall that
σ^j=σ^2(X⊤X)[j,j]−1
Given βj we have
⟹⟹σ^jβ^j−βj∼tn−pP(−ftn−p(1−2α)<σ^jβ^j−βj<ftn−p(1−2α))=1−αP(β^j−σ^j⋅ftn−p(1−2α)<βj<β^j+σ^j⋅ftn−p(1−2α))=1−α
1.3.2 Test of Global Null-Hypothesis
To test whether there exists any effect from the predictor variables, we can look at the global null-hypothesis H0:β2=…=βp=0 versus the alternative hypothesis HA:(∃j∈{2,…,p}:βj=0). Such a test can be developed with an analysis of variance decomposition which takes a simple form for this special case.
Note: We denote with Y the vectorized global arithmetic mean, i.e. Y=Y⋅1 where Y=n−1∑i=1nYi.
Definition 1.17 (Anova decomposition): The total variation Y−Y2 of Yi around the mean Y can be decomposed as
Y−Y2=Y^−Y2+Y−Y^2
where Y^−Y2 is the variation of fitted values Y^i around the mean Y, and Y−Y^2=∥R∥2 is the variation of the residuals.
Proof: Recall that our linear model includes the intercept, i.e. X[:,1]=1 and recall X={X⋅β:β∈Rp}. Let β~ with β~1=Y and ∀j=1:β~j=0, then
X⋅β~=i=1∑pβ~j⋅X[:,j]=β~1⋅X[:,1]=Y⋅1
hence Y∈X and Y−Y∈X. This implies Y−Y⊥R as R∈X⊥. The decomposition of the total squared error Y−Y2 then follows from Pythagoras.
The idea now is to look at the proportion between the variance of the fitted values Y^i around the mean Y and variance of the residuals σ^2. This is a scale-free quantity.
Definition 1.18 (F-statistic): Under the global null-hypothesis H0:β2=…=βp=0 we have the distribution
F=Y−Y^2Y^−Y2⋅p−1n−p=(p−1)⋅σ^2Y^−Y2∼Fp−1,n−p
We call F the F-statistic.
The corresponding test is called the F-test.
Note: Given we observe the F-statistic f for some sample of response variables y the p-value of the F-test is pf=P(t≤F) where F∼Fp−1,n−p.
Besides performing a global F-test to quantify the statistical significance of the predictor variables, we often want to describe the goodness of fit of the linear model for explaining the data. A meaningufl quantity is the coeffecient of determination.
Definition 1.19 (Coefficient of determination): The proportion of the total variation of Y around Y explained by the regression is
R2=Y−Y2Y^−Y2
We call R2 the coefficient of determination.
1.4 Check of Model Assumptions
The residuals Ri=Yi−Y^i can serve as an approximation of the unobservable error term εi and for checking whether the linear model is appropriate.
1.4.1 The Tukey-Anscombe Plot
...
Because the correlation between residuals and fitted values is mechanically forced to be zero under OLS, any pattern you see in the plot (like a curve or a fan shape) is not a simple linear trend. It must be evidence of a model violation, such as non-linearity in the data (the curve) or non-constant variance/heteroscedasticity (the fan shape).