In this chapter, unless otherwise noted, we will always assume the WCLM and will specify if additional assumptions are made.
2.1 OLS estimator
Note: We abbreviate ordinary least squares with OLS.
Definition 2.1 (OLS estimator): The OLS estimator β^ is
β^=β∈Rpargmin∥Y−Xβ∥2
Note: Recall that, as always we distinguish between the estimator β^, which is stochastic, and the estimate βˇ, which is deterministic and computed from a specific data sample Y=y.
Proposition 2.2 (OLS closed form): The OLS estimator can be explicitly computed by
β^=(X⊤X)−1X⊤Y
Proof: As Y−Xβ2 is convex we find the minimizer β^ by setting its gradient to 0:∇β[Y−Xβ2]=−2X⊤(Y−Xβ)=!0
This yields the normal equations
X⊤Xβ^=X⊤Y
Under the assumption that X has rank p the matrix X⊤X∈Rp×p has full rank and is invertible, thus
β^=(X⊤X)−1X⊤Y
Definition 2.3 (Fitted values): The OLS fitted values are Y^=Xβ^.
Definition 2.4 (Residuals): The OLS residuals are ε^=Y−Y^.
2.2 OLS under the SLM
Under the SLM Yi=βxi+εi we have the normal equation x⊤xβˇ=x⊤y and we get the following solution.
Proposition 2.5 (OLS estimate for the SLM): Under the SLM the OLS estimate is
βˇ=x⊤xx⊤y
Note: Under the location model, i.e. x=1, we have βˇ=n1⊤y=y.
Under the SLMI Yi=β1+β2xi+εi we have the following normal equations X⊤Xβˇ=X⊤y:nβˇ1+βˇ2i=1∑nxi=i=1∑nyiandβˇ1i=1∑nxi+βˇ2i=1∑nxi2=i=1∑nxiyi
We use the arithmetic mean x=n1∑i=1nxi and the variables transforms αˇ=βˇ1+βˇ2x and βˇ=βˇ2 and note the following algebraic sum idententities.
Recap (Algebraic sum identities): We have
sxxsyysxy=i=1∑n(xi−x)2=(i=1∑nxi2)−nx2=i=1∑n(yi−y)2=(i=1∑nyi2)−ny2=i=1∑n(xi−x)(yi−y)=(i=1∑nxiyi)−nxy
With that we get the solutions αˇ=y and βˇ=sxxsxy.
Proposition 2.6 (OLS estimate for the SLMI): Under the SLMI the OLS estimate is
βˇ2=sxxsxyandβˇ1=y−βˇ2x
Note: We can rewrite
βˇ2=n−11sxxn−11sxy=var(x,y)cov(x,y)
hence βˇ2 is the ratio between empirical covariance between y and x and empirical variance of x.
At the point x=x we have βˇ1+βˇ2x=x, hence the point (x,y) always lies on the fitted plane. A similar argument can be made for multiple linear regression with an intercept, i.e. Yi=β1+∑i=2pβixi+εi, where one can show that we have βˇ1=y−∑i=2pβˇixi and thus (x2,…,xp,y) lies on the fitted plane.
2.3 Geometric Interpretation
While we can interpret the model geometrically by looking at the rows, which shows that a model with intercept fits a p−1-dimensional hyperplane in a p-dimensional space through the n points (xi,2…,xi,p,yi)i=1n, more mileage is to be had by interpreting the column vectors in the model.
The random vector Y of observations is a single point in Rn. If we wary the value of the parameter β, the set R(X)={z∈Rn∣z=Xb} describes the range of X, more specifically a p-dimensional hyperplane through the origin spanned by the columns of X. Then the OLS estimator β^=argminβ∈RpY−Xβ2 is equivalent to β^ whose respective fitted value Y^=Xβ^ is the orthogonal projection of Y onto the hyperplane.
Definition 2.7 (Hat matrix): The matrix P=X(X⊤X)−1X⊤ is the orthogonal projection matrix onto the column space R(X).
Note: We can write Y^=Xβ^=PY.
Recap (Properties of projection matrices): Let P be a projection matrix onto the column space R(X). Then:
P2=P, i.e. P is idempotent
tr(P)=rank(P)=rank(X), i.e. the trace equals the dimesion of R(X)
If further the projection described by P is orthgonal we have:
P⊤=P, i.e. P is symmetric
Note:
These three properties are also necessary and sufficient conditions for P to be an orthogonal projection.
Any orthogonal projection can be derived by choosing a basis X for the hyperplane and evaluating P=X(X⊤X)−1X⊤. The orthogonal projection matrix is invariant to the choice of basis and unique.
The diagonal entries of an orthogonal projection matrix P[i,i]∈[0,1] tell us how much influence the observation Yi has over the fitted value Y^i.
The set N(X⊤)={z∈RnX⊤z=0} is the null space of X⊤, more specifically the n−p-dimensional hyperplane through the origin orthogonal to R(X).
Recap (Range orthogonal to null of transpose): Given a square matrix X we have R(X)⊥N(X⊤).
Definition 2.8 (Residual maker matrix): The matrix Q=I−P is the orthogonal projection matrix onto the null space N(X⊤).
Note:
We can write ε^=Y−Y^=QY.
PQ=QP=0.
Generally, cov(yˇ,εˇ)=0, i.e. yˇ and εˇ are empirically uncorrelated, only if the model includes an intercept.
2.4 Link to MLE under the SCLM
TODO
2.5 An intuition for the OLS parameters
The OLS coefficient β^j can be derived via the following three step procedure:
Perform OLS regression of x:,j against the other covariate observations {x:,k∣k∈{1,…,n}∖{j}} to receive the residuals rj.
Perform OLS regression of Y against the covariate observations {x:,k∣k∈{1,…,n}∖{j}} to receive the residuals ε^¬j.
Perform OLS regression of ε^¬j against rj to receive β^j.
This is known as the Frisch-Waugh-Lovell Theorem. What this theorem hints at is that β^j measures the linear effect of x:,j on Y which is not explained by the linear effects of all other covariate observations {x:,k∣k∈{1,…,n}∖{j}}.
In general, this means that we cannot combine the parameters we get for the n separate univariate OLS regression of Y agains x:,j to receive β^ as we have to factor in the linear effects of all other covariate observations. There is one notable exception where this is possible.
Proposition 2.9 (Orthogonal covariates): If X is orthogonal, i.e. x:,j1⊥x:,j2 for all j1=j2, then
β^j=x:,j⊤x:,jx:,j⊤Y
for all j∈{1,…,p}.
Note: This is indeed equivalent to the OLS estimator for the SLM Yi=βjxi,j+εi.