2. Ordinary Least Squares

In this chapter, unless otherwise noted, we will always assume the WCLM and will specify if additional assumptions are made.

2.1 OLS estimator

Note: We abbreviate ordinary least squares with OLS.
Definition 2.1 (OLS estimator): The OLS estimator β^\gvec \betahat is β^=arg minβRpYXβ2 \vbetahat = \argmin_{\vbeta \in \R^p} \norm{\vY - \dmat X \gvec \beta}^2
Note: Recall that, as always we distinguish between the estimator β^\vbetahat, which is stochastic, and the estimate βˇ\vbetacheck, which is deterministic and computed from a specific data sample Y=y\gvec Y = \vy.
Proposition 2.2 (OLS closed form): The OLS estimator can be explicitly computed by β^=(XX)1XY \vbetahat = \pabig{\dmattr X \dmat X}^{-1} \dmattr X \vY
Proof: As YXβ2\normbig{\vY - \dmat X\vbeta}^2 is convex we find the minimizer β^\vbetahat by setting its gradient to 0\dvec 0: β ⁣[YXβ2]=2X(YXβ)=!0 \gradwrt{\vbeta}{\normbig{\vY - \dmat X \vbeta}^2}= -2 \dmattr X (\vY - \dmat X \vbeta) \stackrel{!}{=} \dvec 0 This yields the normal equations XXβ^=XY \dmattr X \dmat X \vbetahat = \dmattr X \vY Under the assumption that X\dmat X has rank pp the matrix XXRp×p\dmattr X \dmat X \in \R^{p \times p} has full rank and is invertible, thus β^=(XX)1XY \vbetahat = \pabig{\dmattr X \dmat X}^{-1} \dmattr X \vY
Definition 2.3 (Fitted values): The OLS fitted values are Y^=Xβ^\vYhat = \dmat X \vbetahat.
Definition 2.4 (Residuals): The OLS residuals are ε^=YY^\vepsilonhat = \vY - \vYhat.

2.2 OLS under the SLM

Under the SLM Yi=βxi+εiY_i = \beta x_i + \epsilon_i we have the normal equation xxβˇ=xy\dvectr x \dvec x \betacheck = \dvectr x \vy and we get the following solution.

Proposition 2.5 (OLS estimate for the SLM): Under the SLM the OLS estimate is βˇ=xyxx \betacheck = \frac{\dvectr x \vy}{\dvectr x \dvec x}
Note: Under the location model, i.e. x=1\dvec x = \dvec 1, we have βˇ=1yn=y\betacheck = \frac{\dvectr 1 \vy}{n} = \mean{y}.

Under the SLMI Yi=β1+β2xi+εiY_i = \beta_1 + \beta_2 x_i + \epsilon_i we have the following normal equations XXβˇ=Xy\dmattr X \dmat X \vbetacheck = \dmattr X \vy: nβˇ1+βˇ2i=1nxi=i=1nyiandβˇ1i=1nxi+βˇ2i=1nxi2=i=1nxiyi\begin{align*} &n\betacheck_1 + \betacheck_2 \sum_{i=1}^n x_i = \sum_{i=1}^n y_i \qand \\ &\betacheck_1 \sum_{i=1}^n x_i + \betacheck_2 \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i \end{align*} We use the arithmetic mean x=1ni=1nxi\mean{x} = \frac{1}{n} \sum_{i=1}^n x_i and the variables transforms αˇ=βˇ1+βˇ2x\check{\alpha} = \betacheck_1 + \betacheck_2 \mean{x} and βˇ=βˇ2\check{\beta} = \betacheck_2 and note the following algebraic sum idententities.

Recap (Algebraic sum identities): We have sxx=i=1n(xix)2=(i=1nxi2)nx2syy=i=1n(yiy)2=(i=1nyi2)ny2sxy=i=1n(xix)(yiy)=(i=1nxiyi)nxy\begin{align*} s_{xx} &= \sum_{i=1}^n (x_i - \mean{x})^2 = \pabig{\sum_{i=1}^n x_i^2} - n\mean{x}^2 \\ s_{yy} &= \sum_{i=1}^n (y_i - \mean{y})^2 = \pabig{\sum_{i=1}^n y_i^2} - n\mean{y}^2 \\ s_{xy} &= \sum_{i=1}^n (x_i - \mean{x})(y_i - \mean{y}) = \pabig{\sum_{i=1}^n x_i y_i} - n\mean{x}\mean{y} \end{align*}

With that we get the solutions αˇ=y\check{\alpha} = \mean{y} and βˇ=sxysxx\check{\beta} = \frac{s_{xy}}{s_{xx}}.

Proposition 2.6 (OLS estimate for the SLMI): Under the SLMI the OLS estimate is βˇ2=sxysxxandβˇ1=yβˇ2x \betacheck_2 = \frac{s_{xy}}{s_{xx}} \qand \betacheck_1 = \mean{y} - \betacheck_2 \mean{x}
Note: We can rewrite βˇ2=1n1sxy1n1sxx=cov ⁣(x,y)var ⁣(x,y) \betacheck_2 = \frac{\frac{1}{n-1} s_{xy}}{\frac{1}{n-1} s_{xx}} = \frac{\cov{\dvec x, \vy}}{\var{\dvec x, \vy}} hence βˇ2\betacheck_2 is the ratio between empirical covariance between y\vy and x\dvec x and empirical variance of x\dvec x.

At the point x=xx=\mean{x} we have βˇ1+βˇ2x=x\betacheck_1 + \betacheck_2 \mean{x} = \mean{x}, hence the point (x,y)(\mean{x}, \mean{y}) always lies on the fitted plane. A similar argument can be made for multiple linear regression with an intercept, i.e. Yi=β1+i=2pβixi+εiY_i = \beta_1 + \sum_{i=2}^{p} \beta_i x_i + \epsilon_i, where one can show that we have βˇ1=yi=2pβˇixi\betacheck_1 = \mean{y} - \sum_{i=2}^{p} \betacheck_i \mean{x}_i and thus (x2,,xp,y)(\mean{x}_2, \ldots, \mean{x}_p, \mean{y}) lies on the fitted plane.

2.3 Geometric Interpretation

While we can interpret the model geometrically by looking at the rows, which shows that a model with intercept fits a p1p{-}1-dimensional hyperplane in a pp-dimensional space through the nn points (xi,2,xi,p,yi)i=1n{(x_{i,2} \ldots, x_{i,p}, y_i)}_{i=1}^{n}, more mileage is to be had by interpreting the column vectors in the model.

The random vector Y\vY of observations is a single point in Rn\R^n. If we wary the value of the parameter β\vbeta, the set R(X)={zRn | z=Xb}\Range{\dmat X} = \set{\dvec z \in \R^n \mid \dvec z = \dmat X \dvec b} describes the range of X\dmat X, more specifically a pp-dimensional hyperplane through the origin spanned by the columns of X\dmat X. Then the OLS estimator β^=arg minβRpYXβ2\vbetahat = \argmin_{\vbeta \in \R^p} \normbig{\vY - \dmat X \gvec \beta}^2 is equivalent to β^\vbetahat whose respective fitted value Y^=Xβ^\vYhat = \dmat X \vbetahat is the orthogonal projection of Y\vY onto the hyperplane.

Definition 2.7 (Hat matrix): The matrix P=X(XX)1X\dmat P = \dmat X (\dmat X^\top \dmat X)^{-1} \dmat X^\top is the orthogonal projection matrix onto the column space R(X)\Range{\dmat X}.
Note: We can write Y^=Xβ^=PY\vYhat = \dmat X \vbetahat = \dmat P \vY.
Recap (Properties of projection matrices): Let P\dmat P be a projection matrix onto the column space R(X)\Range{\dmat{X}}. Then:
  • P2=P\dmat P^2 = \dmat P, i.e. P\dmat P is idempotent
  • tr ⁣(P)=rank ⁣(P)=rank ⁣(X)\trace{\dmat P} = \rank{\dmat P} = \rank{\dmat X}, i.e. the trace equals the dimesion of R(X)\Range{\dmat{X}}
If further the projection described by P\dmat P is orthgonal we have:
  • P=P\dmattr P = \dmat P, i.e. P\dmat P is symmetric
Note:
  • These three properties are also necessary and sufficient conditions for P\dmat P to be an orthogonal projection.
  • Any orthogonal projection can be derived by choosing a basis X\dmat{X} for the hyperplane and evaluating P=X(XX)1X\dmat P = \dmat X (\dmat X^\top \dmat X)^{-1} \dmat X^\top. The orthogonal projection matrix is invariant to the choice of basis and unique.
  • The diagonal entries of an orthogonal projection matrix P[i,i][0,1]\dmat P_{[i,i]} \in [0,1] tell us how much influence the observation YiY_i has over the fitted value Y^i\Yhat_i.

The set N(X)={zRn | Xz=0}\Null{\dmattr X} = \set{\dvec z \in \R^n \mid \dmattr X \dvec z = \rvec 0} is the null space of X\dmattr X, more specifically the npn{-}p-dimensional hyperplane through the origin orthogonal to R(X)\Range{\dmat X}.

Recap (Range orthogonal to null of transpose): Given a square matrix X\dmat{X} we have R(X)N(X)\Range{\dmat X} \perp \Null{\dmattr X}.
Definition 2.8 (Residual maker matrix): The matrix Q=IP\dmat Q = \dmat I - \dmat P is the orthogonal projection matrix onto the null space N(X)\Null{\dmattr X}.
Note:
  • We can write ε^=YY^=QY\vepsilonhat = \vY - \vYhat = \dmat Q \vY.
  • PQ=QP=0\dmat P \dmat Q = \dmat Q \dmat P = \dmat 0.
  • Generally, cov ⁣(yˇ,εˇ)=0\cov{\check{\vy},\vepsiloncheck} = 0, i.e. yˇ\check{\vy} and εˇ\vepsiloncheck are empirically uncorrelated, only if the model includes an intercept.

TODO

2.5 An intuition for the OLS parameters

The OLS coefficient β^j\betahat_j can be derived via the following three step procedure:

  1. Perform OLS regression of x:,j\dvec{x}_{:,j} against the other covariate observations {x:,k | k{1,,n}{j}}\set{ \dvec{x}_{:,k} \mid k \in \set{1,\ldots,n}\setminus \set{j}} to receive the residuals rj\dvec r_j.
  2. Perform OLS regression of Y\vY against the covariate observations {x:,k | k{1,,n}{j}}\set{ \dvec{x}_{:,k} \mid k \in \set{1,\ldots,n}\setminus \set{j}} to receive the residuals ε^¬j\vepsilonhat_{\neg j}.
  3. Perform OLS regression of ε^¬j\vepsilonhat_{\neg j} against rj\dvec r_j to receive β^j\betahat_j.

This is known as the Frisch-Waugh-Lovell Theorem. What this theorem hints at is that β^j\betahat_j measures the linear effect of x:,j\dvec{x}_{:,j} on Y\vY which is not explained by the linear effects of all other covariate observations {x:,k | k{1,,n}{j}}\set{ \dvec{x}_{:,k} \mid k \in \set{1,\ldots,n}\setminus \set{j}}.

In general, this means that we cannot combine the parameters we get for the nn separate univariate OLS regression of Y\vY agains x:,j\dvec{x}_{:,j} to receive β^\vbetahat as we have to factor in the linear effects of all other covariate observations. There is one notable exception where this is possible.

Proposition 2.9 (Orthogonal covariates): If X\dmat X is orthogonal, i.e. x:,j1x:,j2\dvec{x}_{:,j_1} \perp \dvec{x}_{:,j_2} for all j1j2j_1 \neq j_2, then β^j=x:,jYx:,jx:,j \betahat_j = \frac{\dvectr{x}_{:,j} \vY}{\dvectr{x}_{:,j} \dvec{x}_{:,j}} for all j{1,,p}j \in \set{1, \ldots, p}.
Note: This is indeed equivalent to the OLS estimator for the SLM Yi=βjxi,j+εiY_i = \beta_j x_{i,j} + \epsilon_i.