Statistical Modelling 2025-10-02

2. Ordinary Least Squares

In this chapter, unless otherwise noted, we will always assume the WCLM and will specify if additional assumptions are made.

2.1 OLS estimator

Note: We abbreviate ordinary least squares with OLS.

Definition 2.1 (OLS estimator): The OLS estimator

\gvec \betahat

\vbetahat = \argmin_{\vbeta \in \R^p} \norm{\vY - \dmat X \gvec \beta}^2

Note: Recall that, as always we distinguish between the estimator

\vbetahat

which is stochastic, and the estimate

\vbetacheck

which is deterministic and computed from a specific data sample

\gvec Y = \vy

Proposition 2.2 (OLS closed form): The OLS estimator can be explicitly computed by

\vbetahat = \pabig{\dmattr X \dmat X}^{-1} \dmattr X \vY

Proof: As

\normbig{\vY - \dmat X\vbeta}^2

is convex we find the minimizer

\vbetahat

by setting its gradient to

\dvec 0

\gradwrt{\vbeta}{\normbig{\vY - \dmat X \vbeta}^2}= -2 \dmattr X (\vY - \dmat X \vbeta) \stackrel{!}{=} \dvec 0

This yields the normal equations

\dmattr X \dmat X \vbetahat = \dmattr X \vY

Under the assumption that

\dmat X

has rank

p

the matrix

\dmattr X \dmat X \in \R^{p \times p}

has full rank and is invertible, thus

\vbetahat = \pabig{\dmattr X \dmat X}^{-1} \dmattr X \vY

Definition 2.3 (Fitted values): The OLS fitted values are

\vYhat = \dmat X \vbetahat

Definition 2.4 (Residuals): The OLS residuals are

\vepsilonhat = \vY - \vYhat

2.2 OLS under the SLM

Under the SLM $Y_i = \beta x_i + \epsilon_i$ we have the normal equation $\dvectr x \dvec x \betacheck = \dvectr x \vy$ and we get the following solution.

Proposition 2.5 (OLS estimate for the SLM): Under the SLM the OLS estimate is

\betacheck = \frac{\dvectr x \vy}{\dvectr x \dvec x}

Note: Under the location model, i.e.

\dvec x = \dvec 1

we have

\betacheck = \frac{\dvectr 1 \vy}{n} = \mean{y}

Under the SLMI $Y_i = \beta_1 + \beta_2 x_i + \epsilon_i$ we have the following normal equations $\dmattr X \dmat X \vbetacheck = \dmattr X \vy$ $\begin{align*} &n\betacheck_1 + \betacheck_2 \sum_{i=1}^n x_i = \sum_{i=1}^n y_i \qand \\ &\betacheck_1 \sum_{i=1}^n x_i + \betacheck_2 \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_i y_i \end{align*}$ We use the arithmetic mean $\mean{x} = \frac{1}{n} \sum_{i=1}^n x_i$ and the variables transforms $\check{\alpha} = \betacheck_1 + \betacheck_2 \mean{x}$ and $\check{\beta} = \betacheck_2$ and note the following algebraic sum idententities.

Recap (Algebraic sum identities): We have

\begin{align*} s_{xx} &= \sum_{i=1}^n (x_i - \mean{x})^2 = \pabig{\sum_{i=1}^n x_i^2} - n\mean{x}^2 \\ s_{yy} &= \sum_{i=1}^n (y_i - \mean{y})^2 = \pabig{\sum_{i=1}^n y_i^2} - n\mean{y}^2 \\ s_{xy} &= \sum_{i=1}^n (x_i - \mean{x})(y_i - \mean{y}) = \pabig{\sum_{i=1}^n x_i y_i} - n\mean{x}\mean{y} \end{align*}

With that we get the solutions $\check{\alpha} = \mean{y}$ and $\check{\beta} = \frac{s_{xy}}{s_{xx}}$

Proposition 2.6 (OLS estimate for the SLMI): Under the SLMI the OLS estimate is

\betacheck_2 = \frac{s_{xy}}{s_{xx}} \qand \betacheck_1 = \mean{y} - \betacheck_2 \mean{x}

Note: We can rewrite

\betacheck_2 = \frac{\frac{1}{n-1} s_{xy}}{\frac{1}{n-1} s_{xx}} = \frac{\cov{\dvec x, \vy}}{\var{\dvec x, \vy}}

hence

\betacheck_2

is the ratio between empirical covariance between

\vy

and

\dvec x

and empirical variance of

\dvec x

At the point $x=\mean{x}$ we have $\betacheck_1 + \betacheck_2 \mean{x} = \mean{x}$ hence the point $(\mean{x}, \mean{y})$ always lies on the fitted plane. A similar argument can be made for multiple linear regression with an intercept, i.e. $Y_i = \beta_1 + \sum_{i=2}^{p} \beta_i x_i + \epsilon_i$ where one can show that we have $\betacheck_1 = \mean{y} - \sum_{i=2}^{p} \betacheck_i \mean{x}_i$ and thus $(\mean{x}_2, \ldots, \mean{x}_p, \mean{y})$ lies on the fitted plane.

2.3 Geometric Interpretation

While we can interpret the model geometrically by looking at the rows, which shows that a model with intercept fits a $p{-}1$ -dimensional hyperplane in a $p$ -dimensional space through the $n$ points ${(x_{i,2} \ldots, x_{i,p}, y_i)}_{i=1}^{n}$ more mileage is to be had by interpreting the column vectors in the model.

The random vector $\vY$ of observations is a single point in $\R^n$ If we wary the value of the parameter $\vbeta$ the set $\Range{\dmat X} = \set{\dvec z \in \R^n \mid \dvec z = \dmat X \dvec b}$ describes the range of $\dmat X$ more specifically a $p$ -dimensional hyperplane through the origin spanned by the columns of $\dmat X$ Then the OLS estimator $\vbetahat = \argmin_{\vbeta \in \R^p} \normbig{\vY - \dmat X \gvec \beta}^2$ is equivalent to $\vbetahat$ whose respective fitted value $\vYhat = \dmat X \vbetahat$ is the orthogonal projection of $\vY$ onto the hyperplane.

Definition 2.7 (Hat matrix): The matrix

\dmat P = \dmat X (\dmat X^\top \dmat X)^{-1} \dmat X^\top

is the orthogonal projection matrix onto the column space

\Range{\dmat X}

Note: We can write

\vYhat = \dmat X \vbetahat = \dmat P \vY

Recap (Properties of projection matrices): Let

\dmat P

be a projection matrix onto the column space

\Range{\dmat{X}}

Then:

$\dmat P^2 = \dmat P$ i.e. $\dmat P$ is idempotent
$\trace{\dmat P} = \rank{\dmat P} = \rank{\dmat X}$ i.e. the trace equals the dimesion of $\Range{\dmat{X}}$

If further the projection described by

\dmat P

is orthgonal we have:

$\dmattr P = \dmat P$ i.e. $\dmat P$ is symmetric

Note:

These three properties are also necessary and sufficient conditions for $\dmat P$ to be an orthogonal projection.
Any orthogonal projection can be derived by choosing a basis $\dmat{X}$ for the hyperplane and evaluating $\dmat P = \dmat X (\dmat X^\top \dmat X)^{-1} \dmat X^\top$ The orthogonal projection matrix is invariant to the choice of basis and unique.
The diagonal entries of an orthogonal projection matrix $\dmat P_{[i,i]} \in [0,1]$ tell us how much influence the observation $Y_i$ has over the fitted value $\Yhat_i$

The set $\Null{\dmattr X} = \set{\dvec z \in \R^n \mid \dmattr X \dvec z = \rvec 0}$ is the null space of $\dmattr X$ more specifically the $n{-}p$ -dimensional hyperplane through the origin orthogonal to $\Range{\dmat X}$

Recap (Range orthogonal to null of transpose): Given a square matrix

\dmat{X}

we have

\Range{\dmat X} \perp \Null{\dmattr X}

Definition 2.8 (Residual maker matrix): The matrix

\dmat Q = \dmat I - \dmat P

is the orthogonal projection matrix onto the null space

\Null{\dmattr X}

Note:

We can write $\vepsilonhat = \vY - \vYhat = \dmat Q \vY$
$\dmat P \dmat Q = \dmat Q \dmat P = \dmat 0$
Generally, $\cov{\check{\vy},\vepsiloncheck} = 0$ i.e. $\check{\vy}$ and $\vepsiloncheck$ are empirically uncorrelated, only if the model includes an intercept.

2.4 Link to MLE under the SCLM

TODO

2.5 An intuition for the OLS parameters

The OLS coefficient $\betahat_j$ can be derived via the following three step procedure:

Perform OLS regression of $\dvec{x}_{:,j}$ against the other covariate observations $\set{ \dvec{x}_{:,k} \mid k \in \set{1,\ldots,n}\setminus \set{j}}$ to receive the residuals $\dvec r_j$
Perform OLS regression of $\vY$ against the covariate observations $\set{ \dvec{x}_{:,k} \mid k \in \set{1,\ldots,n}\setminus \set{j}}$ to receive the residuals $\vepsilonhat_{\neg j}$
Perform OLS regression of $\vepsilonhat_{\neg j}$ against $\dvec r_j$ to receive $\betahat_j$

This is known as the Frisch-Waugh-Lovell Theorem. What this theorem hints at is that $\betahat_j$ measures the linear effect of $\dvec{x}_{:,j}$ on $\vY$ which is not explained by the linear effects of all other covariate observations $\set{ \dvec{x}_{:,k} \mid k \in \set{1,\ldots,n}\setminus \set{j}}$

In general, this means that we cannot combine the parameters we get for the $n$ separate univariate OLS regression of $\vY$ agains $\dvec{x}_{:,j}$ to receive $\vbetahat$ as we have to factor in the linear effects of all other covariate observations. There is one notable exception where this is possible.

Proposition 2.9 (Orthogonal covariates): If

\dmat X

is orthogonal, i.e.

\dvec{x}_{:,j_1} \perp \dvec{x}_{:,j_2}

for all

j_1 \neq j_2

then

\betahat_j = \frac{\dvectr{x}_{:,j} \vY}{\dvectr{x}_{:,j} \dvec{x}_{:,j}}

for all

j \in \set{1, \ldots, p}

Note: This is indeed equivalent to the OLS estimator for the SLM

Y_i = \beta_j x_{i,j} + \epsilon_i