Statistical Modelling 2025-11-13

6. Optimality, GLS and WLS

6.1 Optimality of OLS

Definition 6.1 (BLUE): An estimator

\thetahat

of a scalar parameter

\theta \in \Theta

is a “best linear unbiased estimator” or BLUE if

$\thetahat$ is a linear estimator, i.e $\thetahat = \dvectr a \rvec Y + b$
$\thetahat$ is unbiased, i.e. $\forall \theta \in \Theta : \E*[\thetahat] = \theta$
$\thetahat$ has minimal variance, i.e. for any other linear unbiased estimator $\hat{\gamma}$ it holds $\Var*[\thetahat] \leq \Var{\hat{\gamma}}$

The BLUE is defined in terms of a scalar estimator as variance comparisons between multivariate estimators is not well-defined.

Theorem 6.2 (Gauss-Markov): Under the WCLM assumptions, for any

\dvec c \in \R

the OLS estimator

\dvectr c \vbetahat

is BLUE for

\dvectr c \vbeta

If further SCLM is assumed an, even stronger statement can be achieved.

Definition 6.3 (UMVUE): An estimator

\thetahat

of a scalar parameter

\theta \in \Theta

is a “uniformly minimum variance unbiased estimator” or UMVUE if

$\thetahat$ is unbiased, i.e. $\forall \theta \in \Theta : \E*[\thetahat] = \theta$
$\thetahat$ has minimal variance, i.e. for any other unbiased estimator $\hat{\gamma}$ it holds $\Var*[\thetahat] \leq \Var{\hat{\gamma}}$

Theorem 6.4 (Lehmann-Scheffé): Under the SCLM assumptions, for any

\dvec c \in \R

the OLS estimator

\dvectr c \vbetahat

is UMVUE for

\dvectr c \vbeta

6.2 Generalized Assumptions

We now introduce a generalization of the WCLM and SCLM by relaxing the assumptions that the deviations are homoscedastic.

Assumption 6.5 (Weak general linear model): For each observation the response is the linear function

Y_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i

with zero-mean i.e.

\E{\epsilon_i} = 0

for all

i \in \set{1, \ldots, n}

Note:

Notet that the deviations are possibly heteroscedastic and correlated.
For notational convenience, we use WGLM to denote the weak general linear model.

Assumption 6.6 (Strong general linear model): For each observation the response is the linear function

Y_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i

with zero-mean Gaussian deviations, i.e.

\epsilon_i \simiid \lawN(0, \dmat \Sigma)

Note: For notational convenience, we use SGLM to denote the strong general linear model.

These are rather general assumption so sometimes we additionally assume that some structure is known.

Assumption 6.7 (Known correlation structure): The correlation matrix of the deviations is known up to some multiplicative constant

\sigma^2

i.e.

\Cov{\vepsilon}{\vepsilon} = \sigma^2 \dmat S

where

\dmat S

is positive definite and known.

Alternatively we might assume that the deviations are uncorrelated but heteroscedastic.

Assumption 6.8 (Uncorrelated, heteroscedastic devations): The deviations are uncorrelated but heteroscedastic, i.e.

\Cov{\vepsilon}{\vepsilon} = \diag{\sigma_1^2, \ldots, \sigma_n^2}

where

\sigma_i^2

is unknown.

6.3 Generalized Least Squares

In this section we assume the SGLM with known correlation structure $\dmat \Sigma = \sigma^2 \dmat S$

Note: We denote the OLS estimator with

\vbetahat_{\text{OLS}}

to distinguish it from others.

We have $\vY \sim \lawN(\dmat X \vbeta, \sigma^2 \dmat S)$ and $\vbetahat_{\text{OLS}} \sim \lawN(\vbeta, \sigma^2 (\dmattr X \dmat X)^{-1} \dmat S \dmat X (\dmattr X \dmat X)^{-1})$ This estimator, while unbiased, is not the most efficient.

Recap (Spectral decomposition of pd matrix): Any positive definite matrix

\dmat A \in \R^{n \times n}

can be decomposed into

\dmat A = \dmat U \dmat \Lambda \dmattr U

where

$\dmat U = \pb{\dvec u_1 \ldots \dvec u_n}$ is an orthogonal eigenbasis of eigenvectors $\dvec u_i$ of $\dmat A$
$\dmat \Lambda = \diag{\lambda_1, \ldots, \lambda_n}$ is a diagonal matrix of eigenvalues $\lambda_i$ of $\dmat A$

Recap (Square root of pd matrix): Given a positive definite matrix

\dmat A \in \R^{n \times n}

and let

\dmat \Lambda^{1/2} = \diag{\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}}

We define

$\dmat A^{1/2} = \dmat U \dmat \Lambda^{1/2} \dmattr U$
$\dmat A^{-1/2} = \dmat U \dmat \Lambda^{-1/2} \dmattr U$

We note that

\dmat A^{1/2}

and

\dmat A^{-1/2}

are unique irrespective of the choice of

\dmat U

and that

(\dmat A^{1/2})^2 = \dmat A

and

(\dmat A^{-1/2})^2 = \dmat A^{-1}

We transform the linear model via left multiplication of $\dmat S^{-1/2}$ i.e. $\dmat S^{-1/2} \vY = \dmat S^{-1/2} \dmat X \vbeta + \dmat S^{-1/2} \vepsilon \implies \tilde{\vY} = \tilde{\dmat X} \vbeta + \tilde{\vepsilon}$ and note that $\tilde{\vepsilon} \sim \lawN(\dvec 0, \sigma^2 \dmat I)$ Thus we can use OLS estimator in the tilde model $\vbetahat_{\text{GLS}} = (\tilde{\dmat{X}}{}^{\top} \tilde{\dmat{X}})^{-1} \tilde{\dmat{X}}{}^{\top} \tilde{\vY}$

Definition 6.9 (GLS estimator): The GLS estimator

\vbetahat_{\text{GLS}}

\vbetahat_{\text{GLS}} = (\dmattr X \dmat S^{-1} \dmat X)^{-1} \dmattr X \dmat S^{-1} \vY

Note: Assume known correlation structure

\dmat S

Under WGLM the GLS estimator is BLUE. Under SGLM the GLS estimator is UMVUE. If

\dmat S \neq \dmat I

GLS has smaller variance than OLS, i.e. is more efficient.

Proposition 6.10 (Distribution of the GLS estimator): Under known correlation structure

\dmat S

the distribution of the GLS estimator is

\vbetahat_{\text{GLS}} \sim \lawN(\vbeta, \sigma^2 (\dmattr X \dmat S^{-1} \dmat X)^{-1})

6.4 Weighted Least Squares

A special case of SGLM with known correlation structure $\dmat S$ is if $\dmat S = \diag{v_1, \ldots, v_2}$ Then least squares estimation amounts to $\vbetahat_{\text{GLS}} = \vbetahat_{\text{WLS}} = \argmin_{\vbeta \in \R^p} \sum_{i=1}^n w_i (Y_i - \dvectr x_i \vbeta)^2$ This procedure is called weighted least squares or WLS with the weights $w_i = v_i^{-1} \propto \Var{\epsilon_i}^{-1}$ If $\Var{\epsilon_i}$ is large, we downweight the $i$ -th contribution.

6.5 Unknown Heteroscedastic Errors

TODO

6.6 Misspecification of the Linear Model

TODO