6. Optimality, GLS and WLS

6.1 Optimality of OLS

Definition 6.1 (BLUE): An estimator θ^\thetahat of a scalar parameter θΘ\theta \in \Theta is a “best linear unbiased estimator” or BLUE if
  • θ^\thetahat is a linear estimator, i.e θ^=aY+b\thetahat = \dvectr a \rvec Y + b
  • θ^\thetahat is unbiased, i.e. θΘ:E[θ^]=θ\forall \theta \in \Theta : \E*[\thetahat] = \theta
  • θ^\thetahat has minimal variance, i.e. for any other linear unbiased estimator γ^\hat{\gamma} it holds Var[θ^]Var ⁣[γ^]\Var*[\thetahat] \leq \Var{\hat{\gamma}}

The BLUE is defined in terms of a scalar estimator as variance comparisons between multivariate estimators is not well-defined.

Theorem 6.2 (Gauss-Markov): Under the WCLM assumptions, for any cR\dvec c \in \R, the OLS estimator cβ^\dvectr c \vbetahat is BLUE for cβ\dvectr c \vbeta.

If further SCLM is assumed an, even stronger statement can be achieved.

Definition 6.3 (UMVUE): An estimator θ^\thetahat of a scalar parameter θΘ\theta \in \Theta is a “uniformly minimum variance unbiased estimator” or UMVUE if
  • θ^\thetahat is unbiased, i.e. θΘ:E[θ^]=θ\forall \theta \in \Theta : \E*[\thetahat] = \theta
  • θ^\thetahat has minimal variance, i.e. for any other unbiased estimator γ^\hat{\gamma} it holds Var[θ^]Var ⁣[γ^]\Var*[\thetahat] \leq \Var{\hat{\gamma}}
Theorem 6.4 (Lehmann-Scheffé): Under the SCLM assumptions, for any cR\dvec c \in \R, the OLS estimator cβ^\dvectr c \vbetahat is UMVUE for cβ\dvectr c \vbeta.

6.2 Generalized Assumptions

We now introduce a generalization of the WCLM and SCLM by relaxing the assumptions that the deviations are homoscedastic.

Assumption 6.5 (Weak general linear model): For each observation the response is the linear function Yi=i=1pβixi+εiY_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i with zero-mean i.e. E ⁣[εi]=0\E{\epsilon_i} = 0 for all i{1,,n}i \in \set{1, \ldots, n}.
Note:
  • Notet that the deviations are possibly heteroscedastic and correlated.
  • For notational convenience, we use WGLM to denote the weak general linear model.
Assumption 6.6 (Strong general linear model): For each observation the response is the linear function Yi=i=1pβixi+εiY_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i with zero-mean Gaussian deviations, i.e. εiiidN(0,Σ)\epsilon_i \simiid \lawN(0, \dmat \Sigma).
Note: For notational convenience, we use SGLM to denote the strong general linear model.

These are rather general assumption so sometimes we additionally assume that some structure is known.

Assumption 6.7 (Known correlation structure): The correlation matrix of the deviations is known up to some multiplicative constant σ2\sigma^2, i.e. Cov ⁣[ε, ε]=σ2S\Cov{\vepsilon}{\vepsilon} = \sigma^2 \dmat S where S\dmat S is positive definite and known.

Alternatively we might assume that the deviations are uncorrelated but heteroscedastic.

Assumption 6.8 (Uncorrelated, heteroscedastic devations): The deviations are uncorrelated but heteroscedastic, i.e. Cov ⁣[ε, ε]=diag ⁣(σ12,,σn2)\Cov{\vepsilon}{\vepsilon} = \diag{\sigma_1^2, \ldots, \sigma_n^2} where σi2\sigma_i^2 is unknown.

6.3 Generalized Least Squares

In this section we assume the SGLM with known correlation structure Σ=σ2S\dmat \Sigma = \sigma^2 \dmat S.

Note: We denote the OLS estimator with β^OLS\vbetahat_{\text{OLS}} to distinguish it from others.

We have YN(Xβ,σ2S)\vY \sim \lawN(\dmat X \vbeta, \sigma^2 \dmat S) and β^OLSN(β,σ2(XX)1SX(XX)1)\vbetahat_{\text{OLS}} \sim \lawN(\vbeta, \sigma^2 (\dmattr X \dmat X)^{-1} \dmat S \dmat X (\dmattr X \dmat X)^{-1}). This estimator, while unbiased, is not the most efficient.

Recap (Spectral decomposition of pd matrix): Any positive definite matrix ARn×n\dmat A \in \R^{n \times n} can be decomposed into A=UΛU\dmat A = \dmat U \dmat \Lambda \dmattr U where
  • U=[u1un]\dmat U = \pb{\dvec u_1 \ldots \dvec u_n} is an orthogonal eigenbasis of eigenvectors ui\dvec u_i of A\dmat A
  • Λ=diag ⁣(λ1,,λn)\dmat \Lambda = \diag{\lambda_1, \ldots, \lambda_n} is a diagonal matrix of eigenvalues λi\lambda_i of A\dmat A
Recap (Square root of pd matrix): Given a positive definite matrix ARn×n\dmat A \in \R^{n \times n} and let Λ1/2=diag ⁣(λ1,,λn)\dmat \Lambda^{1/2} = \diag{\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_n}}. We define
  • A1/2=UΛ1/2U\dmat A^{1/2} = \dmat U \dmat \Lambda^{1/2} \dmattr U
  • A1/2=UΛ1/2U\dmat A^{-1/2} = \dmat U \dmat \Lambda^{-1/2} \dmattr U
We note that A1/2\dmat A^{1/2} and A1/2\dmat A^{-1/2} are unique irrespective of the choice of U\dmat U and that (A1/2)2=A(\dmat A^{1/2})^2 = \dmat A and (A1/2)2=A1(\dmat A^{-1/2})^2 = \dmat A^{-1}.

We transform the linear model via left multiplication of S1/2\dmat S^{-1/2}, i.e. S1/2Y=S1/2Xβ+S1/2ε    Y~=X~β+ε~ \dmat S^{-1/2} \vY = \dmat S^{-1/2} \dmat X \vbeta + \dmat S^{-1/2} \vepsilon \implies \tilde{\vY} = \tilde{\dmat X} \vbeta + \tilde{\vepsilon} and note that ε~N(0,σ2I)\tilde{\vepsilon} \sim \lawN(\dvec 0, \sigma^2 \dmat I). Thus we can use OLS estimator in the tilde model β^GLS=(X~X~)1X~Y~\vbetahat_{\text{GLS}} = (\tilde{\dmat{X}}{}^{\top} \tilde{\dmat{X}})^{-1} \tilde{\dmat{X}}{}^{\top} \tilde{\vY}.

Definition 6.9 (GLS estimator): The GLS estimator β^GLS\vbetahat_{\text{GLS}} is β^GLS=(XS1X)1XS1Y \vbetahat_{\text{GLS}} = (\dmattr X \dmat S^{-1} \dmat X)^{-1} \dmattr X \dmat S^{-1} \vY
Note: Assume known correlation structure S\dmat S. Under WGLM the GLS estimator is BLUE. Under SGLM the GLS estimator is UMVUE. If SI\dmat S \neq \dmat I, GLS has smaller variance than OLS, i.e. is more efficient.
Proposition 6.10 (Distribution of the GLS estimator): Under known correlation structure S\dmat S the distribution of the GLS estimator is β^GLSN(β,σ2(XS1X)1) \vbetahat_{\text{GLS}} \sim \lawN(\vbeta, \sigma^2 (\dmattr X \dmat S^{-1} \dmat X)^{-1})

6.4 Weighted Least Squares

A special case of SGLM with known correlation structure S\dmat S, is if S=diag ⁣(v1,,v2)\dmat S = \diag{v_1, \ldots, v_2}. Then least squares estimation amounts to β^GLS=β^WLS=arg minβRpi=1nwi(Yixiβ)2 \vbetahat_{\text{GLS}} = \vbetahat_{\text{WLS}} = \argmin_{\vbeta \in \R^p} \sum_{i=1}^n w_i (Y_i - \dvectr x_i \vbeta)^2 This procedure is called weighted least squares or WLS with the weights wi=vi1Var ⁣[εi]1w_i = v_i^{-1} \propto \Var{\epsilon_i}^{-1}. If Var ⁣[εi]\Var{\epsilon_i} is large, we downweight the ii-th contribution.

6.5 Unknown Heteroscedastic Errors

TODO

6.6 Misspecification of the Linear Model

TODO