8. Model Selection

For this chapter we assume that the true data generating process is Yi=f(xi)+εiY_i = f(\dvec x_i) + \epsilon_i with E ⁣[ε]=0\E{\vepsilon} = \dvec 0 and Cov ⁣[ε, ε]\Cov{\vepsilon}{\vepsilon}, i.e. the the systematic component E ⁣[Yi]=f(xi)\E{Y_i} = f(\dvec x_i) is not necessarily linear in the covariates.

8.1 Problem Definition

Our aim is to find “the best” approximating fitted linear model.

Definition 8.1 (Partial linear model): We denote the partial linear model using mm covariates M{1,,p}\setM \subset \set{1, \ldots, p} with XMβM\dmat X^{\setM} \vbeta^{\setM}.
Note: We use β^M\vbetahat{}^{\setM} for the OLS estimator of this model and Y^M\vYhat{}^{\setM} for the fitted values. For notational convenience we denote the linear partial model as “the model M\setM”.
Definition 8.2 (Sum of mean squared deviation): Under the assumption that Yi=f(xi)+εiY_i = f(\dvec x_i) + \epsilon_i is the true data generating process, the total mean squared deviation of the model M\setM is dM2=i=1nE ⁣[(Y^iMf(xi))2] d^2_{\setM} = \sum_{i=1}^n \E{(\vYhat{}^{\setM}_i - f(\dvec x_i))^2}
Note: In literature, the (total) mean squared deviation is often called the (total) mean squared error or (T)MSE.

We can thus define the best model as the model M\setM^{\star} minimizing the average mean squared deviation, i.e. M=arg minM{1,,p}1ndM2 \setM^{\star} = \argmin_{\setM \subset \set{1, \ldots, p}} \frac{1}{n} d^2_{\setM}

Note: We call the average mean squared deviation also average mean squared error or AMSE.

8.2 Bias-Variance Tradeoff

We first assume the WCLM, i.e. that the data generting process is indeed linear with Y=Xβ+ε\vY = \dmat X \vbeta + \vepsilon.

Proposition 8.3 (AMSE of full model): Let M={1,,p}\setM = \set{1, \ldots, p} be the full model, then 1ndM2=pnσ2 \frac{1}{n} d^2_{\setM} = \frac{p}{n} \sigma^2
Proof: 1ndM2=1nE ⁣[Y^MXβ2]=pnE ⁣[σ^2]=pnσ2 \frac{1}{n} d^2_{\setM} = \frac{1}{n} \E{\normbig{\vYhat{}^{\setM} - \dmat X \vbeta}^2} = \frac{p}{n} \E{\sigmahat^2} = \frac{p}{n} \sigma^2

We note that the AMSE is linearly dependent on the dimensionality pp of the model, i.e. large models accumulate more uncertainty and thus more variance. Hence, it may pay-off to use a smaller model M\setM with M=q\abs{\setM} = q.

Proposition 8.4 (AMSE of partial model): Let M{1,,p}\setM \subsetneq \set{1, \ldots, p} be a partial model with M=q\abs{\setM} = q, then 1ndM2=1ni=1n(E ⁣[(xiM)β^M]xiβ)2+qnσ2 \frac{1}{n} d^2_{\setM} = \frac{1}{n} \sum_{i=1}^n \pa{\E{(\dvec x_i^{\setM})^{\top} \vbetahat{}^{\setM}} - \dvec x_i^{\top} \vbeta}^2 + \frac{q}{n} \sigma^2
Note:
  • 1ni=1n(E ⁣[(xiM)β^M]xiβ)2\frac{1}{n} \sum_{i=1}^n \pa{\E{(\dvec x_i^{\setM})^{\top} \vbetahat{}^{\setM}} - \dvec x_i^{\top} \vbeta}^2 is the squared bias of the model M\setM, it decreases as qq increases
  • qnσ2\frac{q}{n} \sigma^2 is the variance of the model M\setM, it increases as qq increases
This result is called bias-variance tradeoff. A smaller but wrong model may be better in AMSE terms by carefully trading squared bias against variance.

If the data generation process is nonlinear a similar decomposition can be made.

TODO

8.3 Forward and Backward Selection

TODO

The selection method do not guarantee that all covariables in M\setM are significant. For instance, problems may occur when the covariables are highly correlated. We also have a multiple testing problem and a post-selection inference problem.

8.4 Information Criteria

Our goal is to find a proxy that we can use in a data-driven way to find the best model in terms of AMSE. For that purpose we define the standardized TMSE Γp=dM2σ2\Gamma_p = \frac{d_{\setM}^2}{\sigma^2}.

Proposition 8.5 (Test): ...