Statistical Modelling 2025-11-27

8. Model Selection

For this chapter we assume that the true data generating process is $Y_i = f(\dvec x_i) + \epsilon_i$ with $\E{\vepsilon} = \dvec 0$ and $\Cov{\vepsilon}{\vepsilon}$ i.e. the the systematic component $\E{Y_i} = f(\dvec x_i)$ is not necessarily linear in the covariates.

8.1 Problem Definition

Our aim is to find “the best” approximating fitted linear model.

Definition 8.1 (Partial linear model): We denote the partial linear model using

m

covariates

\setM \subset \set{1, \ldots, p}

with

\dmat X^{\setM} \vbeta^{\setM}

Note: We use

\vbetahat{}^{\setM}

for the OLS estimator of this model and

\vYhat{}^{\setM}

for the fitted values. For notational convenience we denote the linear partial model as “the model

\setM

”.

Definition 8.2 (Sum of mean squared deviation): Under the assumption that

Y_i = f(\dvec x_i) + \epsilon_i

is the true data generating process, the total mean squared deviation of the model

\setM

d^2_{\setM} = \sum_{i=1}^n \E{(\vYhat{}^{\setM}_i - f(\dvec x_i))^2}

Note: In literature, the (total) mean squared deviation is often called the (total) mean squared error or (T)MSE.

We can thus define the best model as the model $\setM^{\star}$ minimizing the average mean squared deviation, i.e. $\setM^{\star} = \argmin_{\setM \subset \set{1, \ldots, p}} \frac{1}{n} d^2_{\setM}$

Note: We call the average mean squared deviation also average mean squared error or AMSE.

8.2 Bias-Variance Tradeoff

We first assume the WCLM, i.e. that the data generting process is indeed linear with $\vY = \dmat X \vbeta + \vepsilon$

Proposition 8.3 (AMSE of full model): Let

\setM = \set{1, \ldots, p}

be the full model, then

\frac{1}{n} d^2_{\setM} = \frac{p}{n} \sigma^2

Proof:

\frac{1}{n} d^2_{\setM} = \frac{1}{n} \E{\normbig{\vYhat{}^{\setM} - \dmat X \vbeta}^2} = \frac{p}{n} \E{\sigmahat^2} = \frac{p}{n} \sigma^2

We note that the AMSE is linearly dependent on the dimensionality $p$ of the model, i.e. large models accumulate more uncertainty and thus more variance. Hence, it may pay-off to use a smaller model $\setM$ with $\abs{\setM} = q$

Proposition 8.4 (AMSE of partial model): Let

\setM \subsetneq \set{1, \ldots, p}

be a partial model with

\abs{\setM} = q

then

\frac{1}{n} d^2_{\setM} = \frac{1}{n} \sum_{i=1}^n \pa{\E{(\dvec x_i^{\setM})^{\top} \vbetahat{}^{\setM}} - \dvec x_i^{\top} \vbeta}^2 + \frac{q}{n} \sigma^2

Note:

$\frac{1}{n} \sum_{i=1}^n \pa{\E{(\dvec x_i^{\setM})^{\top} \vbetahat{}^{\setM}} - \dvec x_i^{\top} \vbeta}^2$ is the squared bias of the model $\setM$ it decreases as $q$ increases
$\frac{q}{n} \sigma^2$ is the variance of the model $\setM$ it increases as $q$ increases

This result is called bias-variance tradeoff. A smaller but wrong model may be better in AMSE terms by carefully trading squared bias against variance.

If the data generation process is nonlinear a similar decomposition can be made.

TODO

8.3 Forward and Backward Selection

TODO

The selection method do not guarantee that all covariables in $\setM$ are significant. For instance, problems may occur when the covariables are highly correlated. We also have a multiple testing problem and a post-selection inference problem.

8.4 Information Criteria

Our goal is to find a proxy that we can use in a data-driven way to find the best model in terms of AMSE. For that purpose we define the standardized TMSE $\Gamma_p = \frac{d_{\setM}^2}{\sigma^2}$

Proposition 8.5 (Test): ...