For this chapter we assume that the true data generating process is Yi=f(xi)+εi with E[ε]=0 and Cov[ε,ε], i.e. the the systematic component E[Yi]=f(xi) is not necessarily linear in the covariates.
8.1 Problem Definition
Our aim is to find “the best” approximating fitted linear model.
Definition 8.1 (Partial linear model): We denote the partial linear model using m covariates M⊆{1,…,p} with XMβM.
Note: We use β^M for the OLS estimator of this model and Y^M for the fitted values. For notational convenience we denote the linear partial model as “the model M”.
Definition 8.2 (Sum of mean squared deviation): Under the assumption that Yi=f(xi)+εi is the true data generating process, the total mean squared deviation of the model M is
dM2=i=1∑nE[(Y^iM−f(xi))2]
Note: In literature, the (total) mean squared deviation is often called the (total) mean squared error or (T)MSE.
We can thus define the best model as the model M⋆ minimizing the average mean squared deviation, i.e.
M⋆=M⊆{1,…,p}argminn1dM2
Note: We call the average mean squared deviation also average mean squared error or AMSE.
8.2 Bias-Variance Tradeoff
We first assume the WCLM, i.e. that the data generting process is indeed linear with Y=Xβ+ε.
Proposition 8.3 (AMSE of full model): Let M={1,…,p} be the full model, then
n1dM2=npσ2
Proof:n1dM2=n1E[Y^M−Xβ2]=npE[σ^2]=npσ2
We note that the AMSE is linearly dependent on the dimensionality p of the model, i.e. large models accumulate more uncertainty and thus more variance. Hence, it may pay-off to use a smaller model M with ∣M∣=q.
Proposition 8.4 (AMSE of partial model): Let M⊂{1,…,p} be a partial model with ∣M∣=q, then
n1dM2=n1i=1∑n(E[(xiM)⊤β^M]−xi⊤β)2+nqσ2
Note:
n1∑i=1n(E[(xiM)⊤β^M]−xi⊤β)2 is the squared bias of the model M, it decreases as q increases
nqσ2 is the variance of the model M, it increases as q increases
This result is called bias-variance tradeoff. A smaller but wrong model may be better in AMSE terms by carefully trading squared bias against variance.
If the data generation process is nonlinear a similar decomposition can be made.
TODO
8.3 Forward and Backward Selection
TODO
The selection method do not guarantee that all covariables in M are significant. For instance, problems may occur when the covariables are highly correlated. We also have a multiple testing problem and a post-selection inference problem.
8.4 Information Criteria
Our goal is to find a proxy that we can use in a data-driven way to find the best model in terms of AMSE. For that purpose we define the standardized TMSE Γp=σ2dM2.