1.Multiple Linear Regression

Linear regression is a widely used statistical model in a broad variety of applications. It is one of the easiest examples to demonstrate important aspects of statistical modelling.

1.1 The Linear Model

Definition 1.1 (Linear model): For each observation i{1,,n}i \in \set{1, \ldots, n}, let YiY_i be the response variable and xi(1),,xi(p)x_i^{(1)}, \ldots, x_i^{(p)} be the predictors. In the linear model the response variable is a linear function of the predictors up to some error εi\epsilon_i: Yi=j=1pβjxi(j)+εi Y_i = \sum_{j=1}^p \beta_j x_i^{(j)} + \epsilon_i
Note: Usually we assume that ε1,,εn\epsilon_1, \ldots, \epsilon_n are iid\iid with E ⁣[εi] ⁣=0\E{\epsilon_i} = 0 and Var ⁣[εi] ⁣=σ2\Var{\epsilon_i} = \sigma^2.

We call nn the sample size and pp the number of predictors. The goal is to estimate the parameters {β1,,βp}\set{\beta_1, \ldots, \beta_p}, to study their relevance and to estimate the error variance. The parameters βj\beta_j and σ2\sigma^2 are unknown and the errors εi\epsilon_i are unobservable, while the response variables YiY_i and the predictors xi(j)x_i^{(j)} are given. We can rewrite the model using the vector notation: Y=Xβ+ε \rb{Y} = \vb{X} \cdot \bsym\beta + \bsym\epsilon where YRn\rb{Y} \in \R^n is the random vector of response variables, XRn×p\vb{X} \in \R^{n\times p} is the matrix of predictors, βRp\bsym\beta \in \R^{p} is vector of unknown parameters and εRn\bsym\epsilon \in \R^n is the random vector of errors. We typically assume that the sample size is larger than the number of predictors, i.e. n>pn > p, and that the matrix X\vb{X} has full rank pp.

Note: To model an intercept, we set the first predictor variable to be a constant, i.e. xi(1)=1x_i^{(1)} = 1. We then get Yi=β1+j=2pβjxi(j)+εiY_i = \beta_1 + \sum_{j=2}^p \beta_j x_i^{(j)} + \epsilon_i.
Note (On stochastic models): The linear model involves some stochastic components: the error terms εi\epsilon_i are random variables and hence the response variables YiY_i as well. The predictor variables xi(j)x_i^{(j)} are assumed to be non-random. However, in some applications it is more appropriate to treat the predictor variables as random. The stochastic nature of the error terms εi\epsilon_i can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.
Example (Regression through the origin): Yi=βxi+εi Y_i = \beta x_i + \epsilon_i
Example (Simple linear regression): Yi=β1+β2xi+εi Y_i = \beta_1 + \beta_2 x_i + \epsilon_i
Example (Transformed predictors): Yi=β1+β2log ⁣(xi(2)) ⁣+β3sin ⁣(xi(3)) ⁣+εi Y_i = \beta_1 + \beta_2 \log{x_i^{(2)}} + \beta_3 \sin{x_i^{(3)}} + \epsilon_i

1.2 Least Squares Method

We assume the linear model Y=Xβ+ε\vb{Y} = \vb{X} \bsym\beta + \bsym\epsilon. Our goal is to find a good estimate of β\bsym\beta.

Definition 1.2 (Least squares estimator): The least squares estimator β^\hat{\bsym\beta} is defined as β^=arg minβRpYXβ2 \hat{\bsym\beta} = \argmin_{\bsym\beta \in \R^p} \norm{\rb{Y} - \vb{X}\bsym\beta}^2 Assuming that X\vb{X} has rank pp, the minimizer can be computed explicitly by β^=(XX)1XY \hat{\bsym\beta} = \pa{\vb{X}^{\top} \vb{X}}^{-1} \vb{X}^{\top} \rb{Y}
Proof: As YXβ2\norm{\rb{Y} - \vb{X}\bsym\beta}^2 is convex we find the minimizwer by setting its gradient to 0\vb{0}: YXβ2=2X(YXβ)=!0 \nabla \norm{\rb{Y} - \vb{X}\bsym\beta}^2 = -2 \vb{X}^{\top} (\rb{Y} - \vb{X}\bsym\beta) \stackrel{!}{=} \vb{0} This yields the normal equations XXβ^=XY \vb{X}^{\top} \vb{X} \hat{\bsym\beta} = \vb{X}^{\top} \rb{Y} Under the assumption that X\vb{X} has rank pp the matrix XXRp×p\vb{X}^{\top} \vb{X} \in \R^{p \times p} has full rank and is invertible, thus β^=(XX)1XY \hat{\bsym\beta} = \pa{\vb{X}^{\top} \vb{X}}^{-1} \vb{X}^{\top} \rb{Y}
Definition 1.3 (Residuals): We define the residuals as Ri=Yixiβ^ R_i = Y_i - \vb{x}_i^{\top} \hat{\bsym\beta}

The residuals are esimtates for the εi\epsilon_i's. Thus it is plausible to have σ^2=1npi=1nRi2 \hat{\sigma}^2 = \frac{1}{n-p} \sum_{i=1}^n R_i^2 as the estimator for the error variance σ2\sigma^2. It will be shown later that using the factor 1np\frac{1}{n-p} yields E ⁣[σ^2] ⁣=σ2\E{\hat{\sigma}^2} = \sigma^2.