Computational Statistics 2025-02-21

1.Multiple Linear Regression

Linear regression is a widely used statistical model in a broad variety of applications. It is one of the easiest examples to demonstrate important aspects of statistical modelling.

1.1 The Linear Model

Definition 1.1 (Linear model): For each observation

i \in \set{1, \ldots, n}

let

Y_i

be the response variable and

x_i^{(1)}, \ldots, x_i^{(p)}

be the predictors. In the linear model the response variable is a linear function of the predictors up to some error

\epsilon_i

Y_i = \sum_{j=1}^p \beta_j x_i^{(j)} + \epsilon_i

Note: Usually we assume that

\epsilon_1, \ldots, \epsilon_n

are

\iid

with

\E{\epsilon_i} = 0

and

\Var{\epsilon_i} = \sigma^2

We call $n$ the sample size and $p$ the number of predictors. The goal is to estimate the parameters $\set{\beta_1, \ldots, \beta_p}$ to study their relevance and to estimate the error variance. The parameters $\beta_j$ and $\sigma^2$ are unknown and the errors $\epsilon_i$ are unobservable, while the response variables $Y_i$ and the predictors $x_i^{(j)}$ are given. We can rewrite the model using the vector notation: $\rb{Y} = \vb{X} \cdot \bsym\beta + \bsym\epsilon$ where $\rb{Y} \in \R^n$ is the random vector of response variables, $\vb{X} \in \R^{n\times p}$ is the matrix of predictors, $\bsym\beta \in \R^{p}$ is vector of unknown parameters and $\bsym\epsilon \in \R^n$ is the random vector of errors. We typically assume that the sample size is larger than the number of predictors, i.e. $n > p$ and that the matrix $\vb{X}$ has full rank $p$

Note: To model an intercept, we set the first predictor variable to be a constant, i.e.

x_i^{(1)} = 1

We then get

Y_i = \beta_1 + \sum_{j=2}^p \beta_j x_i^{(j)} + \epsilon_i

Note (On stochastic models): The linear model involves some stochastic components: the error terms

\epsilon_i

are random variables and hence the response variables

Y_i

as well. The predictor variables

x_i^{(j)}

are assumed to be non-random. However, in some applications it is more appropriate to treat the predictor variables as random. The stochastic nature of the error terms

\epsilon_i

can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.

Example (Regression through the origin):

Y_i = \beta x_i + \epsilon_i

Example (Simple linear regression):

Y_i = \beta_1 + \beta_2 x_i + \epsilon_i

Example (Transformed predictors):

Y_i = \beta_1 + \beta_2 \log{x_i^{(2)}} + \beta_3 \sin{x_i^{(3)}} + \epsilon_i

1.2 Least Squares Method

We assume the linear model $\vb{Y} = \vb{X} \bsym\beta + \bsym\epsilon$ Our goal is to find a good estimate of $\bsym\beta$

Definition 1.2 (Least squares estimator): The least squares estimator

\hat{\bsym\beta}

is defined as

\hat{\bsym\beta} = \argmin_{\bsym\beta \in \R^p} \norm{\rb{Y} - \vb{X}\bsym\beta}^2

Assuming that

\vb{X}

has rank

p

the minimizer can be computed explicitly by

\hat{\bsym\beta} = \pa{\vb{X}^{\top} \vb{X}}^{-1} \vb{X}^{\top} \rb{Y}

Proof: As

\norm{\rb{Y} - \vb{X}\bsym\beta}^2

is convex we find the minimizwer by setting its gradient to

\vb{0}

\nabla \norm{\rb{Y} - \vb{X}\bsym\beta}^2 = -2 \vb{X}^{\top} (\rb{Y} - \vb{X}\bsym\beta) \stackrel{!}{=} \vb{0}

This yields the normal equations

\vb{X}^{\top} \vb{X} \hat{\bsym\beta} = \vb{X}^{\top} \rb{Y}

Under the assumption that

\vb{X}

has rank

p

the matrix

\vb{X}^{\top} \vb{X} \in \R^{p \times p}

has full rank and is invertible, thus

\hat{\bsym\beta} = \pa{\vb{X}^{\top} \vb{X}}^{-1} \vb{X}^{\top} \rb{Y}

Definition 1.3 (Residuals): We define the residuals as

R_i = Y_i - \vb{x}_i^{\top} \hat{\bsym\beta}

The residuals are esimtates for the $\epsilon_i$ 's. Thus it is plausible to have $\hat{\sigma}^2 = \frac{1}{n-p} \sum_{i=1}^n R_i^2$ as the estimator for the error variance $\sigma^2$ It will be shown later that using the factor $\frac{1}{n-p}$ yields $\E{\hat{\sigma}^2} = \sigma^2$