Statistical Modelling 2025-09-18

1. The Linear Model

We begin by defining some terminology.

Definition 1.1 (Explanatory variable): An explanatory variable is the input quantity or condition.

Note: The term “covariate” may also be used to denote an explanatory variable.

TODO: Independent variable and causality. Every independent variable is an explanatory variable but not every explanatory variable is an independent variable.

Definition 1.2 (Response variable): A response variable is the outcome or observed result.

TODO: Dependent variable and causality. The terms “independent variable” and “dependent variable” should be used with caution as they imply a causal relationship and an understanding of the underlying mechanism.

Definition 1.3 (Statistical model): A statistical model is an assumption on the relationship between explanatory variables and dependent variables.

TODO: Relationship only on observed values? Why do they hold in generality?

It is important to note that models are nothing more than a set of assumptions on relationships between inputs and outputs. As such, models are introduced as assumptions in this text.

1.1 Linear Model Assumption

Note: A sample is a collection of observations. In this text,

n

denotes the sample size and

p

the number of covariates. For each observation

i \in \set{1, \ldots, n}

let

Y_i

be the response variable and

x_{i,1}, \ldots, x_{i,p}

be the explanatory variables.

Assumption 1.4 (Linear model): For each observation

i \in \set{1, \ldots, n}

the response variable

Y_i

is a linear function of the covariates

f\of{x_{i,1}, \ldots, x_{i,p}} = \sum_{j=1}^p \beta_j x_{i,j}

up to some stochastic, zero-mean deviation

\epsilon_i

i.e.

Y_i = \sum_{j=1}^p \beta_j x_{i,j} + \epsilon_i

with

\E{\epsilon_i} = 0

for all

i \in \set{1, \ldots, n}

Note: We remark that

\E{Y_i} = f\of{x_{i,1}, \ldots, x_{i,p}} = \sum_{j=1}^p \beta_j x_{i,j}

is denoted as the systematic component of the model.

The linear model is a theoretical framework assumption of how the response is generated from the covariates. Assuming correctness of the linear model, the goal is to estimate the parameters $\beta_1, \ldots, \beta_p$ to study their relevance and to estimate the deviation variance. The parameters $\beta_j$ and distribution of $\epsilon_i$ are unknown. The deviations $\epsilon_i$ are unobservable, while the response variables $Y_i$ and the covariates $x_{i,j}$ are given. We can rewrite the model using the vector notation $\rvec* Y n = \dmat* X n p \cdot \gvec* \beta p + \gvec* \epsilon n$ where $\vY \in \R^n$ is the random vector of response variables, $\dmat X \in \R^{n\times p}$ is the matrix of covariates, $\vbeta \in \R^{p}$ is vector of unknown parameters and $\vepsilon \in \R^n$ is the random vector of deviations.

Note: In this text we assume unless told otherwise that the sample size is larger than the number of covariates, i.e.

n > p

and that the matrix

\dmat X

has full rank

p

Note: We use the notation

\dvec* x p _i = \begin{bmatrix} x_{i,1} \\ \vdots \\ x_{i,p} \end{bmatrix} = \pabig{\dmat* X n p _{[i,:]}}^{\top}

for a single observation

i \in \set{1, \ldots, n}

of all the

p

covariates and

\dvec* x n _{:,j} = \begin{bmatrix} x_{1,j} \\ \vdots \\ x_{n,j} \end{bmatrix} = \dmat* X n p _{[:,j]}

for all

n

observations of a single covariate

j \in \set{1, \ldots, p}

Note: To model an intercept, we set the first covariate variable to be a constant, i.e.

\dvec x _{:,1} = \dvec 1

resulting in the model

Y_i = \beta_1 + \sum_{j=2}^p \beta_j x_{i,j} + \epsilon_i

1.2 On Stochastic Components

The linear model involves some stochastic components. The deviation terms $\epsilon_i$ are random variables and hence the responses $Y_i$ as well. The stochastic nature of the terms $\epsilon_i$ can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.

Definition 1.5 (Fixed design): A fixed design describes a study where the values of the covariates

x_{i,j}

are pre-determined and controlled. These values are treated as non-stochastic, and the analysis is concerned with the responses

Y_i

at these specific levels.

In other words, in fixed design, the data is considered a sample from the sequence of random variables $(Y_i)_{i = 1}^n$ at fixed pre-determined covariates $(x_i)_{i = 1}^n$

Definition 1.6 (Random design): A random design describes a study, typically observational, where the values of the covariates are not controlled. Instead, both the covariates

X_{i,1}, \ldots, X_{i,p}

and responses

Y_i

are treated as random.

Hence, in random design, the data is considered a sample from the sequence of random vectors $\pa{\begin{bmatrix} X_{i,1} \\ \vdots \\ X_{i,p} \\ Y_i \end{bmatrix}}_{i=1}^{n}$

This text deals with models in fixed design where we assume that covariates are non-stochastic. This can either be because the data was observed in a fixed design study and the covariates are truly non-stochastic or that the analysis is performed conditional on the observed values, i.e. “given the specific values $x_{1,j}, \dots, x_{n,j}$ we happen to observe from $X_{1,j}, \dots, X_{n,j}$ we assume the relationship is $\E{Y_i \mid \rvec{X}_i = \dvec{x}_i} = f(\dvec{x}_i)$ ”. In some settings however, i.e. random design studies and causal models, it may be of value to consider the covariates as stochastic.

1.3 Examples

Example (Location model): The model

Y_i = \mu + \epsilon_i

for all

i \in \set{1, \dots, n}

is called the location model.

Note: The notion of “fit” is left ambiguous and will be explored later.

Example (Two sample model): To model two samples of size

n_1

and

n_2

we use the linear model

Y_i = \begin{cases} \mu_1 + \epsilon_i & \text{for } i \in \set{1, \ldots, n_1} \\ \mu_2 + \epsilon_i & \text{for } i \in \set{n_1 + 1, \ldots, n_1 + n_2} \end{cases}

We have $p = 2$ and $\dmat X = \begin{bmatrix} \dvec* 1 {n_1} & \dvec* 0 {n_1} \\ \dvec* 0 {n_2} & \dvec* 1 {n_2} \end{bmatrix}$

We will pay special attention to the univariate linear model with and without intercept. Hence we provide formal definitions of these models.

Definition 1.7 (Simple linear model): The simple linear model or SLM is

Y_i = \beta x_i + \epsilon_i

Definition 1.8 (Simple linear model with intercept): The simple linear model with intercept or SLMI is

Y_i = \beta_1 + \beta_2 x_i + \epsilon_i

Example (SLMI): Under the SLMI we have

p = 2

and

\dmat X = \begin{bmatrix} \dvec 1 & \dvec{x}_{:,2} \end{bmatrix}

TODO: Rest of examples.

1.4 Additional Assumptions

We recap that within the framework of the linear model, the following specific assumptions are made:

The systematic component is a linear combination of covariates, i.e. $f(\dvec{x}_i) = \dvectr{x}_i \vbeta$
The deviations are additive and zero-mean, i.e. $\E{\vepsilon} = \dvec 0$

Further, unless otherwise noted, in this text we always assume $p < n$ and the absence of multicollinearity.

Assumption 1.9 (

p < n

): The number of predictors

p

is less than the number of observations

n

Assumption 1.10 (No multicollinearity): The design matrix

\dmat X

is of full rank.

Note: Under

p < n

no multicolinearity is equivalent to

\rank{\dmat X} = p

and implies that the columns

\dvec{x}_i

are linearly independent.

We will see that linear independence of the columns in the design matrix $\dmat X$ is necessary in order to obtain unique estimators of the regression coefficients in $\vbeta$

The linear model is extended to the classical linear model with the following assumptions on the devations.

Assumption 1.11 (Homoscedastic devations): The variance of the deviations is constant across observations, i.e.

\Var{\epsilon_i} = \sigma^2

for all

i

Assumption 1.12 (Uncorrelated deviations): The deviations are uncorrelated, i.e.

\Cov{\epsilon_i}{\epsilon_j} = 0

for all

i \neq j

Note: The two aforementioned assumptions together imply

\Cov{\vepsilon}{\vepsilon} = \sigma^2 \dmat{I}

Assumption 1.13 (Gaussian deviations): The deviations are Gaussian, i.e.

\epsilon_i \sim \lawN(0, \sigma_i^2)

for all

i

We can now blend these assumption with the linear model.

Assumption 1.14 (Weak classical linear model): For each observation the response is the linear function

Y_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i

with zero-mean, homoscedastic and uncorrelated deviations, i.e.

$\E{\epsilon_i} = 0$ for all $i \in \set{1, \ldots, n}$
$\Var{\epsilon_i} = \sigma^2$ for all $i \in \set{1, \ldots, n}$
$\Cov{\epsilon_i}{\epsilon_j} = 0$ for all $i \neq j$

Note: For notational convenience, we use WCLM to denote the weak classical linear model. The WCLM assumptions are also known under the name of Gauss-Markov assumptions.

If additionally Gaussian deviations are assumed, the strong classical linear model is obtained.

Assumption 1.15 (Strong classical linear model): For each observation the response is the linear function

Y_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i

with zero-mean, homoscedastic, independent, Gaussian deviations, i.e.

\epsilon_i \simiid \lawN(0, \sigma^2)

Note: For notational convenience, we use SCLM to denote the strong classical linear model. SCLM is a special case of WCLM.

Proposition 1.16 (Distribution under SCLM): Under the strong classical linear model, we have

\vY \sim \lawN(\dmat X \vbeta, \sigma^2 \dmat 1)