1. The Linear Model

We begin by defining some terminology.

Definition 1.1 (Explanatory variable): An explanatory variable is the input quantity or condition.
Note: The term “covariate” may also be used to denote an explanatory variable.

TODO: Independent variable and causality. Every independent variable is an explanatory variable but not every explanatory variable is an independent variable.

Definition 1.2 (Response variable): A response variable is the outcome or observed result.

TODO: Dependent variable and causality. The terms “independent variable” and “dependent variable” should be used with caution as they imply a causal relationship and an understanding of the underlying mechanism.

Definition 1.3 (Statistical model): A statistical model is an assumption on the relationship between explanatory variables and dependent variables.

TODO: Relationship only on observed values? Why do they hold in generality?

It is important to note that models are nothing more than a set of assumptions on relationships between inputs and outputs. As such, models are introduced as assumptions in this text.

1.1 Linear Model Assumption

Note: A sample is a collection of observations. In this text, nn denotes the sample size and pp the number of covariates. For each observation i{1,,n}i \in \set{1, \ldots, n} let YiY_i be the response variable and xi,1,,xi,px_{i,1}, \ldots, x_{i,p} be the explanatory variables.
Assumption 1.4 (Linear model): For each observation i{1,,n}i \in \set{1, \ldots, n} the response variable YiY_i is a linear function of the covariates f ⁣(xi,1,,xi,p)=j=1pβjxi,jf\of{x_{i,1}, \ldots, x_{i,p}} = \sum_{j=1}^p \beta_j x_{i,j} up to some stochastic, zero-mean deviation εi\epsilon_i, i.e. Yi=j=1pβjxi,j+εi Y_i = \sum_{j=1}^p \beta_j x_{i,j} + \epsilon_i with E ⁣[εi]=0\E{\epsilon_i} = 0 for all i{1,,n}i \in \set{1, \ldots, n}.
Note: We remark that E ⁣[Yi]=f ⁣(xi,1,,xi,p)=j=1pβjxi,j\E{Y_i} = f\of{x_{i,1}, \ldots, x_{i,p}} = \sum_{j=1}^p \beta_j x_{i,j} is denoted as the systematic component of the model.

The linear model is a theoretical framework assumption of how the response is generated from the covariates. Assuming correctness of the linear model, the goal is to estimate the parameters β1,,βp\beta_1, \ldots, \beta_p, to study their relevance and to estimate the deviation variance. The parameters βj\beta_j and distribution of εi\epsilon_i are unknown. The deviations εi\epsilon_i are unobservable, while the response variables YiY_i and the covariates xi,jx_{i,j} are given. We can rewrite the model using the vector notation Yn=Xn×pβp+εn \rvec* Y n = \dmat* X n p \cdot \gvec* \beta p + \gvec* \epsilon n where YRn\vY \in \R^n is the random vector of response variables, XRn×p\dmat X \in \R^{n\times p} is the matrix of covariates, βRp\vbeta \in \R^{p} is vector of unknown parameters and εRn\vepsilon \in \R^n is the random vector of deviations.

Note: In this text we assume unless told otherwise that the sample size is larger than the number of covariates, i.e. n>pn > p, and that the matrix X\dmat X has full rank pp.
Note: We use the notation xpi=[xi,1xi,p]=(Xn×p[i,:]) \dvec* x p _i = \begin{bmatrix} x_{i,1} \\ \vdots \\ x_{i,p} \end{bmatrix} = \pabig{\dmat* X n p _{[i,:]}}^{\top} for a single observation i{1,,n}i \in \set{1, \ldots, n} of all the pp covariates and xn:,j=[x1,jxn,j]=Xn×p[:,j] \dvec* x n _{:,j} = \begin{bmatrix} x_{1,j} \\ \vdots \\ x_{n,j} \end{bmatrix} = \dmat* X n p _{[:,j]} for all nn observations of a single covariate j{1,,p}j \in \set{1, \ldots, p}.
Note: To model an intercept, we set the first covariate variable to be a constant, i.e. x:,1=1\dvec x _{:,1} = \dvec 1, resulting in the model Yi=β1+j=2pβjxi,j+εiY_i = \beta_1 + \sum_{j=2}^p \beta_j x_{i,j} + \epsilon_i.

1.2 On Stochastic Components

The linear model involves some stochastic components. The deviation terms εi\epsilon_i are random variables and hence the responses YiY_i as well. The stochastic nature of the terms εi\epsilon_i can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.

Definition 1.5 (Fixed design): A fixed design describes a study where the values of the covariates xi,jx_{i,j} are pre-determined and controlled. These values are treated as non-stochastic, and the analysis is concerned with the responses YiY_i at these specific levels.

In other words, in fixed design, the data is considered a sample from the sequence of random variables (Yi)i=1n(Y_i)_{i = 1}^n at fixed pre-determined covariates (xi)i=1n(x_i)_{i = 1}^n.

Definition 1.6 (Random design): A random design describes a study, typically observational, where the values of the covariates are not controlled. Instead, both the covariates Xi,1,,Xi,pX_{i,1}, \ldots, X_{i,p} and responses YiY_i are treated as random.

Hence, in random design, the data is considered a sample from the sequence of random vectors ([Xi,1Xi,pYi])i=1n \pa{\begin{bmatrix} X_{i,1} \\ \vdots \\ X_{i,p} \\ Y_i \end{bmatrix}}_{i=1}^{n}

This text deals with models in fixed design where we assume that covariates are non-stochastic. This can either be because the data was observed in a fixed design study and the covariates are truly non-stochastic or that the analysis is performed conditional on the observed values, i.e. “given the specific values x1,j,,xn,jx_{1,j}, \dots, x_{n,j} we happen to observe from X1,j,,Xn,jX_{1,j}, \dots, X_{n,j}, we assume the relationship is E ⁣[Yi | Xi=xi]=f(xi)\E{Y_i \mid \rvec{X}_i = \dvec{x}_i} = f(\dvec{x}_i)”. In some settings however, i.e. random design studies and causal models, it may be of value to consider the covariates as stochastic.

1.3 Examples

Example (Location model): The model Yi=μ+εiY_i = \mu + \epsilon_i for all i{1,,n}i \in \set{1, \dots, n} is called the location model.
Note: The notion of “fit” is left ambiguous and will be explored later.
Example (Two sample model): To model two samples of size n1n_1 and n2n_2 we use the linear model Yi={μ1+εifor i{1,,n1}μ2+εifor i{n1+1,,n1+n2} Y_i = \begin{cases} \mu_1 + \epsilon_i & \text{for } i \in \set{1, \ldots, n_1} \\ \mu_2 + \epsilon_i & \text{for } i \in \set{n_1 + 1, \ldots, n_1 + n_2} \end{cases}

We have p=2p = 2 and X=[1n10n10n21n2] \dmat X = \begin{bmatrix} \dvec* 1 {n_1} & \dvec* 0 {n_1} \\ \dvec* 0 {n_2} & \dvec* 1 {n_2} \end{bmatrix}

We will pay special attention to the univariate linear model with and without intercept. Hence we provide formal definitions of these models.

Definition 1.7 (Simple linear model): The simple linear model or SLM is Yi=βxi+εiY_i = \beta x_i + \epsilon_i.
Definition 1.8 (Simple linear model with intercept): The simple linear model with intercept or SLMI is Yi=β1+β2xi+εiY_i = \beta_1 + \beta_2 x_i + \epsilon_i.
Example (SLMI): Under the SLMI we have p=2p = 2 and X=[1x:,2]\dmat X = \begin{bmatrix} \dvec 1 & \dvec{x}_{:,2} \end{bmatrix}.

TODO: Rest of examples.

1.4 Additional Assumptions

We recap that within the framework of the linear model, the following specific assumptions are made:

  1. The systematic component is a linear combination of covariates, i.e. f(xi)=xiβf(\dvec{x}_i) = \dvectr{x}_i \vbeta.
  2. The deviations are additive and zero-mean, i.e. E ⁣[ε]=0\E{\vepsilon} = \dvec 0.

Further, unless otherwise noted, in this text we always assume p<np < n and the absence of multicollinearity.

Assumption 1.9 (p<np < n): The number of predictors pp is less than the number of observations nn.
Assumption 1.10 (No multicollinearity): The design matrix X\dmat X is of full rank.
Note: Under p<np < n, no multicolinearity is equivalent to rank ⁣(X)=p\rank{\dmat X} = p and implies that the columns xi\dvec{x}_i are linearly independent.

We will see that linear independence of the columns in the design matrix X\dmat X is necessary in order to obtain unique estimators of the regression coefficients in β\vbeta.

The linear model is extended to the classical linear model with the following assumptions on the devations.

Assumption 1.11 (Homoscedastic devations): The variance of the deviations is constant across observations, i.e. Var ⁣[εi]=σ2\Var{\epsilon_i} = \sigma^2 for all ii.
Assumption 1.12 (Uncorrelated deviations): The deviations are uncorrelated, i.e. Cov ⁣[εi, εj]=0\Cov{\epsilon_i}{\epsilon_j} = 0 for all iji \neq j.
Note: The two aforementioned assumptions together imply Cov ⁣[ε, ε]=σ2I\Cov{\vepsilon}{\vepsilon} = \sigma^2 \dmat{I}.
Assumption 1.13 (Gaussian deviations): The deviations are Gaussian, i.e. εiN(0,σi2)\epsilon_i \sim \lawN(0, \sigma_i^2) for all ii.

We can now blend these assumption with the linear model.

Assumption 1.14 (Weak classical linear model): For each observation the response is the linear function Yi=i=1pβixi+εiY_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i with zero-mean, homoscedastic and uncorrelated deviations, i.e.
  • E ⁣[εi]=0\E{\epsilon_i} = 0 for all i{1,,n}i \in \set{1, \ldots, n}
  • Var ⁣[εi]=σ2\Var{\epsilon_i} = \sigma^2 for all i{1,,n}i \in \set{1, \ldots, n}
  • Cov ⁣[εi, εj]=0\Cov{\epsilon_i}{\epsilon_j} = 0 for all iji \neq j
Note: For notational convenience, we use WCLM to denote the weak classical linear model. The WCLM assumptions are also known under the name of Gauss-Markov assumptions.

If additionally Gaussian deviations are assumed, the strong classical linear model is obtained.

Assumption 1.15 (Strong classical linear model): For each observation the response is the linear function Yi=i=1pβixi+εiY_i = \sum_{i=1}^p \beta_i x_{i} + \epsilon_i with zero-mean, homoscedastic, independent, Gaussian deviations, i.e. εiiidN(0,σ2)\epsilon_i \simiid \lawN(0, \sigma^2).
Note: For notational convenience, we use SCLM to denote the strong classical linear model. SCLM is a special case of WCLM.
Proposition 1.16 (Distribution under SCLM): Under the strong classical linear model, we have YN(Xβ,σ21)\vY \sim \lawN(\dmat X \vbeta, \sigma^2 \dmat 1).