Definition 1.1 (Explanatory variable): An explanatory variable is the input quantity or condition.
Note: The term “covariate” may also be used to denote an explanatory variable.
TODO: Independent variable and causality. Every independent variable is an explanatory variable but not every explanatory variable is an independent variable.
Definition 1.2 (Response variable): A response variable is the outcome or observed result.
TODO: Dependent variable and causality. The terms “independent variable” and “dependent variable” should be used with caution as they imply a causal relationship and an understanding of the underlying mechanism.
Definition 1.3 (Statistical model): A statistical model is an assumption on the relationship between explanatory variables and dependent variables.
TODO: Relationship only on observed values? Why do they hold in generality?
It is important to note that models are nothing more than a set of assumptions on relationships between inputs and outputs. As such, models are introduced as assumptions in this text.
1.1 Linear Model Assumption
Note: A sample is a collection of observations. In this text, n denotes the sample size and p the number of covariates. For each observation i∈{1,…,n} let Yi be the response variable and xi,1,…,xi,p be the explanatory variables.
Assumption 1.4 (Linear model): For each observation i∈{1,…,n} the response variable Yi is a linear function of the covariates f(xi,1,…,xi,p)=∑j=1pβjxi,j up to some stochastic, zero-mean deviation εi, i.e.
Yi=j=1∑pβjxi,j+εi
with E[εi]=0 for all i∈{1,…,n}.
Note: We remark that E[Yi]=f(xi,1,…,xi,p)=∑j=1pβjxi,j is denoted as the systematic component of the model.
The linear model is a theoretical framework assumption of how the response is generated from the covariates. Assuming correctness of the linear model, the goal is to estimate the parameters β1,…,βp, to study their relevance and to estimate the deviation variance. The parameters βj and distribution of εi are unknown. The deviations εi are unobservable, while the response variables Yi and the covariates xi,j are given. We can rewrite the model using the vector notation
Yn=Xn×p⋅βp+εn
where Y∈Rn is the random vector of response variables, X∈Rn×p is the matrix of covariates, β∈Rp is vector of unknown parameters and ε∈Rn is the random vector of deviations.
Note: In this text we assume unless told otherwise that the sample size is larger than the number of covariates, i.e. n>p, and that the matrix X has full rank p.
Note: We use the notation
xpi=xi,1⋮xi,p=(Xn×p[i,:])⊤
for a single observation i∈{1,…,n} of all the p covariates and
xn:,j=x1,j⋮xn,j=Xn×p[:,j]
for all n observations of a single covariate j∈{1,…,p}.
Note: To model an intercept, we set the first covariate variable to be a constant, i.e. x:,1=1, resulting in the model Yi=β1+∑j=2pβjxi,j+εi.
1.2 On Stochastic Components
The linear model involves some stochastic components. The deviation terms εi are random variables and hence the responses Yi as well. The stochastic nature of the terms εi can be assigned to various sources, e.g. measurement errors or the inability to capture all underlying non-systematic effects.
Definition 1.5 (Fixed design): A fixed design describes a study where the values of the covariates xi,j are pre-determined and controlled. These values are treated as non-stochastic, and the analysis is concerned with the responses Yi at these specific levels.
In other words, in fixed design, the data is considered a sample from the sequence of random variables (Yi)i=1n at fixed pre-determined covariates (xi)i=1n.
Definition 1.6 (Random design): A random design describes a study, typically observational, where the values of the covariates are not controlled. Instead, both the covariates Xi,1,…,Xi,p and responses Yi are treated as random.
Hence, in random design, the data is considered a sample from the sequence of random vectors
Xi,1⋮Xi,pYii=1n
This text deals with models in fixed design where we assume that covariates are non-stochastic. This can either be because the data was observed in a fixed design study and the covariates are truly non-stochastic or that the analysis is performed conditional on the observed values, i.e. “given the specific values x1,j,…,xn,j we happen to observe from X1,j,…,Xn,j, we assume the relationship is E[Yi∣Xi=xi]=f(xi)”. In some settings however, i.e. random design studies and causal models, it may be of value to consider the covariates as stochastic.
1.3 Examples
Example (Location model): The model Yi=μ+εi for all i∈{1,…,n} is called the location model.
Note: The notion of “fit” is left ambiguous and will be explored later.
Example (Two sample model): To model two samples of size n1 and n2 we use the linear model
Yi={μ1+εiμ2+εifor i∈{1,…,n1}for i∈{n1+1,…,n1+n2}
We have p=2 and
X=[1n10n20n11n2]
We will pay special attention to the univariate linear model with and without intercept. Hence we provide formal definitions of these models.
Definition 1.7 (Simple linear model): The simple linear model or SLM is Yi=βxi+εi.
Definition 1.8 (Simple linear model with intercept): The simple linear model with intercept or SLMI is Yi=β1+β2xi+εi.
Example (SLMI): Under the SLMI we have p=2 and X=[1x:,2].
TODO: Rest of examples.
1.4 Additional Assumptions
We recap that within the framework of the linear model, the following specific assumptions are made:
The systematic component is a linear combination of covariates, i.e. f(xi)=xi⊤β.
The deviations are additive and zero-mean, i.e. E[ε]=0.
Further, unless otherwise noted, in this text we always assume p<n and the absence of multicollinearity.
Assumption 1.9 (p<n): The number of predictors p is less than the number of observations n.
Assumption 1.10 (No multicollinearity): The design matrix X is of full rank.
Note: Under p<n, no multicolinearity is equivalent to rank(X)=p and implies that the columns xi are linearly independent.
We will see that linear independence of the columns in the design matrix X is necessary in order to obtain unique estimators of the regression coefficients in β.
The linear model is extended to the classical linear model with the following assumptions on the devations.
Assumption 1.11 (Homoscedastic devations): The variance of the deviations is constant across observations, i.e. Var[εi]=σ2 for all i.
Assumption 1.12 (Uncorrelated deviations): The deviations are uncorrelated, i.e. Cov[εi,εj]=0 for all i=j.
Note: The two aforementioned assumptions together imply Cov[ε,ε]=σ2I.
Assumption 1.13 (Gaussian deviations): The deviations are Gaussian, i.e. εi∼N(0,σi2) for all i.
We can now blend these assumption with the linear model.
Assumption 1.14 (Weak classical linear model): For each observation the response is the linear function Yi=∑i=1pβixi+εi with zero-mean, homoscedastic and uncorrelated deviations, i.e.
E[εi]=0 for all i∈{1,…,n}
Var[εi]=σ2 for all i∈{1,…,n}
Cov[εi,εj]=0 for all i=j
Note: For notational convenience, we use WCLM to denote the weak classical linear model. The WCLM assumptions are also known under the name of Gauss-Markov assumptions.
If additionally Gaussian deviations are assumed, the strong classical linear model is obtained.
Assumption 1.15 (Strong classical linear model): For each observation the response is the linear function Yi=∑i=1pβixi+εi with zero-mean, homoscedastic, independent, Gaussian deviations, i.e. εi∼iidN(0,σ2).
Note: For notational convenience, we use SCLM to denote the strong classical linear model. SCLM is a special case of WCLM.
Proposition 1.16 (Distribution under SCLM): Under the strong classical linear model, we have Y∼N(Xβ,σ21).