Statistical Modelling 2025-10-30

4. Tests and Confidence Intervals

In this section we assume that the SCLM holds. We start by recapping some important distributions.

Recap (Chi-squared distribution): Let

Z_1, \ldots, Z_d \simiid \lawN(0, 1)

Then

U = \sum_{i=1}^d Z_i \sim \lawChi{d}

We have

\E{U} = k

and

\Var{U} = 2k

Recap (T distribution): Let

Z \sim \lawN(0, 1)

and

U \sim \lawChi{d}

be independent. Then

T = \frac{Z}{\sqrt{d^{-1} U}} \sim \lawT{d}

We have

\E{T} = 0

For

d > 2

we have

\Var{T} = \frac{d}{d-2}

Recap (F distribution): Let

U \sim \lawChi{d_1}

and

V \sim \lawChi{d_2}

be independent. Then

F = \frac{d_1^{-1} U_1}{d_2^{-1} U_2} \sim \lawF{d_1}{d_2}

For

d_2 > 2

we have

\E{F} = \frac{d_2}{d_2 -2}

For

d_2 > 4

we have

\Var{F} = \frac{2 d_2^2 (d_1 + d_2 - 2)}{d_1 (d_2 - 2)^2 (d_2 - 4)}

Note: If

T \sim \lawT{d}

then

T^2 = \frac{Z^2}{U d^{-1}} \sim \lawF{1}{d}

4.1 Basic Test Statistics

We first assume that $\sigma^2$ is known. We know that $\vbetahat \sim \lawN(\vbeta, \sigma^2 (\dmattr X \dmat X)^{-1})$ We want to test the null hypothesis $\H_0: \beta_j = 0$ We can then construct a two-sided test. We do that by noting that under null hypothesis we have $T = \frac{\betahat_j}{\sigma \sqrt{(\dmattr X \dmat X)^{-1}_{[j,j]}}} \sim \lawN(0,1)$ This is our test statistic. We can then compute the $\p$ -value of the observed test statistic by $\p = \probP{\abs{T} > \abs{t}} = 2\Phi(-\abs{t})$ and compare it to a pre-determined test significance level $\alpha \in [0,1]$ We reject $\H_0$ if $\p \leq \alpha$

4.2 Basic Confidence Interval

Note: Given a random variable

X

we use

\sigma_X = \sqrt{\Var{X}}

to denote the standard deviation and

\sigmahat_X

to denote a suitable estimator.

\sigmahat_X

is often called standard error in the literature.

We assume the same setting as before and want to find a $(1-\alpha)$ confidence interval for $\beta_j$ As the standard Gaussian is symmetric we have $\frac{\beta_j - \betahat_j}{\sigma \sqrt{(\dmattr X \dmat X)^{-1}_{[j,j]}}} \sim \lawN(0,1)$ and thus the confidence interval $\probP{\beta \in \left[\betahat_j - \sigma_{\betahat_j} z_{1-\frac{1}{2}\alpha}, \betahat_j + \sigma_{\betahat_j} z_{1-\frac{1}{2}\alpha} \right]} = 1 - \alpha$ where $z_{1-\frac{1}{2}\alpha} = \Phi^{-1}\of{1-\frac{1}{2}\alpha}$

Note: The

(1−\alpha)

confidence interval is equivalent to the set of all parameter values that would not be rejected by the hypothesis test for

\H_0

at significance level

\alpha

4.3 Test and CI Distributions

We assume now that $\sigma^2$ is not known. We can construct the following distributions for test statistics and confidence intervals.

Proposition 4.1 (Individual parameter): For each parameter

\beta_j

we have

\frac{\betahat_j - \beta_j}{\sigmahat_{\betahat_j}} \sim \lawT{n-p}

where

\sigmahat_{\betahat_j} = \sigmahat \sqrt{(\dmattr X \dmat X)^{-1}_{[j,j]}}

Proof: TODO

We can use the t-test statistic $T = \sigmahat_{\betahat_j}^{-1} \betahat_j$ to test whether the coefficient $\beta_j$ is significantly different from zero.

Proposition 4.2 (Entire parameter vector): For the entire parameter vector

\vbeta

we have

\frac{\norm{\dmat X (\vbetahat - \vbeta)}^2}{p \sigmahat^2} = \frac{1}{p \sigmahat^2} (\vbetahat - \vbeta)^{\tr} \dmattr X \dmat X (\vbetahat - \vbeta) \sim \lawF{p}{n-p}

Proof: TODO

We can use the zero-model F-test statistic $T = (p \sigmahat^2)^{-1} \sum_{i=1}^n \Yhat_i^2$ to test whether the entire parameter vector $\vbeta$ is significantly different from the zero vector $\dvec 0$ We discuss the F-test in detail in the next section.

Proposition 4.3 (Subset of parameter vector): For a subset or linear transformation

\dvec b = \dmat B \vbeta

with

\dmat B \in \R^{r \times p}

and

\rank{\dmat B} = r

we have

\frac{1}{p \sigmahat^2} (\hat{\dvec b} - \dvec b) \dmat V^{-1} (\hat{\dvec b} - \dvec b) \sim \lawF{q}{n-p}

where

\dmat V = \dmat B (\dmattr X \dmat X)^{-1} \dmat B^{\tr}

Proof: TODO

As discussed in the next section, we use $\dmat B = [ \dmat I \sep \dmat 0]$ to define a partial F-test.

Proposition 4.4 (Expectation of observation): For the expectation of the

i

-th observation

Y_i

that is

\E{Y_i} = \vbeta^{\tr} \dvec x_i

we have

\frac{\Yhat_i - \E{Y_i}}{\sigmahat_{\Yhat_i}} \sim \lawT{n - p}

where

\sigmahat_{\Yhat_i} = \sigmahat \sqrt{\dmat P_{[i,i]}}

Proof: TODO

Note: It is important to note that we target the true value, i.e. the expected value

\E{Y_i}

and not the observation

Y_i

itself.

Proposition 4.5 (Expectation of observation under any condition): For the expectation of an observation

Y_{n+1}

under an arbitrary experimental condition

\dvec x_{n+1}

that is

\E{Y_{n+1}} = \vbeta^{\tr} \dvec x_{n+1}

we have

\frac{\Yhat_{n+1} - \E{Y_{n+1}}}{\sigmahat_{\Yhat_{n+1}}} \sim \lawT{n - p}

where

\sigmahat_{\Yhat_{n+1}} = \sigmahat \sqrt{\dvectr x_{n+1} (\dmattr X \dmat X)^{-1} \dvec x_{n+1}}

Proof: TODO

Note: We again target the expected value

\E{Y_i}

In the case that

\dvec x_{n+1} = \dvec x_i

for some

i \in \set{1, \ldots, n}

we have

\dvectr x_{i} (\dmattr X \dmat X)^{-1} \dvec x_{i} = \dmat P_{[i,i]}

and

\sigmahat_{\Yhat_{n+1}} = \sigmahat_{\Yhat_{i}}

Proposition 4.6 (New observation): For a new observation

Y_{n+1}

under the experimental condition

\dvec x_{n+1}

we have

\frac{\Yhat_{n+1} - Y_{n+1}}{\sigmahat_{\Yhat_{n+1} - Y_{n+1}}} \sim \lawT{n - p}

where

\sigmahat_{\Yhat_{n+1} - Y_{n+1}} = \sigmahat \sqrt{1 + \dvectr x_{n+1} (\dmattr X \dmat X)^{-1} \dvec x_{n+1}}

Proof: TODO

Note: Here we target the observation

Y_i

itself thus introducing new variability to the distribution.

4.4 F-test and its Geometric Interpretation

Recall that we assume SCLM, thus the linear model $\vY = \dmat X \vbeta + \vepsilon$ is correct. Let $\H_0$ be the hypothesis $\H_0: \beta_{q+1} = 0, \ldots, \beta_p = 0$ i.e. only the first $q$ coefficients are non-zero. We construct the nested model by defining the $n \times q$ submatrix $\dmat X^\circ= \dmat X_{[:,1,q]}$ and the subvector $\vbeta^{\circ} = (\beta_{1}, \ldots, \beta_q)$ Then, under $\H_0$ $\dmat X^\circ$ and $\vbeta^{\circ}$ would suffice to define $\vY$ i.e. $\vY = \dmat X^\circ \vbeta^{\circ} + \vepsilon$ The following orthogonality holds regardless of $\H_0$

Proposition 4.7 (Orthogonal components):

(\vY - \vYhat) \perp (\vYhat - \vYhatcirc)

Proof: As

\vYhat = \dmat P \vY

we know that

(\vY - \vYhat) \perp \Range{\dmat X}

We note that

\Range{\dmat X^{\circ}} = \set{\dvec z \in \R^n \mid \dvec z = \dmat X^{\circ} \dvec b} \subset \Range{\dmat X}

\vYhatcirc \in \Range{\dmat X^{\circ}}

we have

(\vYhat - \vYhatcirc) \in \Range{\dmat X}

Therefore,

(\vY - \vYhat) \perp (\vYhat - \vYhatcirc)

Note: The name nested model comes from the fact that

\Range{\dmat X^{\circ}} \subset \Range{\dmat X}

Applying Pythagoras, this leads to the following decompositon of the residual sum of squares or RSS.

Theorem 4.8 (RSS decomposition):

\normbig{\vY - \vYhatcirc}^2 = \normbig{\vY - \vYhat}^2 + \normbig{\vYhat - \vYhatcirc}^2

Figure image anova.svg — **Figure 4.1:** RSS decomposition given a nested model.

We summarize what we know about the RSS components:

Under SCLM, $\frac{1}{\sigma^2} \vepsilonhat^{\tr} \vepsilonhat = \frac{1}{\sigma^2} \normbig{\vY - \vYhat}^2 \sim \lawChi{n-p}$
Under SCLM, $(\vY - \vYhat)$ and $(\vYhat - \vYhatcirc)$ are independent as they are orthogonal and $\vepsilon$ is idiosyncratic
Under $\H_0$ $\frac{1}{\sigma^2} \vepsilonhatcirctr \vepsilonhatcirc = \frac{1}{\sigma^2} \normbig{\vY - \vYhatcirc}^2 \sim \lawChi{n-q}$
Thus, under $\H_0$ as $\normbig{\vYhat - \vYhatcirc}^2 = \normbig{\vY - \vYhatcirc}^2 - \normbig{\vY - \vYhat}^2$ we have $\frac{1}{\sigma^2} \normbig{\vYhat - \vYhatcirc}^2 \sim \lawChi{p-q}$

Hence, we can construct the test statistic for the partial F-test as follows $T = \frac{(p-q)^{-1} \normbig{\vYhat - \vYhatcirc}^2}{(n-p)^{-1} \normbig{\vY - \vYhat}^2} \sim \lawF{p-q}{n-p}$

Note:

The partial F-test statistic has the relationship $\tan(\delta)^2 \propto T$ where $\delta = \angle(\vYhat - \vYhatcirc, \vY - \vYhat)$ The bigger the statistic, the bigger the angle $\delta$ and the more unlikely it is that the nested model and thus $\H_0$ is true.
In other words, we test whether $\vYhat - \vYhatcirc$ is significantly longer than what random noise would produce by comparing it to one “unit” of random variance $\sigmahat^2 = (n-p)^{-1} \normbig{\vY - \vYhat}^2$
It can be shown that the partial F-test statistic can be equivalently derived by using the distribution for the subset of a parameter vector with the $(p-q) \times p$ transformation $\dmat B = [\dmat 0 \sep \dmat I]$ and $\H_0 : \dvec b = \dvec 0$
It can happen that $\H_{a,0}: \beta_1 = 0$ and $\H_{b,0}: \beta_2 = 0$ are not rejected but $\H_{c,0}: \beta_1 = \beta_2 = 0$ is rejected. This means that the covariates $\dvec x_{:,1}$ and $\dvec x_{:,2}$ are heavily correlated and either one can be left out but not both.

The zero-model F-test statistic emerges from the nested model with $q = 0$ i.e. $T = \frac{p^{-1} \normbig{\vYhat}^2}{(n-p)^{-1} \vepsilonhattr \vepsilonhat} = (p \sigmahat^2)^{-1} \sum_{i=1}^n \Yhat_i^2$ However, the zero-model baseline $\H_0 : \vbeta = \dvec 0$ is almost always trivially significant because the response variable's averge is rarely zero. A better baseline is discussed in the next section.

4.5 Coefficient of Determination

Usually, we use the location model $Y_i = \beta + \epsilon_i$ as the reference nested model, as it provides a simple baseline corresponding to $\vYhat = \vYmean$

Proposition 4.9 (Global F-test): The global F-test statistic

F

compares the full model to the nested model with just the intercept, i.e.

F = \frac{(p-1)^{-1} \normbig{\vYhat - \vYmean}^2}{(n-p)^{-1} \normbig{\vY - \vYhat}^2} \sim \lawF{p-1}{n-p}

It tests the hypothesis

\H_0 : \beta_2 = 0, \ldots, \beta_p = 0

Note: The partial F-test statistic can be equivalently dervide by using the distribution for the subset of a parameter vector with the

(p-1) \times p

transformation

\dmat B = [\dvec 0 \sep \dmat I]

and

\H_0 : b = 0

One important quantity is the coefficient of determination.

Definition 4.10 (Coefficient of determination): The coefficient of determination is

R^2 = \max_{\dvec z \in \mathcal{R}(\dmat X)} \cor{\vY,\dvec z}^2

Proposition 4.11 (Intercept case): If the full model has an intercept, the coefficient of determination is

R^2 = \frac{\normbig{\vYhat - \vYmean}^2}{\normbig{\vY - \vYmean}^2}

Proof: TODO

Note: In the following we assume the full model has an intercept:

$R^2$ is the proportion of variance explained by the full model compared to the total variance.
Equivalently, $R^2 = \cor{\vY, \vYhat}^2$ thus $\abs*{\cor{\vY, \vYhat}} = \sqrt{R^2}$
We can relate $R^2$ to the global F-test by $F = \frac{n-p}{p-1} \frac{R^2}{1-R^2}$