4. Tests and Confidence Intervals

In this section we assume that the SCLM holds. We start by recapping some important distributions.

Recap (Chi-squared distribution): Let Z1,,ZdiidN(0,1)Z_1, \ldots, Z_d \simiid \lawN(0, 1). Then U=i=1dZiχd2U = \sum_{i=1}^d Z_i \sim \lawChi{d}. We have E ⁣[U]=k\E{U} = k and Var ⁣[U]=2k\Var{U} = 2k.
Recap (T distribution): Let ZN(0,1)Z \sim \lawN(0, 1) and Uχd2U \sim \lawChi{d} be independent. Then T=Zd1UtdT = \frac{Z}{\sqrt{d^{-1} U}} \sim \lawT{d}. We have E ⁣[T]=0\E{T} = 0. For d>2d > 2, we have Var ⁣[T]=dd2\Var{T} = \frac{d}{d-2}.
Recap (F distribution): Let Uχd12U \sim \lawChi{d_1} and Vχd22V \sim \lawChi{d_2} be independent. Then F=d11U1d21U2Fd1,d2F = \frac{d_1^{-1} U_1}{d_2^{-1} U_2} \sim \lawF{d_1}{d_2}. For d2>2d_2 > 2 we have E ⁣[F]=d2d22\E{F} = \frac{d_2}{d_2 -2}. For d2>4d_2 > 4 we have Var ⁣[F]=2d22(d1+d22)d1(d22)2(d24)\Var{F} = \frac{2 d_2^2 (d_1 + d_2 - 2)}{d_1 (d_2 - 2)^2 (d_2 - 4)}.
Note: If TtdT \sim \lawT{d} then T2=Z2Ud1F1,dT^2 = \frac{Z^2}{U d^{-1}} \sim \lawF{1}{d}.

4.1 Basic Test Statistics

We first assume that σ2\sigma^2 is known. We know that β^N(β,σ2(XX)1)\vbetahat \sim \lawN(\vbeta, \sigma^2 (\dmattr X \dmat X)^{-1}). We want to test the null hypothesis H0:βj=0\H_0: \beta_j = 0. We can then construct a two-sided test. We do that by noting that under null hypothesis we have T=β^jσ(XX)[j,j]1N(0,1) T = \frac{\betahat_j}{\sigma \sqrt{(\dmattr X \dmat X)^{-1}_{[j,j]}}} \sim \lawN(0,1) This is our test statistic. We can then compute the p\p-value of the observed test statistic by p=P ⁣(T>t)=2Φ(t)\p = \probP{\abs{T} > \abs{t}} = 2\Phi(-\abs{t}) and compare it to a pre-determined test significance level α[0,1]\alpha \in [0,1]. We reject H0\H_0 if pα\p \leq \alpha.

4.2 Basic Confidence Interval

Note: Given a random variable XX, we use σX=Var ⁣[X]\sigma_X = \sqrt{\Var{X}} to denote the standard deviation and σ^X\sigmahat_X to denote a suitable estimator. σ^X\sigmahat_X is often called standard error in the literature.

We assume the same setting as before and want to find a (1α)(1-\alpha) confidence interval for βj\beta_j. As the standard Gaussian is symmetric we have βjβ^jσ(XX)[j,j]1N(0,1) \frac{\beta_j - \betahat_j}{\sigma \sqrt{(\dmattr X \dmat X)^{-1}_{[j,j]}}} \sim \lawN(0,1) and thus the confidence interval P ⁣(β[β^jσβ^jz112α,β^j+σβ^jz112α])=1α \probP{\beta \in \left[\betahat_j - \sigma_{\betahat_j} z_{1-\frac{1}{2}\alpha}, \betahat_j + \sigma_{\betahat_j} z_{1-\frac{1}{2}\alpha} \right]} = 1 - \alpha where z112α=Φ1 ⁣(112α)z_{1-\frac{1}{2}\alpha} = \Phi^{-1}\of{1-\frac{1}{2}\alpha}.

Note: The (1α)(1−\alpha) confidence interval is equivalent to the set of all parameter values that would not be rejected by the hypothesis test for H0\H_0 at significance level α\alpha.

4.3 Test and CI Distributions

We assume now that σ2\sigma^2 is not known. We can construct the following distributions for test statistics and confidence intervals.

Proposition 4.1 (Individual parameter): For each parameter βj\beta_j we have β^jβjσ^β^jtnp \frac{\betahat_j - \beta_j}{\sigmahat_{\betahat_j}} \sim \lawT{n-p} where σ^β^j=σ^(XX)[j,j]1\sigmahat_{\betahat_j} = \sigmahat \sqrt{(\dmattr X \dmat X)^{-1}_{[j,j]}}.
Proof: TODO

We can use the t-test statistic T=σ^β^j1β^jT = \sigmahat_{\betahat_j}^{-1} \betahat_j to test whether the coefficient βj\beta_j is significantly different from zero.

Proposition 4.2 (Entire parameter vector): For the entire parameter vector β\vbeta we have X(β^β)2pσ^2=1pσ^2(β^β)XX(β^β)Fp,np \frac{\norm{\dmat X (\vbetahat - \vbeta)}^2}{p \sigmahat^2} = \frac{1}{p \sigmahat^2} (\vbetahat - \vbeta)^{\tr} \dmattr X \dmat X (\vbetahat - \vbeta) \sim \lawF{p}{n-p}
Proof: TODO

We can use the zero-model F-test statistic T=(pσ^2)1i=1nY^i2T = (p \sigmahat^2)^{-1} \sum_{i=1}^n \Yhat_i^2 to test whether the entire parameter vector β\vbeta is significantly different from the zero vector 0\dvec 0. We discuss the F-test in detail in the next section.

Proposition 4.3 (Subset of parameter vector): For a subset or linear transformation b=Bβ\dvec b = \dmat B \vbeta with BRr×p\dmat B \in \R^{r \times p} and rank ⁣(B)=r\rank{\dmat B} = r we have 1pσ^2(b^b)V1(b^b)Fq,np \frac{1}{p \sigmahat^2} (\hat{\dvec b} - \dvec b) \dmat V^{-1} (\hat{\dvec b} - \dvec b) \sim \lawF{q}{n-p} where V=B(XX)1B\dmat V = \dmat B (\dmattr X \dmat X)^{-1} \dmat B^{\tr}.
Proof: TODO

As discussed in the next section, we use B=[I  0]\dmat B = [ \dmat I \sep \dmat 0] to define a partial F-test.

Proposition 4.4 (Expectation of observation): For the expectation of the ii-th observation YiY_i, that is E ⁣[Yi]=βxi\E{Y_i} = \vbeta^{\tr} \dvec x_i, we have Y^iE ⁣[Yi]σ^Y^itnp \frac{\Yhat_i - \E{Y_i}}{\sigmahat_{\Yhat_i}} \sim \lawT{n - p} where σ^Y^i=σ^P[i,i]\sigmahat_{\Yhat_i} = \sigmahat \sqrt{\dmat P_{[i,i]}}.
Proof: TODO
Note: It is important to note that we target the true value, i.e. the expected value E ⁣[Yi]\E{Y_i}, and not the observation YiY_i itself.
Proposition 4.5 (Expectation of observation under any condition): For the expectation of an observation Yn+1Y_{n+1} under an arbitrary experimental condition xn+1\dvec x_{n+1}, that is E ⁣[Yn+1]=βxn+1\E{Y_{n+1}} = \vbeta^{\tr} \dvec x_{n+1}, we have Y^n+1E ⁣[Yn+1]σ^Y^n+1tnp \frac{\Yhat_{n+1} - \E{Y_{n+1}}}{\sigmahat_{\Yhat_{n+1}}} \sim \lawT{n - p} where σ^Y^n+1=σ^xn+1(XX)1xn+1\sigmahat_{\Yhat_{n+1}} = \sigmahat \sqrt{\dvectr x_{n+1} (\dmattr X \dmat X)^{-1} \dvec x_{n+1}}.
Proof: TODO
Note: We again target the expected value E ⁣[Yi]\E{Y_i}. In the case that xn+1=xi\dvec x_{n+1} = \dvec x_i for some i{1,,n}i \in \set{1, \ldots, n} we have xi(XX)1xi=P[i,i]\dvectr x_{i} (\dmattr X \dmat X)^{-1} \dvec x_{i} = \dmat P_{[i,i]} and σ^Y^n+1=σ^Y^i\sigmahat_{\Yhat_{n+1}} = \sigmahat_{\Yhat_{i}}.
Proposition 4.6 (New observation): For a new observation Yn+1Y_{n+1} under the experimental condition xn+1\dvec x_{n+1}, we have Y^n+1Yn+1σ^Y^n+1Yn+1tnp \frac{\Yhat_{n+1} - Y_{n+1}}{\sigmahat_{\Yhat_{n+1} - Y_{n+1}}} \sim \lawT{n - p} where σ^Y^n+1Yn+1=σ^1+xn+1(XX)1xn+1\sigmahat_{\Yhat_{n+1} - Y_{n+1}} = \sigmahat \sqrt{1 + \dvectr x_{n+1} (\dmattr X \dmat X)^{-1} \dvec x_{n+1}}.
Proof: TODO
Note: Here we target the observation YiY_i itself thus introducing new variability to the distribution.

4.4 F-test and its Geometric Interpretation

Recall that we assume SCLM, thus the linear model Y=Xβ+ε\vY = \dmat X \vbeta + \vepsilon is correct. Let H0\H_0 be the hypothesis H0:βq+1=0,,βp=0\H_0: \beta_{q+1} = 0, \ldots, \beta_p = 0, i.e. only the first qq coefficients are non-zero. We construct the nested model by defining the n×qn \times q submatrix X=X[:,1,q]\dmat X^\circ= \dmat X_{[:,1,q]} and the subvector β=(β1,,βq)\vbeta^{\circ} = (\beta_{1}, \ldots, \beta_q). Then, under H0\H_0, X\dmat X^\circ and β\vbeta^{\circ} would suffice to define Y\vY, i.e. Y=Xβ+ε\vY = \dmat X^\circ \vbeta^{\circ} + \vepsilon. The following orthogonality holds regardless of H0\H_0.

Proposition 4.7 (Orthogonal components): (YY^)(Y^Y^)(\vY - \vYhat) \perp (\vYhat - \vYhatcirc).
Proof: As Y^=PY\vYhat = \dmat P \vY, we know that (YY^)R(X)(\vY - \vYhat) \perp \Range{\dmat X}. We note that R(X)={zRn | z=Xb}R(X)\Range{\dmat X^{\circ}} = \set{\dvec z \in \R^n \mid \dvec z = \dmat X^{\circ} \dvec b} \subset \Range{\dmat X}. As Y^R(X)\vYhatcirc \in \Range{\dmat X^{\circ}}, we have (Y^Y^)R(X)(\vYhat - \vYhatcirc) \in \Range{\dmat X}. Therefore, (YY^)(Y^Y^)(\vY - \vYhat) \perp (\vYhat - \vYhatcirc).
Note: The name nested model comes from the fact that R(X)R(X)\Range{\dmat X^{\circ}} \subset \Range{\dmat X}.

Applying Pythagoras, this leads to the following decompositon of the residual sum of squares or RSS.

Theorem 4.8 (RSS decomposition): YY^2=YY^2+Y^Y^2 \normbig{\vY - \vYhatcirc}^2 = \normbig{\vY - \vYhat}^2 + \normbig{\vYhat - \vYhatcirc}^2
Figure image anova.svg
Figure 4.1: RSS decomposition given a nested model.

We summarize what we know about the RSS components:

  • Under SCLM, 1σ2ε^ε^=1σ2YY^2χnp2\frac{1}{\sigma^2} \vepsilonhat^{\tr} \vepsilonhat = \frac{1}{\sigma^2} \normbig{\vY - \vYhat}^2 \sim \lawChi{n-p}
  • Under SCLM, (YY^)(\vY - \vYhat) and (Y^Y^)(\vYhat - \vYhatcirc) are independent as they are orthogonal and ε\vepsilon is idiosyncratic
  • Under H0\H_0, 1σ2ε^ε^=1σ2YY^2χnq2\frac{1}{\sigma^2} \vepsilonhatcirctr \vepsilonhatcirc = \frac{1}{\sigma^2} \normbig{\vY - \vYhatcirc}^2 \sim \lawChi{n-q}
  • Thus, under H0\H_0, as Y^Y^2=YY^2YY^2\normbig{\vYhat - \vYhatcirc}^2 = \normbig{\vY - \vYhatcirc}^2 - \normbig{\vY - \vYhat}^2, we have 1σ2Y^Y^2χpq2\frac{1}{\sigma^2} \normbig{\vYhat - \vYhatcirc}^2 \sim \lawChi{p-q}

Hence, we can construct the test statistic for the partial F-test as follows T=(pq)1Y^Y^2(np)1YY^2Fpq,np T = \frac{(p-q)^{-1} \normbig{\vYhat - \vYhatcirc}^2}{(n-p)^{-1} \normbig{\vY - \vYhat}^2} \sim \lawF{p-q}{n-p}

Note:
  • The partial F-test statistic has the relationship tan(δ)2T\tan(\delta)^2 \propto T where δ=(Y^Y^,YY^)\delta = \angle(\vYhat - \vYhatcirc, \vY - \vYhat). The bigger the statistic, the bigger the angle δ\delta and the more unlikely it is that the nested model and thus H0\H_0 is true.
  • In other words, we test whether Y^Y^\vYhat - \vYhatcirc is significantly longer than what random noise would produce by comparing it to one “unit” of random variance σ^2=(np)1YY^2\sigmahat^2 = (n-p)^{-1} \normbig{\vY - \vYhat}^2.
  • It can be shown that the partial F-test statistic can be equivalently derived by using the distribution for the subset of a parameter vector with the (pq)×p(p-q) \times p transformation B=[0  I]\dmat B = [\dmat 0 \sep \dmat I] and H0:b=0\H_0 : \dvec b = \dvec 0.
  • It can happen that Ha,0:β1=0\H_{a,0}: \beta_1 = 0 and Hb,0:β2=0\H_{b,0}: \beta_2 = 0 are not rejected but Hc,0:β1=β2=0\H_{c,0}: \beta_1 = \beta_2 = 0 is rejected. This means that the covariates x:,1\dvec x_{:,1} and x:,2\dvec x_{:,2} are heavily correlated and either one can be left out but not both.

The zero-model F-test statistic emerges from the nested model with q=0q = 0, i.e. T=p1Y^2(np)1ε^ε^=(pσ^2)1i=1nY^i2 T = \frac{p^{-1} \normbig{\vYhat}^2}{(n-p)^{-1} \vepsilonhattr \vepsilonhat} = (p \sigmahat^2)^{-1} \sum_{i=1}^n \Yhat_i^2 However, the zero-model baseline H0:β=0\H_0 : \vbeta = \dvec 0 is almost always trivially significant because the response variable's averge is rarely zero. A better baseline is discussed in the next section.

4.5 Coefficient of Determination

Usually, we use the location model Yi=β+εiY_i = \beta + \epsilon_i as the reference nested model, as it provides a simple baseline corresponding to Y^=Y\vYhat = \vYmean.

Proposition 4.9 (Global F-test): The global F-test statistic FF compares the full model to the nested model with just the intercept, i.e. F=(p1)1Y^Y2(np)1YY^2Fp1,np F = \frac{(p-1)^{-1} \normbig{\vYhat - \vYmean}^2}{(n-p)^{-1} \normbig{\vY - \vYhat}^2} \sim \lawF{p-1}{n-p} It tests the hypothesis H0:β2=0,,βp=0\H_0 : \beta_2 = 0, \ldots, \beta_p = 0.
Note: The partial F-test statistic can be equivalently dervide by using the distribution for the subset of a parameter vector with the (p1)×p(p-1) \times p transformation B=[0  I]\dmat B = [\dvec 0 \sep \dmat I] and H0:b=0\H_0 : b = 0.

One important quantity is the coefficient of determination.

Definition 4.10 (Coefficient of determination): The coefficient of determination is R2=maxzR(X)cor(Y,z)2 R^2 = \max_{\dvec z \in \mathcal{R}(\dmat X)} \cor{\vY,\dvec z}^2
Proposition 4.11 (Intercept case): If the full model has an intercept, the coefficient of determination is R2=Y^Y2YY2 R^2 = \frac{\normbig{\vYhat - \vYmean}^2}{\normbig{\vY - \vYmean}^2}
Proof: TODO
Note: In the following we assume the full model has an intercept:
  • R2R^2 is the proportion of variance explained by the full model compared to the total variance.
  • Equivalently, R2=cor(Y,Y^)2R^2 = \cor{\vY, \vYhat}^2, thus cor(Y,Y^)=R2\abs*{\cor{\vY, \vYhat}} = \sqrt{R^2}.
  • We can relate R2R^2 to the global F-test by F=npp1R21R2F = \frac{n-p}{p-1} \frac{R^2}{1-R^2}.