Computational Statistics
1. Multiple Linear Regression
1.1 Introduction
Why are we using probability? It is more of a philosophical question and assumption that we make about the data: that it is generated by an (unknown) underlying distribution. Furthermore, it lets us guarantee that, if certain other assumptions are met, the model and procedure used is the best possible one.
Note: Often the first column of $\vb{X}$ is set to $\vb{1} = (1, \ldots, 1)^{\top}$. This allows for an intercept. Also transforming the features e.g. $(x_{*}^j)^2$ or $\log(x_{*}^j)$ can be done and the model is still called linear.
Note: If the linear model holds then $f_{\text{dream}} : \vb{x} \mapsto \vb{x}^{\top} \boldsymbol{\beta}$.
Note (Prediction vs causality): Assume $p=2$ and $Y_i = \beta_1 + \beta_2 x_i + \epsilon_i$, $\forall i \in \qty{1, \ldots, n, \text{new}}$.
- The statement “When actively setting $x_i = x$, the best guess of $Y_i$ equals $\beta_1 + \beta_2 x$” is false, as there is no proven causal effect.
- The statement “When observing $x_i = x_{\text{new}}$, the best guess of $Y_i$ equals $\beta_1 + \beta_2 x_{\text{new}}$” is correct.
Note: $\operatorname{RSS}(*) = \norm{\vb{y} - \vb{X} *}$ assumes that the $\vb{X}$'s have been observed perfectly.