1. testtest Multiple Linear Regression

1.1 Introduction

Why are we using probability? It is more of a philosophical question and assumption that we make about the data: that it is generated by an (unknown) underlying distribution. Furthermore, it lets us guarantee that, if certain other assumptions are met, the model and procedure used is the best possible one.

Note: Often the first column of X\vb{X} is set to 1=(1,,1)\vb{1} = (1, \ldots, 1)^{\top}. This allows for an intercept. Also transforming the features e.g. (xj)2(x_{*}^j)^2 or log(xj)\log(x_{*}^j) can be done and the model is still called linear.
Note: If the linear model holds then fdream:xxβf_{\text{dream}} : \vb{x} \mapsto \vb{x}^{\top} \boldsymbol{\beta}.
Note (Prediction vs causality): Assume p=2p=2 and Yi=β1+β2xi+εiY_i = \beta_1 + \beta_2 x_i + \epsilon_i,

\forall i \in \qty{1, \ldots, n, \text{new}}

.
  • The statement “When actively setting xi=xx_i = x, the best guess of YiY_i equals β1+β2x\beta_1 + \beta_2 x” is false, as there is no proven causal effect.
  • The statement “When observing xi=xnewx_i = x_{\text{new}}, the best guess of YiY_i equals β1+β2xnew\beta_1 + \beta_2 x_{\text{new}}” is correct.
Note: RSS()=yX\operatorname{RSS}(*) = \norm{\vb{y} - \vb{X} *} assumes that the X\vb{X}'s have been observed perfectly.