3. Nonparametric Regression

We consider here nonparametric regression with one predictor variable. Practically relevant generalizations to more than one or two predictor variables are not so easy due to the curse of dimensionality and will require different approaches discussed later.

Assumption 3.1 (iid\iid assumption for random design): We assume the observations are iid\iid samples from the joint distribution of two real random variables X\rvec X and YY, i.e. (X1,Y1),,(Xn,Yn)iidFX,Y (\rvec X_1, Y_1), \ldots, (\rvec X_n, Y_n) \simiid F_{\rvec X, Y} where FX,YF_{X, Y} denotes the unknown joint cdf\cdf of (X,Y)(\rvec X, Y).
Note: While the iid\iid assumption generalizes for multivariate X\rvec X, in nonparametric regression we mainly deal with univariate XX.

We further assume the regression model for our data i.e. Yi=m(Xi)+εi Y_i = m(X_i) + \epsilon_i and strict exogeneity E ⁣[εi | Xi=xi]=0\E{\epsilon_i \mid X_i = x_i} = 0. Hence m(x)=E ⁣[Y | X=x]m(x) = \E{Y \mid X = x}. We note that m:RpRm : \R^p \to \R can be an arbitrary regression function and does not necessarily need to be linear.

Note: There are four paradigms in nonparametric regression:
  • local averaging
  • local modelling
  • global modelling
  • penalized modelling

3.1 The Kernel Regression Estimator

The kernel regression estimator or Nadaraya–Watson estimator is a local averaging nonparamtric regression estimator. The idea is to plug in the kernel density estimates for fXi,Yif_{X_i,Y_i} and fXif_{X_i}, i.e. f^X(x)=1nhi=1nk ⁣(xXih) \fhat_{X}(x) = \frac{1}{nh} \sum_{i=1}^n k\of{\frac{x - X_i}{h}} and f^X,Y(x,y)=1nh2i=1nk ⁣(xXih)k ⁣(yYih) \fhat_{X,Y}(x,y) = \frac{1}{nh^2} \sum_{i=1}^n k\of{\frac{x - X_i}{h}} k\of{\frac{y - Y_i}{h}} into m(x)=E ⁣[Y | X=x]=RfX,Y(x,y)fX(x)dy m(x) = \E{Y \mid X = x} = \int_R \frac{f_{X,Y}(x, y)}{f_{X}(x)} \dd y

Definition 3.2 (Kernel regression estimator): The kernel regression estimator is m^(x)=i=1nwi(x)Yi \mhat(x) = \sum_{i=1}^n w_i(x) Y_i where wi(x)=k ⁣(xXih)i=1nk(xXih) w_i(x) = \frac{k\of{\frac{x - X_i}{h}}}{\sum_{i=1}^n k(\frac{x- X_i}{h})}
Proof: TODO

An interesting interpretation of the kernel regression estimator is m^(x)=arg minmxRi=1nk ⁣(xXih)(Yimx)2 \mhat(x) = \argmin_{m_x \in \R} \sum_{i=1}^n k\of{\frac{x - X_i}{h}} (Y_i - m_x)^2 Thus for every xx, we are searching for the best local constant mxm_x such that the localized sum of squares is minimized. Localization is here described by the kernel kk and gives a large weight to those observations where XiX_i is close to the point xx of interest.

It is useful to represent the regression function estimator m^\mhat evaluated at the observation points X1,,XnX_1, \ldots, X_n with a linear operator on Y\rvec Y.

Definition 3.3 (Smoother matrix): The smoother matrix of a nonparametric regression estimator m^\mhat is the matrix S\rmat S such that Y^=SY \rvec \Yhat = \rmat S \rvec Y where Y^\rvec \Yhat is the vector of Y^i=m^(Xi)\Yhat_i = \mhat(X_i).
Proposition 3.4 (Smoother matrix of kernel regression): The smoother matrix of the kernel regression estimator is S[i,j]=wj(Xi) \rmat{S}_{[i,j]} = w_j(X_i) for i,j{1,,n}i,j \in \set{1, \ldots, n}.
Definition 3.5 (Degrees of freedom): Given a smoother matrix S\rmat S, the degrees of freedom are defined as d ⁣f=tr ⁣(S) \df = \trace{\rmat S}
Note: The definition of degrees of freedom can be viewed as a general concept for the number of parameters in a model fit with the smoother matrix S\rmat S.
Proposition 3.6 (Degrees of freedom of kernel regression): The degrees of freedom of the kernel regression estimator is d ⁣f=nk(0) \df = n \cdot k(0)

3.2 The Local Polynomial Estimator

The local polynomial estimator is a local modelling nonparametric regression estimator. The idea is to find the local regression parameters of a polynomial fit.

Definition 3.7 (Local Polynomial Estimator): THe local polynomial estomator is m^(x)=β^1(x) \mhat(x) = \betahat_{1}(x) where β^(x)=[β^1(x),,β^p(x)]\gvec \betahat(x) = [\betahat_1(x), \ldots, \betahat_p(x)]^{\tr} are the locally fitted regression parameters of the polynomial of degree p1p-1, i.e. β^(x)=arg minβRpi=1nk ⁣(xXih)(Yij=1pβj(xix)j1)2 \gvec \betahat(x)= \argmin_{\gvec \beta \in \R^{p}} \sum_{i=1}^n k\of{\frac{x - X_i}{h}} \pa{Y_i - \sum_{j=1}^p \beta_j (x_i - x)^{j-1}}^2

We note that drm^(x)dxr=r!β^r+1(x) \dvn{r}{\mhat(x)}{x} = r! \betahat_{r+1}(x) for r{0,,p1}r \in \set{0, \ldots, p-1}.

3.3 The Smoothing Splines Estimator

The smoothing spiles estimator is a global and penalized modelling nonparametric regression estimator.

Definition 3.8 (Smoothing Splines estimator): Let C2\setC_2 be the set of functions mm with continuous second derivatives. The smoothing splines estimator for some regularization parameter λ0\lambda \geq 0 is m^λ(x)=arg minmC2i=1n(Yim(Xi))2+λRd2m(x)dx2dx \mhat_{\lambda}(x) = \argmin_{m \in \setC_2} \sum_{i=1}^n (Y_i - m(X_i))^2 + \lambda \int_{\R} \dvn{2}{m(x)}{x} \dd x

The solution m^λ(x)\mhat_{\lambda}(x) is a natural cubic spline with knots at X1,,XnX_1, \ldots, X_n.