2. Nonparametric Density Estimation

Assumption 2.1 (iid\iid assumption for univariate data): We assume the observations are iid\iid samples from the distribution of a univariate real random variables XX, i.e. X1,,XniidFX X_1, \ldots, X_n \simiid F_X where FXF_X denotes the unknown cdf\cdf of XX.

The goal is to estimate the distribution FXF_X. In particular, we are interested in estimating the density fX=dFXdxf_X = \dv{F_X}{x}, assuming that it exists. Instead of assuming a parametric model for the distribution, e.g. N(μ,σ2)\lawN(\mu, \sigma^2) with unknown μ\mu and σ\sigma, we rather want to be as general as possible. That is, we only assume that the density exists and is suitably smooth, e.g. differentiable. It is then possible to estimate the unknown density function fXf_X. Mathematically, a function is an infinite-dimensional object. Density estimation will become a basic principle how to do estimation for infinite-dimensional objects. We will make use of such a principle in many other settings such as nonparametric regression with one predictor variable and flexible regression and classification methods with many predictor variables.

2.1 Estimation of a Density

2.1.1 Histogram

The histogram is the oldest and most popular density estimator.

Definition 2.2 (Histogram): Given an origin x0Rx_0 \in \R and class width hR+h \in \R_+ for the specifications of the kk intervals Ij=(x0+jh,x0+(j+1)h] \setI_j = (x_0 + jh, x_0 + (j+1)h] for j{0,,k1}j \in \set{0, \ldots, k-1}, the histogram plots the function count:Ij#{i | XiIj}\mathrm{count}: \setI_j \to \#\set{i \mid X_i \in \setI_j}.
Note: A histogram can also plot the relative frequency count(Ij)n\frac{\mathrm{count}(\setI_j)}{n} instead.

2.1.2 Kernel Estimator

Similar to the histogram, we can compute the relative frequency of observations falling into a small region. The density fXf_X at the point xx can be represented as f(x)=limh012hP ⁣(xh<Xx+h) f(x) = \lim_{h\to0} \frac{1}{2h} \probP{x-h < X \leq x+h}

Definition 2.3 (Naive cdf\cdf estimator): The naive cdf\cdf estimator is f^(x)=12hn#{i | Xi(xh,x+h]} \fhat(x) = \frac{1}{2hn} \# \set{i \mid X_i \in (x-h, x+h]} for some h>0h > 0.

This naive estimator is only piecewise constant. As for histograms, we also need to specify the bandwidth hh, but in contrast to histrogram, we do not need to specigy and origin x0x_0. An alternative representation of the naive estimator is as follows. Define the weight function w(x)={12if x10otherwise w(x) = \begin{cases} \frac{1}{2} & \text{if } \abs{x} \leq 1 \\ 0 & \text{otherwise} \end{cases} Then f^(x)=1nhi=1nw ⁣(xXih) \fhat(x) = \frac{1}{nh} \sum_{i=1}^n w\of{\frac{x - X_i}{h}} If we choose instead of the rectangle weight function ww a general, typically smooth kernel function kk, we have the definition of the kernel density estimator.

Definition 2.4 (Kernel function): A function k:RR+k : \R \to \Rext_+ satisfying Rk(x)dx=1 \int_{\R} k(x) \dd x = 1 and symmetry around 00, i.e. k(x)=k(x)k(x) = k(-x), is called a kernel.
Definition 2.5 (Kernel density estimator): The kernel density estimator is f^h(x)=1nhi=1nk ⁣(xXih) \fhat_h (x) = \frac{1}{nh} \sum_{i=1}^n k\of{\frac{x - X_i}{h}} for some h>0h > 0.
Note: The estimator depends on the bandwidth hh which acts as a tuning parameter. For large bandwidths, the estimate fˇh(x)\check{f}_h (x) tends to be very slowly varying as a function of xx, while small bandwidths will produce a more wiggly function estimate.

The smoothness of fˇh\check{f}_h is inherited from the smoothness of the kernel kk: if the rr-th derivate drkdxr\dvn{r}{k}{x} exists, then drf^hdxr\dvn{r}{\fhat_h}{x} exists.

Example (Gaussian kernel): k(x)=φ(x)=12πex22k(x) = \varphi(x) = \frac{1}{2\pi} e^{\frac{-x^2}{2}}
Example (Finite support kernel): k(x)=π4cosπ2x1{x1}k(x) = \frac{\pi}{4}\cos{\frac{\pi}{2} x}\ind{\abs{x} \leq 1}
Example (Epanechnikov kernel): k(x)=34(1x2)1{x1}k(x) = \frac{3}{4} \pa{1 - \abs{x}^2}\ind{\abs{x} \leq 1}
Note: The Epanechnikov kernel is optimal with respect to the mean squared error.

2.2 The Role of the Bandwidth

The bandwidth hh is often called the smoothing parameter. As h0h \to 0, we will have “δ\delta-spikes” at every observation XiX_i, whereas as hh increases the estimate fˇh\check{f}_h becomes smoother.

2.2.1 Variable Bandwidths

Instead of using a global bandwidth, we can use locally changing bandwiths h(x)h(x). The general idea is to use a large bandwidth for regions where the data is sparse.

Definition 2.6 (kNN bandwidth): We define the bandwidth h(x)=xXk(x) h(x) = \abs{x - X_{\mathrm{k}(x)}} where Xk(x)X_{\mathrm{k}(x)} is the k\mathrm{k}-th nearest neighbour to xx.
Note: Generally fˇh(x)\check{f}_{h(x)} will not be a density anymore since the integral Rfˇh(x)(x)dx\int_{\R} \check{f}_{h(x)}(x) \dd x is not necessarily equal to one.

2.2.2 Bias-Variance Trade-Off

We can formalize the behaviour of f^h\fhat_h when varying the bandwidth hh in terms of bas and variance of the estimator.

Proposition 2.7 (Bias-Variance decompoistion of kernel estimators): For any point xRx\in \R, the mean squared error msefX(x) ⁣(f^h(x))\msebig{f_X(x)}{\fhat_{h}(x)} can be decomposed as msefX(x) ⁣(f^h(x))=E ⁣[(f^h(x)f(x))2]=(E ⁣[f^h(x)]f(x))2+Var ⁣[f^h(x)] \msebig{f_X(x)}{\fhat_{h}(x)} = \E{\pa{\fhat_{h}(x) - f(x)}^2} =\pa{\E{\fhat_{h}(x)} - f(x)}^2 + \Var{\fhat_h(x)}

Heuristically, as hh increases, the bias f^h\fhat_h increases and the variance decreases. As a consequence, this allows to optimize the bandwidth parameter in a well-defined, coherent way. Instead of optimizing the mean squared error at a point xx, one may want to optimize the integrated mean squared error.

Definition 2.8 (Integrated mean squared error): The integrated mean squared error imseζ\imse*{\zeta} of a function ξ\xi w.r.t. a true function ζ\zeta is imseζ(ξ)=mseζ(x) ⁣(ξ(x))dx \imse*{\zeta}(\xi) = \int \mse{\zeta(x)}{\xi(x)} \dd x
Definition 2.9 (Mean integrated squared error): The mean integrated squared error miseζ\mise*{\zeta} of a function ξ\xi w.r.t. a true function ζ\zeta is miseζ(ξ)=E ⁣[(ζ(x)ξ(x))2dx] \mise*{\zeta}(\xi) = \E{\int \pa{\zeta(x) - \xi(x)}^2 \dd x}

Since the integrand is non-negative for kernel estimators, we have imsefX ⁣(()f^)=misefX ⁣(()f^)\imse{f_X}(\fhat) = \mise{f_X}(\fhat).

It is straightforward to give an expression for the exact bias and variance.

Proposition 2.10 (Asymptotic bias and variance): For hn0h_n \to 0, hnnh_n n \to \infty as n0n \to 0 we have biasfX(x) ⁣(f^h(x))=n12h2d2fX(x)dx2z2k(z)dz+o(h2) \bias{f_X(x)}{\fhat_h(x)} \stackrel{n \to \infty}{=} \frac{1}{2} h^2 \dvn{2}{f_X(x)}{x} \int z^2 k(z) \dd z + o(h^2) and Var ⁣[f^h(x)]=n(nh)1fX(x)k(z)2dz+o ⁣(1nh) \Var{\fhat_h(x)} \stackrel{n \to \infty}{=} (nh)^{-1} f_X(x) \int k(z)^2 \dd z + o\of{\frac{1}{nh}}