Computational Statistics 2024-03-07

2. Nonparametric Density Estimation

Assumption 2.1 (

\iid

assumption for univariate data): We assume the observations are

\iid

samples from the distribution of a univariate real random variables

X

i.e.

X_1, \ldots, X_n \simiid F_X

where

F_X

denotes the unknown

\cdf

X

The goal is to estimate the distribution $F_X$ In particular, we are interested in estimating the density $f_X = \dv{F_X}{x}$ assuming that it exists. Instead of assuming a parametric model for the distribution, e.g. $\lawN(\mu, \sigma^2)$ with unknown $\mu$ and $\sigma$ we rather want to be as general as possible. That is, we only assume that the density exists and is suitably smooth, e.g. differentiable. It is then possible to estimate the unknown density function $f_X$ Mathematically, a function is an infinite-dimensional object. Density estimation will become a basic principle how to do estimation for infinite-dimensional objects. We will make use of such a principle in many other settings such as nonparametric regression with one predictor variable and flexible regression and classification methods with many predictor variables.

2.1 Estimation of a Density

2.1.1 Histogram

The histogram is the oldest and most popular density estimator.

Definition 2.2 (Histogram): Given an origin

x_0 \in \R

and class width

h \in \R_+

for the specifications of the

k

intervals

\setI_j = (x_0 + jh, x_0 + (j+1)h]

for

j \in \set{0, \ldots, k-1}

the histogram plots the function

\mathrm{count}: \setI_j \to \#\set{i \mid X_i \in \setI_j}

Note: A histogram can also plot the relative frequency

\frac{\mathrm{count}(\setI_j)}{n}

instead.

2.1.2 Kernel Estimator

Similar to the histogram, we can compute the relative frequency of observations falling into a small region. The density $f_X$ at the point $x$ can be represented as $f(x) = \lim_{h\to0} \frac{1}{2h} \probP{x-h < X \leq x+h}$

Definition 2.3 (Naive

\cdf

estimator): The naive

\cdf

estimator is

\fhat(x) = \frac{1}{2hn} \# \set{i \mid X_i \in (x-h, x+h]}

for some

h > 0

This naive estimator is only piecewise constant. As for histograms, we also need to specify the bandwidth $h$ but in contrast to histrogram, we do not need to specigy and origin $x_0$ An alternative representation of the naive estimator is as follows. Define the weight function $w(x) = \begin{cases} \frac{1}{2} & \text{if } \abs{x} \leq 1 \\ 0 & \text{otherwise} \end{cases}$ Then $\fhat(x) = \frac{1}{nh} \sum_{i=1}^n w\of{\frac{x - X_i}{h}}$ If we choose instead of the rectangle weight function $w$ a general, typically smooth kernel function $k$ we have the definition of the kernel density estimator.

Definition 2.4 (Kernel function): A function

k : \R \to \Rext_+

satisfying

\int_{\R} k(x) \dd x = 1

and symmetry around

0

i.e.

k(x) = k(-x)

is called a kernel.

Definition 2.5 (Kernel density estimator): The kernel density estimator is

\fhat_h (x) = \frac{1}{nh} \sum_{i=1}^n k\of{\frac{x - X_i}{h}}

for some

h > 0

Note: The estimator depends on the bandwidth

h

which acts as a tuning parameter. For large bandwidths, the estimate

\check{f}_h (x)

tends to be very slowly varying as a function of

x

while small bandwidths will produce a more wiggly function estimate.

The smoothness of $\check{f}_h$ is inherited from the smoothness of the kernel $k$ if the $r$ -th derivate $\dvn{r}{k}{x}$ exists, then $\dvn{r}{\fhat_h}{x}$ exists.

Example (Gaussian kernel):

k(x) = \varphi(x) = \frac{1}{2\pi} e^{\frac{-x^2}{2}}

Example (Finite support kernel):

k(x) = \frac{\pi}{4}\cos{\frac{\pi}{2} x}\ind{\abs{x} \leq 1}

Example (Epanechnikov kernel):

k(x) = \frac{3}{4} \pa{1 - \abs{x}^2}\ind{\abs{x} \leq 1}

Note: The Epanechnikov kernel is optimal with respect to the mean squared error.

2.2 The Role of the Bandwidth

The bandwidth $h$ is often called the smoothing parameter. As $h \to 0$ we will have “ $\delta$ -spikes” at every observation $X_i$ whereas as $h$ increases the estimate $\check{f}_h$ becomes smoother.

2.2.1 Variable Bandwidths

Instead of using a global bandwidth, we can use locally changing bandwiths $h(x)$ The general idea is to use a large bandwidth for regions where the data is sparse.

Definition 2.6 (kNN bandwidth): We define the bandwidth

h(x) = \abs{x - X_{\mathrm{k}(x)}}

where

X_{\mathrm{k}(x)}

is the

\mathrm{k}

-th nearest neighbour to

x

Note: Generally

\check{f}_{h(x)}

will not be a density anymore since the integral

\int_{\R} \check{f}_{h(x)}(x) \dd x

is not necessarily equal to one.

2.2.2 Bias-Variance Trade-Off

We can formalize the behaviour of $\fhat_h$ when varying the bandwidth $h$ in terms of bas and variance of the estimator.

Proposition 2.7 (Bias-Variance decompoistion of kernel estimators): For any point

x\in \R

the mean squared error

\msebig{f_X(x)}{\fhat_{h}(x)}

can be decomposed as

\msebig{f_X(x)}{\fhat_{h}(x)} = \E{\pa{\fhat_{h}(x) - f(x)}^2} =\pa{\E{\fhat_{h}(x)} - f(x)}^2 + \Var{\fhat_h(x)}

Heuristically, as $h$ increases, the bias $\fhat_h$ increases and the variance decreases. As a consequence, this allows to optimize the bandwidth parameter in a well-defined, coherent way. Instead of optimizing the mean squared error at a point $x$ one may want to optimize the integrated mean squared error.

Definition 2.8 (Integrated mean squared error): The integrated mean squared error

\imse*{\zeta}

of a function

\xi

w.r.t. a true function

\zeta

\imse*{\zeta}(\xi) = \int \mse{\zeta(x)}{\xi(x)} \dd x

Definition 2.9 (Mean integrated squared error): The mean integrated squared error

\mise*{\zeta}

of a function

\xi

w.r.t. a true function

\zeta

\mise*{\zeta}(\xi) = \E{\int \pa{\zeta(x) - \xi(x)}^2 \dd x}

Since the integrand is non-negative for kernel estimators, we have $\imse{f_X}(\fhat) = \mise{f_X}(\fhat)$

It is straightforward to give an expression for the exact bias and variance.

Proposition 2.10 (Asymptotic bias and variance): For

h_n \to 0

h_n n \to \infty

n \to 0

we have

\bias{f_X(x)}{\fhat_h(x)} \stackrel{n \to \infty}{=} \frac{1}{2} h^2 \dvn{2}{f_X(x)}{x} \int z^2 k(z) \dd z + o(h^2)

and

\Var{\fhat_h(x)} \stackrel{n \to \infty}{=} (nh)^{-1} f_X(x) \int k(z)^2 \dd z + o\of{\frac{1}{nh}}