Assumption 2.1 (iid assumption for univariate data): We assume the observations are iid samples from the distribution of a univariate real random variables X, i.e.
X1,…,Xn∼iidFX
where FX denotes the unknown cdf of X.
The goal is to estimate the distribution FX. In particular, we are interested in estimating the density fX=dxdFX, assuming that it exists. Instead of assuming a parametric model for the distribution, e.g. N(μ,σ2) with unknown μ and σ, we rather want to be as general as possible. That is, we only assume that the density exists and is suitably smooth, e.g. differentiable. It is then possible to estimate the unknown density function fX. Mathematically, a function is an infinite-dimensional object. Density estimation will become a basic principle how to do estimation for infinite-dimensional objects. We will make use of such a principle in many other settings such as nonparametric regression with one predictor variable and flexible regression and classification methods with many predictor variables.
2.1 Estimation of a Density
2.1.1 Histogram
The histogram is the oldest and most popular density estimator.
Definition 2.2 (Histogram): Given an origin x0∈R and class width h∈R+ for the specifications of the k intervals
Ij=(x0+jh,x0+(j+1)h]
for j∈{0,…,k−1}, the histogram plots the function count:Ij→#{i∣Xi∈Ij}.
Note: A histogram can also plot the relative frequency ncount(Ij) instead.
2.1.2 Kernel Estimator
Similar to the histogram, we can compute the relative frequency of observations falling into a small region. The density fX at the point x can be represented as
f(x)=h→0lim2h1P(x−h<X≤x+h)
Definition 2.3 (Naive cdf estimator): The naive cdf estimator is
f^(x)=2hn1#{i∣Xi∈(x−h,x+h]}
for some h>0.
This naive estimator is only piecewise constant. As for histograms, we also need to specify the bandwidth h, but in contrast to histrogram, we do not need to specigy and origin x0. An alternative representation of the naive estimator is as follows. Define the weight function
w(x)={210if ∣x∣≤1otherwise
Then
f^(x)=nh1i=1∑nw(hx−Xi)
If we choose instead of the rectangle weight function w a general, typically smooth kernel function k, we have the definition of the kernel density estimator.
Definition 2.4 (Kernel function): A function k:R→R+ satisfying
∫Rk(x)dx=1
and symmetry around 0, i.e. k(x)=k(−x), is called a kernel.
Definition 2.5 (Kernel density estimator): The kernel density estimator is
f^h(x)=nh1i=1∑nk(hx−Xi)
for some h>0.
Note: The estimator depends on the bandwidth h which acts as a tuning parameter. For large bandwidths, the estimate fˇh(x) tends to be very slowly varying as a function of x, while small bandwidths will produce a more wiggly function estimate.
The smoothness of fˇh is inherited from the smoothness of the kernel k: if the r-th derivate dxrdrk exists, then dxrdrf^h exists.
Example (Gaussian kernel):k(x)=φ(x)=2π1e2−x2
Example (Finite support kernel):k(x)=4πcos2πx1{∣x∣≤1}
Example (Epanechnikov kernel):k(x)=43(1−∣x∣2)1{∣x∣≤1}
Note: The Epanechnikov kernel is optimal with respect to the mean squared error.
2.2 The Role of the Bandwidth
The bandwidth h is often called the smoothing parameter. As h→0, we will have “δ-spikes” at every observation Xi, whereas as h increases the estimate fˇh becomes smoother.
2.2.1 Variable Bandwidths
Instead of using a global bandwidth, we can use locally changing bandwiths h(x). The general idea is to use a large bandwidth for regions where the data is sparse.
Definition 2.6 (kNN bandwidth): We define the bandwidth
h(x)=x−Xk(x)
where Xk(x) is the k-th nearest neighbour to x.
Note: Generally fˇh(x) will not be a density anymore since the integral ∫Rfˇh(x)(x)dx is not necessarily equal to one.
2.2.2 Bias-Variance Trade-Off
We can formalize the behaviour of f^h when varying the bandwidth h in terms of bas and variance of the estimator.
Proposition 2.7 (Bias-Variance decompoistion of kernel estimators): For any point x∈R, the mean squared error msefX(x)(f^h(x)) can be decomposed as
msefX(x)(f^h(x))=E[(f^h(x)−f(x))2]=(E[f^h(x)]−f(x))2+Var[f^h(x)]
Heuristically, as h increases, the bias f^h increases and the variance decreases. As a consequence, this allows to optimize the bandwidth parameter in a well-defined, coherent way. Instead of optimizing the mean squared error at a point x, one may want to optimize the integrated mean squared error.
Definition 2.8 (Integrated mean squared error): The integrated mean squared error imseζ of a function ξ w.r.t. a true function ζ is
imseζ(ξ)=∫mseζ(x)(ξ(x))dx
Definition 2.9 (Mean integrated squared error): The mean integrated squared error miseζ of a function ξ w.r.t. a true function ζ is
miseζ(ξ)=E[∫(ζ(x)−ξ(x))2dx]
Since the integrand is non-negative for kernel estimators, we have imsefX(()f^)=misefX(()f^).
It is straightforward to give an expression for the exact bias and variance.
Proposition 2.10 (Asymptotic bias and variance): For hn→0,hnn→∞ as n→0 we have
biasfX(x)(f^h(x))=n→∞21h2dx2d2fX(x)∫z2k(z)dz+o(h2)
and
Var[f^h(x)]=n→∞(nh)−1fX(x)∫k(z)2dz+o(nh1)