As before, we assume that Y1,…Yn is an iid sample of an absolutely continuous response Y, i.e. Yi∼iidFY(⋅,θ⋆), that the parametric distribution FY(y,θ) is correct and that the true parameters θ⋆∈Θ⊆Rp are unknown. We also assume that the probability density function fY(y,θ) is at least twice continuously θ-differentiable and θ-concave. Further, we make the log-likelihood density function approximation.
4.1 MLE, Score and Fisher Information
Definition 4.1 (Maximum Likelihood Estimator): The solution to the optimization problem
θ^=θ∈Θargmaxℓ(θ,Y)
is called maximum likelihood estimator or MLE for θ.
Note: We denote estimators with ⋅^ and specific realizations, i.e. estimates, with ⋅ˇ, hence the maximum likelihood estimate is θˇ=argmaxθ∈Θℓ(θ,y).
A unique solution to the optimization problem does not necessarily exist and thus the existence and uniqueness of the maximum likelihood estimator is not guaranteed. For the distributions of the exponential family however, the MLE exists and is unique. The following definitions all assume that the existence and uniqueness of the MLE.
Definition 4.2 (Relative log-likelihood function): The relative log-likelihood function is
r(θ,Y)=ℓ(θ,Y)−ℓ(θ^,Y)
Proposition 4.3 (MLE under change of variables): Let θ:Γ→Θ be a bijection. Given the MLE γ^=argmaxγ∈Γℓ(θ(γ),Y), the MLE in the parameter space Θ is θ^=θ(γ^)
Properties of the MLE can be studied with the help of its gradient, called the score function, and the negative Hessian, called the Fisher information.
Definition 4.4 (Score contribution): The score contribution is the gradient of the log-likelihood contribution with respect to θ, i.e.
si(θ)=∇θ[ℓi](θ,Yi)
Note:si(θ) implicitly depends on Yi and is thus stochastic.
Definition 4.5 (Score function): The score function is the gradient of the log-likelihood function with respect to θ, i.e.
s(θ)=∇θ[ℓ](θ,Y)=i=1∑nsi(θ)
Note: By definition s(θ^)=0 and thus we can compute the MLE by solving the score equations s(θ)=0 for θ∈Θ.
Definition 4.6 (Fisher information contribution): The Fisher information contribution is the negative Hessian of the log-likelihood contribution with respect to θ, i.e.
Fi(θ)=−Hessθ[ℓi](θ,Yi)
Note: We have Fi(θ)=−Jacobθ[si](θ).
Definition 4.7 (Fisher information function): The Fisher information function is the negative Hessian of the log-likelihood function with respect to θ, i.e.
F(θ)=−Hessθ[ℓ](θ,Y)=i=1∑nFi(θ)
Note: We call the Fisher information function F(θ^) evaluated at the MLE the observed Fisher information.
Definition 4.8 (Expected Fisher information): The expected Fisher information function is the expectation of the Fisher information evaluated at the same parameter θ of the distribution FY(⋅,θ), i.e.
I(θ)=Eθ[Fi(θ)]
Note:
When computing I(θ⋆) is not possible one can use I(θ^) or n1F(θ^) as consistent estimators.
Under a change of variables θ:Γ→Θ we have I(θ(γ))=D−1I(γ)D−⊤ where D=Jacobγ[θ](γ). This is known as the delta method.
4.2 Distribution of the MLE
Definition 4.9 (Asymptotic MLE distribution): The MLE is asymptotically normally distributed with
θ^∼aN(θ⋆,n1I(θ⋆)−1)
With that we can compute a confidence interval for the true parameter called the Wald interval.
Definition 4.10 (Wald interval): The Wald interval is the (1−α)-confidence interval for the true paramater θj⋆ given by
θ^j±Φ−1(1−2α)n1I(θ^)[j,j]−1
Note: As θ⋆ is unknown, we use I(θ^) as an estimator for I(θ⋆).
4.3 Distribution of the Score
Proposition 4.11 (Mean score contributions): The expecation of the score contribution evaluated at the same parameter θ of the distribution FY(⋅,θ) is zero, i.e.
Eθ[si(θ)]=0
Proof: We start with 1=∫RfY(y,θ)dy and take the gradient on both sides
0=∇ϑ[ϑ↦∫RfY(y,ϑ)dy](θ)=∫R∇θ[fY](y,θ)dy=∫RfY(y,θ)∇θ[fY](y,θ)fY(y,θ)dy=∫R∇θ[logfY](y,θ)dy=∫R∇θ[ℓi](θ,y)dy=Eθ[si(θ)]
Proposition 4.12 (Variance score contributions): The covariance matrix of the score contribution evaluated at the same parameter θ of the distribution FY(⋅,θ) is the expected Fisher information, i.e.
Covθ[si(θ),si(θ)]=I(θ)
Note: In particular Eθ⋆[si(θ⋆)]=0 and Covθ⋆[si(θ⋆),si(θ⋆)]=I(θ⋆).
Definition 4.13 (Asymptotic score distribution): The score function evaluated at the true parameter s(θ⋆) is asymptotically normally distributed with
s(θ⋆)∼aN(0,nI(θ⋆))
With that we can compute a confidence region for the true parameter called the Wilson region.
Definition 4.14 (Wilson region): The Wilson region is the (1−α)-confidence region for the true paramater θ⋆ given by
{θ∈Θn1s(θ)⊤I(θ)−1s(θ)≤Fχp2(1−2α)}
Note: Unlike the Wald interval, the Wilson region reflects the asymmetry of the likelihood function and respects the boundaries Θ of the parameters.
4.4 Distribution of the Relative Log-Likelihood
Definition 4.15 (Asymptotic relative log-likelihood distribution): The relative log-likelihood is asymptotically distributed with
−2ℓ~(θ⋆)∼aχp2
With that we can compute a confidence region for the true parameter called the likelihood-ratio region.
Definition 4.16 (Likelihood-ratio region): The likelihood-ratio region is the (1−α)-confidence region for the true paramater θ⋆ given by
{θ∈Θ−2ℓ~(θ)≤Fχp2(1−2α)}
Note: The advantage ot the likelihood-ratio region besides its simple computation is that it is invariant with respect to parametrization. Let θ:Γ→Θ be a parameter change and CΓ be the likelihood-ratio cutoff in Γ, then CΘ=θ(CΓ) is the likelihood-ratio cutoff in Θ.