4. Maximum Likelihood Inference

As before, we assume that Y1,YnY_1, \ldots Y_n is an iid\iid sample of an absolutely continuous response YY, i.e. YiiidFY(,θ)Y_i \simiid F_{Y}(\cdotsep, \vthetastar), that the parametric distribution FY(y,θ)F_Y(y, \vtheta) is correct and that the true parameters θΘRp\vthetastar \in \Theta \subset \R^p are unknown. We also assume that the probability density function fY(y,θ)f_{Y}(y, \vtheta) is at least twice continuously θ\vtheta-differentiable and θ\vtheta-concave. Further, we make the log-likelihood density function approximation.

4.1 MLE, Score and Fisher Information

Definition 4.1 (Maximum Likelihood Estimator): The solution to the optimization problem θ^=arg maxθΘ(θ,Y) \vthetahat = \argmax_{\vtheta \in \Theta} \ell(\vtheta, \vY) is called maximum likelihood estimator or MLE for θ\vtheta.
Note: We denote estimators with ^\hat{\cdot} and specific realizations, i.e. estimates, with ˇ\check{\cdot}, hence the maximum likelihood estimate is θˇ=arg maxθΘ(θ,y)\vthetacheck = \argmax_{\vtheta \in \Theta} \ell(\vtheta, \rvec y).

A unique solution to the optimization problem does not necessarily exist and thus the existence and uniqueness of the maximum likelihood estimator is not guaranteed. For the distributions of the exponential family however, the MLE exists and is unique. The following definitions all assume that the existence and uniqueness of the MLE.

Definition 4.2 (Relative log-likelihood function): The relative log-likelihood function is r(θ,Y)=(θ,Y)(θ^,Y) r(\vtheta, \vY) = \ell(\vtheta, \vY) - \ell(\vthetahat, \vY)
Proposition 4.3 (MLE under change of variables): Let θ:ΓΘ\vtheta : \Gamma \to \Theta be a bijection. Given the MLE γ^=arg maxγΓ(θ(γ),Y)\vgammahat = \argmax_{\vgamma \in \Gamma} \ell(\vtheta(\vgamma), \vY), the MLE in the parameter space Θ\Theta is θ^=θ(γ^)\vthetahat = \vtheta(\vgammahat)

Properties of the MLE can be studied with the help of its gradient, called the score function, and the negative Hessian, called the Fisher information.

Definition 4.4 (Score contribution): The score contribution is the gradient of the log-likelihood contribution with respect to θ\vtheta, i.e. si(θ)=θ ⁣[i](θ,Yi) \rvec s_i(\vtheta) = \gradwrt{\vtheta}{\ell_i}(\vtheta, Y_i)
Note: si(θ)\rvec s_i(\vtheta) implicitly depends on YiY_i and is thus stochastic.
Definition 4.5 (Score function): The score function is the gradient of the log-likelihood function with respect to θ\vtheta, i.e. s(θ)=θ ⁣[](θ,Y)=i=1nsi(θ) \rvec s(\vtheta) = \gradwrt{\vtheta}{\ell}(\vtheta, \vY) = \sum_{i=1}^n \rvec s_i(\vtheta)
Note: By definition s(θ^)=0\rvec s(\vthetahat) = \dvec 0 and thus we can compute the MLE by solving the score equations s(θ)=0\rvec s(\vtheta) = \dvec 0 for θΘ\vtheta \in \Theta.
Definition 4.6 (Fisher information contribution): The Fisher information contribution is the negative Hessian of the log-likelihood contribution with respect to θ\vtheta, i.e. Fi(θ)=Hessθ ⁣[i](θ,Yi) \rmat F_i(\vtheta) = - \hesswrt{\vtheta}{\ell_i}(\vtheta, Y_i)
Note: We have Fi(θ)=Jacobθ ⁣[si](θ)\rmat F_i(\vtheta) = - \jacobwrt{\vtheta}{\rvec s_i}(\vtheta).
Definition 4.7 (Fisher information function): The Fisher information function is the negative Hessian of the log-likelihood function with respect to θ\vtheta, i.e. F(θ)=Hessθ ⁣[](θ,Y)=i=1nFi(θ) \rmat F(\vtheta) = - \hesswrt{\vtheta}{\ell}(\vtheta, \vY) = \sum_{i=1}^n \rmat F_i(\vtheta)
Note: We call the Fisher information function F(θ^)\rmat F(\vthetahat) evaluated at the MLE the observed Fisher information.
Definition 4.8 (Expected Fisher information): The expected Fisher information function is the expectation of the Fisher information evaluated at the same parameter θ\vtheta of the distribution FY(,θ)F_{Y}(\cdot, \vtheta), i.e. I(θ)=Eθ ⁣[Fi(θ)] \dmat I(\vtheta) = \Ewrt{\vtheta}{\rmat F_i(\vtheta)}
Note:
  • When computing I(θ)\dmat I(\vthetastar) is not possible one can use I(θ^)\dmat I(\vthetahat) or 1nF(θ^)\frac{1}{n} \rmat F (\vthetahat) as consistent estimators.
  • Under a change of variables θ:ΓΘ\vtheta : \Gamma \to \Theta we have I(θ(γ))=D1I(γ)D\dmat I(\vtheta(\vgamma)) = \dmat D^{-1} \dmat I(\vgamma) \dmat D^{-\top} where D=Jacobγ ⁣[θ](γ)\dmat D = \jacobwrt{\vgamma}{\vtheta}(\vgamma). This is known as the delta method.

4.2 Distribution of the MLE

Definition 4.9 (Asymptotic MLE distribution): The MLE is asymptotically normally distributed with θ^aN ⁣(θ,1nI(θ)1) \vthetahat \sima \lawN\of{\vthetastar, \frac{1}{n} \dmat I(\vthetastar)^{-1}}

With that we can compute a confidence interval for the true parameter called the Wald interval.

Definition 4.10 (Wald interval): The Wald interval is the (1α)(1{-}\alpha)-confidence interval for the true paramater θj\thetastar_j given by θ^j±Φ1 ⁣(1α2)1nI(θ^)[j,j]1 \thetahat_j \pm \Phi^{-1}\of{1 - \frac{\alpha}{2}} \sqrt{\frac{1}{n} \dmat I(\vthetahat)_{[j,j]}^{-1}}
Note: As θ\vthetastar is unknown, we use I(θ^)\dmat I(\vthetahat) as an estimator for I(θ)\dmat I(\vthetastar).

4.3 Distribution of the Score

Proposition 4.11 (Mean score contributions): The expecation of the score contribution evaluated at the same parameter θ\vtheta of the distribution FY(,θ)F_{Y}(\cdot, \vtheta) is zero, i.e. Eθ ⁣[si(θ)]=0 \Ewrt{\vtheta}{\rvec s_i(\vtheta)} = \dvec 0
Proof: We start with 1=RfY(y,θ)dy1 = \int_{\R} f_{Y}(y, \vtheta) \dd y and take the gradient on both sides 0=ϑ ⁣[ϑRfY(y,ϑ)dy](θ)=Rθ ⁣[fY](y,θ)dy=Rθ ⁣[fY](y,θ)fY(y,θ)fY(y,θ)dy=Rθ ⁣[logfY](y,θ)dy=Rθ ⁣[i](θ,y)dy=Eθ ⁣[si(θ)]\begin{align*} \dvec 0 &= \gradwrt{\gvec \vartheta}{\gvec \vartheta \mapsto \int_{\R} f_{Y}(y, \gvec \vartheta) \dd y}(\vtheta) \\ &= \int_{\R} \gradwrt{\vtheta}{f_{Y}}(y, \vtheta) \dd y \\ &= \int_{\R} \frac{\gradwrt{\vtheta}{f_{Y}}(y, \vtheta)}{f_{Y}(y, \vtheta)} f_{Y}(y, \vtheta) \dd y \\ &= \int_{\R} \gradwrt{\vtheta}{\log f_{Y}}(y, \vtheta) \dd y \\ &= \int_{\R} \gradwrt{\vtheta}{\ell_i}(\vtheta, y) \dd y \\ &= \Ewrt{\vtheta}{\rvec s_i(\vtheta)} \end{align*}
Proposition 4.12 (Variance score contributions): The covariance matrix of the score contribution evaluated at the same parameter θ\vtheta of the distribution FY(,θ)F_{Y}(\cdot, \vtheta) is the expected Fisher information, i.e. Covθ ⁣[si(θ), si(θ)]=I(θ) \Covwrt{\vtheta}{\rvec s_i(\vtheta)}{\rvec s_i(\vtheta)} = \dmat I(\vtheta)
Note: In particular Eθ ⁣[si(θ)]=0\Ewrt{\vthetastar}{\rvec s_i(\vthetastar)} = \dvec 0 and Covθ ⁣[si(θ), si(θ)]=I(θ)\Covwrt{\vthetastar}{\rvec s_i(\vthetastar)}{\rvec s_i(\vthetastar)} = \dmat I(\vthetastar).
Definition 4.13 (Asymptotic score distribution): The score function evaluated at the true parameter s(θ)\rvec s(\vthetastar) is asymptotically normally distributed with s(θ)aN(0,nI(θ)) \rvec s(\vthetastar) \sima \lawN(\dvec 0, n \dmat I(\vthetastar))

With that we can compute a confidence region for the true parameter called the Wilson region.

Definition 4.14 (Wilson region): The Wilson region is the (1α)(1{-}\alpha)-confidence region for the true paramater θ\vthetastar given by {θΘ | 1ns(θ)I(θ)1s(θ)Fχp2 ⁣(1α2)} \set{\vtheta \in \Theta \mid \frac{1}{n} \rvec s(\vtheta)^{\top} \dmat I(\vtheta)^{-1} \rvec s(\vtheta) \leq F_{\lawChi{p}}\of{1 - \frac{\alpha}{2}}}
Note: Unlike the Wald interval, the Wilson region reflects the asymmetry of the likelihood function and respects the boundaries Θ\Theta of the parameters.

4.4 Distribution of the Relative Log-Likelihood

Definition 4.15 (Asymptotic relative log-likelihood distribution): The relative log-likelihood is asymptotically distributed with 2~(θ)aχp2 -2 \tilde{\ell}(\vthetastar) \sima \lawChi{p}

With that we can compute a confidence region for the true parameter called the likelihood-ratio region.

Definition 4.16 (Likelihood-ratio region): The likelihood-ratio region is the (1α)(1{-}\alpha)-confidence region for the true paramater θ\vthetastar given by {θΘ | 2~(θ)Fχp2 ⁣(1α2)} \set{\vtheta \in \Theta \mid -2 \tilde{\ell}(\vtheta) \leq F_{\lawChi{p}}\of{1 - \frac{\alpha}{2}}}
Note: The advantage ot the likelihood-ratio region besides its simple computation is that it is invariant with respect to parametrization. Let θ:ΓΘ\vtheta : \Gamma \to \Theta be a parameter change and CΓ\setC_{\Gamma} be the likelihood-ratio cutoff in Γ\Gamma, then CΘ=θ(CΓ)\setC_{\Theta} = \vtheta(\setC_{\Gamma}) is the likelihood-ratio cutoff in Θ\Theta.