Likelihood and Regression 2025-10-15

4. Maximum Likelihood Inference

As before, we assume that $Y_1, \ldots Y_n$ is an $\iid$ sample of an absolutely continuous response $Y$ i.e. $Y_i \simiid F_{Y}(\cdotsep, \vthetastar)$ that the parametric distribution $F_Y(y, \vtheta)$ is correct and that the true parameters $\vthetastar \in \Theta \subset \R^p$ are unknown. We also assume that the probability density function $f_{Y}(y, \vtheta)$ is at least twice continuously $\vtheta$ -differentiable and $\vtheta$ -concave. Further, we make the log-likelihood density function approximation.

4.1 MLE, Score and Fisher Information

Definition 4.1 (Maximum Likelihood Estimator): The solution to the optimization problem

\vthetahat = \argmax_{\vtheta \in \Theta} \ell(\vtheta, \vY)

is called maximum likelihood estimator or MLE for

\vtheta

Note: We denote estimators with

\hat{\cdot}

and specific realizations, i.e. estimates, with

\check{\cdot}

hence the maximum likelihood estimate is

\vthetacheck = \argmax_{\vtheta \in \Theta} \ell(\vtheta, \rvec y)

A unique solution to the optimization problem does not necessarily exist and thus the existence and uniqueness of the maximum likelihood estimator is not guaranteed. For the distributions of the exponential family however, the MLE exists and is unique. The following definitions all assume that the existence and uniqueness of the MLE.

Definition 4.2 (Relative log-likelihood function): The relative log-likelihood function is

r(\vtheta, \vY) = \ell(\vtheta, \vY) - \ell(\vthetahat, \vY)

Proposition 4.3 (MLE under change of variables): Let

\vtheta : \Gamma \to \Theta

be a bijection. Given the MLE

\vgammahat = \argmax_{\vgamma \in \Gamma} \ell(\vtheta(\vgamma), \vY)

the MLE in the parameter space

\Theta

\vthetahat = \vtheta(\vgammahat)

Properties of the MLE can be studied with the help of its gradient, called the score function, and the negative Hessian, called the Fisher information.

Definition 4.4 (Score contribution): The score contribution is the gradient of the log-likelihood contribution with respect to

\vtheta

i.e.

\rvec s_i(\vtheta) = \gradwrt{\vtheta}{\ell_i}(\vtheta, Y_i)

Note:

\rvec s_i(\vtheta)

implicitly depends on

Y_i

and is thus stochastic.

Definition 4.5 (Score function): The score function is the gradient of the log-likelihood function with respect to

\vtheta

i.e.

\rvec s(\vtheta) = \gradwrt{\vtheta}{\ell}(\vtheta, \vY) = \sum_{i=1}^n \rvec s_i(\vtheta)

Note: By definition

\rvec s(\vthetahat) = \dvec 0

and thus we can compute the MLE by solving the score equations

\rvec s(\vtheta) = \dvec 0

for

\vtheta \in \Theta

Definition 4.6 (Fisher information contribution): The Fisher information contribution is the negative Hessian of the log-likelihood contribution with respect to

\vtheta

i.e.

\rmat F_i(\vtheta) = - \hesswrt{\vtheta}{\ell_i}(\vtheta, Y_i)

Note: We have

\rmat F_i(\vtheta) = - \jacobwrt{\vtheta}{\rvec s_i}(\vtheta)

Definition 4.7 (Fisher information function): The Fisher information function is the negative Hessian of the log-likelihood function with respect to

\vtheta

i.e.

\rmat F(\vtheta) = - \hesswrt{\vtheta}{\ell}(\vtheta, \vY) = \sum_{i=1}^n \rmat F_i(\vtheta)

Note: We call the Fisher information function

\rmat F(\vthetahat)

evaluated at the MLE the observed Fisher information.

Definition 4.8 (Expected Fisher information): The expected Fisher information function is the expectation of the Fisher information evaluated at the same parameter

\vtheta

of the distribution

F_{Y}(\cdot, \vtheta)

i.e.

\dmat I(\vtheta) = \Ewrt{\vtheta}{\rmat F_i(\vtheta)}

Note:

When computing $\dmat I(\vthetastar)$ is not possible one can use $\dmat I(\vthetahat)$ or $\frac{1}{n} \rmat F (\vthetahat)$ as consistent estimators.
Under a change of variables $\vtheta : \Gamma \to \Theta$ we have $\dmat I(\vtheta(\vgamma)) = \dmat D^{-1} \dmat I(\vgamma) \dmat D^{-\top}$ where $\dmat D = \jacobwrt{\vgamma}{\vtheta}(\vgamma)$ This is known as the delta method.

4.2 Distribution of the MLE

Definition 4.9 (Asymptotic MLE distribution): The MLE is asymptotically normally distributed with

\vthetahat \sima \lawN\of{\vthetastar, \frac{1}{n} \dmat I(\vthetastar)^{-1}}

With that we can compute a confidence interval for the true parameter called the Wald interval.

Definition 4.10 (Wald interval): The Wald interval is the

(1{-}\alpha)

-confidence interval for the true paramater

\thetastar_j

given by

\thetahat_j \pm \Phi^{-1}\of{1 - \frac{\alpha}{2}} \sqrt{\frac{1}{n} \dmat I(\vthetahat)_{[j,j]}^{-1}}

Note: As

\vthetastar

is unknown, we use

\dmat I(\vthetahat)

as an estimator for

\dmat I(\vthetastar)

4.3 Distribution of the Score

Proposition 4.11 (Mean score contributions): The expecation of the score contribution evaluated at the same parameter

\vtheta

of the distribution

F_{Y}(\cdot, \vtheta)

is zero, i.e.

\Ewrt{\vtheta}{\rvec s_i(\vtheta)} = \dvec 0

Proof: We start with

1 = \int_{\R} f_{Y}(y, \vtheta) \dd y

and take the gradient on both sides

\begin{align*} \dvec 0 &= \gradwrt{\gvec \vartheta}{\gvec \vartheta \mapsto \int_{\R} f_{Y}(y, \gvec \vartheta) \dd y}(\vtheta) \\ &= \int_{\R} \gradwrt{\vtheta}{f_{Y}}(y, \vtheta) \dd y \\ &= \int_{\R} \frac{\gradwrt{\vtheta}{f_{Y}}(y, \vtheta)}{f_{Y}(y, \vtheta)} f_{Y}(y, \vtheta) \dd y \\ &= \int_{\R} \gradwrt{\vtheta}{\log f_{Y}}(y, \vtheta) \dd y \\ &= \int_{\R} \gradwrt{\vtheta}{\ell_i}(\vtheta, y) \dd y \\ &= \Ewrt{\vtheta}{\rvec s_i(\vtheta)} \end{align*}

Proposition 4.12 (Variance score contributions): The covariance matrix of the score contribution evaluated at the same parameter

\vtheta

of the distribution

F_{Y}(\cdot, \vtheta)

is the expected Fisher information, i.e.

\Covwrt{\vtheta}{\rvec s_i(\vtheta)}{\rvec s_i(\vtheta)} = \dmat I(\vtheta)

Note: In particular

\Ewrt{\vthetastar}{\rvec s_i(\vthetastar)} = \dvec 0

and

\Covwrt{\vthetastar}{\rvec s_i(\vthetastar)}{\rvec s_i(\vthetastar)} = \dmat I(\vthetastar)

Definition 4.13 (Asymptotic score distribution): The score function evaluated at the true parameter

\rvec s(\vthetastar)

is asymptotically normally distributed with

\rvec s(\vthetastar) \sima \lawN(\dvec 0, n \dmat I(\vthetastar))

With that we can compute a confidence region for the true parameter called the Wilson region.

Definition 4.14 (Wilson region): The Wilson region is the

(1{-}\alpha)

-confidence region for the true paramater

\vthetastar

given by

\set{\vtheta \in \Theta \mid \frac{1}{n} \rvec s(\vtheta)^{\top} \dmat I(\vtheta)^{-1} \rvec s(\vtheta) \leq F_{\lawChi{p}}\of{1 - \frac{\alpha}{2}}}

Note: Unlike the Wald interval, the Wilson region reflects the asymmetry of the likelihood function and respects the boundaries

\Theta

of the parameters.

4.4 Distribution of the Relative Log-Likelihood

Definition 4.15 (Asymptotic relative log-likelihood distribution): The relative log-likelihood is asymptotically distributed with

-2 \tilde{\ell}(\vthetastar) \sima \lawChi{p}

With that we can compute a confidence region for the true parameter called the likelihood-ratio region.

Definition 4.16 (Likelihood-ratio region): The likelihood-ratio region is the

(1{-}\alpha)

-confidence region for the true paramater

\vthetastar

given by

\set{\vtheta \in \Theta \mid -2 \tilde{\ell}(\vtheta) \leq F_{\lawChi{p}}\of{1 - \frac{\alpha}{2}}}

Note: The advantage ot the likelihood-ratio region besides its simple computation is that it is invariant with respect to parametrization. Let

\vtheta : \Gamma \to \Theta

be a parameter change and

\setC_{\Gamma}

be the likelihood-ratio cutoff in

\Gamma

then

\setC_{\Theta} = \vtheta(\setC_{\Gamma})

is the likelihood-ratio cutoff in

\Theta