next up previous contents
Next: 4.2 Applications Up: 4. The Method of Maximum Likelihood Previous: 4. The Method of Maximum Likelihood

4.1 Basis

The method of maximum likelihood provides one solution to the problem of estimation. Consider a set of observations $\{x_i\}$ from a population with the probability distribution function
\begin{displaymath}\phi(x;a_1,a_2,\dots a_n)=\phi(x,\{a\}) . \end{displaymath} (4.1)
 

The parameters $\{a\}$ influence the distribution function, but are generally unknown. The task of estimation is to determine functions of the observations $\{x\}$ to use as estimates of the parameters $\{a\}$.

An estimator of aj can be any function $f_j(\{x\})$ used to estimate the true value of the parameter aj. The sample mean and sample standard deviation are often used as estimators of the true population mean and standard deviation, for example. Desirable characteristics of estimators are:

The probability of obtaining a set of observations $\{x\}$ from a population with probability distribution function $\phi(x,\{a\})$ is the product of the probabilities of all the observations:
\begin{displaymath}{\cal L}(a) = \prod_i \phi(x_i;\{a\}) . \end{displaymath} (4.2)
 

This joint probability function is called the likelihood and depends on the parameters $\{a\}$. If the likelihood function is plotted as a function of a for the case with a single parameter, the resulting distribution will have a shape somewhat like Fig. 4.1. The value a*, for which the likelihood reaches its maximum value, is the maximum-likelihood estimate for the parameter a.


 
Figure 4.1: Example likelihood function ${\cal L}(a)$ vs the assumed parameter a. This example was generated using 10 randomly generated numbers in the interval (0-100): 67.6, 60.2, 41.5, 84.9, 40.1, 2.5, 9.6, 45.7, 50.2, 15.6. The actual mean is 41.8, and that is also the maximum in the likelihood function. The likelihood function was then calculated using (4.2) where $\phi(x_i;a)$ was taken to be a Gaussian probability function with mean a and standard deviation 100/(12)1/2.

For numerical convenience, it is usually preferable to calculate instead the function W defined as

\begin{displaymath}W = {\rm ln} {\cal L}(\{a\}) . \end{displaymath} (4.3)
 

Because W is a monotonic function of ${\cal L}$, the maximum in Wwill coincide with the maximum in ${\cal L}$. However, because the calculation of W involves a summation rather than a product, there are computational advantages to the use of W:

 W $\textstyle = {\rm ln}\left(\prod_i \phi(x_i;\{a\})\right)$   (4.4)
  $\textstyle = \sum_i {\rm ln}\left(\phi(x_i;\{a\})\right) \ .$   (4.5)
 

The maximum-likelihood estimate of the parameters $\{a\}$ satisfies the simultaneous equations

\begin{displaymath}{{\partial W}\over{\partial a_j}}\Bigr\vert _{a_j=a_j^*} = 0 . \end{displaymath} (4.6)
 

The maximum-likelihood estimator has several desirable properties:

1.
The estimator is efficient in the sense that there is no estimator with smaller variance.
2.
The estimator approaches the true population parameter asymptotically as the number of observations increases.
3.
The distribution of deviations of the estimator from the population parameter approaches a normal distribution for large numbers of observations.
The Gaussian behavior of the likelihood function for large sample sizes can be used to determine the uncertainty in the maximum-likelihood estimate of $\{a\}$. If the aj are uncorrelated,
\begin{displaymath}{\cal L}(\{a\}) = {C_1}~ \exp\{-{{(a_1-a_1^*)^2}\over{2\sigma_1^2}}\} \exp\{-{{(a_2-a_2^*)^2}\over{2\sigma_2^2}}\} \cdots \end{displaymath} (4.7)
 

and

\begin{displaymath}W = {C_2} - \sum_j {{(a_j-a_j^*)^2}\over{2\sigma_j^2}}\end{displaymath} (4.8)
 

where C1 and C2 are constants. Differentiating W twice isolates $\sigma_j$:

 \begin{displaymath}{{\partial^2W}\over{\partial a_j^2}} = -{{1}\over{\sigma_j^2}}\end{displaymath} (4.9)
 
 \begin{displaymath}\sigma_j = \Bigl[-{{\partial^2 W}\over{\partiala_j^2}}\Bigr\vert _{a_j^*}\Bigr]^{-1/2} .\end{displaymath} (4.10)
 

It is often simplest to use (4.10) directly, rather than evaluate the second derivative in (4.11), particularly when there is a single parameter to be determined. When ajdiffers from aj* by $\sigma_j$, the term on the right side of (4.10) decreases by 1/2 from its maximum value, so for uncorrelated errors an estimate of the standard deviation in the result can be found by finding the deviation in aj from aj* that causes W to reduce by 1/2.

Maximizing W is equivalent to minimizing the chisquare function, defined as

\begin{displaymath}\chi^2 = \sum_j {{(a_j-a_j^*)^2}\over{\sigma_j^2}} ~ . \end{displaymath} (4.11)
 

Because $\chi^2$ increases by 1 when W decreases by 1/2, the standard deviation in the estimate of aj* can also be estimated from the deviation that causes unity increase in the chisquare.

In a case where the fit to the measurements is poor, perhaps because an inappropriate distribution function was used, the likelihood will have a value much smaller than expected. In that case, the estimates of uncertainty obtained from (4.6-4.8) should not be used. Instead, the proper conclusion is that the model used is inappropriate because it does not provide an adequate fit to the observations. Erroneously small estimates of uncertainty limits sometimes arise from using (4.6-4.8) when the fit is poor.


next up previous contents
Next: 4.2 Applications Up: 4. The Method of Maximum Likelihood Previous: 4. The Method of Maximum Likelihood 



NCAR Advanced Study Program
http://www.asp.ucar.edu