next up previous contents
Next: 11.7.2 Application to spectral Up: 11.7 The Maximum-Entropy Approach Previous: 11.7 The Maximum-Entropy Approach

11.7.1 Fundamentals

Chapter 4 outlined the maximum-likelihood approach and advocated that as a fundamental basis for data analysis. There it was argued that the best value of a parameter is that which gives the maximum probability that the set of observed values would be obtained in an experiment. Here we explore the logical foundation for that claim, and then extend the approach by incorporating powerful additional information. The resulting Maximum-Entropy Method (MEM) produces results that are sometimes far superior to those that don't incorporate this additional information, particularly in cases where the observations are a large set of numbers (as in an image or a variance spectrum).

The analysis of Chapter 4 considered a way of finding a best value of a parameter with the assumption that the distribution has a particular form. Here that analysis is expanded by including, as an element in the probability statement, a measure of our confidence that the functional form really is as we have assumed. For this, we can use the standard relationship applicable to conditional probabilities:

\begin{displaymath}P(A,B) = P(A) \times P(B\vert A) = P(B) \times P(A\vert B) \ . \end{displaymath} (11.43)
 

That is, the probability of results A and B occurring simultaneously is the probability of Aoccurring multiplied by the conditional probability that B occurs when A occurs. For example, if H represents a hypothesis, I the state of our knowledge before we do an experiment, and y a set of observations, then the set of observations contribute to the joint probability that H is true and y are measured, as follows:

\begin{displaymath}P(H,y\vert I) = P(H\vert I) \times P(y\vert H,I) = P(y\vert I) \times P(H\vert y,I) \ . \end{displaymath} (11.44)
 

Here P(H|I) represents our prior knowledge of H, based on prior information I. This may be our expectation that reasonable measurements of temperature should be in a particular range, for example, or any other information that could form the basis for rejecting a measurement as unreasonable. It may also represent our expectation that events will obey a particular probability distribution, or will vary smoothly with space or time. In all these cases, this prior knowledge conditions our estimate of the probability P(H,y|I), representing the probability that we will observe the set of observations y and that H is true. The factor P(y|H,I) in (11.44) is the probability of obtaining the observations y under the assumption that H is true, and so represents the probability that is maximized in the maximum-likelihood solution. However, the factor we want to determine from evaluation of the experimental data is P(H|y,I), the probability that H is true, given our observations and prior information. In terms that involve the maximum-likelihood solution, this is

\begin{displaymath}P(H\vert y,I) = p(H\vert I) {{P(y\vert H,I)}\over{P(y\vert I)}} \ , \end{displaymath} (11.45)
 

which is Bayes' theorem. In the maximum-likelihood approach, we maximized P(y|H,I) and identified the result with the maximum-likelihood solution. Equation (11.45) shows that this selection of a best value should dependent on our prior knowledge P(H|I) as well. (It does not depend on P(y|I) because this does not depend on H. P(y|I) is sometimes written as $\sum_{H^{\prime}}P(y\vert H^\prime,I)$ where the sum extends over all possible alternatives for H.)

Bayes' theorem thus describes the way in which newly acquired information (y) modifies our prior estimate of the probability of a conclusion: the modification is proportional to the probability with which we would observe y, given that H is true and given our prior knowledge I. This is a mathematical statement describing calculations related to probability, but when applied to the analysis of experimental data Bayes' theorem introduces an explicit dependence of the answer on the prior information and hence, potentially, on the analyst.11.5 This is often cited as a defect of the Bayesian approach: The result depends on the analyst and can vary depending on prior information, while we like to think that we strive for objective results that are independent of who does the analyses. This is of course never completely the case, but the maximum-entropy method introduces a degree of standardization into this choice by advocating a particular basis for the introduction of prior information.

The result (11.45) may seem not particularly useful in the case where a single parameter is to be determined. For example, if we are measuring a temperature and have no knowledge of what it should be, other than perhaps that it should be within some "reasonable" range, we might assign the prior probability P(H|I) = constant throughout that reasonable range, in which case Bayes' theorem states that the maximum-likelihood solution is also the most probable result for H. However, the method becomes nontrivial when H represents a more complicated result, such as an image or a variance spectrum or a droplet size distribution. We do have prior knowledge that is relevant in many of these cases. We may expect only a certain subset of possible images, or a smoothly varying distribution without sharp peaks, for example, or there may be constraints that the set of observations must satisfy. In image reconstruction, the many images that are consistent with the data may differ only in unresolved fine structure, so we may wish to incorporate a preference for a smooth image into the term P(y|I). Bayes' theorem describes how to incorporating these expectations into the result, and the maximum-entropy method assigns a particular choice (described below) to the term representing prior information. MEM is particularly valuable in cases where the dimensionality of the result is large. Indeed, the method becomes very powerful when the number of values needed to characterize the result exceeds the number of observations, because there is extra information from P(y|I) that can be used to determine values for the extra parameters.

If we have no prior knowledge of a process, the reasonable approach would be to incorporate this lack of knowledge into Bayes' theorem. However, this is not equivalent to assuming that all possible sets of observations y are equally probable. For example, a set of observations of coin tosses in which all are heads is very improbable compared to results in which there are about equal numbers of heads and tails, not because the probability of any specified sequence differs, but rather because there are many alternate ways to obtain almost-equal numbers of events and only one way to obtain all one result. Therefore a weighting according to probability would assign the result where the number of events in the two classes is nearly equal a higher a priori probability than the result where all events are in the same class. In statistical mechanics, the entropy S of a particular macroscopic state (characterized by macroscopic variables) is related to the number of different microscopic states W that are consistent with that macroscopic state, via the relationship

\begin{displaymath}S = k \ln W \quad {\rm or}\quad W = e^{S/k} \end{displaymath} (11.46)
 

where k is Boltzmann's constant. If the state is characterized by N results, each occurring for a fraction pn of all events, then the entropy becomes

\begin{displaymath}S = -k \sum_{n=1}^N p_n \ln p_n \ . \end{displaymath} (11.47)
 

In an analysis that forms a cornerstone of modern information theory, Shannon (1948) showed that the same formula (apart from the constant k) provides a measure of the uncertainty in a priori knowledge of the correct answer in the case where there are N mutually exclusive possibilities known to have probabilities pn. The natural tendency of thermodynamic systems to increase in entropy is thus associated with a tendency to move toward disordered states having high multiplicity or more microscopic states corresponding to the macroscopic variables. Because information content in a message represents a departure from uniformity or the result of ordering, the same formula (without the factor -k) provides a measure of the information content in a message, according to Shannon's analysis. Indeed, then the negative of the right side of this equation, expressed with logarithms having base 2, gives the number of binary operations providing the same amount of information, in the sense of the number of such results needed to specify the result with complete certainty.

A lesson from this is that, if we want to make as few assumptions as possible when assigning a priori probabilities for events, we should assign a probability having an exponential form where the exponent is proportional to the entropy. In the case where the result is the number of events falling into a discrete set of mutually exclusive possibilities, this leads to

\begin{displaymath}P(H\vert I) = e^{-\sum_n p_n \ln p_n} \ . \end{displaymath} (11.48)
 

Recall, from section 4.1, that the maximum-likelihood solution was obtained by maximizing the probability of obtaining a set of observations $\{x\}$ from a population with probability distribution function $\phi(x,\{a\})$ via (4.2):

\begin{displaymath}{\cal L}(a) = \prod_i \phi(x_i;\{a\}) . \end{displaymath} (11.49)
 

Also, in the case of Gaussian behavior, the likelihood function was equivalent to

\begin{displaymath}{\cal L}(\{a\}) = e^{-\chi^2/2} \ . \end{displaymath} (11.50)
 

A maximum-entropy solution then can be obtained by maximizing the product of the likelihood function with P(H|I), or in the case of Gaussian errors by maximizing the exponential term

\begin{displaymath}E = -\sum (p_n \ln p_n) - \chi^2/2 \ .\end{displaymath} (11.51)
 

Maximizing E amounts to selecting the most probable distribution consistent with the constraints of the observations. This is not in general the maximum likelihood solution, but a departure from that solution such that the gain in probability attributed to prior knowledge exceeds the decrease in probability as the solution is perturbed from its conventional maximum-likelihood value. In some cases when the number of parameters exceeds the number of observations only solutions matching the observations exactly (and hence giving $\chi^2=0$ are considered, and then the solution corresponds to the state having the maximum entropy while satisfying the constraints of the observations.

Jaynes (1985) gives a brief summary of the history and evolution of this method and its ties to statistical mechanics. That article also should be read for its exposition of the underlying approach and the philosophical foundations of the method.


next up previous contents
Next: 11.7.2 Application to spectral Up: 11.7 The Maximum-Entropy Approach Previous: 11.7 The Maximum-Entropy Approach 


NCAR Advanced Study Program
http://www.asp.ucar.edu