Bayes’ rule and Jeffrey’s updating from the principle of minimum change

What follows is, to the best of my understanding, largely folklore. Variants of the argument appear implicitly across the literature on Bayesian updating, relative entropy, and consistent inference, and the conclusion is often taken for granted once one is familiar with Jeffrey updating. However, I have not been able to find a place where the reasoning is laid out explicitly and in a self-contained way, starting from minimal assumptions and making clear what is, and is not, being imposed. For that reason, I am recording this note here, both for my own reference and in the hope that it may be useful to others. A PDF version containing the same material, with slightly improved formatting, is available here.

Let XX and YY be finite-valued quantities (random variables) about which we wish to reason. Our prior information is summarized by a joint distribution

P(x,y)=p(x)φ(y|x),P(x,y)=p(x)\,\varphi(y|x),

where p(x)p(x) represents our prior state of knowledge about XX, and φ(y|x)\varphi(y|x) encodes the likelihood. No interpretation beyond this bookkeeping role is assumed.

From PP we may compute the implied marginal distribution PY(y)=xp(x)φ(y|x)P_Y(y)=\sum_x p(x)\,\varphi(y|x) and, by the product rule (I wouldn’t call this “Bayes’ theorem” yet), the inverse conditional distribution

φ^(x|y)=P(x|y)=p(x)φ(y|x)PY(y).\hat\varphi(x|y)=P(x|y)=\frac{p(x)\,\varphi(y|x)}{P_Y(y)}.

Now suppose that new information becomes available which does not specify a particular value of YY, but instead constrains our revised beliefs about YY to take the form of a probability distribution τ(y)\tau(y). The problem is then to determine what joint distribution R(x,y)R(x,y) should represent our new state of knowledge, given that it must be consistent with τ\tau and must not contain any information not logically implied by the prior PP together with this new constraint.

Principle of minimum change

Following the general principle that probabilities should be updated only to the extent required by new information, we select RR to minimize the relative entropy

D(RP)=x,yR(x,y)logR(x,y)P(x,y),D(R\|P)=\sum_{x,y}R(x,y)\log\frac{R(x,y)}{P(x,y)},

subject to the constraint RY=τR_Y=\tau. This criterion ensures that no unwarranted assumptions are introduced.

Solution

Any admissible RR may be written as

R(x,y)=τ(y)R(x|y).R(x,y)=\tau(y)\,R(x|y).

Substitution into the relative entropy yields the identity

D(RP)=x,yR(x,y)logR(x,y)P(x,y)=x,yτ(y)R(x|y)[logτ(y)P(y)+logR(x|y)φ^(x|y)]={x,yτ(y)R(x|y)logτ(y)P(y)+x,yτ(y)R(x|y)logR(x|y)φ^(x|y)}=D(τPY)+x,yτ(y)R(x|y)logR(x|y)φ^(x|y)=D(τPY)+yτ(y)D(R(|y)φ^(|y)).\begin{aligned} D(R\| P) &= \sum_{x,y}R(x,y)\log\frac{R(x,y)}{P(x,y)} \\ &= \sum_{x,y}\tau(y)R(x|y)\left[\log\frac{\tau(y)}{P(y)}+\log\frac{R(x|y)}{\hat\varphi(x|y)}\right] \\ &= \left\{\sum_{x,y}\tau(y)R(x|y)\log\frac{\tau(y)}{P(y)}+\sum_{x,y}\tau(y)R(x|y)\log\frac{R(x|y)}{\hat\varphi(x|y)}\right\} \\ &= D(\tau\|P_Y)+\sum_{x,y}\tau(y)\;R(x|y)\;\log\frac{R(x|y)}{\hat\varphi(x|y)} \\ & = D(\tau\|P_Y) + \sum_y \tau(y)\; D\Big(R(\cdot|y)\|\hat\varphi(\cdot|y)\Big)\;. \end{aligned}

(Apologies for the ugly formatting; I’m afraid this is a bug of WordPress; remember that a better formatted version can be downloaded from this link.) The first term depends only on the revised marginal τ\tau and is therefore fixed. The second term is nonnegative and vanishes if and only if

R(x|y)=φ^(x|y)for all x,y.R(x|y)=\hat\varphi(x|y) \quad\text{for all }x,y.

Hence the unique distribution consistent with the stated constraints and the principle of minimum updating is R(x,y)=τ(y)φ^(x|y)R(x,y)=\tau(y)\,\hat\varphi(x|y), wherever τ(y)>0\tau(y)>0. Note that, if it happens that τ(y)>0\tau(y)>0 but PY(y)=0P_Y(y)=0, we are in the situation in which the new evidence is falsifying our prior. This is signaled by the fact that, in this case, D(τPY)=+D(\tau\|P_Y)=+\infty.

Interpretation

The new information alters only our beliefs about YY; therefore, rational consistency requires that our conditional beliefs about XX given YY remain exactly those implied by the prior. Any other choice would amount to smuggling in additional information not contained in the premises.

Leave a comment