The Zen Of Maxent As A Generalization Of Bayes Updates

David Lorell

I'm confused about this. (I had already read Jaynes' book and had this confusion, but this post is about the same topic so I decided to ask here.)

In this example, Mr. A has learned the average numbers of red, yellow, and green orders for some past days and wants to update his predictions of today's orders on this information. So he decides that the expected values of his distributions should be equal to those averages, and that he should find the distribution that makes the least assumptions, given those constraints. I at least agree that entropy is a good measure of how little assumptions your distribution makes. The point I'm confused about is how you get from "the average of this number in past observations is N" to "the expected value of our distribution for a future observation has to be N but we should put no other information in it".

First, it is not necessarily true that, after seeing results that have some average value, your distribution will always have that same value. For example, if you are watching a repeated binary experiment and your prior is Laplace's Rule of Succession, your posterior expected value will be close to the average value (your probability for a result will be close to that result's frequency), but not equal to it; and if you have a maximum entropy distribution, like the "poorly informed robot" in Chapter 9 of Jaynes that assigns probability of 1/2^N to each possible sequence of N outcomes, your probability for each result will keep being 1/2 no matter how much information you get!

Second, why are you even finding a distribution that is constrainedly optimal in the first place, rather than just taking your prior distribution over sequences of results and your observations, and using Bayes' Theorem to update your probabilities for future results? Even if you don't know anything other than the average value, you can still take your distribution over sequences of results, update it on this information (eliminating the possible outcome sequences that don't have this average value), and then find the distribution P(NextResult|AverageValue) by integrating P(NextResult|PastResults)P(PastResults|AverageValue) over the possible PastResults. This seems like the correct thing to do according to Bayesian probability theory, and it's very different from doing constrained optimization to find a distribution.

You could say that maximum entropy given constraints is easier than doing the full update and often works well in practice, but then why does it work?

^{^}

You can find Jaynes’ original problem starting on page 440 of Probability Theory: The Logic Of Science. The version I present here is similar but not identical; I have modified it to remove conceptual distractions about unnormalizable priors and to get the point of this post faster.

^{^}

$I [\cdot]$ is the indicator function; it’s 1 if its inputs are true and 0 if its inputs are false.

[-]Menotim17h112

[-]Daniel C1d70

Nice, some connections with why are maximum entropy distributions so ubiquitous:

If your system is ergodic, time average=ensemble average. Hence expected constraints can be estimated via following your dynamical system over time
If your system follows the second law, then entropy increases subject to the constraints

So the system converges to the maxent invariant distribution subject to constraint, which is why langevin dynamics converges to the Boltzmann distribution, and you can estimate equilibrium energy by following the particle around

In particular, we often use maxent to derive the prior itself (=invariant measure), and when our system is out of equilibrium, we can then maximize relative entropy w.r.t our maxent prior to update our distribution

LESSWRONG
LW

LESSWRONG
LW

58

The Zen Of Maxent As A Generalization Of Bayes Updates

58

58

Jaynes’ Widget Problem^[1]: How Do We Update On An Expected Value?

Enter Maxent

Some Special Cases To Check Our Intuition

“No Information”

Bayes Updates

Relative Entropy and Priors

Recap

58

The Zen Of Maxent As A Generalization Of Bayes Updates

58

58

Jaynes’ Widget Problem[1]: How Do We Update On An Expected Value?

Enter Maxent

Some Special Cases To Check Our Intuition

“No Information”

Bayes Updates

Relative Entropy and Priors

Recap

Jaynes’ Widget Problem^[1]: How Do We Update On An Expected Value?