I'm confused about this. (I had already read Jaynes' book and had this confusion, but this post is about the same topic so I decided to ask here.)
In this example, Mr. A has learned the average numbers of red, yellow, and green orders for some past days and wants to update his predictions of today's orders on this information. So he decides that the expected values of his distributions should be equal to those averages, and that he should find the distribution that makes the least assumptions, given those constraints. I at least agree that entropy is a good measure of how little assumptions your distribution makes. The point I'm confused about is how you get from "the average of this number in past observations is N" to "the expected value of our distribution for a future observation has to be N but we should put no other information in it".
First, it is not necessarily true that, after seeing results that have some average value, your distribution will always have that same value. For example, if you are watching a repeated binary experiment and your prior is Laplace's Rule of Succession, your posterior expected value will be close to the average value (your probability for a result will be close to that result's frequency), but not equal to it; and if you have a maximum entropy distribution, like the "poorly informed robot" in Chapter 9 of Jaynes that assigns probability of 1/2^N to each possible sequence of N outcomes, your probability for each result will keep being 1/2 no matter how much information you get!
Second, why are you even finding a distribution that is constrainedly optimal in the first place, rather than just taking your prior distribution over sequences of results and your observations, and using Bayes' Theorem to update your probabilities for future results? Even if you don't know anything other than the average value, you can still take your distribution over sequences of results, update it on this information (eliminating the possible outcome sequences that don't have this average value), and then find the distribution P(NextResult|AverageValue) by integrating P(NextResult|PastResults)P(PastResults|AverageValue) over the possible PastResults. This seems like the correct thing to do according to Bayesian probability theory, and it's very different from doing constrained optimization to find a distribution.
You could say that maximum entropy given constraints is easier than doing the full update and often works well in practice, but then why does it work?
Good questions, those are exactly the sorts of things which confused me when learning this stuff! And sometimes still do confuse me.
Even if you don't know anything other than the average value, you can still take your distribution over sequences of results, update it on this information (eliminating the possible outcome sequences that don't have this average value), and then find the distribution P(NextResult|AverageValue) by integrating P(NextResult|PastResults)P(PastResults|AverageValue) over the possible PastResults.
This part is the easiest to answer.
Suppose I am rolling a six-sided biased die (for which I don't know the bias) 100 times. Someone who does know the bias comes along and tells me that the expectation of the die roll is 2.0. Well, I do not actually expect that the sum of my 100 rolls, divided by 100, will be exactly 2.0.
I could do better by imagining that I will have infinitely many independent rolls, and then updating on that average being exactly 2.0 (in the limit). IIUC that should replicate the max relative entropy result (and might be a better way to argue for the max relative entropy method), but I have not checked that myself.
First, it is not necessarily true that, after seeing results that have some average value, your distribution will always have that same value.
Indeed. I don't have an easy rule for when it's appropriate to jump from "the average of this number in past observations is N" to "the expected value of our distribution for a future observation has to be N but we should put no other information in it".
I could do better by imagining that I will have infinitely many independent rolls, and then updating on that average being exactly 2.0 (in the limit). IIUC that should replicate the max relative entropy result (and might be a better way to argue for the max relative entropy method), but I have not checked that myself.
I had thought about something like that, but I'm not sure it actually works. My reasoning (which I expect might be close to yours, since I learned about this theorem in a post of yours) was that by the entropy concentration theorem, most outcome sequences given a constraint have the frequencies of individual results match the maximum entropy frequency distribution. I think this would in fact imply that if we had a set of results, we were told the frequencies of those results had that constraint, and then we drew a random result out of that set, our probabilities for that result would have the maximum entropy distribution, because it's very likely that the frequency distribution in the set of results is the maximum entropy distribution or close to it.
However, we are not actually drawing from a set of results that followed this constraint, we had a past set of results that followed it and are drawing a new result that isn't a member of this set. In order to say knowledge of these past results influences our beliefs about the next result, our probabilities for the past results and the next results have to be correlated. And it would be a really weird coincidence if our distribution had the next result be correlated to the past results, but the past results not be correlated to each other. So the past results probably are correlated, which breaks the assumption that all possible past sequences are equally likely!
IIUC there are two scenarios to be distinguished:
One is that the die has bias p unknown to you (you have some prior over p) and you use i.i.d flips to estimate bias as usual & get maxent distribution for a new draw. The draws are independent given p but not independent given your priors, so everything works out.
The other is that the die is literally i.i.d over your priors. In this case everything from your argument routes through: Whatever bias\constraint you happen to estimate from your outcome sequence doesn't say anything about a new i.i.d draw because they're uncorrelated, the new draw is just another sample from your prior
I think that's a good way of phrasing it, except that I would emphasize that these are two different states of knowledge, not necessarily two different states of the world.
I didn't think it would work out to the maximum entropy distribution even in your first case, so I worked out an example to check:
Suppose we have a three-sided die, that can land on 0, 1 or 2. Then suppose we are told the die was rolled several times, and the average value was 1.5. The maximum entropy distribution is (if my math is correct) probability 0.116 for 0, 0.268 for 1 and 0.616 for 2.
Now suppose we had a prior analogous to Laplace's Rule: two parameters and for the "true probability" or "bias" of 0 and 1, and uniform probability density for all possible values of these parameters (the region where their sum is less than 1, which has area 1/2). Then as the number of cases goes to infinity, the probability each possible set of parameter values assigns to the average being 1.5 goes to 1 if that's their expected value, and to 0 otherwise. So we can condition on "the true values give an expected value of 1.5". We get probabilities of 0.125 for 0, 0.25 for 1 and 0.625 for 2.
That is not exactly equal to the maximum entropy distribution, but it's surprisingly close! Now I'm wondering if there's a different set of priors that gives the maximum entropy distribution exactly. I really should have worked out an actual numerical example sooner; I had previously thought of this example, assumed it would end up at different values than maxentropy distribution, and didn't go to the end and notice it ends up actually very close to it.
In this example, Mr. A has learned the average numbers of red, yellow, and green orders for some past days and wants to update his predictions of today's orders on this information. So he decides that the expected values of his distributions should be equal to those averages, and that he should find the distribution that makes the least assumptions, given those constraints. I at least agree that entropy is a good measure of how little assumptions your distribution makes. The point I'm confused about is how you get from "the average of this number in past observations is N" to "the expected value of our distribution for a future observation has to be N but we should put no other information in it".
I agree that it's implausible that Mr A has enough data to be confident of the averages, but not enough data to draw any other conclutions. Such is often the case with math execises. :shrug:
Second, why are you even finding a distribution that is constrainedly optimal in the first place, rather than just taking your prior distribution over sequences of results and your observations, and using Bayes' Theorem to update your probabilities for future results? Even if you don't know anything other than the average value, you can still take your distribution over sequences of results, update it on this information (eliminating the possible outcome sequences that don't have this average value), and then find the distribution P(NextResult|AverageValue) by integrating P(NextResult|PastResults)P(PastResults|AverageValue) over the possible PastResults. This seems like the correct thing to do according to Bayesian probability theory, and it's very different from doing constrained optimization to find a distribution.
In the example in the post, what would you say is the "prior distribution over sequences of results"? All Mr A has is a probability distribution for widgets each day. If I would naively turn that in distributions over sequences of widget orders each day, the simplest option is to assume inedpenent draw from that distribution each day. But then Mr A is in the same situation as the "poorly informed robot"
The reason one can't use Bayes rule in this case is because of a type error. If Mr A had a prior probaility distribution over probability distributions, P[P_i], then he could use Bays rule, to calculate a posteriour of P[P_i], and then integrage P_final = Sum_i P[P_i] P_i. But the porblem with this is that the anser will defpend on how you generalise from P[N,N,N] to P[P_i], and there isn't a unique way to do this.
In the example in the post, what would you say is the "prior distribution over sequences of results"?
I don't actually know.
If it's a binary experiment, like a "biased coin" that outputs either Heads or Tails, an appropriate distribution is Laplace's Rule of Succession (like I mentioned). Laplace's Rule has a parameter that is the "objective probability" of Heads, in the sense that if we know our probabilities for each result giving Heads is independently. (I don't think it makes sense to think of as an actual probability, since it's not anybody's belief; I think a more correct interpretation of it is the fraction of the space of possible initial states that ends up in heads.)
Then the results are independent given the latent variable , but since we initially don't know they're not actually independent; learning one result gives us information about , which we can use to infer things about the next result. It ends up giving more probability to the sequences with almost all Heads or Tails. (If after seeing a Head, another Head becomes more probable, the sequence HH must necessarily have more probability than the sequence HT.)
In this case our variable is the amount of widgets, that has 100 possible values, How do you generalize Laplace's Rule to that? I don't know. You could do something exactly like Laplace's Rule with 100 different "bins" instead of 2, but that wouldn't actually capture all our intuitions. For example, after getting 34 widgets one day we'd say getting 36 the next day is more likely than getting 77. If there's an actual distribution people use here, I'd be interested in learning about it.
The problem I have is that with any distribution, we'd perform this process of taking the observed values, updating our distributions for our latent parameters conditional on them, and using the updated distributions to make more precise predictions for future values. This process is very different from assuming that a fact about the frequencies must also hold for our distribution, then finding the "least informative" distribution with that property. In the case of Laplace's Rule, our probability of Heads (and expected value of ) end up pretty close to the observed frequency of Heads, but that's not a fundamental fact, it's derived from the assumptions. Which correspondences do you derive from which assumptions, in the widget case? That is what I'm confused about.
Nice, some connections with why are maximum entropy distributions so ubiquitous:
So the system converges to the maxent invariant distribution subject to constraint, which is why langevin dynamics converges to the Boltzmann distribution, and you can estimate equilibrium energy by following the particle around
In particular, we often use maxent to derive the prior itself (=invariant measure), and when our system is out of equilibrium, we can then maximize relative entropy w.r.t our maxent prior to update our distribution
Mr A manages a widget factory. The factory produces widgets of three colors - red, yellow, green - and part of Mr A’s job is to decide how many widgets to paint each color. He wants to match today’s color mix to the mix of orders the factory will receive today, so he needs to make predictions about how many of today’s orders will be for red vs yellow vs green widgets.
The factory will receive some unknown number of orders for each color throughout the day - red, yellow, and green orders. For simplicity, we will assume that Mr A starts out with a prior distribution under which:
… and then Mr A starts to update that prior on evidence.
You’re familiar with Bayes’ Rule, so you already know how to update on some kinds of evidence. For instance, if Mr A gets a call from the sales department saying “We have at least 40 orders for green widgets today!”, you know how to plug that into Bayes’ Rule:
… i.e. the posterior is still uniform, but with probability mass only on , and the normalization is different to reflect the narrower distribution.
But consider a different kind of evidence: Mr A goes through some past data, and concludes that the average number of red sales each day is 25, the average number of yellow sales is 50, and the average number of green sales is 5. So, Mr A would like to update on the information .
Chew on that for a moment.
That’s… not a standard Bayes’ Rule-style update situation. The information doesn’t even have the right type for Bayes’ Rule. It’s not a logical sentence about the variables , it’s a logical statement about the distribution itself. It’s a claim about the expected values which will live in Mr A’s mind, not the widget orders which will live out in the world. It’s evidence which didn’t come from observing , but rather from observing some other stuff and then propagating information through Mr A’s head.
… but at the same time, it seems like a kind of intuitively reasonable type of update to want to make. And we’re Bayesians, we don’t want to update in some ad-hoc way which won’t robustly generalize, so… is there some principled, robustly generalizable way to handle this type of update? If the information doesn’t have the right type signature for Bayes’ Rule, how do we update on it?
Here’s a handwavy argument: we started with a uniform prior because we wanted to assume as little as possible about the order counts, in some sense. Likewise, when we update on those expected values, we should assume as little as possible about the order counts while still satisfying those expected values.
Now for the big claim: in order to “assume as little as possible” about a random variable, we should use the distribution with highest entropy.
Conceptually: the entropy tells us how many bits of information we expect to gain by observing the order counts. The less information we expect to gain by observing those counts, the more we must think we already know. A 50/50 coinflip has one bit of entropy; we learn one bit by observing it. A coinflip which we expect will come up heads with 100% chance has zero bits of entropy; we learn zero bits by observing it, because (we think) we already know the one bit which the coin flip nominally tells us. One less bit of expected information gain is one more bit which we implicitly think we already know. Conversely, one less bit which we think we already know means one more bit of entropy.
So, to assume as little as possible about what we already know… we should maximize our distribution’s entropy. We’ll maximize that entropy subject to constraints encoding the things we do want to assume we know - in this case, the expected values.
Spelled out in glorious mathematical detail, our update looks like this:
(... as well as the implicit constraints and , which make sure that is a probability distribution. We usually won’t write those out, but one does need to include them when actually calculating .)
Then we use the Standard Magic Formula for maxent distributions which we’re not going to derive here because this is a concepts post, which says
… where the parameters and are chosen to match the expected value constraints and make the distribution sum to 1. (In this case, David's numerical check finds Z 17465.2, )
We have a somewhat-handwavy story for why it makes sense to use this maxent machinery: the more information we expect to gain by observing a variable, the less we implicitly assume we already know about it. So, maximize expected information gain (i.e. minimize implicitly-assumed knowledge) subject to the constraints of whatever information we do think we know.
But to build confidence in that intuitive story, we should check that it does sane things in cases we already understand.
First, what does the maxent construction do when we don’t pass in any constraints? I.e. we don’t think we know anything relevant?
Well, it just gives the distribution with largest entropy over the outcomes, which turns out to be a uniform distribution. So in the case of our widgets problem, the maximum entropy construction with no constraints gives the same prior we specified up front, uniform over all outcomes.
Furthermore: what if the expected number of yellow orders, , were 49.5 - the same as under the prior - and we only use that constraint? Conceptually, that constraint by itself would not add any information not already implied by the prior. And indeed, the maxent distribution would be the same as the trivial case: uniform.
Now for a more interesting class of special cases. Suppose, as earlier, that Mr A gets a call from the sales department saying “We have at least 40 orders for green widgets today!” - i.e. . This is a case where Mr A can use Bayes’ Rule, as we all know and love. But he could use a maxent update instead… and if he does so, he’ll get the same answer as Bayes’ Rule.
Here’s how.
Let’s think about the variable - i.e. it’s 1 if there are 40 or more green orders, 0 otherwise. What does it mean if I claim ? Well, that expectation is 1 if and only if all of the probability mass is on . In other words, is synonymous with (under the distribution).
So what happens when we find the maxent distribution subject to ? Well, the Standard Magic Formula says
… where and are chosen to satisfy the constraints. In this case, we’ll need to take to be (positive) infinitely large, and to normalize it. In that limit, the probability will be 0 on , and uniform on - exactly the same as the Bayes update.
This generalizes: the same construction, with the expectation of an indicator function, can always be used in the maxent framework to get the same answer as a Bayes update on a uniform distribution.
… but uniform distributions aren’t always the right starting point, which brings us to the next key piece.
Our trick above to replicate a Bayes update using maximum entropy machinery only works insofar as the prior is uniform. And that points to a more general problem with this whole maxent approach: intuitively, it doesn’t seem like a uniform prior should always be my “assume as little as possible” starting point.
A toy example of the sort of problem which comes up: suppose two people are studying rolls of the same standard six-sided die. One of them studies extreme outcomes, and only cares whether the die rolls 6 or not, so as a preprocessing step they bin all the rolls into 6 or not-6. The other keeps the raw data on the rolls. Now, if they both use a uniform distribution, they get different distributions: one of them assigns probability ½ to a roll of 6 (because 6 is one of the two preprocessed outcomes), the other assigns probability ⅙ to a roll of 6. Seems wrong! This maxent machine should have some kind of slot in it where we put in a distribution representing (in this case) how many things we binned together already. Or, more generally, a slot where we put in prior information which we want to take as already known/given, aside from the expectation constraints.
Enter relative entropy, the negative of KL divergence.
Relative entropy can be thought of as entropy relative to a reference distribution, which works like a prior. Intuitively:
In most cases, rather than maximizing entropy, it makes more sense to maximize relative entropy - i.e. minimize KL divergence - relative to some prior . (In the case of continuous variables, using relative entropy rather than entropy is an absolute necessity, for reasons we won’t get into here.)
The upshot: if we try to mimic a Bayes update in the maxent framework just like we did earlier, but we maximize entropy relative to a prior, we get the same result as a Bayes update - without needing to assume a uniform prior. Mathematically: let
That optimization problem will spit out the standard Bayes-updated distribution
.
… and that is the last big piece in how we think of maxent machinery as a generalization of Bayes updates.
The key pieces to remember are:
… but in the cases which can be handled by Bayes’ Rule, updating via maxent yields the same answer.
You can find Jaynes’ original problem starting on page 440 of Probability Theory: The Logic Of Science. The version I present here is similar but not identical; I have modified it to remove conceptual distractions about unnormalizable priors and to get the point of this post faster.
is the indicator function; it’s 1 if its inputs are true and 0 if its inputs are false.