Menotim — LessWrong

The Zen Of Maxent As A Generalization Of Bayes Updates

I think that's a good way of phrasing it, except that I would emphasize that these are two different states of knowledge, not necessarily two different states of the world.

I didn't think it would work out to the maximum entropy distribution even in your first case, so I worked out an example to check:

Suppose we have a three-sided die, that can land on 0, 1 or 2. Then suppose we are told the die was rolled several times, and the average value was 1.5. The maximum entropy distribution is (if my math is correct) probability 0.116 for 0, 0.268 for 1 and 0.616 for 2.

Now suppose we had a prior analogous to Laplace's Rule: two parameters and $p_{1}$ for the "true probability" or "bias" of 0 and 1, and uniform probability density $2 d p_{0} d p_{1}$ for all possible values of these parameters (the region where their sum is less than 1, which has area 1/2). Then as the number of cases goes to infinity, the probability each possible set of parameter values assigns to the average being 1.5 goes to 1 if that's their expected value, and to 0 otherwise. So we can condition on "the true values give an expected value of 1.5". We get probabilities of 0.125 for 0, 0.25 for 1 and 0.625 for 2.

That is not exactly equal to the maximum entropy distribution, but it's surprisingly close! Now I'm wondering if there's a different set of priors that gives the maximum entropy distribution exactly. I really should have worked out an actual numerical example sooner; I had previously thought of this example, assumed it would end up at different values than maxentropy distribution, and didn't go to the end and notice it ends up actually very close to it.

The Zen Of Maxent As A Generalization Of Bayes Updates

Menotim15d30

I could do better by imagining that I will have infinitely many independent rolls, and then updating on that average being exactly 2.0 (in the limit). IIUC that should replicate the max relative entropy result (and might be a better way to argue for the max relative entropy method), but I have not checked that myself.

I had thought about something like that, but I'm not sure it actually works. My reasoning (which I expect might be close to yours, since I learned about this theorem in a post of yours) was that by the entropy concentration theorem, most outcome sequences given a constraint have the frequencies of individual results match the maximum entropy frequency distribution. I think this would in fact imply that if we had a set of results, we were told the frequencies of those results had that constraint, and then we drew a random result out of that set, our probabilities for that result would have the maximum entropy distribution, because it's very likely that the frequency distribution in the set of results is the maximum entropy distribution or close to it.

However, we are not actually drawing from a set of results that followed this constraint, we had a past set of results that followed it and are drawing a new result that isn't a member of this set. In order to say knowledge of these past results influences our beliefs about the next result, our probabilities for the past results and the next results have to be correlated. And it would be a really weird coincidence if our distribution had the next result be correlated to the past results, but the past results not be correlated to each other. So the past results probably are correlated, which breaks the assumption that all possible past sequences are equally likely!

The Zen Of Maxent As A Generalization Of Bayes Updates

Menotim15d30

In the example in the post, what would you say is the "prior distribution over sequences of results"?

I don't actually know.

If it's a binary experiment, like a "biased coin" that outputs either Heads or Tails, an appropriate distribution is Laplace's Rule of Succession (like I mentioned). Laplace's Rule has a parameter that is the "objective probability" of Heads, in the sense that if we know $p$ our probabilities for each result giving Heads is $p$ independently. (I don't think it makes sense to think of $p$ as an actual probability, since it's not anybody's belief; I think a more correct interpretation of it is the fraction of the space of possible initial states that ends up in heads.)

Then the results are independent given the latent variable $p$ , but since we initially don't know $p$ they're not actually independent; learning one result gives us information about $p$ , which we can use to infer things about the next result. It ends up giving more probability to the sequences with almost all Heads or Tails. (If after seeing a Head, another Head becomes more probable, the sequence HH must necessarily have more probability than the sequence HT.)

In this case our variable is the amount of widgets, that has 100 possible values, How do you generalize Laplace's Rule to that? I don't know. You could do something exactly like Laplace's Rule with 100 different "bins" instead of 2, but that wouldn't actually capture all our intuitions. For example, after getting 34 widgets one day we'd say getting 36 the next day is more likely than getting 77. If there's an actual distribution people use here, I'd be interested in learning about it.

The problem I have is that with any distribution, we'd perform this process of taking the observed values, updating our distributions for our latent parameters conditional on them, and using the updated distributions to make more precise predictions for future values. This process is very different from assuming that a fact about the frequencies must also hold for our distribution, then finding the "least informative" distribution with that property. In the case of Laplace's Rule, our probability of Heads (and expected value of $p$ ) end up pretty close to the observed frequency of Heads, but that's not a fundamental fact, it's derived from the assumptions. Which correspondences do you derive from which assumptions, in the widget case? That is what I'm confused about.

How to survive until AGI

Menotim16d10

My guess that utopia happens and its inhabitants succeed in “bringing back” people preserved using cryonics is around 10%.

If you're saying this because the probability that utopia happens is low, that's unfair; if there's no utopia, none of the other interventions work either, so you should also be multiplying their micromort amounts by 10%.

If you instead think that "bringing back people is possible even in principle" has a probability of 10%, that makes more sense but I think that's way, way too low.

The Zen Of Maxent As A Generalization Of Bayes Updates

Menotim17d172

I'm confused about this. (I had already read Jaynes' book and had this confusion, but this post is about the same topic so I decided to ask here.)

In this example, Mr. A has learned the average numbers of red, yellow, and green orders for some past days and wants to update his predictions of today's orders on this information. So he decides that the expected values of his distributions should be equal to those averages, and that he should find the distribution that makes the least assumptions, given those constraints. I at least agree that entropy is a good measure of how little assumptions your distribution makes. The point I'm confused about is how you get from "the average of this number in past observations is N" to "the expected value of our distribution for a future observation has to be N but we should put no other information in it".

First, it is not necessarily true that, after seeing results that have some average value, your distribution will always have that same value. For example, if you are watching a repeated binary experiment and your prior is Laplace's Rule of Succession, your posterior expected value will be close to the average value (your probability for a result will be close to that result's frequency), but not equal to it; and if you have a maximum entropy distribution, like the "poorly informed robot" in Chapter 9 of Jaynes that assigns probability of 1/2^N to each possible sequence of N outcomes, your probability for each result will keep being 1/2 no matter how much information you get!

Second, why are you even finding a distribution that is constrainedly optimal in the first place, rather than just taking your prior distribution over sequences of results and your observations, and using Bayes' Theorem to update your probabilities for future results? Even if you don't know anything other than the average value, you can still take your distribution over sequences of results, update it on this information (eliminating the possible outcome sequences that don't have this average value), and then find the distribution P(NextResult|AverageValue) by integrating P(NextResult|PastResults)P(PastResults|AverageValue) over the possible PastResults. This seems like the correct thing to do according to Bayesian probability theory, and it's very different from doing constrained optimization to find a distribution.

You could say that maximum entropy given constraints is easier than doing the full update and often works well in practice, but then why does it work?

Omelas Is Perfectly Misread

Menotim2mo1212

When I was reading it I had the impression that the reactions of the people of Omelas to the child were meant to reference the readers' own rationalizations of suffering, in real life as well as fiction, especially in this paragraph:

But as time goes on they begin to realize that even if the child could be released, it would not get much good of its freedom: a little vague pleasure of warmth and food, no doubt, but little more. It is too degraded and imbecile to know any real joy. It has been afraid too long ever to be free of fear. Its habits are too uncouth for it to respond to humane treatment. Indeed, after so long it would probably be wretched without walls about it to protect it, and darkness for its eyes, and its own excrement to sit in. Their tears at the bitter injustice dry when they begin to perceive the terrible justice of reality, and to accept it. Yet it is their tears and anger, the trying of their generosity and the acceptance of their helplessness, which are perhaps the true source of the splendor of their lives. Theirs is no vapid, irresponsible happiness. They know that they, like the child, are not free. They know compassion. It is the existence of the child, and their knowledge of its existence, that makes possible the nobility of their architecture, the poignancy of their music, the profundity of their science. It is because of the child that they are so gentle with children. They know that if the wretched one were not there snivelling in the dark, the other one, the flute-player, could make no joyful music as the young riders line up in their beauty for the race in the sunlight of the first morning of summer.

The child wouldn't even like being released anyway; it's a fundamental part of reality; if the child didn't exist we couldn't really be happy. That sounds like a big pile of rationalizations to me! The people of Omelas start by knowing the child's suffering is wrong, aren't able to do something about it, and then slowly come up with rationalizations until they can accept it.

So the ones who walk away are the ones that refuse to rationalize. This could imply that they are nothing but the ones that refuse to rationalize, that the "walking away" represents rejecting the city of Omelas as happy and resolving to build a better version of it. Or maybe I'm just imagining and this is not even close to the intended meaning.

FDT Does Not Endorse Itself in Asymmetric Games

Menotim5mo101

In the FDT paper there is this footnote:

In the authors’ preferred formalization of FDT, agents actually iterate over policies (mappings from observations to actions) rather than actions. This makes a difference in certain multi-agent dilemmas, but will not make a difference in this paper.

And it does seem that using FDT, but as a function that returns a policy rather than an action, solves this problem. So this is not an intrinsic problem with FDT that UDT doesn't have, it's a problem that arises in simpler versions of both theories and can be solved in both with the same modification.

Intransitive Trust

Menotim1y20

This problem doesn't seem to be about trust at all, it seems to be about incomplete sharing of information. It seems weird to me to say Carla doesn't completely trust Bob's account if she is 100% sure he isn't lying.

The sensitivity of the test - that aliens actually abduct people, given someone is telling her aliens abducted him - is 2.5% since she doesn't really know his drug habits and hasn't ruled out there's a LARP she's missing the context for.

I would describe this not as Carla not trusting Bob, but as her not having all of Bob's information - Bob could just tell her that he doesn't use drugs, or that he isn't referring to a LARP, or any other things he knows about himself that Carla doesn't that are causing her sensitivity to be lower, until their probabilities are the same. And, of course, if this process ends with Carla having the same probabilities as Bob, and Carla does the same with Dean, he will have the same probabilities as Bob as well.

I think this satisfies Aumann's Agreement Theorem.

Well, if it does then Bob and Carla definitely have the same probabilities; that's what the problem says, after all.

Big-endian is better than little-endian

Menotim2y20

I had the same confusion when I first heard those names. It's called little-endian because "you start with the little end", and the term comes from an analogy to Gulliver

Big-endian is better than little-endian

Menotim2y40

You're mixing up big-endian and little-endian. Big-endian is the notation used in English: twelve is 12 in big-endian and 21 in little-endian. But yes, 123.456 in big-endian would be 654.321 and with a decimal point, you couldn't parse little-endian numbers in the way described by lsusr.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments