Maximize Worst Case Bayes Score

by Scott Garrabrant2 min read17th Jun 201424 comments


Personal Blog

In this post, I propose an answer to the following question:

Given a consistent but incomplete theory, how should one choose a random model of that theory?

My proposal is rather simple. Just assign probabilities to sentences in such that if an adversary were to choose a model, your Worst Case Bayes Score is maximized. This assignment of probabilities represents a probability distribution on models, and choose randomly from this distribution. However, it will take some work to show that what I just described even makes sense. We need to show that Worst Case Bayes Score can be maximized, that such a maximum is unique, and that this assignment of probabilities to sentences represents an actual probability distribution. This post gives the necessary definitions, and proves these three facts.

Finally, I will show that any given probability assignment is coherent if and only if it is impossible to change the probability assignment in a way that simultaneously improves the Bayes Score by an amount bounded away from 0 in all models. This is nice because it gives us a measure of how far a probability assignment is from being coherent. Namely, we can define the "incoherence" of a probability assignment to be the supremum amount by which you can simultaneously improve the Bayes Score in all models. This could be a useful notion since we usually cannot compute a coherent probability assignment so in practice we need to work with incoherent probability assignments which approach a coherent one.

I wrote up all the definitions and proofs on my blog, and I do not want to go through the work of translating all of the latex code over here, so you will have to read the rest of the post there. Sorry. In case you do not care enough about this to read the formal definitions, let me just say that my definition of the "Bayes Score" of a probability assignment P with respect to a model M is the sum over all true sentences s of m(s)log(P(s)) plus the sum over all false sentences s of m(s)log(1-P(s)), where m is some fixed nowhere zero probability measure on all sentences. (e.g. m(s) is 1/2  to the number of bits needed to encode s)

I would be very grateful if anyone can come up with a proof that this probability distribution which maximizes Worst Case Bayes Score has the property that its Bayes Score is independent of the choice of what model we use to judge it. I believe it is true, but have not yet found a proof.


24 comments, sorted by Highlighting new comments since Today at 10:56 AM
New Comment

Hi Coscott,

I believe that this article by Peter Grunwald and Phil Dawid is relevant: Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory.

Thanks! This is very relevant.

I am not sure if you are implying that this will help with my conjecture. On my first very quick skim through, it looks like this is only dealing with finite spaces, which allows them to use the standard Bayes score, and not this weighted version that I am using. Unfortunately, my conjecture (and some other stronger stuff I want to prove but didn't post) is easy in the case where there is a finite language.

Sorry, I should have been more clear. The thing that I thought might help you (to prove the thing about independence of Bayes score) was the discussion of equalizer rules in the paper. I do not know if it will actually be helpful, just pattern-matching between your question and the paper. Incidentally, my intuition is that you need some assumptions (maybe some sort of convexity?) for the result to hold in general.

Can you explain why a sequence of probability assignments, whose WCB converges to a limit, must have a pointwise converging subsequence (i.e. the probabilities assigned to all sentences converge simultaneously)?

It must have a subsequence S1 which converges for the first sentence (because the interval [0,1] is compact). This subsequence must itself have a subsequence S2 which converges in the second sentence, which must have a subsequence S3 which converges in the third sentence and so on.

The subsequence we want takes the first entry of S1, then the second entry of S2, then the third entry of S3, and so on. For every n, after the nth entry, this is a subsequence of S_n, so the probabilities of the nth sentence must converge.

Note, that all the probabilities converge, but they do not converge uniformly. At any given time, there will be some probabilities that are still way off. This is a common analysis trick. Converging simultaneously on countably many axes is no harder that converging simultaneously on finitely many axes. Let me know if I should clarify further.

Thanks! That's a good trick, I didn't know it.

Reading further, it seems like your definition of P3 in terms of P1 and P2 is indeterminate when P1(phi)=0 and P2(phi)=1. I assume this hole can be patched. (ETA: I'm being stupid, this can't happen if P1 and P2 both maximize WCB, because we can assign a truth value to phi that will make either P1 or P2 have negative infinity Bayes score.)

Otherwise the proof seems fine at first glance. Great work! This is exactly the kind of stuff I want to see on LW.

You could make a similar complaint about the proof of coherence too. Just observe that clearly something that maximizes WCB can only assign probability 1 to tautologies and can only assign probability 0 to contradictions, so that can never happen.

Thanks! I have actually been thinking along these lines for about a year. (Notice that the update function for both the proofs of uniqueness and coherence are generalizations of the strategy I argued for.) I consider doing MIRIx a success just for inspiring me to finally sit down and write stuff up.

That update move is nice in that it updates the Bayes score by the same amount in all models. What I would really like to show is that if you start with the constant 1/2 probability assignment, and just apply that update move whenever you observe that your probability assignment is incoherent, you will converge to the WCB maximizer. I think this would be nice because it converges to the answer in such a way that seems unbiased during the entire process.

Let's view this as a zero-sum game between you and Nature. The game state at each step is a set of sentences, which is initially empty. At each step, we consider the lowest numbered sentence that is independent from the set. You choose a probability to assign to that sentence, then Nature chooses whether to add that sentence or its negation to the set. The outcome of the game is your Bayes score after infinitely many steps. Every coherent probability assignment corresponds to a strategy you can play in this game and vice versa, while every model corresponds to a play of the game by Nature.

You have proved that the game has a minimax value. Is it true that you can achieve that value by always choosing your move so that the two possible subgames after Nature's next move have equal minimax values? I think your conjecture would follow from that, because your Bayes score on a particular model is the limit of subgame values on the corresponding path through the game tree.

A proof could go something along these lines. Let's say your move will assign probability p to the next sentence. The value of the left subgame increases monotonically from negative infinity at p=0, and the value of the right subgame decreases monotonically to negative infinity at p=1. So there's a unique p where they cross over, and choosing any other p would be worse.

The above argument also seems to show that the assignment you found is the only coherent assignment that assigns equal finite Bayes scores to all models. The reason is that any other such assignment would correspond to a strategy that doesn't always keep the WCB of the left and right subgames equal. Then we can define two strategies for Nature, "always choose the subgame with the lower WCB" and "always choose the subgame with the higher WCB", which would give you different Bayes scores.

That leads to your second conjecture. If you start with a constant 1/2 assignment and apply the update move repeatedly, you get a sequence of assignments whose WCB increases. You can use your convergence trick to get a pointwise converging subsequence from that. If the limit is a coherent assignment, then we know that it assigns equal Bayes score to all models, so it must be the one you found. Otherwise we can apply the update move some more.

It's a bit unsatisfying that we had to pick a converging subsequence, but I think we can also prove that the whole sequence converges. Namely, if the sequence of probabilities for any sentence has more than one limit point, then we can do the above procedure with both of those limit points separately. Since we're guaranteed to reach the same result on both paths, that means both limit points must be the same.

Sorry about the handwaving, it's very likely that at least some of these arguments are wrong :-)

So this is wrong. Imagine that P (The WCB maximizer) does the same on all models except one, and does better on that one model. Restricting to any finite list of sentences will not change the worst case bayes score. (In fact, I can prove that the set of models with bayes score near WCB must be everywhere dense.)

You have said a lot of correct things that are important though.

First, If the first conjecture is true, then an alternate defintion of P is "The unique coherent probability assignment for which Bayes score is constant," which seems like an even nicer definition, and a big reason why I want to prove it. (Especially since the two definitions are the same)

Second, The sequence in my second conjecture has the property that any limit point is coherent. If I were to show also that any limit point is flat (Bayes is not a constant of model), Then I would be able to extend this proof to a proof that the sequences converge, and I would be able to prove everything from this.

The problem is that a limit of flat probability distributions is not necessarily flat. I actually basically have a paper written up which give the three definitions of P, and shows they are all well defined and equal (except I am missing the proof that the limit of my procedure must be flat). This paper uses the sequence definition to prove welldefinedness of the other two definitions. The stuff I wrote on the blog post is not proven that way in the paper because I had to figure out a way to reprove them without all the tools I got from analyzing the sequence.

Yeah, you're right. I assumed that Bayes(M, P) is continuous in both arguments, but we don't know what that means or how to prove it.

It's not clear how to define continuity in P for a given M, because pointwise convergence doesn't cut it and we don't have any other ideas. But continuity in M for a given P seems to be easier, because we can define the distance between two M's as the total weight of sentences where they disagree. Under this definition, I think I've found an example where Bayes(M, P) is not continuous in M.

1) Let's take a language consisting of sentences phi_k = "x=k" for each natural k, and all their combinations using logical connectives (and/or/not). This language has models M_k = "phi_k is true, all others are false" for each k, and also M_inf = "all phi_k are false".

2) To define the weights, fix c<1 and set mu(phi_k) = c*2^-k for each k, and distribute the remaining 1-c of weight among the remaining sentences arbitrarily.

3) For every k, any statement where M_k and M_inf disagree must include phi_k as a component, because all other phi_n have the same truth value in M_k and M_inf. The total weight of sentences that include phi_k goes to zero as k goes to infinity, so the model M_inf is the limit of models M_k under the above metric.

4) Now let's define a probability assignment P so that P(phi_k) = (2^(-2^k))/(1+2^(-2^k)) for each k, and P(anything else) = 1/2. You can check that Bayes(M_k, P) has the same value for each k, but Bayes(M_inf, P) has a different, higher value.

I actually thought of exactly this construction at our last MIRIx on Saturday. It made me sad. However here is a question I couldn't answer. Can you do the same thing with a coherent P? That I do not know.

If Bayes is continuous in M for coherent P then we would be able to prove the first conjecture.

Something slightly easier, maybe we can show that if Bayes is not continuous at the point M for coherent P, then P must assign nonzero probability to M. I am also not sure if this would prove the conjecture, but haven't really thought about this angle yet.

Another interesting thing to think about: What happens if we start with the P you described and then do the algorithm in the second conjecture.

I haven't told you everything about this algorithm yet. One frustrating thing is that if you want to change a probability from p to q in one step, you can write down how much your bayes score increases. Since bayes score is bounded above, you can only do this finitely many times. However, if you update a probability from p to q in a series of very small steps, you could in theory only increase your bayes score by an arbitrarily small amount. :(

I just recently started trying to prove the first conjecture without the second conjecture, so I have more hope that there is something easy there I just missed.

I think you know most of what I know about the first conjecture. Abram gave me a sketch of a proof that if P is the WCB maximizer then WCB(P) is Exp(Bayes(M,P)) (when M is chosen according to P), but we haven't really checked it. If you (or someone else who would like to try to find this proof) would like to know more about the second conjecture, you can pm me your email, I will send you my personal notes, and we can set up a skype chat?

The second conjecture implies the first, and If that is true, I would really like to know, because I think it makes the result much nicer.

[-][anonymous]8y 0

Yeah, you're right. I was assuming that Bayes(M, P) is continuous in both arguments in some sense. But we haven't proved that, and now I think I found a counterexample.

Consider the language that consists of countably many sentences, each of which contradicts all the others. So the only valid models are M_k = "the kth sentence is true, all the rest are false" for each k, and M_inf = "all sentences are false". Let's say that the kth sentence has weight 2^-k. Now let's define a probability assignment P that assigns to the kth sentence the probability P_k = (2^(-2^k))/(1+2^(-2^k)). You can check that Bayes(M_k, P) has the same value for all k, but Bayes(M_inf, P) has a different and higher value.

That seems to falsify most naive statements like "Bayes(M, P) is continuous in M for a given P, where the distance between two M's is the total weight of sentences where they disagree". There's probably some simple argument that falsifies continuity in P as well, once we define a metric on P's (which I don't know how to do).

[This comment is no longer endorsed by its author]Reply
[-][anonymous]8y 0

Imagine that P (The WCB maximizer) does the same on all models except one, and does better on that one model.

I don't think that's possible. Bayes(M, P) for a fixed P is a continuous function of M, if you define the distance between two models as the total weight of sentences where they disagree. Or am I missing something?

The problem is that a limit of flat probability distributions is not necessarily flat.

Wait, how can that be? Do you have an example?

[This comment is no longer endorsed by its author]Reply

You give a natural construction of a coherent probability assignment give a probability measure over sentences. Now, there is another such natural construction. My hunch is that they are the same: it would be interesting to prove/disprove.

What is the other construction? Start with an empty set of known sentences. On each iteration, pick a sentence at random using the probability measure on sentences. If it is consistent with the current set, add it to the set. Continue ad infinitum. The probability assigned to a sentence is the probability it will be added to the known set at some point during this procedure. I think it is very close to the proposal by Abram Densky (maybe identical? I need to reread it to remember).

There is a general probability theory analogue of this situation. Consider a set X and a countable system of subsets Si. Equip X with the sigma algebra generated by Si. Suppose there is a probability measure on the set of indices {i} is given. Then the analogue of the above procedure is as follows. Start with the set Y = X. On each iteration, pick an index i at random using the probability measure on indices. If Si intersects Y, replace Y by the intersection. Continue ad infinitum. The probability of a measurable subset T of X is the probability it will contain Y at some point during this procedure.

Your construction can also be generalized to this setting. Just define the Bayes score of by summing over Si. Coherence of a probability assignment to the Si corresponds to the assignment coming from a probability measure on X. Models corresponds to elements of X. So maybe both constructions coincide in the general setting?

I have I already presented this to Abram Demski, and he and I have been working together on trying to prove my conjecture. (He and I are both in Los Angeles, and coincidentally are interested in the same question, so it is likely to be the direction that the MIRIxLosAngeles workshop continues to focus on.)

Your proposal is equivalent to Abram's proposal. We believe the two distributions are not the same. I think we checked this for some small finite analogue.

Your "general" setting does not seem that much more general to me, It seems like it is pretty much identical, only reworded in terms of set theory instead of logic. There is one way in which it is more general. In my system, the set of subsets must be closed under union, intersection, and complement, and this is not true for a general collection of subsets. However, my construction does not work without this assumption. I actually use the fact that \mu is nowhere zero, and not being closed under union intersection and complement is kind of liking having some sets have measure 0.

I think the language of subsets instead of logic makes things a little easier to think about for some people. I think I prefer the logic language. (However when I wanted to think about small finite examples I would sometimes think about it in the subset language instead)

Firstly, I'd love to see the counterexample for the distributions being the same.

Secondly, are you sure \mu nowhere zero is essential? Intuitively, your uniqueness result must work whenever for every two models M1, M2 there is a sentence \phi separating them with \mu(\phi) non-zero. But I haven't checked it formally.

At very least my conjecture is not true if \mu is not nowhere zero, which was enough for me to ignore that case, because (see my response to cousin_it) what I actually believe is that there are three very different definitions that all give the same distribution, which I think makes the distribution stand out a lot more as a good idea. Also if \mu is sometimes zero, we also lose uniqueness because we dont know what to do with sets that our Bayes score does not care about. The fact that we can do whatever we want with these sets also takes away coherence (although maybe we could artificially require coherence) I really don't want to do that, because the whole reason I like this approach so much is that it didnt require coherence and coherence came out for free.

Well for example, if Si only has one set A, Abram will think we are in that set, I will think we are in that set with probability 1/2. Now, you could require that every sentence has the same \mu as its negation, (corresponding to putting the sentence or its negation in with probability 1/2 in Abram's procedure) in which case partition X into 3 sets, A, B, and C for which the in or not in A question is given weight muA, and similarly define muB and muC.

Let muA=1/2, muB=1/4 and muC=1/4.

Abrams procedure will with probability 1/4 choose A first, probability 1/8 choose B first, probability 1/8 choose C first, with probability 1/8 choose not A first then choose B with probability 1/8 choose not A first then choose C with probability 1/16 choose not C first and end up with B, with probability 1/16 choose not B first then end up with C, and with probability 1/8 choose not B or not C first then end up with A. In the end P(A) is .375.

Notice that Abrams solution gives a different Bayes score when in set A than when in the other 2 sets. My Bayes score will not. My bayes score will give P(A) probability p where the bayes score is constant:

1/2 log p+1/4 log (1-(1-p)/2)+1/4 log (1-(1-p)/2)=1/2 log (1-p)+1/4 log ((1-p)/2)+1/4 log (1-(1-p)/2)

2 log p+log (1-(1-p)/2)=2 log (1-p)+ log ((1-p)/2)




If you check this p value, you should see that bayes score is independent of model.

This idea is interesting enough that I think writing it up formally with some more detail for a logic journal or a philosophy journal may make sense. The approach is neat, and I find it surprising that this makes this much sense.

Thanks! That is actually the plan, but hopefully I (or someone potential coauthor) prove my conjecture (and the stronger conjecture I mentioned in response to to cousin_it) first. I and the MIRIxLosAngeles group are also thinking about how to this approach changes when we give the system the ability to reason about its own WCB maximizing probability function. (Thus making the set of possible models an acausal function of the choices of probabilities)

I am kind of surprised and sad that according to my blog stats, only at most 27 people have actually followed the link and read it.

Very neat!!

Your use of the Bayes score is reminiscent of what I'm trying to do here, only that I'm working in the context of complexity theory (and consider only efficiently computable things) whereas you work in the context of logic (and consider incomputable things). I think that (assuming my optimal estimators actually exist) it might be useful to try to find a formal connection between the two. The direction I see so far is observing that my estimator can be used to assign a probability to "sentence phi has a proof of length at most n" by considering the NP problem of deciding, given input (phi, x), whether a proof of length at most |x| exists.

Btw, I think your post can be great content for the FAI google group... wink wink... :)

This seems useful to me. However there should be regions of equal probability distribution. Have you considered that the probabilities would shift in real time with the geometry of the events implied by the sensor data? For example if you translated these statements to a matrix of likely interaction between events?

I do not really understand your question, but the answer is no. Care to elaborate?

I will have to think through my response so it will be useful. It may take a few days.

[This comment is no longer endorsed by its author]Reply

Ignore the second part of my initial comment, I was hadn't read your blog post explaining your idea at that point. I believe your problem can be formulated differently in order to introduce other necessary information to answer it. I appreciate your approach because it is generalized to any case which is why I find it appealing, but I believe it cannot be answered in this fashion because if you examine integrated systems of information by constituent parts in order to access that information you must take a "cruel cut" of the integrated system of information which may result in it becoming differentiable into 2 or more subsystems of information neither of which can contain the integrated information of the initial system.

I would recommend reading:

To be honest I am still in the midst of digesting this and several other ideas relevant to the computability of consciousness/intelligence and my remarks will have to stay brief for now. The first part of my initial comment referred to something similar to the example of a 2D isling model referred to in section C. The information you obtain from a random model of the consistent but incomplete theory will be bounded by other operators on the system from outside which limit the observable state(s) it produces. As for the second perhaps I can rephrase it now.... If viewed in this fashion your initial question becomes one of maximizing the information you get from a physical system however if the operators acting on it evolve in time then the information you receive will not be of T but rather of a set of theories. For example take a box of gas sitting in the open. As the sun rises and sets changing the temperature, the information you get about the gas will be of it operating under different conditions so uppercase phi will be changing in time and you will not be getting information about the same system. However, I do believe you can generalize these systems to a degree and predict how they will operate in a given context while bounded by particular parameters. I am still clarifying my thoughts on this issue.