# Ω 9

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The post on naturalistic logical updates left open the question of whether the probability distribution converges as we condition on more logical information. Here, I show that this cannot always be the case: for any computable probability distribution with naturalistic logical updates, we can show it proofs in an order which will prevent convergence. In fact, at any time, we can drive the probability of up or down as much as we like, for a wide variety of sentences .

As an aid to intuition, I describe the theorem informally as "all mathematicians are trollable". I was once told that there was an "all mathematicians go to Bayesian hell" theorem, based on the fact that a computable probability distribution must suffer arbitrarily large log-loss when trying to model mathematics. The idea here is similar. We are representing the belief state of a mathematician with a computable probability distribution, and trying to manipulate that belief state by proving carefully-selected theorems to the mathematician.

(ETA: A way around this result has now been found, giving us an untrollable mathematician.)

represents "p is provable", but we consider a class of proofs which the mathematician believes are sound; this justifies characterizing the mathematician's beliefs with the provability logic (as in naturalistic logical updates) rather than , so the mathematician's beliefs will obey the principle . For concreteness, we can suppose that the mathematician accepts proofs in .

Definition. A probability assignment over a language is a function from sentences of that language to values in the interval .

This is, of course, an extremely weak definition. We will generally also want coherence with respect to some logic. Where a logic is a set of axioms and inference rules over , we define coherence as usual:

Definition. A probability assignment over is coherent with respect to if and only if:

1. implies and implies ,
2. whenever , we have , and
3. .

This is equivalent to a number of other common definitions of coherence. For a computable probability assignment, coherence with won't be possible. Requirement (1) alone is impossible: the sentences which can be proven from and those which can be refuted are recursively inseparable sets.

We can, however, insist that the probability assignment is coherent with respect to . To be more precise, we can translate all the axioms and inference rules of over to the language of by interpreting the modal provability operator as a provability predicate in . I will keep using the box-notation for simplicity, and will refer to -derivability, -coherence, and so on; I always mean it in this translated sense.^1^ Since the consequence relation of is decidable, we don't run into the same problem as we do for coherence with respect to ; I showed previously that this allows a probability distribution to be constructed. In most respects, this is much weaker than coherence with respect to ; however, it validates the soundness of proofs in , which itself cannot.

The motivation for -coherence in my previous post was reflective consistency: when a new mathematical fact is proven, we want to take it into account in a way which our past self would endorse. More precisely, I proposed that in the absence of interactions between impossible possible worlds (such as counterfactual mugging with a logical coin), the policy for responding to logical information which looks best before getting a proof one way or the other should still look like the best way of responding after a proof is found. I also argued something similar from logical Dutch Book arguments, though I hadn't found the recipe to satisfy the desiderata yet. These decision-theoretic arguments still needs to be dealt with in more detail, but it suggested to me the goal of making a computable probability assignment in which Bayesian updates on proofs take the place of the "second update" usually associated with logical uncertainty. The -coherent prior which I constructed was the result.

In order for this approach to be very appealing, it seems that updating on proofs in this way should have other good properties which we have established for logical uncertainty, such as uniform coherence. The assignment which was given last time does not achieve such good properties, but perhaps some -coherent probability assignment will. The major point of this post is to show a difficulty in achieving uniform coherence with such an approach: it is not possible for -coherent Bayesian updates on proofs to have convergence guarantees when proofs are presented in arbitrary order. This does not rule out convergence for "good" proof orderings (indeed, perhaps there is a large class if "good" orderings for which we can converge); however, it is a bad sign. If convergence turns out to be impossible or very fragile, uniform coherence would as well.

Translating back to the story about "trolling mathematicians", the idea is to show that if someone is choosing proofs adversarially, a mathematician with -coherent beliefs can be made to form arbitrarily strong conjectural beliefs for or against a target proposition (so long as this proposition has not yet been proved or disproved); these beliefs can be made to oscillate as desired. The adversary can do this without ever lying to the mathematician.

The initial idea of the proof is that because a computable probability assignment cannot be coherent with respect to , any statement has consequences in which are currently unrecognized. Showing the mathematician a proof of those unexpected consequences should reduce belief in , since it eliminates some of the ways could be true (removing as a possibility from situations where ). (A real-life example of this is that learning about the Banach-Tarski paradox may reduce belief in the axiom of choice.) Furthermore, since learning tends to decrease belief in and increase it in , we can reduce the probability of a statement arbitrarily far by putting it as and finding many which it implies, or we can increase the probability of a statement by taking it to be and finding many which imply it.

The proof doesn't quite go through that simply, however; we will actually manipulate the probability of the statement , rather than manipulating directly. is abbreviated as . I also argue by choosing which is a theorem in , making vacuously true; this simplifies the proof. Where is a theorem, I will call and anti-theorem.

Lemma. Consider a computable probability assignment over the sentences of , such that is coherent with respect to . For any such that , there exists a theorem of , , such that .

Proof. The expression cannot act as a separator between which are theorems and anti-theorems of , since they are recursively inseparable. Due to -coherence, and assuming , there will already be some theorems of for which this expression has value 1 and anti-theorems for which it is 0; for example, for any sentence , and . Therefore, to prevent separation, there must be either an anti-theorem with , or a theorem with .

Suppose there's an anti-theorem with . Then . , and implies in (since and also ). is -equivalent to . By -coherence, . But this means that is a -theorem with .

Therefore, in either case, there is a -theorem such that .

For the next theorem, I also need a condition which I will call weak non-dogmatism:

Defintion. A probability assignment on the language of is weakly non-dogmatic if it does not assign probability zero to any theorem of .

The -coherent prior which I constructed previously is weakly non-dogmatic; I take this to be a very reasonable requirement which is not at all difficult to meet.

Note that weak non-dogmatism together with -coherence further implies that theorems of the combined system have nonzero probability.

Theorem 1. For any computable probability assignment over the sentences of which is -coherent and weakly non-dogmatic, for any such that , there exists a theorem of such that ; that is, odds of reduce by half given that is a theorem.

Proof. We will show this with the theorem , with chosen via the lemma. is a theorem of , since is. So, we want to show . The quantities are well-defined thanks to weak non-dogmatism and the assumption . Simplifying:

(The conjunct in the divisor was redundant, since means , which already implies for any .)

So, we want to show that is at most half as probable as .

But we already know , which means . implies , since if we could prove , we could combine that with a proof of to prove . So by coherence, . So we have , which is what was to be proved.

This theorem allows us to cut the odds of in half (or more) by a set formula. Since the result of updating on will itself be a -coherent probability assignment (thanks in part to weak non-dogmatism), we can keep doing this, cutting the probability of down as much as we like.

Corollary 1. For such that , we can drive as low as we like.

Proof. proves , so it also proves . Since has modus ponens as an inference rule, this means , so . Since we can iterate the trick in Theorem 1 to drive down the probability of arbitrarily far, we can drive down as well.

Corollary 2. If we have , we can drive as high as we like.

Proof. By corollary 1, we can drive down as far as we like in this case.

The two corollaries identify cases when a "troll" can make a mathematician's belief arbitrarily low or high. The next theorem shows that we can combine the trick to send a mathematician's beliefs waver back and forth forever.

Theorem 2. Consider a sequence of weakly non-dogmatic -coherent probability assignments such that is a Bayesian update of on some sentence with being a theorem of . We can insert additional updates on theorems, making a modified sequence , to ensure that for some , does not converge to any value.

Proof. There must be undecidable sentences with both and . If there weren't, then all undecidable sentences would have either or . By weak non-dogmatism, theorems would have , and anti-theorems would have . Sentences with both above zero would be guaranteed decidable one way or the other, so we could run a theorem prover and be sure that it would terminate on those. Sentences with both zero would be definitely undecidable. We can find the remaining theorems from anti-theorems by the predicate . Thus, we have a procedure for deciding whether a given is a theorem or not in finite time, which is impossible.

Since implies , we also have , which means . Similarly, because implies , we can infer and hence . So the sentences meet the conditions of corollaries 1 and 2.

Furthermore, there will be some sentences which stay in this state forever. If not, then it would be the case that we eventually know at least one of or for these . In fact, this means we eventually know one of those for any whatsoever. But this would allow us to separate theorems and anti-theorems: simply wait for one of or to become zero, grouping the first case with anti-theorems and the second with theorems. This procedure would always halt, and never put a theorem in the anti-theorem group or vice versa. This can't happen, so there must be sentences whose state we stay uncertain about forever.

Therefore, we can cause a failure of convergence by driving the probability of these sentences up and down forever. Suppose, as in the theorem statement, that there is a sequence of weakly non-dogmatic -coherent probability assignments such that is a Bayesian update of on some sentence with being a theorem of . Denote the sequence of theorems used as . We can edit the sequence to be sure that there are sentences for which the probability assignment does not converge, as follows. Between and for whole-numbered , insert additional theorems as follows. Find the first sentences matching the conditions of both corollaries 1 and 2. Use the strategy in corollary 1 to at least half the odds of these sentences times each, and then apply the strategy of corollary 2 to at least double them times each.

This guarantees that some sentences will have probability assignments which go back and forth forever, coming arbitrarily close to both probability 1 and probability 0 as the probability assignment is conditioned on more information.

# Are Naturalistic Updates to Blame?

I think some people might look at this and think that my unusual choice to update on rather than and to use the provability logic rather than the more usual might be the source of the difficulty. However, updating on theorems directly rather than updating on their provability is just as bad (in fact a little worse). The following proof is due primarily to Scott Garrabrant (and was the seed idea for the proof of theorem 1):

Theorem 3. Consider a propositionally coherent probability assignment to the language of which is computable and weakly non-dogmatic. For any sentence with , we can choose a sequence of theorems of to update on which send arbitrarily high or low.

Proof. There must be a theorem of such that , by the same seperability argument used in the proof of theorem 1. I want to establish that . The expressions involved are well-defined thanks to weak non-dogmatism and the assumed range of . Reducing, it suffices to show:

Since is propositionally equivalent to , and is propositionally equivalent to , so it suffices to show:

But we already know , which is to say .

This allows us to cut the odds of in half. A symmetric trick allows us to inflate the odds by half, using an anti-theorem and updating on . We can repeat both tricks as much as we like to drive the odds arbitrarily down or up.

So it is not exactly the formalism of naturalistic updates which creates the central problem, but rather, has more to do with representing a belief-state as a computable probability assignment and perform a Bayesian update to take into account more information. On the other hand, the details (such as updating on rather than ) do change the prognosis somewhat (changing the conditions under which we can manipulate a sentence's probability); it might even be that there are mild variations which significantly improve things.

# Discussion

I've shown that logical uncertainty following the school of naturalistic logical updates is "trollable", IE, can be manipulated by another entity selectively revealing mathematical proofs. This makes convergence (at best) dependent on proof order, which seems like a bad sign: logical uncertainty results so far haven't needed to make many assumptions about the theorem provers being used.

Although convergence/divergence is a nice formal property to check, one might argue that it doesn't matter too much what we believe about undecidable sentences. Decidable sentences will eventually be proved or refuted, and so will always converge. However, the proof also demonstrates that we can manipulate the probabilities of a large class of sentences, before those sentences are proved or refuted. This suggests that adversarial proof orderings can cause a -agent to be arbitrarily wrong about sentences which are later proved/refuted.

The main concern is not so much whether -coherent mathematicians are trollable as whether they are trolling themselves. Vulnerability to an external agent is somewhat concerning, but the existence of misleading proof-orderings brings up the question: are there principles we need to follow when deciding what proofs to look at next, to avoid misleading ourselves?

How damning is this for naturalistic logical updates, as a line of research? Speaking for myself, I find it concerning but not yet totally damning. The way I see it, this puts untrollability at odds with (a specific form of) reflective coherence: an agent shifting its mathematical uncertainty according to something other than a Bayesian update on proofs would want to self-modify, but an agent which uses a Bayesian update is more vulnerable to trolls. ("Trollability is bad, but if I can get a higher expected utility by editing myself to be more vulnerable to trolls, who cares?") I'm still somewhat hopeful that a probability assignment can converge on a broad class of proof-orderings. For example, it would be very nice if there were a -coherent distribution which was not trollable by any proof-ordering within its own range of computational complexity ("successful trolls need more processing power than their targets"). I think a result like that is unlikely, at present, but I haven't ruled it out yet. We could then guarantee convergence by allocating more computational resources to the logical uncertainty than to the theorem prover (or, something resembling this).

One comment I got about this, in a discussion at MIRI (I forget who made the remark) was that it's a problem of updating on information without taking into account the source of the information. Just as you would draw different conclusions about the frequencies of colors in a ball pit if you knew someone was selecting balls at random to show you vs if you thought they were selecting by some other policy, you should be able to model the proof-selecting strategy of a troll; believing with high probability that they will be showing you more evidence against , you would be inoculated against needing to actually adjust beliefs (unless the evidence was surprisingly compelling). I haven't figured out a good way to model this yet, though.

On the MIRIx slack, Miëtek Bak discussed explicit provability logic (papers: 1, 2). The idea is that reasoning about the existential statement "there exists a proof", in the form of "", causes unnecessary problems in reflection due to its non-constructive nature. Instead, we can consider a system which has explicit terms for proofs, with terms (" is a proof of ") taking the place of . Like , the resulting system can conclude the raw statement from the statement of provability; it endorses all instances of . Like , provability is reflective, representing provability in-this-very-logic. This may make explicit provability a good candidate for a variant of naturalistic logical updates. I think there may be reflective consistency considerations which point in this direction as well: my argument for naturalistic logical updates involves the agent considering the future, but makes a perhaps problematic leap from the agent observing a proof to the agent observing abstract provability, "". (This is especially questionable given that the agent is not supposed to be making first-order inferences like that automatically.)

Overall, naturalistic logical updates stand or fall with the strength of the reflective consistency argument which I'm currently claiming leads us to them. I haven't articulated this well enough yet, I think. (But, see logical Dutch Book arguments, logical counterfactuals consistent under self-modification, and naturalistic logical updates for my arguments as they stand.)

1. This still leaves details to the imagination, so to be yet more precise: consider a (computable) encoding of sentences of the language of into numerals , and another encoding of proofs of as numerals. There exists a predicate in which checks whether is a proof of ; denote it as . Also, we will use a function giving unique propositional sentence-letters of for each expression of ; this function must have the property that syntactically distinct expressions always get distinct propositional sentence-letters (no "hash collisions"), while identical expressions are given the same sentence-letter. We encode statements from into by a recursive translation :
• Otherwise,

The abuse of language whereby I talk about consequence and coherence "in ", but concerning the sentences of , can now be justified by translating sentences into to check the consequence relation. Notice that since identical sub-expressions always get translated to the same sentence-letters in , we can validate principles like and for all sentences in ; however, we make quantifiers totally opaque, failing to recognize even simple equivalences such as .