Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

All Mathematicians are Trollable: Divergence of Naturalistic Logical Updates

0Diffractor

New Comment

1 comment, sorted by Click to highlight new comments since: Today at 2:31 AM

Quick question: It is possible to drive the probability of x down arbitrarily far by finding a bunch of proofs of the form "x implies y" where y is a theorem. But the exact same argument applies to not x.

If the theorem-prover always finds a proof of the form "not x implies y" immediately afterwards, the probability wouldn't converge, but it would fluctuate within a certain range, which looks good enough.

What, if any, conditions need to be imposed on the theorem prover to confine the probabilities assigned to an unprovable statement to a range that is narrower than (0, 1)?

The post on naturalistic logical updates left open the question of whether the probability distribution converges as we condition on more logical information. Here, I show that this cannot always be the case: for any computable probability distribution with naturalistic logical updates, we can show it proofs in an order which will prevent convergence. In fact, at any time, we can drive the probability of x up or down as much as we like, for a wide variety of sentences x.

As an aid to intuition, I describe the theorem informally as "all mathematicians are trollable". I was once told that there was an "all mathematicians go to Bayesian hell" theorem, based on the fact that a computable probability distribution must suffer arbitrarily large log-loss when trying to model mathematics. The idea here is similar. We are representing the belief state of a mathematician with a computable probability distribution, and trying to manipulate that belief state by proving carefully-selected theorems to the mathematician.

(ETA: A way around this result has now been found, giving us an untrollable mathematician.)□p represents "p is provable", but we consider a class of proofs which the mathematician believes are sound; this justifies characterizing the mathematician's beliefs with the provability logic GLS (as in naturalistic logical updates) rather than GL, so the mathematician's beliefs will obey the principle □p→p. For concreteness, we can suppose that the mathematician accepts proofs in PA.

Definition.Aprobability assignmentover a language L is a function from sentences of that language to values in the interval [0,1].This is, of course, an extremely weak definition. We will generally also want coherence with respect to some logic. Where a logic A is a set of axioms and inference rules over L, we define coherence as usual:

Definition.A probability assignment P over L iscoherentwith respect to A if and only if:This is equivalent to a number of other common definitions of coherence. For a computable probability assignment, coherence with PA won't be possible. Requirement (1) alone is impossible: the sentences which can be proven from PA and those which can be refuted are recursively inseparable sets.

We can, however, insist that the probability assignment is coherent with respect to GLS. To be more precise, we can translate all the axioms and inference rules of GLS over to the language of PA by interpreting the modal provability operator □ as a provability predicate in PA. I will keep using the box-notation for simplicity, and will refer to GLS-derivability, GLS-coherence, and so on; I always mean it in this translated sense.^1^ Since the consequence relation of GLS is decidable, we don't run into the same problem as we do for coherence with respect to PA; I showed previously that this allows a probability distribution to be constructed. In most respects, this is much weaker than coherence with respect to PA; however, it validates the soundness of proofs in PA, which PA itself cannot.

The motivation for GLS-coherence in my previous post was reflective consistency: when a new mathematical fact is proven, we want to take it into account in a way which our past self would endorse. More precisely, I proposed that in the absence of interactions between impossible possible worlds (such as counterfactual mugging with a logical coin), the policy for responding to logical information which looks best

beforegetting a proof one way or the other should still look like the best way of respondingaftera proof is found. I also argued something similar from logical Dutch Book arguments, though I hadn't found the GLS recipe to satisfy the desiderata yet. These decision-theoretic arguments still needs to be dealt with in more detail, but it suggested to me the goal of making a computable probability assignment in which Bayesian updates on proofs take the place of the "second update" usually associated with logical uncertainty. The GLS-coherent prior which I constructed was the result.In order for this approach to be very appealing, it seems that updating on proofs in this way should have other good properties which we have established for logical uncertainty, such as uniform coherence. The assignment which was given last time does not achieve such good properties, but perhaps some GLS-coherent probability assignment will. The major point of

thispost is to show a difficulty in achieving uniform coherence with such an approach: it is not possible for GLS-coherent Bayesian updates on proofs to have convergence guarantees when proofs are presented inarbitraryorder. This does not rule out convergence for "good" proof orderings (indeed, perhaps there is a large class if "good" orderings for which we can converge); however, it is a bad sign. If convergence turns out to be impossible or very fragile, uniform coherence would as well.Translating back to the story about "trolling mathematicians", the idea is to show that if someone is choosing proofs adversarially, a mathematician with GLS-coherent beliefs can be made to form arbitrarily strong conjectural beliefs for or against a target proposition x (so long as this proposition has not yet been proved or disproved); these beliefs can be made to oscillate as desired. The adversary can do this without ever lying to the mathematician.

The initial idea of the proof is that because a computable probability assignment cannot be coherent with respect to PA, any statement x has consequences in PA which are currently unrecognized. Showing the mathematician a proof of those unexpected consequences should reduce belief in x, since it eliminates some of the ways x could be true (removing x as a possibility from situations where ¬y). (A real-life example of this is that learning about the Banach-Tarski paradox may reduce belief in the axiom of choice.) Furthermore, since learning x→y tends to decrease belief in x and increase it in y, we can reduce the probability of a statement arbitrarily far by putting it as x and finding many y which it implies, or we can increase the probability of a statement by taking it to be y and finding many x which imply it.

The proof doesn't quite go through that simply, however; we will actually manipulate the probability of the statement ¬□¬x, rather than manipulating x directly. ¬□¬x is abbreviated as ◊x. I also argue by choosing y which is a theorem in PA, making x→y vacuously true; this simplifies the proof. Where ¬x is a theorem, I will call x and anti-theorem.

Lemma.Consider a computable probability assignment P over the sentences of PA, such that P is coherent with respect to GLS. For any x such that P(◊x)≠0, there exists a theorem of PA, y, such that P(◊y|◊x)≤1/2.Proof.The expression P(◊y|◊x)>1/2 cannot act as a separator between y which are theorems and anti-theorems of PA, since they are recursively inseparable. Due to GLS-coherence, and assuming P(◊x)≠0, there will already be some theorems of PA for which this expression has value 1 and anti-theorems for which it is 0; for example, P(◊(a∨¬a)|◊x)=1 for any sentence a, and P(◊(a∧¬a)|◊x)=0. Therefore, to prevent separation, there must be either an anti-theorem c with P(◊c|◊x)>1/2, or a theorem t with P(◊t|◊y)≤1/2.Suppose there's an anti-theorem c with P(◊c|◊x)>1/2. Then P(¬◊c|◊x)≤1/2. ¬◊c=□¬c, and □¬c implies ¬□c in GLS (since ¬□⊥ and also ((□c)∧(□¬c))→□⊥). ¬□x is GLS-equivalent to ◊¬c. By GLS-coherence, P(◊¬c|◊x)≤P(¬◊c|◊x). But this means that ¬c is a PA-theorem with P(◊¬c|◊x)≤1/2.

Therefore, in either case, there is a PA-theorem y such that P(◊y|◊x)≤1/2. ■

For the next theorem, I also need a condition which I will call weak non-dogmatism:

Defintion.A probability assignment on the language of PA isweakly non-dogmaticif it does not assign probability zero to any theorem of PA.The GLS-coherent prior which I constructed previously is weakly non-dogmatic; I take this to be a very reasonable requirement which is not at all difficult to meet.

Note that weak non-dogmatism together with GLS-coherence

furtherimplies that theorems of the combined system PA+GLS have nonzero probability.Theorem 1.For any computable probability assignment P over the sentences of PA which is GLS-coherent and weakly non-dogmatic, for any x such that 1>P(◊x)>0, there exists a theorem t of PA such that P(◊x|□t)/P(¬◊x|□t)≤12P(◊x)/P(¬◊x); that is, odds of ◊x reduce by half given that t is a theorem.Proof.We will show this with the theorem t=x→y, with y chosen via the lemma. x→y is a theorem of PA, since y is. So, we want to show P(◊x|□(x→y))/P(¬◊x|□(x→y))≤12P(◊x)/P(¬◊x). The quantities are well-defined thanks to weak non-dogmatism and the assumption 1>P(◊x)>0. Simplifying:⇔P(◊x∧□(x→y))/P(¬◊x∧□(x→y))≤12P(◊x)/P(¬◊x)

⇔P(◊x∧□(x→y))/P(¬◊x)≤12P(◊x)/P(¬◊x)

(The conjunct □(x→y) in the divisor was redundant, since ¬◊x means □¬x, which already implies □(x→z) for any z.)⇔P(◊x∧□(x→y))≤12P(◊x)

So, we want to show that ◊x∧□(x→y) is at most half as probable as ◊x.

But we already know P(◊y|◊x)≤12, which means P(◊y∧◊x)≤12P(◊x). (¬□¬x∧□(x→y)) implies ¬□¬y, since if we could prove ¬y, we could combine that with a proof of x→y to prove ¬x. So by GLS coherence, P(◊x∧□(x→y))≤P(◊y∧◊x). So we have P(◊x∧□(x→y))≤12P(◊x), which is what was to be proved. ■

This theorem allows us to cut the odds of ◊x in half (or more) by a set formula. Since the result of updating on □(x→y) will itself be a GLS-coherent probability assignment (thanks in part to weak non-dogmatism), we can keep doing this, cutting the probability of ◊x down as much as we like.

Corollary 1.For x such that 1>P(◊x)>0, we can drive P(x) as low as we like.Proof.GLS proves □¬x→¬x, so it also proves x→¬□¬x. Since GLS has modus ponens as an inference rule, this means x⊢◊x, so P(x)≤P(◊x). Since we can iterate the trick in Theorem 1 to drive down the probability of ◊x arbitrarily far, we can drive down P(x) as well. ■Corollary 2.If we have 1>P(□x)>0, we can drive P(x) as high as we like.Proof.By corollary 1, we can drive P(¬x) down as far as we like in this case. ■The two corollaries identify cases when a "troll" can make a mathematician's belief arbitrarily low or high. The next theorem shows that we can combine the trick to send a mathematician's beliefs waver back and forth forever.

Theorem 2.Consider a sequence of weakly non-dogmatic GLS-coherent probability assignments P1,P2,P3,... such that Pn+1 is a Bayesian update of Pn on some sentence □t with t being a theorem of PA. We can insert additional updates on theorems, making a modified sequence P∗n, to ensure that for some x, P∗n(x) does not converge to any value.Proof.There must be undecidable sentences u with both P(□u)>0 and P(□¬u)>0. If there weren't, then all undecidable sentences would have either P(□x)=0 or P(□¬x)=0. By weak non-dogmatism, theorems would have P(□x)>0, and anti-theorems would have P(□¬x)>0. Sentences with both above zero would be guaranteed decidable one way or the other, so we could run a theorem prover and be sure that it would terminate on those. Sentences with both zero would be definitely undecidable. We can find the remaining theorems from anti-theorems by the predicate P(□x)>0∧P(□¬x)=0. Thus, we have a procedure for deciding whether a given x is a theorem or not in finite time, which is impossible.Since □x implies ◊x, we also have P(◊u)>0, which means P(□¬u)<1. Similarly, because □¬x implies ◊¬x, we can infer P(◊¬u)>0 and hence P(□u)<1. So the sentences u meet the conditions of corollaries 1 and 2.

Furthermore, there will be some sentences which stay in this state forever. If not, then it would be the case that we eventually know at least one of ¬□u or ¬□¬u for these u. In fact, this means we eventually know one of those for any x whatsoever. But this would allow us to separate theorems and anti-theorems: simply wait for one of P(□x) or P(□¬x) to become zero, grouping the first case with anti-theorems and the second with theorems. This procedure would always halt, and never put a theorem in the anti-theorem group or vice versa. This can't happen, so there must be sentences whose state we stay uncertain about forever.

Therefore, we can cause a failure of convergence by driving the probability of these sentences up and down forever. Suppose, as in the theorem statement, that there is a sequence of weakly non-dogmatic GLS-coherent probability assignments P1,P2,P3,... such that Pn+1 is a Bayesian update of Pn on some sentence □t with t being a theorem of PA. Denote the sequence of theorems used as t1,t2,t3,.... We can edit the sequence to be sure that there are sentences for which the probability assignment does not converge, as follows. Between t2n and t2n+1 for whole-numbered n, insert additional theorems as follows. Find the first n sentences matching the conditions of both corollaries 1 and 2. Use the strategy in corollary 1 to at least half the odds of these sentences n times each, and then apply the strategy of corollary 2 to at least double them n times each.

This guarantees that some sentences will have probability assignments which go back and forth forever, coming arbitrarily close to both probability 1 and probability 0 as the probability assignment is conditioned on more information. ■

## Are Naturalistic Updates to Blame?

I think some people might look at this and think that my unusual choice to update on □x rather than x and to use the provability logic GLS rather than the more usual GL might be the source of the difficulty. However, updating on theorems x directly rather than updating on their provability is just as bad (in fact a little worse). The following proof is due primarily to Scott Garrabrant (and was the seed idea for the proof of theorem 1):

Theorem 3.Consider a propositionally coherent probability assignment P to the language of PA which is computable and weakly non-dogmatic. For any sentence x with 0<P(x)<1, we can choose a sequence of theorems of PA to update on which send P(x) arbitrarily high or low.Proof.There must be a theorem t of PA such that P(t|x)≤12, by the same seperability argument used in the proof of theorem 1. I want to establish that P(a|a→t)/P(¬a|a→t)≤12P(a)/P(¬a). The expressions involved are well-defined thanks to weak non-dogmatism and the assumed range of P(x). Reducing, it suffices to show:P(a∧(a→t))/P(¬a∧(a→t))≤12P(a)/P(¬a)

Since a∧(a→t) is propositionally equivalent to a∧t, and ¬a∧(a→t) is propositionally equivalent to ¬a, so it suffices to show:

P(a∧t)/P(¬a)≤12P(a)/P(¬a)

⇔P(a∧t)≤12P(a)

But we already know P(t|x)≤12, which is to say P(t∧x)≤12P(a).

This allows us to cut the odds of x in half. A symmetric trick allows us to inflate the odds by half, using an anti-theorem c and updating on c→x. We can repeat both tricks as much as we like to drive the odds arbitrarily down or up. ■

So it is not exactly the formalism of naturalistic updates which creates the central problem, but rather, has more to do with representing a belief-state as a computable probability assignment and perform a Bayesian update to take into account more information. On the other hand, the details (such as updating on □t rather than t) do change the prognosis somewhat (changing the conditions under which we can manipulate a sentence's probability); it might even be that there are mild variations which significantly improve things.

## Discussion

I've shown that logical uncertainty following the school of naturalistic logical updates is "trollable", IE, can be manipulated by another entity selectively revealing mathematical proofs. This makes convergence (at best) dependent on proof order, which seems like a bad sign: logical uncertainty results so far haven't needed to make many assumptions about the theorem provers being used.

Although convergence/divergence is a nice formal property to check, one might argue that it doesn't matter too much what we believe about undecidable sentences. Decidable sentences will eventually be proved or refuted, and so will always converge. However, the proof also demonstrates that we can manipulate the probabilities of a large class of sentences, before those sentences are proved or refuted. This suggests that adversarial proof orderings can cause a GLS-agent to be arbitrarily wrong about sentences which are later proved/refuted.

The main concern is not so much whether GLS-coherent mathematicians are

trollableas whether they aretrolling themselves. Vulnerability to an external agent is somewhat concerning, but the existence of misleading proof-orderings brings up the question: are there principles we need to follow when deciding what proofs to look at next, to avoid misleading ourselves?How damning is this for naturalistic logical updates, as a line of research? Speaking for myself, I find it concerning but not yet totally damning. The way I see it, this puts untrollability at odds with (a specific form of) reflective coherence: an agent shifting its mathematical uncertainty according to something other than a Bayesian update on proofs would want to self-modify, but an agent which uses a Bayesian update is more vulnerable to trolls. ("Trollability is bad, but if I can get a higher expected utility by editing myself to be more vulnerable to trolls, who cares?") I'm still somewhat hopeful that a probability assignment can converge on a broad class of proof-orderings. For example, it would be very nice if there were a GLS-coherent distribution which was not trollable by any proof-ordering within its own range of computational complexity ("successful trolls need more processing power than their targets"). I think a result like that is unlikely, at present, but I haven't ruled it out yet. We could then guarantee convergence by allocating more computational resources to the logical uncertainty than to the theorem prover (or, something resembling this).

One comment I got about this, in a discussion at MIRI (I forget who made the remark) was that it's a problem of updating on information without taking into account the source of the information. Just as you would draw different conclusions about the frequencies of colors in a ball pit if you knew someone was selecting balls at random to show you vs if you thought they were selecting by some other policy, you should be able to model the proof-selecting strategy of a troll; believing with high probability that they will be showing you more evidence against x, you would be inoculated against needing to actually adjust beliefs (unless the evidence was

surprisinglycompelling). I haven't figured out a good way to model this yet, though.On the MIRIx slack, Miëtek Bak discussed

explicitprovability logic (papers: 1, 2). The idea is that reasoning about the existential statement "there exists a proof", in the form of "□", causes unnecessary problems in reflection due to its non-constructive nature. Instead, we can consider a system which has explicit terms for proofs, with terms p:s ("p is a proof of s") taking the place of □s. Like GLS, the resulting system can conclude the raw statement from the statement of provability; it endorses all instances of (p:s)→s. Like GL, provability is reflective, representing provability in-this-very-logic. This may make explicit provability a good candidate for a variant of naturalistic logical updates. I think there may be reflective consistency considerations which point in this direction as well: my argument for naturalistic logical updates involves the agent considering the future, but makes a perhaps problematic leap from the agent observing a proof to the agent observing abstract provability, "□". (This is especially questionable given that the agent is not supposed to be making first-order inferences like that automatically.)Overall, naturalistic logical updates stand or fall with the strength of the reflective consistency argument which I'm currently claiming leads us to them. I haven't articulated this well enough yet, I think. (But, see logical Dutch Book arguments, logical counterfactuals consistent under self-modification, and naturalistic logical updates for my arguments as they stand.)

The abuse of language whereby I talk about consequence and coherence "in GLS", but concerning the sentences of PA, can now be justified by translating PA sentences into GLS to check the consequence relation. Notice that since identical sub-expressions always get translated to the same sentence-letters in GLS, we can validate GLS principles like a∨¬a and a⊢a for all sentences a in PA; however, we make quantifiers totally opaque, failing to recognize even simple equivalences such as (∀x.x=x)↔(∀y.y=y).