Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Vadim pointed out in a comment to my post on logical counterfactuals that a very similar idea had been explained in a LessWrong post summarizing work by Vladimir Slepnev and Paul Christiano at a MIRI workshop in 2013. The algorithm which they suggested was called UDT 1.5. In fact, the essential idea is already argued by Vladimir Slepnev (cousin_it) in a post from 2012: Should logical probabilities be updateless too?.

Here, I continue to develop these ideas and those in my logical dutch-book post. I present an alternate prior to the one which was used for UDT 1.5. I show that this new prior has naturalistic logical updates, a kind of improvement-endorsing property which seems likely to work well with UDT. This property also gets around Paul Christiano's paradox of ignorance.


Motivation

Slepnev motivates the problem with a variant of counterfactual mugging, inspired by a post by Wei Dai but simplified:

Suppose Omega (the near-omniscient near-omnipotent and almost perfectly trustworthy agent who prefers the pronouns ve/ver) appears before you. Omega tells you that ve has computed the millionth binary digit of , and that before knowing what the digit was, ve had committed to the following procedure: if the millionth digit is zero, ask you for $100. If the millionth digit is one, give you $10,000 only if you would have given Omega the $100 had the digit been zero; otherwise, give you nothing. Omega then tells you that the millionth digit was in fact zero, and asks you for $100.

Suppose you already use a variant of UDT, so you take the actions you would have approved of ahead of time. If you are a "smart" agent who already knew the millionth digit of and only calculated expected utility of universes consistent with this knowledge, you'd turn Omega down. If you're a "dumb" agent who doesn't, you'd give Omega the $100. (This is true even though UDT uses a kind of logical counterfactual; a UDT agent takes a counterfactual on its possible actions, not anything else, so it still doesn't consider the worlds where digits of are different.)

This means that "dumb" agents don't want to become "smart" agents. A UDT-like agent which improves its beliefs over time by proving more theorems will be temporally inconsistent. This calls into question approaches to logical uncertainty which rely on estimates which improve given more time, such as uniform coherence.

My intuition is that we want to bite the bullet and be updateless about logical uncertainty as well as empirical uncertainty, because this maximizes expected value from a logically uncertain perspective. This makes it seem like the only reflectively consistent possibility. Scott Garrabrant gave me a different argument for the same conclusion. My recollection of the argument is as follows:

Problems like this are a result of someone simulating you. You can't be sure of whether you're in the simulation, being fooled into thinking the millionth digit of is zero, or whether you're in the real world. You can't trust your own logic, since it could be spoofed. If you're in the real world, you want to not give up the $100. If you're in the simulation, then you'd like to give it up so that the real you gets $10,000. However, as in the absentminded driver problem, the sleeping beauty problem, and other such problems, you shouldn't solve the problem by assigning probabilities to the two possibilities; instead, you want to act according to a policy you'd endorse ahead of time. Hence, an updateless solution.

(An interesting question: to what extent do the details of Omega's counterfactual computation determine whether we buy the whole argument?)

Naturalistic Logical Updates

It seems we need a logically ignorant prior which is used to judge policies. This is a prior over impossible possible worlds, so it is bound to include some nonsense, but it can't be as simple as assigning probability 1/2 to everything; there needs to be enough structure to get "reasonable" results. Slepnev and Christiano offered such a prior in the workshop results I mentioned, but I don't see that it gets me the properties I want. I will construct one according to the properties I see as desirable, along the lines I started to develop in my logical dutch book post.

A logically updateless UDT should reduce to a logically updateful DT in circumstances which do not involve entanglements with impossible possible worlds. I won't get that property here, but I will go for something suggestive: an approximation scheme for a logical prior which "endorses" improved approximations as being the appropriate update.

An AI which improves its logical consistency over time can be modeled by a sequence of approximate probability distributions which are supposed to get better and better. For this post, the probability distribution will be over sentences of . I make the assumption that the change between any and in this sequence is due purely to information coming from a theorem-prover; at time , we've proved more theorems, and taken them into account as logical constraints on beliefs. The type of constraints I will use will be partitions of truth: a set of sentences of which exactly one can be true. Coherence relative to a single constraint is defined by probabilities summing to one:

Relative Coherence: Where is a finite set of sentences, is coherent relative to if and only if .

If a function is coherent relative to all provable partitions of truth, it is coherent in the standard sense. So, we can suppose that this is the only type of information which the theorem-prover gives us. (This is a version of coherence which I got from Scott.)

As I found when working on the logical Dutch book post, the behavior of conditional probabilities gets tricky when we only have relative coherence; we don't necessarily have basic properties of conjunction, so we can have things like and probabilities which depend on the order in which we condition on evidence. This is a problem for getting nice properties out of such probability functions. To solve this, I will require propositional coherence: the distribution will be coherent relative to every constraint which is provable by propositional logic alone. (Although this introduces an infinite number of constraints, can still be computable, as I will show later.) This is probably overkill, but is convenient.

Let be the sentence asserting that is a partition of truth. (For example, if , then ; on the other hand, if , then .) We define a naturalistic update as follows:

Naturalistic Logical Update: has a naturalistic logical update if and only if:

  • Where is a partition of truth, is coherent relative to .

This is the successor to my notion of "information-adding coherence" in the logical dutch book post. Note that this definition only requires coherence relative to real logical constraints. If is not a genuine partition of truth, may or may not be coherent relative to . This is because coherence conditioning on arbitrary constraints would be impossible: we could condition on an unsatisfiable set of constraints.

We could then define by updating sequentially on theorems as we prove them. This won't be quite what's wanted, but it will give enough intuition for now. I'll use to denote the set of constraints has updated on.

For an updateless decision theory which uses as a logically ignorant prior, this means the segment of expected utility of a policy coming from just those worlds in which constraints are provable should be exactly the expected utility under . If this possible world is sufficiently "isolated" from everything else (negligible probability of being simulated), then the preferred policy for this case should be very close to the policy for a logically updateful agent using . This suggests there's some hope for a well-behaved UDT based on .

This is "naturalistic" in the sense that it doesn't care where the proof comes from. The only difference between the internal theorem-prover and an external one is that the agent has perfect knowledge of it. If we instead update on observations which make it very probable that a external theorem-prover has proved something, the resulting distribution would be (almost) coherent with respect to those constraints. Note that this solves Paul Christiano's paradox of ignorance: the counterintiutive effect was precisely that updating on information from an external theorem-prover was different from updating on an internal one. (Christiano 2014, section 6.1.5.)

This level of self-trust might appear concerning for Lobian reasons. We can apply the naturalistic update property to prove ; since , we have ; then ; so we have . However, recall that the naturalistic update property only applied to genuine partitions of truth; this line of argument only applies if is a theorem in the first place. Furthermore, even if we construct a prior for which for a much wider class of than only theorems, this would still be a far cry from having in the proof system itself.

It turns out such a distribution can be constructed by surprisingly simple means.

Theorem. There exists a computable probability distribution which is coherent with respect to propositional constraints, and which has a naturalistic logical update.

Proof. Sample a random infinite theory as follows. Let be a positive probability distribution over sentences. Sampling from repeatedly, accept those samples into which are consistent with everything accepted so far under the provability logic . (When I say a set of sentences is consistent under , I mean the following: Take the conjunction of all the sentences. Translate this sentence to modal logic, hiding all details which aren't either propositional logic or the provability predicate. If the negation of the result is a theorem of , the set of sentences is said to be inconsistent; otherwise, it is consistent.)

Define as the probability that will occur in such a . This is computable, as follows. Generate the tree of possible random theories, ending in a leaf whenever or is sampled. The probability that one of the two will eventually be added to is 1, because extends propositional logic and so has the property that if both and are inconsistent, is already inconsistent. Therefore, for any desired accuracy , we can extend the tree far enough to compute to within .

Suppose that is a partition of truth which is provable by propositional means alone. Since extends propositional logic, any sampled would contain at most one to avoid inconsistency. Furthermore, since any eventually contains either or for each , it will also contain at least one: otherwise, it would contain the negation of all sentences in , which would also be an inconsistency. Therefore, the probability of any propositional partition of truth sums to one, as required.

also has a naturalistic update. has for all . Therefore, is coherent relative to constraint .

It might seem strange that I've used rather than . is the "truth provability logic", which gives every true modal statement about provability in , rather those provable in . (Boolos, 1995.) represents the knowledge about provability possessed by , whereas represents knowledge beyond . For example, proves . It might seem odd to impose this level of knowledge about provability on the probability distribution. However, coherence relative to would be odd in other ways. obeys "necessitation": . This means that anything true must be provable. This seems wrong for our logical prior: we're sampling random sentences to create a probability distribution in part because we know not everything is provable or disprovable, but want to have beliefs about those things anyway. Consistency with respect to would make us wrongly believe that . (Everything is provable or refutable or both.)

Of course, the main motivation here is that wouldn't get us a naturalistic logical update: we want to be consistent with constraint by conditioning on the provability of , . In , is consistent with , so conditioning on new provability facts coming from a theorem prover wouldn't get us much at the object level. We would have to condition on directly, making the role of the theorem prover non-naturalistic. This would (I suspect) be bad news if we then try to apply it to an updateless decision theory.

Updateless Decision Theory

Given that the motivation for naturalistic updates was an updateless one, it wouldn't do to finish the post without at least sketching a UDT based on .

As in UDT 1.5, we have an agent search for a program with high expected value according to and according to a utility function . ( must be encoded in , so I'll make the assumption that it has natural-numbered values.) Define this as:

The set of programs may be finite, for example when operating on a finite machine. In this case, the operation is computable (if infeasible).

Limitations

The above construction is not uniformly coherent. I haven't even shown that it converges to something coherent as we update!

It's not totally clear what happens as we update on constraints provable in . One concern is that it sounds dangerous to mix with -- one might get a system which obeys Lob's theorem but also contains . Things should be fine here, because we're effectively just adding arithmetically true statements to . Although this creates a stronger logic which endorses things like , the still refers to provability in rather than provability in the new system.

One thing which I considered in the logical dutch book post which I didn't manage to achieve here is updating on failed proof attempts. Trying to prove something for a long time without success should make us find it less likely. For this purpose, I introduce the notation , which says that there is a proof of of length at most . We want updating on this to obey conservation of expected evidence:

Conservation of Expected Proof:

Note, however, that this already follows from propositional coherence alone. To make this meaningful, we also want:

Bounded Naturalistic Update

  • Where is a partition of truth, is coherent relative to .

This could likely be achieved by enforcing for all , in addition to everything already being enforced in . Showing that the result is still well-defined and computable requires more than just the decidability of , which is why I didn't do this here.

We could then define by updating sequentially on all our proof attempts, successful or unsuccessful. (Differentiating proof attempts by a length-bound only is very course-grained, but we could make this finer-grained as desired.)

Works Cited:

Paul Christiano. 2014. Non-Omniscience, Probabilistic Inference, and Metamathematics. Technical report 2014–3. MIRI.

Boolos, George. The logic of provability. Cambridge university press, 1995.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 10:08 AM

Nice work! I'm not too attached to the particular prior described in the "UDT 1.5" post, and I'm very happy that you're trying to design something better to solve the same problem. That said, it's a bit suspicious that you're conditioning on provability rather than truth, which leads you to use GLS etc. Maybe I just don't understand the "non-naturalistic" objection?

Also I don't understand why you have to use both propositional coherence and partitions of truth. Can't you replace partitions of truth with arbitrary propositional combinations of sentences?

The main reason to think it's important to be "naturalistic" about the updates is reflective consistency. When there is negligible interaction between possible worlds, an updateless agent should behave like an updateful one. The updateless version of the agent would not endorse updating on rather than ! Since it does not trust what its future self proves, it would self-modify to remove such an update if it were hard-wired. So I think updating on provability rather than truth is really the right thing.

(Sorry I didn't notice this comment earlier.)

My intention wasn't really to use both propositional coherence and partitions of truth -- I switched from one to the other because they're equivalent (at least, the way I did it). Probably would have been better to stick with one.

I do think this notion of 'naturalistic' is important. The idea is that if another computer implements the same logic ans your internal theorem prover, and you know this, you should treat information coming from it just the same as you would your own. This seems like a desirable property, if you can get it.

I can understand being suspicious. I'm not claiming that using GLS gives some magical self-reference properties due to knowing what is true about provability. It's more like using PA+CON(PA) to reason about PA. It's much more restricted than that, though; it's a probabilistic reasoning system that believes GLS, reasoning about PA and provability in PA. In any case, you won't be automatically trusting external theorem provers in this "higher" system, only in PA. However, GLS is decidable, so trusting what external theorem provers claim about it is a non-issue.

What's the use? Aside from giving a quite pleasing (to me at least) solution to the paradox of ignorance, this seems to me to be precisely what's needed for impossible possible worlds. In order to be uncertain about logic itself, there needs to be some "structure" what we are uncertain about: something that is really chosen deterministically, according to real logic, but which we pretend is chosen randomly, in order to get a tractable distribution to reason about (which then includes the impossible possible worlds).