Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Yesterday, I gave a definition of "decision theory" and "fair decision problem" in the context of provability logic, and gave a formal version of drnickbone's argument that no decision theory can perform optimally on every "fair" decision problem. Today, I prove that for any provably "fair" decision problem , there is a sound extension of by a finite set of formulas, expressible in the language of provability logic, such that Vladimir Slepnev's modal logic version of UDT performs optimally on if we have it search for proofs in this extended proof system.

This is a version of a result that Kenny Easwaran, Nate Soares and I proved in the context of UDT with bounded proof length when Kenny visited MIRI for two days earlier this year. (I'll post about the bounded proof length version some other time; it's more complicated, as you'd expect.)

Prerequisites: A primer on provability logic, "Evil" decision problems in provability logic.

Provability in extensions of PA

In order to carry out the idea above, we need to talk about provability in extensions of , not just about the provability in expressed by the operator of Gödel-Löb provability logic. Fortunately, the systems we'll need in this post only extend by a finite number of axioms which can be expressed as closed formulas in the language of , and there's a simple trick to reason about such extensions in itself:

Suppose that we extend by a single axiom, . Then, by the deduction theorem for first-order logic, if and only if . Thus, if both and can be expressed as modal formulas, then we can express the proposition that is provable in by the modal formula . The same trick works, of course, if we want to add finitely many different modal formulas to , since we can simply let be their conjunction.

In this post, when I'm using this trick to talk about , I'll emphasize this fact by abbreviating to (since we often subscript by the proof system when it's not clear which system we are referring to).

Modal UDT

The version of modal UDT I'm using in this article is equivalent to the one in Vladimir's article, except for using instead of , for an appropriate . Nevertheless, let me give a quick recap, which puts a slightly different spin on the system, and check that this does indeed meet the specification of a "modal decision theory".

As a modal decision theory, -action, -preference level modal UDT is a sequence of formulas in parameters, such that gives the action UDT chooses in the modal universe . Let's write for the version which uses provability in .

I think the easiest way to specify is to say that it's a sequence of modal formulas which describe the behavior of the following an algorithm (which needs a halting oracle):

  • For each :
    • For each :
      • If it's provable in that "I take action " implies "I obtain outcome ", then return .
  • If still here, return .

In other words, UDT starts with the most preferred outcome, and tries to find an action that provably implies that outcome; if there is such an action, it returns it, breaking ties lexicographically. If there isn't, it tries the second-most-preferred outcome, and so on down to the least preferred one. If no actions imply any outcomes, it returns an arbitrary default action.

To actually write out , we need to find a formula that describes when the above algorithm returns the action . Clearly, for actions other than the default action, the condition is that there's some outcome such that (a) it's provable that implies ; (b) there's no action which provably implies ; (c) and there isn't any outcome which is provably implied by any action . The default action is taken if the condition above holds or if no action provably implies any outcome.

Let's agree to order pairs lexicographically, i.e., if or and . Then, we can write as follows, for : and for : From this definition, it's straight-forward to see that is provably mutually exclusive and exhaustive (p.m.e.e.); we work in and distinguish the following cases: There either is no pair such that , or some pair is the first one. If there is no such pair, then is true, and all other formulas in the sequence are false. If is the first such pair, then is true, and all the other formulas are false.

Provably extensional problems

Let's call a closed formula in the language of sound if its translation into the language of arithmetic is true on the natural numbers; unsurprisingly, we'll write this as for short. My claim is that for every provably extensional decision problem , there is a sound formula such that performs optimally on , that is, obtains the best outcome that any decision theory can achieve on this outcome.

Recall that a decision problem is defined to be provably extensional if where stands for the conjunction . (Of course, the two sequences need to be of the same length.)

Intuitively, such a decision problem assigns a single outcome to every action , and every agent which chooses action will obtain outcome ; this is a formalization of the idea that we should consider a decision problem "fair" if it only rewards or punishes you for the actions you decide to take, not for the decision process you used to arrive at these decisions.

This doesn't mean that the mapping is all that matters about the decision problem---for example, any particular version of modal UDT fails on its "evil" decision problem, despite the fact that these "evil" problems are provably extensional. That's because it can be difficult to determine which mapping a particular corresponds to.

The idea of the optimality proof in this post is that for any particular , there is some truth about what the corresponding mapping is, even if it's hard to determine, and if we run UDT with a proof system strong enough to handle this problem, this version of UDT will simply be able to figure out which action is best and then take it.

How do we find a proof system strong enough to determine the action/outcome mapping for a given ? The simplest possibility is to just take and add to it the information about the true action/outcome mapping.

Let and for , and write for the corresponding sequence of length or (depending on the context). By the true mapping, I mean the mapping such that for all , we have . This mapping is unique, because is a decision problem, which by definition implies that the formulas in this sequence are p.m.e.e. (i.e., exactly one of them is true, for any particular value of the ).

Now it's clear how we can encode the fact that this is the true mapping as a formula of : we simply set .

Optimality

It's now clear the best outcome any agent can obtain on the provably extensional decision problem is : An agent is a sequence of closed p.m.e.e. formulas, which means that for any agent there is some such that , which by extensionality implies , so the outcome obtained by is .

Write for the fixed point of and , with defined as in the previous section. Then, in order to show that performs optimally on this decision problem, we need to show that for some such that .

I claim that this holds for the smallest such that . Given the way is constructed, in order to show this I need to show that , plus for all .

The first of these statements is equivalent to which, given the properties and defining and , we can rewrite as Since is p.m.e.e., is equivalent to , so by provable extensionality of , we have . The definition of implies , so we obtain or , which is equivalent to the desired result.

For the second statement, we need to show that is not provable in for any . So suppose it were provable, for some such . Then would take the action corresponding to the smallest such pair; in other words, we would have . We can again use the defining properties of and to rewrite this as .

Since is sound and , the assumption that is provable in implies that it is true on . We can rewrite this as well, to . Hence, we have .

If , this says that achieves an outcome better than , which we've shown above to be the best outcome that any agent can possibly achieve---contradiction. If and , then we have found an such that (since by extensionality, implies , and hence for as is p.m.e.e.)---contradiction to the definition of .

New to LessWrong?

New Comment