Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

When reading Vladimir's recent post on defining UDT through modal logic, I had to think a bit about his actual definition of the modal formula(s) corresponding to UDT, because it was phrased in terms of an algorithm doing things instead of an actual modal formula. Then I remembered having worked through the correspondence in the context of modal agents, and it became clear what was going on. I think Vladimir's approach is really interesting and I want to refer to it in future posts, so I thought I'd write a tutorial, since it seems likely that others will have the same problem.


Like Vladimir, let's start with the simplest case of an agent that chooses between two actions, in a world with two different possible outcomes (one of which is preferred to the other). Let's write the world as a universe program, in the usual style of proof-based UDT; for example:

  def U():
    if A() == a1:
      return 10
    else:
      return 5

And, in the style of UDT with a halting oracle, let's write our agent as a program with access to a halting oracle, which it uses to figure out whether certain sentences are provable (in Peano Arithmetic, let's say):

  def A():
    if   Provable("A() == a1  -->  U() == 10"):
      return a1
    elif Provable("A() == a2  -->  U() == 10"):
      return a2
    return a2

(The reference to and inside the quotes is by quining.) Our agent tries to get utility 10 by trying all possible actions in order; for each action, it sees whether it can prove that this action leads to the desired outcome, and if so, it takes that action; if it doesn't find a good action, it randomly takes the last one.


Given a programming language with access to a halting oracle, there's a relation such that if is the Gödel number of a program in this language, then if and only if this program halts and returns . (The definition of is very similar to the one for programming languages without a halting oracle.) We implicitly used this fact in the definition of : the statement "" really means "".

Now, by looking at the definition of , we can show that and by looking at the definition of , we can show that If we ignore the internal structure of and , and treat these as propositional variables, both of the statements above look like formulas of modal logic: and .

We can write this as and , where and are the modal formulas and .


So far, the modal logic thing has been an analogy, but now we can observe that the two displayed statements above imply that is a fixed point of the formula ; that is, Now, since is a fully modalized formula (meaning that all occurrences of its propositional variable, , are inside boxes), this means that we can apply results from provability logic: in particular, there is a modal formula that's a fixed point of ---a closed formula of Gödel-Löb provability logic () such that ---and moreover, .

It's straight-forward to check that . (Recall that proves any closed formula in the language of modal logic if and only if proves its translation to the language of arithmetic, so we prove the above statement by arguing: can clearly show that proves , and hence that is equivalent to .) Hence, it follows that .


So the general idea for specifying modal formulas through algorithms with halting oracles is as follows:

  • Write a program in a language with access to a halting oracle, which uses that oracle to check the provability of relatively simple statements---which may, however, refer to the output of the program.
  • By inspection of the source code of this program, show in that its output is a fixed point of a certain fully modalized formula of modal logic.
  • Show in modal logic that a certain closed formula is also a fixed point of this.
  • Conclude that shows to be equivalent to the complicated formula that describes the output of your program.

Before looking at how this applies to Vladimir's actual definition of UDT, there's one complication to add: allowing more than two actions and more than two outcomes. To do so, instead of talking just about the single statement , we can talk about the statements , and instead of talking about just (with in our example), we'll talk about .

Now we describe the behavior of the agent program by statements of the form and the behavior by the universe program by statements of the form where the are fully modalized formulas, and the are arbitrary formulas of provability logic. Intuitively, The fact that the are modalized corresponds to the fact that agents can't run the universe they're in, even if they have its source code---they have to use abstract reasoning (such as proofs) to figure out what the consequences of different actions are. The universe program, on the other hand, can just run the agent and see what it does, which corresponds to being able to refer to it outside modal operators.

I hope it won't be too confusing that I'm using both to denote an action and the corresponding propositional variable, and similarly for . From now on, I'll abbreviate by and by (deviating from Vladimir's use of overbars in order to keep overbars available for splicing natural numbers into arithmetic formulas).

Again, the place that the formulas and are coming from is just by looking at what the source code of these programs does and writing down the modal formula that expresses what that source does. That said, we're not just looking for formulas that in some intuitive sense correspond to what the program does; we're looking for formulas such that the displayed statements above are provable in .

By the fixed point theorem for provability logic, we then know that we can find formulas such that , where abbreviates ; and moreover, we know that then, . (To apply the fixed point theorem, we need to use the fact that is fully modalized, which it clearly is.) Thus, we can carry out the same steps as in the two-action, two-outcome case above.


Now let's revisit Vladimir's definition of the agent program UDT, suitably adapted to my notation here and to my convention of making the mutually exclusive:

  1. There are finitely many sentences of the form "if such-and such holds, then such-and-such holds". Find all such sentences that are provable. If no such sentences were found, choose an arbitrary action---e.g., make true and all other false.

  2. From all pairs found in the previous step, choose the whose is highest in our preference ordering. If there are multiple such , choose any one, e.g. the one with the lowest index.

  3. For each , define to be true if (where is the index of the we chose in the previous step).

To cash this out, first we think of this definition as an algorithm with a halting oracle:

  def UDT():
    pairs = []
    for i in range(m):
      for j in range(n):
        if Provable("UDT() == a_i  -->  U() = u_j"):
          pairs.append((i,j))
    
    if len(pairs) == 0:
      return a_1
    else:
      find the pair (i,j) in pairs with the largest j
      return the corresponding a_i

Now we try to find such that we can prove in PA that is equivalent to . This is somewhat tedious, but possible: for example, we can:

  • enumerate every subset of the set of all pairs with and
  • for every such subset , write down the modal formula which is the conjunction of , for all , and , for all ;
  • let be the disjunction of for every such that the above algorithm returns if the Provable(...) condition is evaluated according to . (Since there are only finitely many , we can just figure this out by running the code, interpreting the call to Provable(...) according to each .)

You could certainly write the more compactly than this, but the above version shows the idea.


Having done all this, not only do we have a better understanding of how to translate Vladimir's specification into a formula of modal logic, we also have an idea of what to do with it: Given a universe program for which we can show statements of the form we can now find formulas such that , and can then conclude that tells us whether ---the program with the halting oracle---will output , given the universe .

This may be rather daunting, given the complicated definition of from above, but finding the fixed point formulas and evaluating their truth value are both decidable (unless I'm making a mistake?), so at least in principle (and hopefully in practice), we can write a program that figures out the answer for us.


(I'm not sure whether this post will be helpful or interesting to anybody---feedback appreciated...)

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 12:48 AM

Thanks for writing this! I just added a paragraph to my post to explain how the algorithm corresponds to a modal formula. Also I like your notation with instead of , edited my post to use that instead.

Quite helpful, thanks!

(Thanks for the feedback!)

This seems good. I'll think about it more.