Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

From halting oracles to modal logic

5cousin_it

1abramdemski

0Benya_Fallenstein

0Stuart_Armstrong

When reading Vladimir's recent post on defining UDT through modal logic, I had to think a bit about his actual definition of the modal formula(s) corresponding to UDT, because it was phrased in terms of an algorithm doing things instead of an actual modal formula. Then I remembered having worked through the correspondence in the context of modal agents, and it became clear what was going on. I think Vladimir's approach is really interesting and I want to refer to it in future posts, so I thought I'd write a tutorial, since it seems likely that others will have the same problem.

Like Vladimir, let's start with the simplest case of an agent that chooses between two actions, in a world with two different possible outcomes (one of which is preferred to the other). Let's write the world as a universe program, in the usual style of proof-based UDT; for example:

And, in the style of UDT with a halting oracle, let's write our agent as a program with access to a halting oracle, which it uses to figure out whether certain sentences are provable (in Peano Arithmetic, let's say):

(The reference to A() and U() inside the quotes is by quining.) Our agent tries to get utility 10 by trying all possible actions in order; for each action, it sees whether it can prove that this action leads to the desired outcome, and if so, it takes that action; if it doesn't find a good action, it randomly takes the last one.

Given a programming language with access to a halting oracle, there's a relation Run(m,n) such that if m is the Gödel number of a program in this language, then N⊨Run(m,n) if and only if this program halts and returns n. (The definition of Run(m,n) is very similar to the one for programming languages

withouta halting oracle.) We implicitly used this fact in the definition of A(): the statement "A()=a1→U()=10" really means "Run(A,a1)→Run(U,10)".Now, by looking at the definition of U(), we can show that PA⊢[U()=10]↔[A()=a1], and by looking at the definition of A(), we can show that PA⊢[A()=a1]↔□┌[A()=a1]→[U()=10]┐. If we ignore the internal structure of [U()=10] and [A()=a1], and treat these as propositional variables, both of the statements above look like formulas of modal logic: u↔a and a↔□┌a→u┐.

We can write this as a↔F(a,u) and u↔G(a), where F(a,u) and G(a) are the modal formulas F(a,u)≡□┌a→u┐ and G(a)≡a.

So far, the modal logic thing has been an analogy, but now we can observe that the two displayed statements above imply that [A()=a1] is a fixed point of the formula F(a,G(a)); that is, PA⊢[A()=a1]↔F([A()=a1],G([A()=a1])). Now, since F(a,G(a)) is a fully modalized formula (meaning that all occurrences of its propositional variable, a, are inside boxes),

thismeans that we can apply results from provability logic: in particular, there is a modal formula that's a fixed point of F(a,G(a))---a closed formula φ of Gödel-Löb provability logic (GL) such that GL⊢φ↔F(φ,G(φ))---and moreover, PA⊢[A()=a1]↔φ.It's straight-forward to check that GL⊢⊤↔□┌⊤→⊤┐. (Recall that GL proves any closed formula in the language of modal logic if and only if PA proves its translation to the language of arithmetic, so we prove the above statement by arguing: PA can clearly show that PA proves ⊤→⊤, and hence that □┌⊤→⊤┐ is equivalent to ⊤.) Hence, it follows that PA⊢[A()=a1]↔⊤.

So the general idea for specifying modal formulas through algorithms with halting oracles is as follows:

alsoa fixed point of this.Before looking at how this applies to Vladimir's actual definition of UDT, there's one complication to add: allowing more than two actions and more than two outcomes. To do so, instead of talking just about the single statement [A()=a1], we can talk about the m statements [A()=a1],…,[A()=am], and instead of talking about just [U()=u1] (with u1=10 in our example), we'll talk about [U()=u1],…,[U()=un].

Now we describe the behavior of the agent program by m statements of the form PA⊢[A()=ai]↔Fi([A()=a1],…,[A()=am],[U()=u1],…,[U()=un]) and the behavior by the universe program by n statements of the form PA⊢[U()=uj]↔Gj([A()=a1],…,[A()=am]), where the Fi(a1,…,am,u1,…,un) are fully modalized formulas, and the Gj(a1,…,am) are arbitrary formulas of provability logic. Intuitively, The fact that the Fi are modalized corresponds to the fact that agents can't

runthe universe they're in, even if they have its source code---they have to use abstract reasoning (such as proofs) to figure out what the consequences of different actions are. The universe program, on the other hand, can just run the agent and see what it does, which corresponds to being able to refer to it outside modal operators.I hope it won't be

tooconfusing that I'm using ai both to denote an action and the corresponding propositional variable, and similarly for ui. From now on, I'll abbreviate a1,…,am by →a and u1,…,un by →u (deviating from Vladimir's use of overbars in order to keep overbars available for splicing natural numbers into arithmetic formulas).Again, the place that the formulas Fi(→a,→u) and Gj(→a) are coming from is just by looking at what the source code of these programs does and writing down the modal formula that expresses what that source does. That said, we're not just looking for formulas that in some intuitive sense correspond to what the program does; we're looking for formulas such that the displayed statements above are provable in PA.

By the fixed point theorem for provability logic, we then know that we can find formulas φi such that GL⊢φi↔Fi(→φ,G1(→φ),…,Gn(→φ)), where →φ abbreviates φ1,…,φm; and moreover, we know that then, PA⊢[A()=ai]↔φi. (To apply the fixed point theorem, we need to use the fact that F(→a,G1(→a),…,Gn(→a)) is fully modalized, which it clearly is.) Thus, we can carry out the same steps as in the two-action, two-outcome case above.

Now let's revisit Vladimir's definition of the agent program UDT, suitably adapted to my notation here and to my convention of making the ai mutually exclusive:

There are finitely many sentences of the form "if such-and such ai holds, then such-and-such uj holds". Find all such sentences that are provable. If no such sentences were found, choose an arbitrary action---e.g., make F1 true and all other Fi false.

From all pairs (ai,uj) found in the previous step, choose the ai whose uj is highest in our preference ordering. If there are multiple such ai, choose any one, e.g. the one with the lowest index.

For each k, define Fk to be true if k=i (where i is the index of the ai we chose in the previous step).

To cash this out,

firstwe think of this definition as an algorithm with a halting oracle:Now we try to find Fi such that we can prove in PA that [UDT()=ai] is equivalent to Fi([UDT()=a1],…,[UDT()=am],[U()=u1],…,[U()=un]). This is somewhat tedious, but possible: for example, we can:

`Provable(...)`

condition is evaluated according to X. (Since there are only finitely many X, we can just figure this out by running the code, interpreting the call to`Provable(...)`

according to each X.)You could certainly write the Fi more compactly than this, but the above version shows the idea.

Having done all this, not only do we have a better understanding of how to translate Vladimir's specification into a formula of modal logic, we also have an idea of what to do with it: Given a universe program U() for which we can show statements of the form PA⊢[U()=uj]↔Gj([UDT()=a1],…,[UDT()=an]), we can now find formulas φi such that GL⊢φi↔Fi(→φ,G1(→φ),…,Gn(→φ)), and can then conclude that φi tells us whether UDT()---the program with the halting oracle---will output ai, given the universe U().

This may be rather daunting, given the complicated definition of Fi(→a,→u) from above, but finding the fixed point formulas φi and evaluating their truth value are both decidable (unless I'm making a mistake?), so at least in principle (and hopefully in practice), we can write a program that figures out the answer for us.

(I'm not sure whether this post will be helpful or interesting to anybody---feedback appreciated...)