# 3

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is a simplified introduction to existing ideas by Eliezer Yudkowsky, Wei Dai, Vladimir Nesov and myself. For those who already understand them, this post probably won't contain anything new. As always, I take no personal credit for the ideas, only for the specific mathematical model.

People usually think about decision-making in terms of causality, where an agent's action causes an outcome to happen. In this post I will outline a different idea of "causality", which can work even if regular causality isn't available. For example, in the world of natural numbers it doesn't make sense to say that the sentence somehow "causes" the sentence to be true. Yet we can devise a notion of "logical causality" that will work in such worlds, and allow us to make decisions which maximize utility in some sense. The rest of this post is devoted to making these claims precise.

We start off with a simple formal model. Recall that provability logic is a modal logic where the operator is interpreted as provability in Peano arithmetic. Let's define a propositional formula with two variables . Since all occurrences of propositional variables are inside , we can build fixed points such that for different values of , and see what happens.

1. If (the true sentence), then , which is easy to verify directly.

2. If (the false sentence), then , which can be verified by Löb's theorem.

Now let's go a little deeper, and let itself be defined in terms of , using the fixed point to tie everything together:

1. In the simplest case where , we see that .

2. If (the negation of ), we see that . You can verify that by hand, using the methods from the previous points.

By now an interesting pattern is starting to emerge. Let's say is the formula that results from the variable after applying the fixed point. In all cases except (2) where is false by fiat, we see that tries and succeeds to make true! Metaphorically, here's what is doing: "If I can prove that my truth implies the truth of , then I choose to be true, otherwise I choose to be false". That's the basic idea of "logical causality", in the simplest possible setting where it works.

Note that all sentences above can be interpreted as sentences about natural numbers within Peano arithmetic, by using Gödel numbering to spell out the provability operator. For example, in point (4) would be a long sentence about the natural numbers, and would be a slightly longer sentence that has embedded inside it. In decision-making terms, is the "action" and is the "outcome" that logically depends on the action, but 's dependence on is not explicit, because both are closed formulas without free variables. Instead, the dependence is due to knowing the Gödel number of .

To skip ahead a bit, it's easy to go from a formalism about natural numbers to a formalism about computer programs, which know their own source code by quining. The examples above can be directly translated to programs that have access to a provability oracle for Peano arithmetic, or (with some caveats) to programs that successively search for proofs and check them manually. In fact, that was the original motivation for this line of research, because "programs trying to influence a larger program that they are embedded in" might be a good description of our own world, at least from the perspective of a program :-)

Going back to the math, perhaps the approach will break down if we have more possible choices than just true or false? Let's assume a more general setting, where we have formulas . We will denote truth assignments to as , and truth assignments to as . We will have a preference ordering on all possible , and will be interested in fixed points such that for all . The formulas will encode the execution of this algorithm:

1. There are finitely many sentences of the form "if such-and such holds, then such-and-such holds". Find all such sentences that are provable. If no such sentences were found, choose any , e.g. all false.

2. From all pairs found in the previous step, choose the whose is highest in our preference ordering. If there are multiple such , choose any one, e.g. the lexicographically smallest.

3. For each , define to be true iff the chosen assigns true to .

To illustrate the definition of in more detail, let's work through the case where and our preference ordering wants to be true. On step 1 we have four possible sentences in lexicographic order: , , , and . Then is true iff the chosen assigns true to , which can only happen on step 2. The corresponding can either assign true to , which happens iff the first sentence is provable, or assign false to , which happens iff the second sentence is provable but the first and third aren't. Simplifying, we obtain the definition for . By now it should be clear how to use the same algorithm for .

Which choices of are amenable to this approach? Intuitively, it seems that "fair" deterministic problems are those where every choice of "actions" logically implies at least one "outcome" , and these implications are apparent to the agent (i.e. provable). But that's exactly the class of problems where our approach obviously gives the right answer! So it seems that having multiple possible choices doesn't cause problems.

For example, let's take , and assume that the preference ordering wants to be true. Then it's easy to see that the chosen is either (true, false), which provably implies , or something else that also provably implies . But the latter is impossible, because choosing any other makes false, so it can't be provable as long as the logic is sound. (Of course the logic doesn't prove its own soundness, but we're reasoning from the outside now.) Therefore the chosen is (true, false), and is true.

One counterintuitive feature of our approach is that some "actions" might logically imply multiple different "outcomes" after taking the fixed point, because if an action is in fact not taken, it logically implies anything at all. However, the approach is designed so that the existence of such "spurious" logical implications can never lead to a suboptimal outcome. The proof of that is left as an easy exercise.