Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is a simplified introduction to existing ideas by Eliezer Yudkowsky, Wei Dai, Vladimir Nesov and myself. For those who already understand them, this post probably won't contain anything new. As always, I take no personal credit for the ideas, only for the specific mathematical model.

People usually think about decision-making in terms of causality, where an agent's action causes an outcome to happen. In this post I will outline a different idea of "causality", which can work even if regular causality isn't available. For example, in the world of natural numbers it doesn't make sense to say that the sentence somehow "causes" the sentence to be true. Yet we can devise a notion of "logical causality" that will work in such worlds, and allow us to make decisions which maximize utility in some sense. The rest of this post is devoted to making these claims precise.

We start off with a simple formal model. Recall that provability logic is a modal logic where the operator is interpreted as provability in Peano arithmetic. Let's define a propositional formula with two variables . Since all occurrences of propositional variables are inside , we can build fixed points such that for different values of , and see what happens.

  1. If (the true sentence), then , which is easy to verify directly.

  2. If (the false sentence), then , which can be verified by Löb's theorem.

Now let's go a little deeper, and let itself be defined in terms of , using the fixed point to tie everything together:

  1. In the simplest case where , we see that .

  2. If (the negation of ), we see that . You can verify that by hand, using the methods from the previous points.

By now an interesting pattern is starting to emerge. Let's say is the formula that results from the variable after applying the fixed point. In all cases except (2) where is false by fiat, we see that tries and succeeds to make true! Metaphorically, here's what is doing: "If I can prove that my truth implies the truth of , then I choose to be true, otherwise I choose to be false". That's the basic idea of "logical causality", in the simplest possible setting where it works.

Note that all sentences above can be interpreted as sentences about natural numbers within Peano arithmetic, by using Gödel numbering to spell out the provability operator. For example, in point (4) would be a long sentence about the natural numbers, and would be a slightly longer sentence that has embedded inside it. In decision-making terms, is the "action" and is the "outcome" that logically depends on the action, but 's dependence on is not explicit, because both are closed formulas without free variables. Instead, the dependence is due to knowing the Gödel number of .

To skip ahead a bit, it's easy to go from a formalism about natural numbers to a formalism about computer programs, which know their own source code by quining. The examples above can be directly translated to programs that have access to a provability oracle for Peano arithmetic, or (with some caveats) to programs that successively search for proofs and check them manually. In fact, that was the original motivation for this line of research, because "programs trying to influence a larger program that they are embedded in" might be a good description of our own world, at least from the perspective of a program :-)

Going back to the math, perhaps the approach will break down if we have more possible choices than just true or false? Let's assume a more general setting, where we have formulas . We will denote truth assignments to as , and truth assignments to as . We will have a preference ordering on all possible , and will be interested in fixed points such that for all . The formulas will encode the execution of this algorithm:

  1. There are finitely many sentences of the form "if such-and such holds, then such-and-such holds". Find all such sentences that are provable. If no such sentences were found, choose any , e.g. all false.

  2. From all pairs found in the previous step, choose the whose is highest in our preference ordering. If there are multiple such , choose any one, e.g. the lexicographically smallest.

  3. For each , define to be true iff the chosen assigns true to .

To illustrate the definition of in more detail, let's work through the case where and our preference ordering wants to be true. On step 1 we have four possible sentences in lexicographic order: , , , and . Then is true iff the chosen assigns true to , which can only happen on step 2. The corresponding can either assign true to , which happens iff the first sentence is provable, or assign false to , which happens iff the second sentence is provable but the first and third aren't. Simplifying, we obtain the definition for . By now it should be clear how to use the same algorithm for .

Which choices of are amenable to this approach? Intuitively, it seems that "fair" deterministic problems are those where every choice of "actions" logically implies at least one "outcome" , and these implications are apparent to the agent (i.e. provable). But that's exactly the class of problems where our approach obviously gives the right answer! So it seems that having multiple possible choices doesn't cause problems.

For example, let's take , and assume that the preference ordering wants to be true. Then it's easy to see that the chosen is either (true, false), which provably implies , or something else that also provably implies . But the latter is impossible, because choosing any other makes false, so it can't be provable as long as the logic is sound. (Of course the logic doesn't prove its own soundness, but we're reasoning from the outside now.) Therefore the chosen is (true, false), and is true.

One counterintuitive feature of our approach is that some "actions" might logically imply multiple different "outcomes" after taking the fixed point, because if an action is in fact not taken, it logically implies anything at all. However, the approach is designed so that the existence of such "spurious" logical implications can never lead to a suboptimal outcome. The proof of that is left as an easy exercise.

New Comment
10 comments, sorted by Click to highlight new comments since:

I know this is supposed to be just introductory, but I actually think that the complete reformulation of UDT-with-a-halting-oracle in terms of modal logic is really interesting! For starters, it allows us to compare UDT and modal agents in the same framework (with the right 's, we can see this version of UDT as a modal agent). It would also be really neat if we could write an "interpreter" that allows us to write UDT as a program calling a halting oracle, and then evaluate what it does by way of modal logic.

But also, it allows us to give a nice definition of "decision theory" and "decision problem" in the context with halting oracles. I was planning on posting soon about the definition that I showed you when you were visiting, which is designed for actually computable agents with bounded proof lengths, and is more complicated because of that. As a stepping stone to that, using the provability logic framework, I think we can define:

  • a decision theory to be a set of fully modalized formulas , for (fully modalized meaning that the propositional arguments only appear inside boxes);
  • and a decision problem to be a set of formulas , for , which do not need to be modalized (which isn't a problem because they're not self-referential).

The condition that must be fully modalized, but doesn't need to be, seems to be the natural thing corresponding to how in the bounded case, we allow the universe to run the agent, but the agent must use abstract reasoning about the universe, it can't just run it.

Given such a set, we can define , for , as follows: Then the actual action taken by decision theory when run on the decision problem is the modal fixed point of the equations , for .


Yes, modal logic seems to be the most natural setting for these kinds of ideas. Also the "chicken rule" from the usual oracle formulations is gone now, I can't remember why we needed it anymore.

In a probabilistic setting (with a prior over logical theories), EDT wants to condition on the possible actions with a Bayesian conditional, in order to then find the expected utility of each action.

If the agent can prove that it will take a particular action, then conditioning on this action may yield inconsistent stuff (divide-by-zero error for Bayesian conditional). This makes the result ill-defined.

The chicken rule makes this impossible, ensuring that the conditional probabilities are well defined.

So, the chicken rule at least seems useful for EDT.

A model of UDT with a halting oracle searches only for one utility value for each action. I'm guessing the other formulation just wasn't obvious at the time? (I don't remember realizing the possibility of playing chicken implicitly before Will Sawin advertised it to me, though I think he attributed it to you.)

However, the approach is designed so that these "spurious" logical implications are unprovable, so they don’t interfere with decision-making. The proof of that is left as an easy exercise.

I don't think this is technically true as stated; it seems to be possible that the agent proves some spurious counterfactuals as long as the outcome it does in fact obtain is the best possible one. (This is of course harmless!) Say the agent has two possible actions, and , leading to outcomes and , respectively. The latter is preferred, and these are the only two outcomes. Suppose that happens to be lexicographically lower than in the agent's reasoning. Then it seems to be provable that the agent will in fact choose , meaning that it's provable that it won't choose , meaning that it finds both and the spurious in the first step.

So I think the correct statement is a disjunction: The agent obtains the highest possible outcome or it finds no spurious counterfactuals.

Ooh, nice: we don't need to eliminate all spurious counterfactuals, only the malignant ones!


Yes, that's correct. Thanks!

This can evidently deal with probabilistic settings; the would encode an expected value which the agent is trying to maximize, with the obvious preference ordering. It would get unrealistic for very complicated situations, since expected values are very often intractable and must be approximated. Given that we know the exact probabilistic model, this is a sort of logical uncertainty.

How might this be modified to deal with logical uncertainty?

A simple idea that comes to mind is to modify the procedure to look for lower bounds on expected values, rather than trying to prove exact expected values. The procedure of choosing the best & breaking ties remains unmodified. (I'm not really trying to think about the modalized form at the moment.)

This should work for a lot more cases, but still has a feeling of being potentially unrealistic. It doesn't take advantage of any approaches to logical uncertainty that have been discussed.

One really nice thing about the approach is that we do not have to identify the agent within the universe. All that is needed is the logical implications of supposing that the agent takes various actions. As a result, the agent will treat programs provably equivalent to its own as copies of itself with no fuss.

We can take a prior on all programs as our universe, with no special interface between the agent and the program as in AIXI. (We do need to somehow specify utilities on programs, which is a non-obvious procedure.) The agent then searches for proofs that if it takes some particular action, then at least some particular expected utility is achieved. There are no longer finitely many possible outcomes, so modal tricks to do this without a halting oracle won't work. Instead, the agent must be doing this for some bounded time. It then takes the action with highest proven utility.

This implicitly contains a sort of chicken rule, since if the agent can prove that it will not take a particular , it can proceed to prove arbitrarily good for that . So, it will want to take that action.

I wouldn't really call this logical causality, by the way; to me, that suggests the ability to take arbitrarily mathematical counterfactuals ("What if ?"), not only counterfactuals in service of actions.