This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.

Let the universe be a computer program U that can make calls to a halting oracle. Let the agent be a subprogram A within U that can also make calls to the oracle. The source code of both A and U is available to A.

def U():

# Fill boxes, according to predicted action.

box1 = 1000

box2 = 1000000 if (A() == 1) else 0

# Compute reward, based on actual action.

return box2 if (A() == 1) else (box1 + box2)

A complete definition of U should also include the definition of A, so let's define it. We will use the halting oracle only as a provability oracle for some formal system S, e.g. Peano arithmetic. Here's the algorithm of A:

- Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
- For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
- Return the action that corresponds to the highest utility found on step 2.

Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.

Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.

Lemma 2: S doesn't prove any other utility values for A()=1 or A()=2. Proof: assume, for example, that S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1. According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent unsound (thx Misha). So if S is sound, that can't happen.

We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.

### But why? What's the point of this result?

There's a big problem about formalizing UDT. If the agent chooses a certain action in a deterministic universe, then it's a true fact about the universe that choosing a different action would have caused Santa to appear. Moreover, if the universe is computable, then such silly logical counterfactuals are not just true but provable in any reasonable formal system. When we can't compare actual decisions with counterfactual ones, it's hard to define what it means for a decision to be "optimal".

For example, one previous formalization searched for formal proofs up to a specified length limit. Problem is, that limit is a magic constant in the code that can't be derived from the universe program alone. And if you try searching for proofs without a length limit, you might encounter a proof of a "silly" counterfactual which will make you stop early before finding the "serious" one. Then your decision based on that silly counterfactual can make it true by making its antecedent false... But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.

In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system. The agent will always return whatever action is proved by the formal system to be optimal, if such an action exists. This notion of optimality matches our intuitions even though the universe is still perfectly deterministic and the agent is still embedded in it, because the oracle ensures that determinism is just out of the formal system's reach.

**P.S.** I became a SingInst research associate on Dec 1. They did not swear me to secrecy, and I hope this post shows that I'm still a fan of working in the open. I might just try to be a little more careful because I wouldn't want to discredit SingInst by making stupid math mistakes in public :-)

This is one of the most interesting posts I've seen on Less Wrong. It is a non-trivial result and it goes a long way towards dealing with a lot of the problems with UDT. I'm deeply impressed and wish I had some helpful comment to make other than general praise.

(In the setting where there were no oracles, this problem was generally discussed on the decision theory list, with various conditions used to prove when intended moral arguments [A=A1=>U=U1] will in fact be proven by the agent (are agent-provable, not just provable). The "chicken rule" condition used here I figured out in explicit form sometime in April ("To avoid confusion, immediately perform any action that implies absurdity"), though I'm not sure anyone published an actual proof using this condition. A few days ago, cousin_it posted a thought experiment involving a Turing oracle, and I pointed out that using the chicken rule in this setting elicits a much nicer result.)

Would it be naive to hope for a criterion that roughly says: "A conditional P ⇒ Q is silly iff the 'most economical' way of proving it is to deduce it from ¬P or else from Q." Something like: "there exists a proof of ¬P or of Q which is strictly shorter than the shortest proof of P ⇒ Q"?

A totally different approach starts with the fact that your 'lemma 1' could be proved without knowing anything about A. Perhaps this could be deemed a sufficient condition for a counterfactual to be serious. But I guess it's not a necessary condition?

Congratulations! This is a key step forward. And also, congrats to the SIAI for getting both you and Vladimir Nesov as official researchers.

Sorry, I just had this image:

cousin_it: I want to use my 20% time to prevent the extermination of humanity.

Google overlord: Okay, and this would help Google ... how, exactly?

(I know, I know, oversimplification.)

Is this the first time an advanced decision theory has had a mathematical expression rather than just a verbal-philosophical one?

This totally deserves to be polished a bit and published in a mainstream journal.

Specifically, given

anyformal system S for reasoning about the world and agent's place in it, the chicken rule (step 1) forces S to generate consistent theories of consequences for all possible actions. This seems to crack a long-standing problem in counterfactual reasoning, giving a construction for counterfactual worlds (in form of consistent formal theories) from any formal theory that has actual world as a model.Is this right? I'm wondering if you're assuming soundness relative to the natural semantics about A, because in step one, it isn't clear that there is a contradiction in S rather than a failure of S to be sound with respect to the semantics it's supposed to model (what actions A can take and their utility). I might be confused, but wouldn't this entail contradiction of the

soundnessof S rathe... (read more)Would it be correct to say that S can't prove Lemma 2?

"S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1" I don't see how this follows. Perhaps it is because, if the system was sound, it would never prove more than one value for U() for a given a, therefore by the principle of explosion it proves A()≠1? But that doesn't seem to actually follow. I'm aware that this is an old post, but on the off chance that anyone ever actually sees this comment, help would be appreciated.

Suppose S proves that A()≠1 and A()≠2. What does A return?

Is there a reason you're using logical conditionals rather than Pearlian counterfactual conditionals like TDT?

What about the prisoner's dilemma?

Here's a concrete example of a game between non-identical agents: agent A of this post, who maximizes outcome, and agent B, who cooperates iff the other player plays the same move.

You seem to imply that you consider this approach practically equivalent to your earlier approach with bounded proof length. Is that right?

But in your first post, you say you're not sure what happens with two player games, while in your second, you say "if agent A tries to reason about agent B in the same way, it will fail miserably."... (read more)

Wouldn't the same job be done by the agent using proper counterfactuals instead of logic ones, which seems like something that would also be needed for other purposes?

I don't know who (if anyone) has done any work on this, but when a human considers a counterfactual statement like "If Gore won in 2000" that is very underspecified, because the implicit assumption is to discard contradicting knowledge, but how to do that exactly is left open. Humans just know that they should assume something like "Bush didn't win Florida" instead of &q... (read more)

Stupid question: A() always returning 1 manifestly solves the Newcomb problem, no fancy proofs required. What am I missing?

The implementation of A in the post works correctly for many different U's.

You put the oracle outside the universe, but the logical impossibility of an oracle is still visible from inside the universe. For example, this agent (EDIT simplified from original comment):

I don't see why exploring a universe with such contradictions is useful for UDT, can you explain?

Another possible modification of the algorithm:

Return the action for which S does not prove that A()≠a, and which corresponds to the highest utility found on step 2 among all such actions.

I still have doubts about the consistency of this architecture. What if the agent sees a copy of itself perform some action in a situation exactly equal to the situation in which the agent finds itself now. Would it not mean that the agent can now prove that it would perform the same action? (There would be a difference between the agent and the copy, but only an "additive" difference - the agent will have additional knowledge that the copy wouldn't - so whatever the copy proved, the agent must also be able to prove. [And this fact would be prova... (read more)

Is it possible to change the definition of A() to something like this:

Let UX(X) be the function that results from the source of U() if all calls "A()" were changed to "eval(X)".

For some action p, prove that for any possible agent X, UX(X) <= UX("return p").

return p.

Then the agent would be able to prove that it returns a specific action, and yet there would be no contradictions.

It seems to me this should be better than the "chicken rule" (it's really funny and clever, though! :)). But I'm getting no response to sim... (read more)