### Or: Causal Decision Theory and Substitution

**Previously:**

0. Decision Theories: A Less Wrong Primer

1. The Problem with Naive Decision Theory

**Summary of Post: ***We explore the role of substitution in avoiding spurious counterfactuals, introduce an implementation of Causal Decision Theory and a CliqueBot, and set off in the direction of Timeless Decision Theory.*

In the last post, we showed the problem with what we termed Naive Decision Theory, which attempts to prove counterfactuals directly and pick the best action: there's a possibility of spurious counterfactuals which lead to terrible decisions. We'll want to implement a decision theory that does better; one that is, by any practical definition of the words, foolproof and incapable of error...

I know you're eager to get to Timeless Decision Theory and the others. I'm sorry, but I'm afraid I can't do that just yet. This background is too important for me to allow you to skip it...

Over the next few posts, we'll create a sequence of decision theories, each of which will outperform the previous ones (the new ones will do better in some games, without doing worse in others^{0}) in a wide range of plausible games.

Let's formalize the setup a bit more: we've written a program **X**, which is paired with another program **Y** in a game **G**. Both **X** and **Y** have access to the source codes of **X**, **Y**, and **G**, and **G** calculates the payouts to the programmers of **X** and of **Y** in terms of their outputs alone. That is, **X** and **Y** should be functions of two inputs, and **G** should do the following:

- Calculate the output of
**X**(code**G**, 1, code**Y**).^{1} - Calculate the output of
**Y**(code**G**, 2, code**X**). - From these values, calculate the payoffs to the programmers.

We write the expected payout to **X**'s programmer as **U**, by which we mean **U**(output **X**, output **Y**), by which we really mean **U**(output **X**(code **G**, 1, code **Y**), output **Y**(code **G**, 2, code **X**)); similarly, we'll write **V** as the expected payout to **Y**'s programmer. We'll need to use quining in order for **G** to feed its own source code into **X** and **Y**, but this is straightforward. (I said earlier that **X** should have access to its own source code as well, but the programmer can do this via quining, with no help from **G**.)

We're also assuming that **X** is equipped with a valid inference module that deduces *some* true statements and *no* false ones; it can return "true", "false" or "unknown" when given a statement. (One such example is an automated theorem prover which gives up after a certain length of time.)

Last time, we tried the most straightforward idea—trying to directly prove counterfactuals about the value of **U** for various outputs of **X**—but we ran into the problem of spurious counterfactuals, which are logically true but practically misleading. The various decision theories we'll discuss each represent different ways of getting around this obstacle.

### Idea 3: Substitute to Avoid Self-Fulfilling Prophecies

Our spurious counterfactuals were harmful because of the possibility of a self-fulfilling prediction; one way to avoid such a thing is to base your output only on deductions about statements that are in some sense "independent" of the final output. We'll illustrate this principle by doing a simple task that Naive Decision Theory failed to do: infer the correct payoff matrix for **G** from its source code.

For each pair of outputs (*x _{i}*,

*y*), we can simply substitute

_{j}*x*for (output

_{i}**X**) and

*y*for (output

_{j}**Y**) in the code for

**G**, obtaining

*unconditional*deductions of

**U**(

*x*,

_{i}*y*) and

_{j}**V**(

*x*,

_{i}*y*).

_{j}^{2}Note that this is

*not*the counterfactual "if (output

**X**)=

*x*and (output

_{i}**Y**)=

*y*then

_{j}**U**=

*u*", but instead a simple mathematical object that doesn't depend on the actual values of (output

_{ij}**X**) and (output

**Y**)!

This is a powerful enough principle that, if we use it properly, we arrive at a formalized version of the hallowed classical theory:

### Causal Decision Theory: No Longer Naive

**X** can always simply play a default Nash equilibrium strategy, but there's an obvious case where we want **X** to depart from it: if the output of **Y** is predictable, **X** can often gain more by stepping away from its equilibrium to exploit **Y**'s stupidity (or coordinate with it on a different equilibrium). We might program **X** as follows:

- Try to deduce the output of
**Y**(code**G**, 2, code**X**); note that this may be a mixed strategy. - If this succeeds, output the
*x*which maximizes_{i}**U**(*x*,(output_{i}**Y**)). (There must be at least one pure strategy which does so.) - Otherwise, output a default Nash equilibrium.

Note that even though there is some circularity involved—**Y** may base its output on the code of **X**—this isn't like a spurious counterfactual, because the deduction of **U**(*x _{i}*,(output

**Y**)) is independent of the output of

**X**. In particular, if

**X**succeeds at deducing the output of

**Y**, then that really will be the output of

**Y**in the game (presuming, as usual, that the inference module of

**X**is valid).

This decision theory acts optimally in "one-player games" where the output of **Y** is independent of the source code of **X** (and can be deduced by **X**'s inference module), and in zero-sum two-player games as well. Finally, **X** is un-exploitable in the following sense: if (*x _{i}*,

*y*) is the Nash equilibrium with the lowest value for

_{j}**U**(

*x*,

_{i}*y*), then

_{j}**U**(

**X**,

**Y**) ≥

**U**(

*x*,

_{i}*y*) or

_{j}**V**(

**X**,

**Y**) ≤

**V**(

*x*,

_{i}*y*). (Namely, if

_{j}**Y**is trying to maximize its own payoff, then

**X**should wind up with at least the value of its worst Nash equilibrium.)

In a philosophical sense, **X** does "consider" **Y**'s inference of (output **X**) as a factor in deducing (output **Y**), but it considers it as if **Y** were instead inferring the output of some unrelated program, and in particular there's no "feedback" from this consideration to the second step. **X** acts as if its own eventual output cannot cause **Y**'s output; that is, we've written an instance of Causal Decision Theory.^{3}

We can also see that if a dominant strategy exists (like two-boxing in Newcomb's Problem^{4} or defecting in the Prisoner's Dilemma), this CDT algorithm will always pick it. We should ask at this point: in our situation with perfect knowledge of source codes, can we do better than the classical solution?

### Cliquish Decision Theory: Better, but Not Good Enough

We can, of course. There's at least one case where defecting on the Prisoner's Dilemma is clearly a bad idea: if the source code of **Y** is identical to the source code of **X**. So we can upgrade **X** by adding an additional rule: If the payoff matrix is a prisoner's dilemma (a well-defined subset of payoff matrices), and the code of **Y** equals the code of **X**, then cooperate. (I've called algorithms like this CliqueBots before, since they cooperate only with themselves and defect against everyone else.)

Furthermore, we can extend this to cases where the source codes are different but equivalent; let **X** do the following when given a game **G** and an opponent **Y**:

- Deduce the payoff matrix of
**G**. If**G**is not a prisoner's dilemma, then implement Causal Decision Theory. - If
**G**is a prisoner's dilemma, try to deduce "for all opponents**Z**, output**X**(code**G**, 1, code**Z**)=*Cooperate*if and only if output**Y**(code**G**, 1, code**Z**)=*Cooperate*if and only if output**X**(code**G**, 2, code**Z**)=*Cooperate*if and only if output**Y**(code**G**, 2, code**Z**)=*Cooperate*".^{5} - If this succeeds, then cooperate; else defect.

This now satisfies #1-3 of my criteria for a better decision theory (and we could add another rule so that it one-boxes on Newcomblike problems and satisfies #4), but it completely fails #5: this is a well-justified ad-hoc exception, but it's still an ad-hoc exception, and it's not that enlightening as examples go. We'll want something that achieves the same result, but for a genuine reason rather than because we've added individual common-sense patches. (One problem with patches is that they don't tell you how to solve more complicated problems- extending the analysis to three-player games, for instance.)

Furthermore, Cliquish Decision Theory misses out on a lot of potential cooperation: there are multiple strict improvements on Causal Decision Theory which cooperate with each other but don't all cooperate with the same third parties, and it's simple to show that Cliquish Decision Theory can't cooperate with any of these. So we'll need new ideas if we're going to do better.

Our next steps will be toward Timeless Decision Theory. If you've considered TDT on the philosophical level, you'll recognize the concept of substituting your (hypothetical) output for your opponent's (hypothetical) prediction of your output, and using *those* outcomes to decide what to do. Our setup gives us a clean general way to implement this messy idea:

### Idea 4: Substitute for the Source Code of X (as an input of Y)

That is, instead of trying to deduce the output of **Y**(code **G**, 2, code **X**) and then pair it with each action *x _{i}*, we can deduce the output of

**Y**(code

**G**, 2,

*some other code that corresponds to x*) and pair it with the deduced action. I'm being vague about this, partly to generate some suspense for the next post, but mostly because the most obvious way to go about it has two major flaws...

_{i}**Next:** Part III: Formalizing Timeless Decision Theory

### Notes

**0.** This is not quite true for Updateless Decision Theory, which explicitly trades off worse performance in some rounds for a better average performance. But we'll get to that in due time.

Also, it's possible to write an opponent **Y** that "rewards stupidity" by caring about things besides **X**'s outputs: it might let **X** attain its best outcome only if **X** is provably a causal decision theorist, for instance. But in such cases, **Y** isn't really optimizing its own payoffs (as listed) in any meaningful sense, and so this is not a fair problem; we're tacitly assuming throughout this analysis that **X** is a real attempt to optimize **U** and **Y** is a real attempt to optimize **V**, and both of these depend only on the outputs.

**1.** **X** and **Y** need to know which is player 1 in game **G** and which is player 2. Also, the steps of **G** don't have to be done in that sequential order- they can be calculated in parallel or any other sensible way. (In Part III we'll introduce an interesting game which first flips a coin to decide which one is Player 1 and which one is Player 2.)

**2.** In fact, the correct payoff matrix is necessary even to implement the Nash equilbrium strategy! I glossed over this before by assuming that, in addition to the source code, **X** was given a valid payoff matrix; but here, we see that there is a sensible way to read off the matrix from the code of **G** (if the inference module of **X** is up to the task). We'll assume from here on out that this is the case.

**3.** There are other variants of this CDT algorithm: for instance, **X** might only output argmax(**U**(*x*,(output **Y**))) if the value is greater than the value of **X**'s default Nash equilibrium. This algorithm can "insist on a good result or else no deal"; for instance, in the game below, if **X** takes as default the equilibrium that nets xer 2, the new algorithm will refuse to settle for 1, when the old algorithm would often have done so.

(2,1) | (0,0) |

(0,0) | (1,2) |

On the other hand, when playing an opponent that always picks the second column (i.e. a one-player game), this new algorithm will wind up with nothing! So there's a tradeoff involved, where other algorithms can get (in many cases) the best of both worlds.

**4.** Maybe the best way to implement Newcomb's Problem in this setting is to introduce the game **N** as follows (**X** chooses the row and gets the first listed payoff):

(100,1) | (0,0) |

(101,0) | (1,1) |

Your opponent is the predictor **P**, which is programmed as follows (when the game is **N**):

- Try to deduce "output
**X**(code**N**, code**P**)=row 1". - If this succeeds, output column 1; else output column 2.

If you expect to face **N** against **P**, then you have every incentive to make it an easy deduction that output **X**(code **N**, code **P**)=1.

**5.** There may be a better way to do this, but most simplifications will fall short of CDT in one way or another. For instance, if we just have **X** try to deduce "output **X**(code **G**, 1, code **Y**)=output **Y**(code **G**, 2, code **X**)", then it's quite possible for **X** to wind up cooperating with a CooperateBot, which is suboptimal performance on a one-player game. (A proof of that statement can be generated by Löb's Theorem.)