## Or: The Problem with Naive Decision Theory

**Previously:** Decision Theories: A Less Wrong Primer

**Summary of Sequence:** *In **the context of a tournament for computer programs**, I give almost-explicit versions of causal, timeless, ambient, updateless, and several other decision theories. I explain the mathematical considerations that make decision theories tricky in general, and end with a bunch of links to the relevant recent research. This sequence is heavier on the math than the primer was, but is meant to be accessible to a fairly general audience. Understanding the basics of game theory (and Nash equilibria) will be essential. Knowing about things like Gödel numbering, quining and Löb's Theorem will help, but won't be required.*

**Summary of Post:** *I introduce a context in which we can avoid most of the usual tricky philosophical problems and formalize the decision theories of interest. Then I show the chief issue with what might be called "naive decision theory": the problem of spurious counterfactual reasoning. In future posts, we'll see how other decision theories get around that problem.*

In my Decision Theory Primer, I gave an intuitive explanation of decision theories; now I'd like to give a technical explanation. The main difficulty is that in the real world, there are all sorts of complications that are extraneous to the core of decision theory. (I'll mention more of these in the last post, but an obvious one is that we can't be sure that our perception and memory match reality.)

In order to avoid such difficulties, I'll need to demonstrate decision theory in a completely artificial setting: a tournament among computer programs.

You're a computer programmer entering a tournament for spectacular stakes. But you won't be competing in person: instead, you'll be submitting code for a program to represent you, and that program will be competing one-on-one with other programs in a series of games.

You don't know what the specific games are, but the contest has specified some ground rules:

- In each game, your program and its opponent will have to choose separately between several options. (There are independent sources of randomness available to each program, so mixed strategies are legal.)
- The games are pure strategic conflicts: the expected payouts to
*each*programmer depend on the outputs of*both*programs, and can be calculated simply if you know those outputs. (In particular, you can represent each game as a payoff matrix in terms of the programs' outputs.) For example, your program might play the Prisoner's Dilemma against another program. - Both programs have access to the source code of
*each other*and of the game they're playing. - The programs don't get to carry any memories from round to round, or modify their source code at any stage. (This makes the analysis
*much*simpler, and makes it even more impressive if we find unexploitable programs which are capable of mutual cooperation.) - While many of the programs were written by other programmers trying to win prizes for themselves, there are also some special algorithms included to test your reactions; for example, there could be programs that always cooperate in the Prisoner's Dilemma, or an instance of Newcomb's Predictor.
- The tournament may also include some exact copies of your own program, but
*you*won't get any of the prizes won by these extra copies, only the prizes won by your actual entry. (There's no way for your program to distinguish whether it's the original or a copy.) - There are more than enough prizes to go around, so you don't have to make anyone else lose, you just want your program to win as much as it can.
- Also, there will be enough games played that most of the variance should wash out, so you should aim for the highest expected value per round rather than worry about the concave utility of more prizes.

So, what kind of program should you write?

In the next few posts, we'll examine several ideas, problems and decision theories, increasing our sophistication as we go. We'll use **X** to denote your program, and *x _{1}*, . . . ,

*x*to denote its possible outputs; likewise, your opponent in the current round is

_{n}**Y**with possible outputs

*y*, . . . ,

_{1}*y*. We'll let

_{m}**U**(

*x*,

_{i}*y*) denote the resulting payout to you if

_{j}**X**outputs

*x*and

_{i}**Y**outputs

*y*.

_{j}### Idea 1: Play defense with a Nash equilibrium

In our example, we know what utility function the opponent should have: its own expected payout.^{0} Any such game has at least one Nash equilibrium, a pair of strategies (which may be mixed) such that if **X** and **Y** both adopted them, then neither would be better off unilaterally switching strategies. In that sense, at least, if **X** plays a Nash equilibrium, it can be sure of not being exploited by **Y**. (In particular, **X** will never end up tricked into cooperating in the Prisoner's Dilemma while **Y** defects.)

Of course, there may be more than one Nash equilibrium in a game, and these may be of unequal value if the game is non-zero-sum. (Coordination problems are tricky in general.) So this is underspecified; still, choosing an arbitrary Nash equilibrium is a decent backup strategy. But we can often do better:

### Idea 2: Inference

The most basic intuition a human being has in this situation is to start trying to deduce things about **Y**'s output from its source code, or even deduce things directly about **U**. This idea is best illustrated by playing Rock, Paper, Scissors against a player who always throws Rock: if you figure this out, then you should of course play Paper rather than the Nash equilibrium of 1/3 Rock, 1/3 Paper, 1/3 Scissors. (And in a coordination game, you'd prefer to settle on the *same* Nash equilibrium that **Y** outputs.)

Automating inference is an exceedingly difficult problem in general, though researchers have made substantial progress. All the decision theories we'll talk about will include some sort of "inference module", which can be applied to the source code of **Y** to deduce its output, applied to the code of the full round (including **X**, **Y**, and the payoff matrix) to deduce the value of **U**, etc.

### Problem: You can't deduce everything

Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces *in general*^{1} the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.

A well-written inference module will either return a correct answer for a question or return "Unknown"; a sloppily-written module can get stuck in an infinite process, and a badly-written one will return an incorrect answer sometimes. It should be clear that we'll want our inference module to be of the first sort.

It seems we have enough already to define our first candidate decision theory:

### Naive Decision Theory

Let's first consider the approach that seems most obvious. Since we know the source code of the entire round (including **X** and **Y**), we could implement the following program:

- For each
*x*, assume the output of_{i}**X**is*x*, and try to deduce the expected value of_{i}**U**. (That is, try and deduce statements of the form "if (output**X**)=*x*then_{i}**U**=*u*" for some_{i}*u*)._{i} - If this succeeds for each
*x*, output the_{i}*x*for which_{i}*u*is the largest._{i} - If this does not succeed for some
*x*, output a Nash equilibrium strategy._{i}

This "naive decision theory" certainly qualifies for our tournament; it may be a bit trickier to write an inference module that does an open-ended search for the value of **U**, but it's not impossible (since human mathematicians solve open-ended deduction problems all the time). And it looks like the worst-case scenario is a Nash equilibrium, not total exploitation. What could possibly go wrong?

### Problem: Beware self-fulfilling prophecies!

There's a reason that we don't normally ask an automated theorem prover to consider questions about its *own* mathematical structure: if we ask the question in a certain way, any choice becomes a self-fulfilling prophecy.

If **X** deduces its own output by a valid process, then it's created a self-fulfilling prophecy for its output, and the problem with *that* is that a bad self-fulfilling prophecy is just as consistent as a good one. If we want to use statements like "if (output **X**)=*x _{i}* then

**U**=

*u*" to make our final choice, then we have to beware the other half of logical implication, that "not P" implies "if P then Q". This allows for what we might call

_{i}*spurious counterfactuals*, which can throw off the actual decision in a perfectly self-consistent way.

Consider, for example, the one-player game where **X** gets $1 for outputting *a*, or $10 for outputting *b*. We want **X** to do the following:

- Prove "if (output
**X**)=*a*then**U**=1" - Prove "if (output
**X**)=*b*then**U**=10" - Output
*b*.

But it's just as consistent for **X** to do the following:

- Prove "(output
**X**)=*a*" - Prove "if (output
**X**)=*a*then**U**=1" - Prove "if (output
**X**)=*b*then**U**=0" - Output
*a*.

How could that possibly work? Since (output **X**)=*a*, the third line is a true logical statement! It's like the fact that you can prove anything if you assume a falsehood (though rather than unconditionally accepting a false premise, **X** is using a false premise as the antecedent of a material conditional). In this example, the very order in which **X** looks for proofs (which is part of the definition of **X**) affects which counterfactuals **X** can and cannot prove. (One important thing to note is that **X** *cannot* prove a spurious counterfactual about the action that it *does* output, only about the ones that it doesn't!)

I don't need to tell you that the second chain of proofs is *not* what we want **X** to do. Worse, if this is a real bug, then it could also be an *exploitable vulnerability*: if your source code for **X** were released in advance of the tournament, then other programmers might write programs that cause **X** to generate spurious counterfactuals for all but the moves that are most favorable to **Y**.

### Can NDT be salvaged?

Let's consider some quick fixes before we give up on Naive Decision Theory.

Can we simply prohibit **X** from ever deducing "(output **X**)=*x _{i}*" as a step? This doesn't work because of the possibility of indirect self-reference;

**X**could end up deducing some seemingly innocuous statements which happen to correspond to its own Gödel numbering, and the spurious counterfactuals would follow from those- without

**X**ever having noticed that it had done anything of the sort. And it's provably impossible for

**X**to recognize every possible Gödel numbering for its own inference module!

Next, it might seem like an inference module should stumble on the "genuine" counterfactuals before running into spurious ones, since the "genuine" ones seem necessarily simpler. However, it turns out (as proved by cousin_it) that one can write a valid but malicious inference module which returns and acts on a spurious proof, and (as proved by gRR) that a game with malicious code can similarly dupe a NDT agent with a good inference module!

Lastly, it seems safer to deduce counterfactuals "if (output **X**)=*x _{i}* then (output

**Y**)=

*y*" and apply the

_{j}**U**(

*x*,

_{i}*y*) afterwards. And indeed, I can't see how to make

_{j}**Y**exploit

**X**in a straight-up Prisoner's Dilemma if that algorithm is used. There are still two problems, though. First, this algorithm now depends on the values

**U**(

*x*,

_{i}*y*) being given to it by authority- it can't safely deduce them from the source code for the game. And secondly, it could two-box on Newcomb's Problem and defect against itself in the Prisoner's Dilemma if it finds spurious counterfactuals there.

_{j}Thus it seems we'll need to do something substantially different.

### Well, now what?

There are several ways we could write a decision theory to do inference without risking spurious counterfactuals, and indeed the decision theories we discuss on Less Wrong correspond to different valid approaches. The differences in their decisions come not from better-written inference modules, but from more effective strategies for using their inference module. In the posts to come, I'll show you how they work in detail.

**Next:** Part II: Causal Decision Theory and Substitution

### Notes:

**0.** Things get wacky if we don't know the utility function of the opponent; fortunately, even the special cases like Newcomb's predictor can be expressed as expected utility maximizers for some payoff matrix (in this case, one where the predictor gets rewarded when it matches your decision exactly).

**1.** That "in general" is important: it's quite possible to write programs that deduce the outputs of plenty of other programs. It just so happens that there's always some program that your inference module will fail on. The classic way to generate a failure case is to run the inference module on a modified version of itself, such that returning a correct answer would induce a contradiction. This isn't just a hypothetical disability: if **X** is trying to deduce the output of **Y** in this round, and **Y** is trying to deduce the output of **X** in this round, then we might have exactly this problem!

Markup note: Footnote links should be made relative so they don't force a reload of the page due to its URL having been changed.

But everyone's source code is communicated at each round to everyone. What do you mean, "in advance"? What for?

... (read more)Let me just spell out to myself what would have to happen in this instance. For definiteness, let's take the payoffs in prisoner's dilemma to be $0 (CD), $1 (DD), $10 (CC) and $11 (DC).

Now, if X is going to co-operate and Y is going to defect then X is going to prove "If I... (read more)

I look forward to the rest of this sequence!

Well, this article puts my confusion from the primer into context, and I think I understand the issue now.

The "problem" (and it's ultimately not a problem, but I'll get to that) I have is that the game these programs are playing is ill-posed, in that it's not a closed system. It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar. Now I'm not enoug... (read more)

I suspect the difficulty you've hit here is that Eliezer's theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements ("If my algorithm were to pick a in this situation, then my expected utility would be -10000000, so I won't do that then"). I can see what he's trying to achieve, but unfortunately I don't know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.

I'... (read more)

Your self-fulfilling prophecy example works for the iteration of the for loop (described in "For each xi, assume the output of X is xi, and try to deduce the expected value of U.") in which the output is assumed to be a, but for the iteration in which the output is assumed to be b, proving that the output is a would be to prove a contradiction. "if (output X)=b then U=0" is one possible outcome, but U could also equal anything else.

I don't see how the NDT algorithm as given allows "(output X)=a" to be proved

outsideof the fo... (read more)! No memory is a big hole to leave in your explanation. It also destroys your ability to do inference, i.e. idea 2.

Overall, it looks like for NDT you don't want to have access to your opponent's source code, and you do want to have memory. You don't need to infer if you can read!

This branch of math is outside my training. I'm stumbling over the self-fulfilling prophecies section.

How can these two statements

both be true?

For footnote #1, it may be useful to give a couple of examples. When both programs are making decisions based on the others source code, it is pretty close to the undecidable situation, but it depends on the game.

When the game is Rock-Paper-Scissors (or parity), the programs have contradictory goals and are pretty much the textbook example of noncomputable.

But if the game is the Prisoner's Dilemma (or the Coordination Game), they are trying to cooperate to land on (C,C), so they want to be decidable.

I plan to put all the acknowledgments and references in the last part; but I should mention now that I first read about the self-fulfilling prophecy problem on the decision-theory mailing list, in a post by Vladimir Slepnev. (

EDIT:He passes on the credit to Benja Fallenstein in this LW comment.)Other people have used "naive decision theory" before, some of them on Less Wrong, but none of the usages seemed to stick. So I called this one (which was an early candidate for UDT before the problem was noticed) Naive Decision Theory. I can change the name if people prefer.

And here you fail the very basic thing, the domain of applicability: limited computational power. Unlimited computational power introduces nasty infinities into things. "Naive decision theory" is not interested in games where you confuse yourself using infinities. The contest organizers have ... (read more)

Upvoted- just enough math.