Or: The Problem with Naive Decision Theory

Previously: Decision Theories: A Less Wrong Primer

Summary of Sequence: In the context of a tournament for computer programs, I give almost-explicit versions of causal, timeless, ambient, updateless, and several other decision theories. I explain the mathematical considerations that make decision theories tricky in general, and end with a bunch of links to the relevant recent research. This sequence is heavier on the math than the primer was, but is meant to be accessible to a fairly general audience. Understanding the basics of game theory (and Nash equilibria) will be essential. Knowing about things like Gödel numbering, quining and Löb's Theorem will help, but won't be required.

Summary of Post: I introduce a context in which we can avoid most of the usual tricky philosophical problems and formalize the decision theories of interest. Then I show the chief issue with what might be called "naive decision theory": the problem of spurious counterfactual reasoning. In future posts, we'll see how other decision theories get around that problem.

In my Decision Theory Primer, I gave an intuitive explanation of decision theories; now I'd like to give a technical explanation. The main difficulty is that in the real world, there are all sorts of complications that are extraneous to the core of decision theory. (I'll mention more of these in the last post, but an obvious one is that we can't be sure that our perception and memory match reality.)

In order to avoid such difficulties, I'll need to demonstrate decision theory in a completely artificial setting: a tournament among computer programs.

You're a computer programmer entering a tournament for spectacular stakes. But you won't be competing in person: instead, you'll be submitting code for a program to represent you, and that program will be competing one-on-one with other programs in a series of games.

You don't know what the specific games are, but the contest has specified some ground rules:

  • In each game, your program and its opponent will have to choose separately between several options. (There are independent sources of randomness available to each program, so mixed strategies are legal.)
  • The games are pure strategic conflicts: the expected payouts to each programmer depend on the outputs of both programs, and can be calculated simply if you know those outputs. (In particular, you can represent each game as a payoff matrix in terms of the programs' outputs.) For example, your program might play the Prisoner's Dilemma against another program.
  • Both programs have access to the source code of each other and of the game they're playing.
  • The programs don't get to carry any memories from round to round, or modify their source code at any stage. (This makes the analysis much simpler, and makes it even more impressive if we find unexploitable programs which are capable of mutual cooperation.)
  • While many of the programs were written by other programmers trying to win prizes for themselves, there are also some special algorithms included to test your reactions; for example, there could be programs that always cooperate in the Prisoner's Dilemma, or an instance of Newcomb's Predictor.
  • The tournament may also include some exact copies of your own program, but you won't get any of the prizes won by these extra copies, only the prizes won by your actual entry. (There's no way for your program to distinguish whether it's the original or a copy.)
  • There are more than enough prizes to go around, so you don't have to make anyone else lose, you just want your program to win as much as it can.
  • Also, there will be enough games played that most of the variance should wash out, so you should aim for the highest expected value per round rather than worry about the concave utility of more prizes.

So, what kind of program should you write?

In the next few posts, we'll examine several ideas, problems and decision theories, increasing our sophistication as we go. We'll use X to denote your program, and x1, . . . , xn to denote its possible outputs; likewise, your opponent in the current round is Y with possible outputs y1, . . . , ym. We'll let U(xi,yj) denote the resulting payout to you if X outputs xi and Y outputs yj.

Idea 1: Play defense with a Nash equilibrium

In our example, we know what utility function the opponent should have: its own expected payout.0 Any such game has at least one Nash equilibrium, a pair of strategies (which may be mixed) such that if X and Y both adopted them, then neither would be better off unilaterally switching strategies. In that sense, at least, if X plays a Nash equilibrium, it can be sure of not being exploited by Y. (In particular, X will never end up tricked into cooperating in the Prisoner's Dilemma while Y defects.)

Of course, there may be more than one Nash equilibrium in a game, and these may be of unequal value if the game is non-zero-sum. (Coordination problems are tricky in general.) So this is underspecified; still, choosing an arbitrary Nash equilibrium is a decent backup strategy. But we can often do better:

Idea 2: Inference

The most basic intuition a human being has in this situation is to start trying to deduce things about Y's output from its source code, or even deduce things directly about U. This idea is best illustrated by playing Rock, Paper, Scissors against a player who always throws Rock: if you figure this out, then you should of course play Paper rather than the Nash equilibrium of 1/3 Rock, 1/3 Paper, 1/3 Scissors. (And in a coordination game, you'd prefer to settle on the same Nash equilibrium that Y outputs.)

Automating inference is an exceedingly difficult problem in general, though researchers have made substantial progress. All the decision theories we'll talk about will include some sort of "inference module", which can be applied to the source code of Y to deduce its output, applied to the code of the full round (including X, Y, and the payoff matrix) to deduce the value of U, etc.

Problem: You can't deduce everything

Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.

A well-written inference module will either return a correct answer for a question or return "Unknown"; a sloppily-written module can get stuck in an infinite process, and a badly-written one will return an incorrect answer sometimes. It should be clear that we'll want our inference module to be of the first sort.

It seems we have enough already to define our first candidate decision theory:

Naive Decision Theory

Let's first consider the approach that seems most obvious. Since we know the source code of the entire round (including X and Y), we could implement the following program:

  • For each xi, assume the output of X is xi, and try to deduce the expected value of U. (That is, try and deduce statements of the form "if (output X)=xi then U=ui" for some ui).
  • If this succeeds for each xi, output the xi for which ui is the largest.
  • If this does not succeed for some xi, output a Nash equilibrium strategy.

This "naive decision theory" certainly qualifies for our tournament; it may be a bit trickier to write an inference module that does an open-ended search for the value of U, but it's not impossible (since human mathematicians solve open-ended deduction problems all the time). And it looks like the worst-case scenario is a Nash equilibrium, not total exploitation. What could possibly go wrong?

Problem: Beware self-fulfilling prophecies!

There's a reason that we don't normally ask an automated theorem prover to consider questions about its own mathematical structure: if we ask the question in a certain way, any choice becomes a self-fulfilling prophecy.

If X deduces its own output by a valid process, then it's created a self-fulfilling prophecy for its output, and the problem with that is that a bad self-fulfilling prophecy is just as consistent as a good one. If we want to use statements like "if (output X)=xi then U=ui" to make our final choice, then we have to beware the other half of logical implication, that "not P" implies "if P then Q". This allows for what we might call spurious counterfactuals, which can throw off the actual decision in a perfectly self-consistent way.

Consider, for example, the one-player game where X gets $1 for outputting a, or $10 for outputting b. We want X to do the following:

  • Prove "if (output X)=a then U=1"
  • Prove "if (output X)=b then U=10"
  • Output b.

But it's just as consistent for X to do the following:

  • Prove "(output X)=a"
  • Prove "if (output X)=a then U=1"
  • Prove "if (output X)=b then U=0"
  • Output a.

How could that possibly work? Since (output X)=a, the third line is a true logical statement! It's like the fact that you can prove anything if you assume a falsehood (though rather than unconditionally accepting a false premise, X is using a false premise as the antecedent of a material conditional). In this example, the very order in which X looks for proofs (which is part of the definition of X) affects which counterfactuals X can and cannot prove. (One important thing to note is that X cannot prove a spurious counterfactual about the action that it does output, only about the ones that it doesn't!)

I don't need to tell you that the second chain of proofs is not what we want X to do. Worse, if this is a real bug, then it could also be an exploitable vulnerability: if your source code for X were released in advance of the tournament, then other programmers might write programs that cause X to generate spurious counterfactuals for all but the moves that are most favorable to Y.

Can NDT be salvaged?

Let's consider some quick fixes before we give up on Naive Decision Theory.

Can we simply prohibit X from ever deducing "(output X)=xi" as a step? This doesn't work because of the possibility of indirect self-reference; X could end up deducing some seemingly innocuous statements which happen to correspond to its own Gödel numbering, and the spurious counterfactuals would follow from those- without X ever having noticed that it had done anything of the sort. And it's provably impossible for X to recognize every possible Gödel numbering for its own inference module!

Next, it might seem like an inference module should stumble on the "genuine" counterfactuals before running into spurious ones, since the "genuine" ones seem necessarily simpler. However, it turns out (as proved by cousin_it) that one can write a valid but malicious inference module which returns and acts on a spurious proof, and (as proved by gRR) that a game with malicious code can similarly dupe a NDT agent with a good inference module!

Lastly, it seems safer to deduce counterfactuals "if (output X)=xi then (output Y)=yj" and apply the U(xi,yj) afterwards. And indeed, I can't see how to make Y exploit X in a straight-up Prisoner's Dilemma if that algorithm is used. There are still two problems, though. First, this algorithm now depends on the values U(xi,yj) being given to it by authority- it can't safely deduce them from the source code for the game. And secondly, it could two-box on Newcomb's Problem and defect against itself in the Prisoner's Dilemma if it finds spurious counterfactuals there.

Thus it seems we'll need to do something substantially different.

Well, now what?

There are several ways we could write a decision theory to do inference without risking spurious counterfactuals, and indeed the decision theories we discuss on Less Wrong correspond to different valid approaches. The differences in their decisions come not from better-written inference modules, but from more effective strategies for using their inference module. In the posts to come, I'll show you how they work in detail.

Next: Part II: Causal Decision Theory and Substitution


0. Things get wacky if we don't know the utility function of the opponent; fortunately, even the special cases like Newcomb's predictor can be expressed as expected utility maximizers for some payoff matrix (in this case, one where the predictor gets rewarded when it matches your decision exactly).

1. That "in general" is important: it's quite possible to write programs that deduce the outputs of plenty of other programs. It just so happens that there's always some program that your inference module will fail on. The classic way to generate a failure case is to run the inference module on a modified version of itself, such that returning a correct answer would induce a contradiction. This isn't just a hypothetical disability: if X is trying to deduce the output of Y in this round, and Y is trying to deduce the output of X in this round, then we might have exactly this problem!

New Comment
90 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Markup note: Footnote links should be made relative so they don't force a reload of the page due to its URL having been changed.

How do I do that? Currently the link is of the form: /lw/axl/#utilities
Like this: (fixed)
You should be able to link to just #utilities.

if your source code for X were released in advance

But everyone's source code is communicated at each round to everyone. What do you mean, "in advance"? What for?

The problem is that we have an exploitable vulnerability here: if your source code for X were released in advance, then other programmers could write a Gödel-numbering for X's inference module into their own programs in such a way that X has to take a circuitous route to the "genuine" counterfactuals (or so that the genuine ones come back "Unknown"), and in the me

... (read more)
It's easier for me to imagine a programmer writing a program specifically to exploit mine than it is to imagine a program that's able to recognize an opportunity for such exploitation in general. Just as a Nash equilibrium strategy can't be exploited even if it's publicly known in advance, we'd like our decision theories to be secure from exploitation even if their source code were publicly known. I agree; I just thought of it as I was writing the post, and in retrospect I should have asked the decision-theory mailing list before putting it in. But my intuition is that this sort of malice could be possible, using the equivalent of a Henkin sentence. I'll edit to make it clearer that this is speculation, though.
This sounds like a way of making a natural proof, not a spurious one... Could you explain in more detail?
I had the intuition that one could both "push X away from genuine counterfactuals" using diagonalization and "pull X toward particular spurious counterfactuals" using specially constructed explicit Henkin sentences (which the inference module will find). But I'm more confident that the first half makes sense than that the second half does.
Right, but sentences where? Why does our agent believe them? Can you give a sketch of an analogous example?
The inference module of X will at some point start trying to prove sub-problems about the code of Y if it doesn't find an easily proved counterfactual. In the source code of Y, in a place where X will apply its inference module, the programmer has included an explicit Henkin sentence (i.e. contains the ingredients for its own proof) such that when X applies its inference module, it quickly proves "if X!=a then U=-1000", which happens to be the sort of thing the NDT agent was looking for. This section of code doesn't affect Y's decision except when Y is playing a game of this sort against X. Again, highly speculative.
I still don't understand what you're saying. What's the nature of this "Henkin sentence", that is in what way is it "included" in Y's source code? Is it written in the comments or something? What does it mean for X to "apply its inference module"? My only guess is that you are talking about X starting to enumerate proofs in order other than their length, checking the proofs that happen to be lying around in Y's code just in case, so that including something in Y's code might push X to consider a proof that Y wants it to consider earlier than some other proof. Is it what you mean? How would "if X!=a then U=-1000" follow from a Henkin sentence (it obviously can't make anything it likes true)?
The mental image of an inference module that I'm trying to trick is something like a chess program, performing a heuristically biased depth-first search on the space of proofs, based on the source code of the round. If the first path it takes is a garden path to an easy but spurious counterfactual proof, and if the genuine counterfactual is unreachable at the same level (by some diagonalization of X's proof method), then X should be fooled. The question is whether, knowing X's source code, there's a Y which leads X's heuristics down that garden path. Since in the end we'll achieve X=a, the statement "if X!=a then U=-1000" is true, and it should be provable by NDT if it finds the right path, so there should be some way of helping NDT find the proof more quickly. I don't have a strong intuition for how to fool the "enumerate and check all proofs up to length N" inference module.
Okay. Let's say we write an X that will start from checking any proof given to it (and accepting its conclusion). How can we construct a proof of X=a that X can read then? It looks like the more X reads, the longer the proof must be, and so there doesn't exist a proof that X can also read. I don't see how to construct a counterexample to this property without corrupting X's inference system, though I imagine some quining trick might work...
Right. That's why I acknowledge that this is speculative. If it turns out there's really no need to worry about spurious counterfactuals for a reasonable inference module, then hooray! But mathematical logic is perverse enough that I expect there's room for malice if you leave the front door unlocked.
...and Slepnev posted a proof (on the list) that in my formulation, X can be successfully deceived. It's not so much a Henkin sentence, just a program that enumerates all proofs, looking for a particular spurious counterfactual, and doesn't give up until it finds it. If the spurious counterfactual is provable, the program will find it, and so the agent will be tricked by it, and then the spurious counterfactual will be true. We have an implication from provability of the spurious argument to its truth, so by Loeb's theorem it's true, and X is misled. So you were right!
I just wrote up the proof Nesov is talking about, here.
  1. Actually, this is an open problem so far as I know: show that if X is a Naive Decision Theory agent as above, with some analyzable inference module like a halting oracle, then there exists an agent Y written so that X cooperates against Y in a Prisoner's Dilemma while Y defects.

Let me just spell out to myself what would have to happen in this instance. For definiteness, let's take the payoffs in prisoner's dilemma to be $0 (CD), $1 (DD), $10 (CC) and $11 (DC).

Now, if X is going to co-operate and Y is going to defect then X is going to prove "If I... (read more)

Your intuition that it gets deduced before any of the spurious claims like "if output = 'defect' then utility <= -$1" is taking advantage of an authoritative payoff matrix that X can't safely calculate xerself. I'm not sure that this tweaked version is any safer from exploitation...
Why not? Can't the payoff matrix be "read off" from the "world program" (assuming X isn't just 'given' the payoff matrix as an argument.)
The one-player game that I wrote out is an example of a NDT agent trying to read off the payoff matrix from the world program, and failing. There are ways to ensure you read off the matrix correctly, but that's tantamount to what you do to implement CDT, so I'll explain it in Part II.

I look forward to the rest of this sequence!


Well, this article puts my confusion from the primer into context, and I think I understand the issue now.

The "problem" (and it's ultimately not a problem, but I'll get to that) I have is that the game these programs are playing is ill-posed, in that it's not a closed system. It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar. Now I'm not enoug... (read more)

There's a relevant footnote in the next post: while it's possible to write agents that punish a particular decision theory, an agent that's trying (in any meaningful sense) to maximize its stated payoffs won't do this, and it's certainly not a constant strategy (i.e. a one-player game) either. In that sense, we can say that such problems are less likely to be encountered in reality. I agree, in the same way that Iterated Prisoners' Dilemma (with no source code access) is a fundamentally different game from One-Shot Prisoners' Dilemma (with no source code access). But there are real-life instances that are closer to the source-code access version than to the classical game theory version- for instance, a Cold War between two powers, each of which has a network of spies in the other's government.
... I believe this is called AIXI. (Not exactly, but ...)

I suspect the difficulty you've hit here is that Eliezer's theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements ("If my algorithm were to pick a in this situation, then my expected utility would be -10000000, so I won't do that then"). I can see what he's trying to achieve, but unfortunately I don't know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.

I'... (read more)

6Wei Dai
But one that seems to have multiple "lines of attacks". Have you seen cousin_it's recent posts in discussion for example? What do you mean by "causal decision theorist"? Is it: 1. A decision theorist (somebody who studies decision theory) who, based on the arguments he has seen so far, thinks CDT is the right decision theory? Or 2. An AI that has access to its own source code and can see that it's running CDT?
I hadn't seen the cousin_it posts, thanks, though I'm reading them now. One observation is that I thought myself of the "tweak" whereby if TDT, UDT or similar manages to prove that it will not perform an action a, then it immediately does perform an action a. This at least prevents a sound decision theory finding two distinct proofs of the form "If my algorithm were to do a, then my utility would be 1000" and "If my algorithm were to do a, then my utility would be -1000". However, a couple of reservations: 1. The tweak sounds pretty weird (if I'm president, and I prove to myself that I won't push the nuclear button, then suddenly I will. Huh???) 2. If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A. Indeed it's not obvious that I can infer anything at all; it will depend on how badly my impossible worlds behave. Some models of impossible worlds are complete anarchy, with no logical relationships at all between propositions. On the question about CDT, I hadn't really thought about whether the agent was human or automated, or whether it has access to its own source-code or just a general self-awareness of its decision-making powers. To be honest, I don't think it makes much difference; human agents who are utterly convinced of the correctness of CDT (and there are some of them) will reach much the same conclusion as the automated agent (I.e. that there just isn't going to be $1 million in the opaque box, so don't worry about it). And when noticing other agents who seem to be getting the $1 million, they'll shrug and say "Oh dear, Omega is awarding irrationality, but so what? Even if I had switched to TDT or UDT instead of CDT, there would still be some possible Omega who'd penalise me for that switch, and I had no way of knowing in advance which Omega I was likely to meet; in fact I was pretty confident I wouldn
In our models so far, this isn't a problem, you just use a factory-standard first order inference system. What do you mean by "can't legitimately infer"? The worst that can happen is that you infer something misleading (but still valid), and the diagonal step/chicken rule is one way of ensuring that doesn't happen.
More specifically, see the last 4 posts linked from the ADT wiki page: * A model of UDT with a halting oracle * Predictability of Decisions and the Diagonal Method * A model of UDT without proof limits * An example of self-fulfilling spurious proofs in UDT
Thanks for the pointers to these... I'll probably move my comments to there. On your other point: and... The inference rule A-> B, A->~B |- ~A is the familiar principle of proof by contradiction i.e. "reductio an absurdam" of the alternative. The difficulty I see is that you can't really use reduction ad absurdam in an impossible worlds semantics, because the impossible worlds are - ahem - absurd. That's kinda the point. I haven't read all the details of what you're trying, but my suspicion is that using standard off-the-shelf logics won't work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference). Maybe you need some sort of belief semantics, where the "worlds" admitted into the belief-system are strictly a mixture of possible and impossible worlds, but the agent hasn't realized yet that the impossible ones actually are impossible or will never realize it, so they "seem" to be possible. So he's chugging along analysing those worlds using inference rules that will eventually hit a contradiction, but not yet. For "practical purposes" the system is consistent and won't blow up. Though be very careful, because an inconsistent system with the rule ("If I prove I won't do a, then do a") could accidentally prove that it won't nuke the planet and then (inconsistently) nuke the planet. Oops.
This happens automatically, because (1) only the statements contributing to the decision matter, and there's a finite number of them, and (2) presence of the things like the diagonal step/chicken rule in the decision algorithm implies that inferring of absurdity doesn't happen. So we can prove that it's not the case that an agent can infer absurdity, even though it's free to use any first order inference it wants, and even though absurdity does follow from the axioms (in the setting without provability oracle). In the setting with provability oracle, agent's algorithm is constructed in such a way that its axioms become sufficiently weak that the impossible counterfactuals (from the point of view of a stronger theory) remain consistent according to the theory used by the agent, and so in that setting the impossible worlds have actual models.

Your self-fulfilling prophecy example works for the iteration of the for loop (described in "For each xi, assume the output of X is xi, and try to deduce the expected value of U.") in which the output is assumed to be a, but for the iteration in which the output is assumed to be b, proving that the output is a would be to prove a contradiction. "if (output X)=b then U=0" is one possible outcome, but U could also equal anything else.

I don't see how the NDT algorithm as given allows "(output X)=a" to be proved outside of the fo... (read more)

The point is that 'given (output X)=a' may eventually let you prove a contradiction when the output is not, in fact, a, and you have added a false statement as input. In practice, one does not use a two-way axiom of (output X)=a but an one-way substitution rule 'replace (output X) with a' . The rule may be applied once at start, or through first N steps of deduction process (to catch the cases where deduction manages to deduce X from a slightly modified X that was included inside the 'other algorithm' ; note that one can't do it forever because at some point the proof checker arrives at X from first principles, and the contradiction can be introduced by substitution. The issue he described is specific to using (outputs X)=a as given, which allows you to e.g. do some algebra, arrive at a number a for any reason, and then replace a with (output X) , which lets you contradict yourself, or make a self fulfilling prophesy. The intent of making it a given, is to make the substitutions one way, but the theorem prover can do substitutions other way around.
Here's what I think you're saying: there is one value that will actually be output, call it o. In every iteration of the for loop except the one where you assume the output is o, you have assumed a false statement. From this contradiction you should be able to derive anything, and in particular, derive U(this choice)=some large negative number, such that o will appear to be the best choice. Furthermore, this argument makes no reference to what o actually is, so the algorithm can output any choice this way. That's a very good argument, although I never would have figured it out from the article and it took some thinking to get it from your comment. I think it proves that the algorithm is underspecified though, not (necessarily) faulty; the description given is not enough to actually figure out what the algorithm will output. As for the rest of your comment, I think by "in practice" you mean "in decision theories other than NDT which work better"?
re: the argument, not mine, they explained it somewhere else, albeit it is hard to get. By in practice, I mean in anything that haven't got computing power substantially larger than that of the universe. You don't feed the tail into the dragon when your dragon has finite mouth and long tail, that is to say, you dont feed source code into theorem provers if you don't have to; instead you actively try to simplify the stuff that is being processed by theorem prover. You don't make equivalence of your code with provisional output an axiom, you make it a substitution rule that works one way (it still makes nonsense eventually, but easily long after all the stars in the universe have burnt themselves out, and that can too be fixed). That is assuming you are using state of the art automated algebra package to implement decision theory.

The programs don't get to carry any memories from round to round, or modify their source code at any stage.

! No memory is a big hole to leave in your explanation. It also destroys your ability to do inference, i.e. idea 2.

Overall, it looks like for NDT you don't want to have access to your opponent's source code, and you do want to have memory. You don't need to infer if you can read!

What you're thinking of seems to be EDT, not NDT. I warned at the beginning that this was an artificial setting, but it's one in which we can express and distinguish CDT, TDT, ADT, UDT and others. The lack of memory serves to focus attention on what we can do; note that it's perfectly legitimate to try and deduce what Y would have done against another opponent, if you want that kind of data.
It's a lot easier for me to see the applications for an iterated game where you try to model your opponent's source code given the game history than a single-shot game where you try to predict what your opponent will do given their source code and your source code. That seems like it just invites infinite regress. Suppose we know we're going to play the one-shot prisoner's dilemma with access to source code, and so we want to try to build the bot that cooperates only with bots that have functionally the same rule as it, i.e. the 'superrational prisoner' that's come up on LW. How would we actually express that? Suppose, for ease of labeling, we have bots 1 and 2 that can either C or D. Bot 1 says "output C iff 2 will output C iff 1 will output C iff 2 will output C iff..." Well, clearly that won't work. We also can't just check to see if the source code is duplicated, because we want to ensure that we also cooperate with similar bots, and don't want to get fooled by bots that have that commented out. Is there a way to code that bot with the sort of propositions you're doing? Outcomes aren't enough- you need models of outcome dependencies that can fit inside themselves.
Well, that's described in these posts, or do you mean something else?
That is what I meant. My problem is that I don't see how to express that strategy correctly with the vocabulary in this post, and I don't understand the notation in those posts well enough to tell if they express that idea correctly. For example, I'm not comfortable with how Eliezer framed it. If my strategy is "cooperate iff they play cooperate iff I play cooperate," then won't I never cooperate with myself, because I don't have the strategy "cooperate iff they play cooperate"? (I do think that general idea might be formalizable, but you need models of outcome dependencies that can fit inside themselves. Which maybe the notation I don't understand is powerful enough to do.)
"Cooperate iff they play cooperate iff I play cooperate" is imprecise. Your complaint seems to stem from interpreting it to pay too much attention to the details of the algorithms.* One reason to use counterfactuals is to avoid this. One interpretation of the slogan is "A()=C iff A()=B()." This has symmetry, so the agent cooperates with itself. However, it this is vulnerable to self-fulfilling prophecy against the cooperation rock (ie, it can choose either option) and its actions against the defection rock depend on implementation details. It is better to interpret it as cooperate if both counterfactuals "A()=C => B()=C" and "A()=D => B()=D" are true. Again, this has symmetry to allow it to cooperate with itself, but it defects against rocks. In fact, an agent like from this post pretty much derives this strategy and should cooperate with an agent that comes pre-programmed with this strategy. The problem is whether the counterfactuals are provable. Here is a thread on the topic. * Another example of agents paying too much attentions to implementation details is that we usually want to exclude agents that punish other agents for their choice of decision theory, rather than their decisions.
Aren't the implementation details sort of the whole point? I mean, suppose you code a version of cooperate iff "A()=C => B()=C and A()=D => B()=D" and feed it a copy of itself. Can it actually recognize that it should cooperate? Or does it think that the premise "A()=C" implies a cooperation rock, which it defects against? That is, if the algorithm is "create two copies of your opponent, feed them different inputs, and combine their output by this formula," when you play that against itself it looks like you get 2^t copies, where t is how long you ran it before it melted down. I see how one could code a cliquebot that only cooperated with someone with its source code, and no one else. I don't see how one could formalize this general idea- of trying D,C then C,C then D,D (and never playing C,D)- in a way that halts and recognizes different implementations of the same idea. The problem is that the propositions don't seem to cleanly differentiate between code outcomes, like "A()=C", and code, like "A()=C => B()=C".
No, the algorithm is definitely not to feed inputs in simulation. The algorithm is to prove things about the interaction of the two algorithms. For more detail, see the post that is the parent of the comment I linked. "A()=C => B()=C" is a statement of propositional logic, not code. It is material implication, not conditional execution.
This is possibly just ignorance on my part, but as far as I can tell what that post does is take the difficult part and put it in a black box, which leaves me no wiser about how this works. Code was the wrong phrase for me to use- I should have gone with my earlier "outcome dependency," which sounds like your "material implication." I think I've found my largest stumbling block: when I see "B()=C", it matters to me what's in the (), i.e. the inputs to B. But if B is a copy of this algorithm, it needs more information than "A()=C" in order to generate an output. Or, perhaps more precisely, it doesn't care about the "=C" part and makes its decision solely on the "A()" part. But then I can't really feed it "A()=C" and "A()=D" and expect to get different results. That is, counterfactuals about my algorithm outputs are meaningless if the opponent's algorithm judges me based on my algorithm, not my outputs. Right? Or am I missing something?
The input to A and B is always the same, namely the source code of A and B and the universe. In fact, often we consider all that to be hard coded and there to be no input, hence the notation A(). Could I rephrase your last paragraph as "counterfactuals about the output of my algorithm are meaningless since my algorithm only has one output"? Yes, this is a serious problem in making sense of strategies like "cooperate iff the opponent cooperates iff I do." Either my opponent cooperates or not. That is a fact about math. How could it be different? And yet, in the post I linked to, Cousin It managed to set up a situation in which there are provable theorems about what happens in the counterfactuals. Still, that does not answer the question of the meaning of "logical counterfactuals." They seems to match the ordinary language use of counterfactuals, as in the above strategy. If the "black box" was the provability oracle, then consider searching for proofs of bounded lengths, which is computable. If the black box was formal logic (Peano arithmetic), well, it doesn't look very black to me, but it is doing a lot of work. I don't think you should be judging material implication by its name. Perhaps by "outcome dependency" you mean "logical counterfactual." Then, yes, counterfactuals are related to such implications, by definition. The arrow is just logical implication. There are certain statements of logic that we interpret as counterfactuals.
That looks right to me. What do you mean by the first sentence? That the opponent's move is one of {C, D}? Or that we are either in the world where C is played with probability 1, or the world where D is played with probability 1? It seems to me that pernicious algorithms could cause cycling if you aren't careful, and it's not specified what happens if the program fails to output an order. If you tack on some more assumptions, I'm comfortable with saying that the opponent must play either C or D, but the second seems outlawed by mixed strategies being possible. Peano arithmetic is beyond my formal education. I think I understand most of what it's doing but the motivation of why one would turn to it is not yet clear. So, "A()=C => B()=C" means "it cannot be the case that I cooperate and they defect" and "A()=D => B()=D" means "it cannot be the case that I defect and they cooperate," and if both of those are true, then the algorithm cooperates. Right? (The first is to not be taken advantage of, and the second is to take advantage of those who will let you.) What I'm not seeing is how you take two copies of that, evaluate them, and get "C,C" without jumping to the end by wishful thinking. It looks to me like the counterfactuals multiply without end, because every proof requires four subproofs that are just as complicated. (This is where it looks like the provability oracle is being called in, but I would rather avoid that if possible.)
Yes, there are complications like programs failing to terminate or using random numbers. For simplicity, we usually assume no random numbers. We could force termination by putting a time bound and some default action, but it's usually better to distinguish the such program failure from other actions. In any event, at least one of the possibilities A()=C and A()=D is false, and thus thus claims like "A()=C => B()=C" are confusing. If A and B have the same source code, then it is provable that they have the same source code and thus it is provable that A()=B(), and thus provable that A()=C => B()=C. That is what A and B do, is search for that proof. They do not simulate each other, so there is no multiplication of counterfactuals. Identical source code is a rather limited case, but it is a start. Loeb's theorem shows how to prove things in some other cases.
Ok. Where do they look for it, and how will they know if they've found it? I don't like, but will accept for now, the "evaluate every possible proof less than X characters" method of finding proofs. But I don't see how you determine those proofs are true or false without simulating A() and B(), especially if B() isn't a copy of A(), but some complicated algorithm that might or might not cash out as equivalent to A(). (Where I'm going with this: if this idea requires magic to do its basic operation, then I am uncomfortable with using this idea for anything.)
Very often in this conversation, I think we're using words to mean wildly different things; such as "proof." Do you need to simulate bubble sort to prove that it takes quadratic time?
Very possibly! "Simulate" is another word that we may not be using the same way. I would say 'yes, if', conditioned on what you mean by simulate. It doesn't look like you can prove bubble sort takes quadratic time by running it a bunch of times, though you could certainly suggest that it does. But I also don't see how to prove bubble sort takes quadratic time without intense understanding of the algorithm (and it may be that intense understanding requires simulation). If you're given the algorithm to begin with, this doesn't seem like a big issue. But if I give you some code, which is an implementation of bubble sort but not labeled as such, then proving it takes quadratic time is more involved, and may (will?) require running the code. Right?
It is certainly true that some algorithms are easier to prove things about than others. Let's set aside running time and just talk about correctness of sorting. There are algorithms that sort but are undecidable; or for which the proof is very long. But the proof that bubbling correctly sorts is short, even intercal bubbling. Finding a short proof quickly may be hard, but exhaustively searching for it is simple to implement and simple to prove things about. And once you've found it, you just verify it. You don't have to simulate the code to verify the proof. You certainly don't have an infinite regress. If you like, you can say that by invoking exhaustive searching, we are assuming P=NP and assuming away the problem of AI. That is intentional. The question of decision theory is an independent problem and it is right to separate them this way. But exhaustive search is a concrete computable function, not "magic." Also, sometimes you really are constrained by space, not time and so exhaustive search is a realistic option. Putting in a proof generator makes it easy to make an algorithm undecidable. If we make these deterministic algorithms play Rock-Paper-Scissors, they will fail to terminate or otherwise break down. Many similar algorithms with resource constraints will be undecidable to other algorithms with the same resource constraints. But proofs are not particularly hard to reason about and do not force undecidability / infinite regress. In particular, when the game is PD and the algorithms are trying to cooperate to be understood and reach (C,C) rather than (D,D). Then (sometimes) there are short proofs of the relevant things and they do manage to cooperate.
I still don't see why this is the case, but I think I understand everything else about it, which suggests I should let this sit for a while and it'll become clear when it's relevant. (The proof verification is where it looked like magic was happening, not finding the interesting proofs.) I think I disagree that it's right to separate them this way, but this may just be a matter of taste / goals. I am far more interested in prescriptive decision theory than normative decision theory (bounded rationality vs. unbounded rationality).
I considered saying that proofs are verifiable by definition. I mean a proof in formal logic. The point of formal logic is that it is computable (even fast) to verify it. The point, which matches the common usage, is that it is convincing. You can't prove anything about arbitrary code; these algorithms are not guaranteed to understand the opponent. But it's interesting that sometimes they do succeed in proving things about their opponents.
I'd say it's easier to prove that a naïve bubble sort runs in quadratic time, than it is to prove that when it terminates, the list is actually sorted. It's obvious by inspection that it contains a nested loop over the same data structure.
What Nesov said. Loebian cooperation happened because I wanted to make quining cooperation work when the opponent has the same algorithm, but not necessarily the same source code. Illusion of transparency again, I post a lot of math thinking that its motivation will be crystal clear to everyone...
Combining branches: check out this comment.
You'll see by Part III. (Hint: when you see an infinite regress, consider a clever quining.)
Ok. I see how to do quining to only cooperate with copies of your source code, but I don't yet see how to do quining with outcome dependencies.
By the way, I was too blasé in the grandparent comment. I have a model of TDT that does almost the right thing, but I haven't figured out how to quine it so that it does exactly the right thing. (Technically, I can't be sure it'll still work if I do get the right quine in, but my gut feeling is that it will.)
So, I'm going to be pretty disappointed if this whole affair is just someone inventing a meta-cliquebot that's actually just a cliquebot.
Trust me, there's better stuff than that. (In particular, I've got a more nicely formalized write-up of UDT. It's just TDT I'm having issues with.)
The quining part usually seems to be the tricky part doesn't it?

This branch of math is outside my training. I'm stumbling over the self-fulfilling prophecies section.

How can these two statements

if (output X)=b then U=10
if (output X)=b then U=0

both be true?

Because in the second example, it's been deduced that (output X)=a. It's like how you can prove anything from a false premise.
I think it might help to say that explicitly.
Good call. Is my edit better?
Yes, though I would say "because you can prove anything from a false premise".
Subtle distinction: it's not unconditionally taking a false axiom and deriving a spurious conclusion, it's proving a conditional by proving the antecedent is false. I'll see if I can improve the wording.
These articles seem beyond my skillset also, but this may help you: In math, the sentence "if A then B", is equivalent to "(not A) or B" So, "if A then B" is considered true if "A" is false, regardless of what B is.

For footnote #1, it may be useful to give a couple of examples. When both programs are making decisions based on the others source code, it is pretty close to the undecidable situation, but it depends on the game.

When the game is Rock-Paper-Scissors (or parity), the programs have contradictory goals and are pretty much the textbook example of noncomputable.

But if the game is the Prisoner's Dilemma (or the Coordination Game), they are trying to cooperate to land on (C,C), so they want to be decidable.

I plan to put all the acknowledgments and references in the last part; but I should mention now that I first read about the self-fulfilling prophecy problem on the decision-theory mailing list, in a post by Vladimir Slepnev. (EDIT: He passes on the credit to Benja Fallenstein in this LW comment.)

Other people have used "naive decision theory" before, some of them on Less Wrong, but none of the usages seemed to stick. So I called this one (which was an early candidate for UDT before the problem was noticed) Naive Decision Theory. I can change the name if people prefer.

Vote up this comment if I should stick with calling it Naive Decision Theory.
Vote up this comment if I should change the name of Naive Decision Theory. Reply with a suggestion or upvote one of the suggestions that's been made.
I propose: Very Naive Decision Theory, as it represents an effort to make a decision theory which is maximally naive by making an assumption that the makers of naive decision theory are naive about the Godel's incompleteness theorem and Halting problem, yet allow their decision theory to recurse rather than simply black-box it from itself. In practice, the black boxing is always done to cut down on the work because you are an applied mathematician, and you got a problem to solve, and you won't enter the recursion if you can avoid doing that.
As far as I know, the problem of self-fulfilling prophecies and spurious counterfactuals was first pointed out by Benja Fallenstein in the comments to one of my posts.

Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.

And here you fail the very basic thing, the domain of applicability: limited computational power. Unlimited computational power introduces nasty infinities into things. "Naive decision theory" is not interested in games where you confuse yourself using infinities. The contest organizers have ... (read more)

Also, your intuitions for how to do decision theory are an unformalized version of something like CDT or TDT, as you'll see soon enough. [snark deleted by author]
I never assumed that the inference module was infinitary- in fact, I said that it should be designed to return either a correct answer or "Unknown". The very first example that was used for the self-fulfilling prophecy problem was that of a bounded proof-checker, which looks through all proofs shorter than length N. Finitary vs. infinitary is not the issue here.
edit: okay, that's just argument over the definition of NDT. Then don't refer to Halting problem and Godel's incompleteness theorem. It is hard to object on the very informal things; all I can say, 'you didn't prove this', but it is a fully general counter argument to informal things. The other misinterpretation is that NDT does not make assumption (outputs X) = xi in this 'transparent' way with source code on the left side, and take this as an axiom. It has a black box for itself, denoted by symbol A, which it substitutes for own source code whenever it spots it's own source code. It substitutes different xi as A into resulting utility equations, solving for utility, and finding the maximum utility. It does not analyze itself. I do agree that for ONDT (orthonormal's interpretation of NDT) which you present here, what you say, is correct. I do not agree that your NDT is the NDT that applied mathematicians use. There are substantial differences.
If you were substituting in variables for the output of X rather than analyzing the round as a whole, then you're not talking about Naive Decision Theory, you're talking about something in the family of CDT and TDT.
TDT was made by Eliezer, right? Here, a bright guy (maybe naive, but certainly not stupid) outside the field's theory.
I don't understand your comment. EDIT: Maybe you assumed that I was denigrating your intuitions on decision theory for departing from NDT? If so, that's not the case- substituting for the output of X in various spots turns out to be a good way to avoid the problem of spurious counterfactuals. CDT does this in a very basic way, TDT in a better way.
re: EDIT Ahh, I see. What I am saying is that the decision theory which applied mathematicians follow, operates under practical constraints (limited computing time), and this prevents introduction of very fancy things like your (output X) simply because they are computationally heavy. The theory that always substitutes is what I originally noted in the other thread, regarding the newcomb's. Due to substitution, that theory doesn't 'pollute' it's own proof checker with a proposition that may be untrue and may break the proof checker (but practically, would make proof checker very slow). edit: That being said, the decision theories operate under notion of perfect accuracy and unlimited computing time; the applied mathematicians see this field as not very relevant to any practical software (e.g. AI). The practical AI is best off substituting for X everywhere. The practical AI is never a guaranteed utility maximizer; there will always be problems that it won't be able to solve correctly in the given time, and the important consideration is to try to be able to solve as large set of problems in limited time, as possible. That is why i have very little inclination to write some semi-formalization of what i use when i am writing code to decide on things. I would rather write specifications for making AI that actually does it; the specifications only need to be as formal as for the programmer who implements them to understand the intent. (I can alternatively implement it myself, and then it becomes completely formalized to the point that computer can run it)
It's just an argument over word definitions, in any case. You can call what i propose TDT if you wish. I'm just pointing out that Eliezer, being a bright guy outside the field, is precisely the kind of person to make a naive but not stupid decision theory which matches what many people actually use in practice.
I edited my comment while you were replying to it. I don't think we're actually disagreeing on this particular point.
Agreed. Given access to the SOURCE of the opponent, you can just RUN the opponent and observe what they'll do this round. Of course, given, say, 10 seconds to calculate between rounds, the wise opponent just ensures it will take exactly 10 seconds to run their code, that way you can either calculate your own move OR the opponent's, but not both. Then you get in to the complexities of whether you have enough time at any given point to attempt to OPTIMIZE your opponent's equation ("Roll dice for 9.99 seconds then select rock" is very easy to optimize down to "select rock"), and at that point I think we're sufficiently outside the simplified domain of the post :) So, basically, you're right, but I don't think it's actually a relevant critique. I would enjoy seeing it addressed in the post, though, because it bugged me enough that I was going to post about it myself :)
Main problem: if you're playing against yourself, this creates an infinite loop of you simulating you simulating you simulating... Fortunately, it's often possible to prove things about other programs without running them. In a Prisoner's Dilemma, TDT would defect if it couldn't deduce anything in time about the other player. The only way to achieve mutual cooperation is to make it easy for the other player to deduce what you're doing in that circumstance. (I'll cover this in more depth in the TDT section.) This doesn't mean "always cooperating", nor does it mean giving up mixed strategies in zero-sum games.

Upvoted- just enough math.