Decision Theories, Part 3.5: Halt, Melt and Catch Fire

18Eliezer Yudkowsky

4orthonormal

12cousin_it

-7V_V

2[anonymous]

1orthonormal

2thescoundrel

0orthonormal

0thescoundrel

0orthonormal

1OrphanWilde

4orthonormal

0OrphanWilde

2orthonormal

0OrphanWilde

0orthonormal

3OrphanWilde

0orthonormal

0Vaniver

2orthonormal

1Vaniver

0orthonormal

0orthonormal

0V_V

3cousin_it

0V_V

4cousin_it

0orthonormal

0V_V

0cousin_it

0orthonormal

0cousin_it

0orthonormal

0orthonormal

0cousin_it

New Comment

if one is found, then output aj otherwise, output ak

What are "if", "then", and "otherwise"?

Let's consider the following mask agent, which we'll call AntiFairBot: it searches for a proof that its opponent cooperates against it, and it defects if it finds one; if it doesn't find such a proof, then it cooperates. This may not be a very optimal agent, but it has one interesting property: if you pit AntiFairBot against FairBot, and the two of them use equivalent oracles, then it takes an oracle stronger than either to deduce what the two of them will do!

What do they do?

[This comment is no longer endorsed by its author]

What are "if", "then", and "otherwise"?

Um, they're pseudocode. I'm not sure what your objection is...

What do they do?

My intuition is that they'll both fail to deduce that the other cooperates, and thus output their default actions. However, I imagine that there could be rigged-up versions such that FairBot deduces AntiFairBot's cooperation at the last possible second and thus cooperates, while AntiFairBot runs out of time and thus cooperates.

(Think about the other possibilities, including one of them deducing the other's cooperation with deductive capacity to spare, and you'll see that all other possibilities are inconsistent.)

What happens if the masks are spawned as sub processes that are not "aware" of the higher level process monitoring thems? The higher level process can kill off the sub processes and spawn new ones as it sees fit, but the mask processes themselves retain the integrity needed for a fairbot to cooperate with itself.

This is the case with Masquerade: within the hypotheticals, the masks and opponents play against each other as if theirs were the "real" round.

So- does the whole problem go away if instead of trying to deduce what fairbot is going to do with masquerade, we assume that fairbot is going to asses it as if masquerade = the current mask? By ignoring the existence of masquerade in our deduction, we both solve the Gödel inconsistency, while simultaneously ensuring that another AI can easily determine that we will be executing exactly the mask we choose.

Masquerade deduces the outcomes of each of its masks, ignoring its own existence, and chooses the best outcome. Fairbot follows the exact same process, determines what Mask Masquerade is going to use, and then uses that outcome to make its own decision, as if Masquerade were whatever mask it ends up as.I assume Masquerade would check that it is not running against itself, and automatically co-operate if it is, without running the deduction, which would be the other case for avoiding the loop.

The problem with that is something I outlined in the previous post: this agent without a sanity check is exploitable. Let's call this agent TrustingBot (we'll see why in a minute), and have them play against a true TDT agent.

Now, TDT will cooperate against FairBot, but not any of the other masks. So TrustingBot goes ahead and cooperates with TDT. But TDT notices that it can defect against TrustingBot without penalty, since TrustingBot only cares what TDT does against the masks; thus TDT defects and steals TrustingBot's lunch money.

See how tricky this gets?

Don't predict; rationalize. Include source code that always cooperates, and source code that always defects, and calculate your opponent's source code against these strategies instead of your own. This eliminates endless recursion, always halts against halting opponents, and gives you a good idea of what your opponent will do. Defect against inconsistent behavior (for example, the opponent who would one-box in Newcomb's when the box is empty, and two-box when the box is full).

ETA: This sounds identical to your approach, but the "rationalization" part is important. A rationalizing agent doesn't look for proof; it's looking for evidence that its strategy is good enough, rather than seeking the perfect strategy. It will cooperate against itself iff it cooperates with cooperating agents. All other strategies are inconsistent. It also penalizes agents whose behavior is inconsistent by consistently defecting. It doesn't attempt recursive strategies, which suffer diagonalization failures.

Remember how I said that my idea made qualitative sense until I tried to show rigorously that it would do the right thing?

That strategy isn't identical to mine- it's similar to "Not Quite Timeless Decision Theory", except less well formalized. Consider the first two Problems in that post, and whether they apply to that strategy too.

That's what I thought you were referring to here.

It's less formalized for a reason, though; the strategy isn't perfection. It asserts a particular proposition and attempts to defend it. (Hence the use of the term "Rationalization" as opposed to prediction.) It also does not engage in recursive logic; it evaluates only what its opponent will do against a specific strategy, which is "Always cooperate." If the opponent will cooperate against that strategy, it is assumed to be safe. If the opponent defects against an always-cooperator, it defects as well.

This rationalization engine doesn't seek to do better than all other agents, it only seeks to do as well as every agent it comes up against. There is a specific counterstrategy which defeats it - cooperate with always-cooperators, defect against opponents who cooperate if you cooperate with always-cooperators. Introduce a "Falsified prediction" mechanism, by which it then always defects against opponents it falsely predicted would cooperate in the past, however, and it becomes an effective blackmail agent against every other agent in iterated games; such a counterstrategy becomes more expensive than simply cooperating.

It has the advantage of being extremely predictable; it will predict itself to cooperate against an always-cooperator, and because it doesn't recurse, as long as P is remotely good at prediction, it will predict it to one-box in Newcomb's Problem.

(Fundamentally the issues raised are whether an agent can be stupid enough to be predictable while being smart enough to accurately predict. The trick to this is that you can't be afraid of getting the wrong answer; it is deliberately less formalized, and deliberately does not attempt to prove its solutions are ideal.)

ETA: The purpose of this agent is to exploit signaling; it observes what other agents would do to a helpless agent, suspecting that other agents may be nastier than it is, and that it is similarly helpless. It then behaves based on that information. It is in effect a reflective agent; it does to the other agent what the other agent would do to a helpless agent. It would calculate using the same probabilities other agents would, as well; against the 99%-cooperate timeless agent mentioned by Vaniver, it has a 99% chance of cooperating itself. (Although if it predicts wrong it then begins always defecting; such strategies would be risky against the reflective vindictive agent.)

Since the impasse only comes up when we try and rigorously show how our agents behave (using tricky ideas like Löbian proofs of cooperation), I'm looking for something more than qualitative philosophical ideas. It turns out that a lot of the philosophical reasoning I was using (about why these agents would achieve mutual cooperation) made use of outside-the-system conclusions that the agents cannot rely on without self-contradiction, but I didn't notice this until I tried to be more rigorous. The same applies to what you're suggesting.

Note also that by de-formalizing how your agent works, you make it harder to verify that it actually does what it is philosophically "supposed" to do.

I'm glad you're thinking about this sort of decision theory too, but I'm letting you know that here there be dragons, and you don't see them until you formalize things.

Assuming the existence of an interpreter that can run the opponent's code (which you seem to be assuming as well), there's not a lot of logic here to formalize.

The dragons only arise when you have two non-trivial processes operating against each other. The issue with your formalized logic arises when it recurses; when the two agents are playing a "I know what you're thinking." "I know that you know what I'm thinking." "I know that you know that I know what you're thinking." game. As long as one of the two agents uses a trivialized process, you can avoid this situation.

By trivialized process, I mean non-self referential.

In formalized pseudocode, where f(x, y) returns "Cooperate" or "Defect", x is the code to be tested (your opponent's code), and y is the code to test with:

y: Cooperate();

main: if(f(x, y)) = Cooperate { Cooperate(); } else { Defect(); }

That would be the whole of the simple reflective agent; the vindictive agent would look thus:

boolean hasFailed=false;

y: Cooperate();

main: if(f(x, y)) = Cooperate and !hasFailed { Cooperate(); } else { Defect(); }

onHasCooperatedWithDefector: hasFailed = true;

It's relatively simple to see that it cooperates against a CooperateBot, which it uses as its signal; agents which cooperate with CooperateBot are considered "Trustworthy." Because it cooperates with CooperateBot, it will cooperate with itself.

Now, you may say that it is throwing away free points by cooperating with CooperateBot. This is true in some sense, but in another, it is avoiding extensive computation by using this methodology, and making its behavior predictable; these lost points are the cost paid for these advantages. If computation costs utility, the relative cheapness of this behavior is important.

The penalty it pays for this behavior is that it cannot simultaneously guarantee that its opponent will defect against a DefectBot; that would introduce an inconsistency. This -appears- to be the issue with your original agent. To potentially resolve that issue, a somewhat better mechanism might be comparing behavior against the next iteration of moves made by a Tit For Tat agent against a cooperative and a defecting bot, which would reward vindictive but trustworthy agents. (A necessary but not sufficient element to avoid inconsistency is that the agent being compared against is not reflective at all - that is, doesn't examine its opponent's source code.) However, I am having trouble proving out that that behavior wouldn't also be inconsistent.

ETA: Why do I claim that self-referential (non-trivial) agents aren't consistent, when opposing each other? For the following reason: Agent A runs f(x,this) on Agent B; agent B runs f(x,this) on Agent A. This becomes f(f(f(f(f(f(f(..., this) - that is, it's infinite recursion. Masquerade solves this by trying out a variety of different "masks" - which makes the reflective agent a subset of Masquerade, using only a single mask. The issue is that Masquerade is designed to deal with other reflective agents; the two agents would mutually defect if Masquerade defects against CooperateBot in the original formulation, which is to say, the addition of masks can make Masquerade -less effective- depending upon the composition of its opponent. Whatever signal different reflective agents use, Masquerade is likely to break, given a sufficiently large number of masks.

Now, you may say that it is throwing away free points by cooperating with CooperateBot.

Indeed, I do say this. I'm looking to formalize an agent that does the obviously correct things against CooperateBot, DefectBot, FairBot, and itself, without being exploitable by any opponent (i.e. the opponent cannot do better than a certain Nash equilibrium unless the agent does better too). Anything weaker than that simply doesn't interest me.

One reason to care about performance against CooperateBot is that playing correctly against constant strategies is equivalent to winning at one-player games. Rational agents should **win**, *especially* if there are no other agents in the picture!

"Always defect" meets your criteria. You missed a criteria: That it wouldn't miss out on a cooperation it could achieve if its strategy were different.

Your agent will have to consider, not only its current opponent, but every opponent currently in the game. (If Masquerade played in a larger game full of my reflective agents, it would consistently -lose-, because it would choose a mask it thinks they will cooperate against, and they will find the mask it would use against CooperateBot and believe it would defect against them.)

Therefore, in a game full of my reflective agents and one CooperateBot, your agent should choose to cooperate with CooperateBot, because otherwise my reflective agents will always defect against it.

"Always defect" meets your criteria.

I said that it needed to cooperate with FairBot and with itself (and not for gerrymandered reasons the way that CliqueBots do).

You missed a criteria: That it wouldn't miss out on a cooperation it could achieve if its strategy were different.

This is impossible to guarantee without thereby being exploitable. Say that there are agents which cooperate iff their opponent cooperates with DefectBot.

Part of the trick is figuring out exactly what kind of optimality we're looking for, and I don't have a good definition, but I tend to think of agents like the ones you defined as "not really trying to win according to their stipulated payoff matrix", and so I'm less worried about optimizing my strategy for them. But those sort of considerations do factor into UDT thinking, which adds entire other layers of confusion.

So, if I'm reading this correctly, the trouble is that you try to select actions by searching for proofs, instead of an algorithm known to halt quickly, and thus you only behave correctly if it's possible to find proofs (which it basically isn't)?

At the risk of sounding falsely prescient, I thought that was a really fishy way to do things, but I can't claim I had the mathematical understanding to say why.

Semi-related, I know this subthread still requires Löbian objects, but I'm still confused by why you think CDT can't think this way with the proper causal graph (and access to an inference module).

How do you auto-generate the proper causal graph, given the source code of a game and an opponent? (The reason that this algorithm appealed to me is because, if it had worked the way I originally thought, it would *automatically* do the right thing against a lot of agents in a lot of different games. I don't think that anything I could actually pseudocode in the CDT direction achieves that.)

I was imagining generating a causal graph of the game from the code of the game, where the opponent's code would be one of the nodes in the graph. I agree that determining what will happen from arbitrary code is arbitrarily difficult- but that's not a decision theory problem so much as a computer science problem. (I trust my decision theories more when they consider easy problems easy and difficult problems difficult; it's part of a sanity check that they're faithfully representing reality.)

I think it's better to be thinking like Nash than like Turing here; Nash's approach is the one that turns an infinite regress into a mixed strategy, rather than an impossibility.

That is, I can think of at least one way to restate the tournament which allows for about as much meaningful diversity in strategies, but doesn't allow infinite regress. The winner is still the TDT-inspired mask, though it no longer requires arbitrary proofs.

This might not be interesting to you, of course- if your end goal is to something like studying the intertemporal decision dynamics of self-modifying AI, then you want something like a proof that "I can judge the desirability of any modification to myself." My guess is that if you go down that route then you're going to have a bad time, and I'd instead ask the question of "are there any modifications to myself that I can prove the desirability of in a reasonable length of time?"

We can have different mathematical aesthetics, sure. I got excited about these decision theories in the first place when I saw that Löbian cooperation worked without any black boxes, and that's the standard to which I'm holding my thinking. If you can find a way to automatically generate better causal diagrams for a wide range (even if not a complete one) of games and opponents, then I'll start finding that approach really fascinating too.

So there's one last fix that occurred to me overnight, but I can't see whether it could possibly work.

The impasse is that we need a version of FairBot which acts as usual (given enough deductive capacity) AND is able to deduce that it defects mutually against DefectBot. The current problem is that even though it can prove that DefectBot defects, this isn't enough to prove that FairBot won't find a proof of DefectBot's cooperation anyway. (FairBot can't deduce that FairBot's deduction procedure is consistent, unless it's inconsistent!)

A failed fix is a FairBot that simultaneously looks for proofs of the opponent's cooperation and defection, and acts symmetrically as soon as it finds one (and defects if it doesn't). The problem is that this algorithm may as easily find a proof of mutual defection against itself as a proof of mutual cooperation; it depends on the proof ordering.

So the last-ditch fix is for FairBot to search for both proofs simultaneously, but somehow **search more effectively in the cooperative direction**, so that we can guarantee it finds a cooperative proof against itself rather than a defective one. This could be done either by giving it exponentially more search time in the cooperative direction than in the defective direction, or by preparing it with a Löbian constructive approach in the cooperative direction.

Do either of these seem feasible?

The current problem is that even though it can prove that DefectBot defects

How can it do that in the general case?

The problem is that this algorithm may as easily find a proof of mutual defection against itself as a proof of mutual cooperation; it depends on the proof ordering.

Maybe I'm missing something, but wouldn't that imply that the formal system is inconsistent, and hence useless due to the principle of explosion?

Maybe I'm missing something, but wouldn't that imply that the formal system is inconsistent, and hence useless due to the principle of explosion?

If program A which uses a specific proof ordering can find a proof of some theorem about program A, and program B which uses a different proof ordering can find a proof of the opposite theorem about program B, that doesn't imply inconsistency.

proof of mutual defection against

itself(emphasis mine)

Hence he's assuming that A = B, if I understand correctly.

He's saying that program A can find a proof that it cooperates with program A, but if we slightly change the proof ordering in program A and obtain program B, then program B can find a proof that it defects against program B. I still don't see the inconsistency.

(Sorry, my previous reply was off the mark.)

If the agent spends more time trying to prove cooperation, doesn't it become harder to prove that the agent defects against DefectBot?

Let's imagine for the moment that we have an increasing function f from the naturals to the naturals, and we instruct our FairBot variant to spend f(N) steps looking only for cooperative proofs (i.e. reacting only if it finds a proof that Opp(FairBot)=C) before spending an Nth step looking for defective ones (proofs of Opp(FairBot)=D). When paired against DefectBot, the simple proof that DefectBot defects will come up at the Kth step of looking for defective proofs, and thus at the (f(K)+K)th step of FairBot's algorithm.

But then, if FairBot needs the lemma that FairBot defects against DefectBot, this can be proved in g(f(K)+K) steps for some function g. The important part here is that this limit doesn't blow up as FairBot's proof limit increases! (It'll still increase a bit, simply because FairBot needs a slightly longer representation if it's mentioning a sufficiently large proof limit, but this sort of increase is quite manageable.)

So the real question is, is there a function f such that we can be sure this FairBot variant finds a Löbian proof of cooperation with itself in fewer than f(N) steps, before it finds a Löbian proof of defection against itself in N steps?

You could make f(1) large enough for Loebian cooperation to happen immediately. Then you could make f(2) large enough to simulate what happened in the first f(1)+1 steps, and so on. I still don't completely understand how you're going to use this thing, though.

**Lemma 1:** There's a proof that M'(DefectBot)=D, and that there does not exist a shorter proof of M'(DefectBot)=C.

*Proof:* There's a short proof that DefectBot(DefectBot)=D, and a proof of length (waiting time + short) that FB'(DefectBot)=DefectBot(FB')=D. (It needs to wait until FB' stops looking for only cooperative proofs.) Clearly there are no shorter proofs of contrary statements.

So there's a proof that M' skips to its default defection (since there's nothing strictly better), and this proof has length less than (g(short) + g(wait + short) + short), where g is the function that bounds how long it takes to verify that something is the shortest proof of a certain type, in terms of the proof length. (I think g is just bounded by an exponential, but it doesn't matter).

Thus, of course, Lemma 1 has a proof of length at most g(g(short) + g(wait + short) + short).

**Lemma 2:** There's a proof that M'(FB')=FB'(M')=C, and that there does not exist a shorter proof of any alternative outcome.

*Proof:* Similar to Lemma 1. The initial proofs are a bit longer because we need the Löbian proof of length less than (waiting time) that FB'(FB')=C, instead of the short proof that DefectBot(DefectBot)=D.

And so there's a proof of length (g(wait+short) + g(wait) + short) that M' goes to the sanity-check trying to prove that FB'(M')=FB'(FB'). Now a proof of length L that FB'(M')=C would imply a proof of length L + (g(wait+short) + g(wait) + 2*short) that M'(FB')=C...

Oh, and here's the impasse again, because the existence of a proof that M'(FB')=C is no longer enough to prove that FB'(M')=C, you also need that there's no shorter proof that M'(FB')=D. Unless there's a clever way to finesse that (within something provable by the same system), it looks like that's where this last-ditch fix fails.

Drat.

So this is setting off all sorts of aesthetic flags in my head, but it appears to work:

Let FairBotPrime (FB') be the agent that does this: first tries to prove Opp(FB')=C for a long time (long enough for Löb to work), then simply tries to prove either Opp(FB')=C or Opp(FB')=D for the rest of the time. If it ever succeeds, it immediately halts and returns the same move as it just deduced. If it doesn't deduce anything in time, then it defects by default.

Let MasqueradePrime (M') be the agent that first tries on the DefectBot and FB' masks: it looks through all proofs for one of the form "DefectBot(Opp)=X and Opp(DefectBot)=Y" and progresses to the next stage once it finds one (or once it runs out of allotted time), then does the same for the FB' mask. Then, if either payout thus obtained exceeds the default (D,D) payout, it sanity-checks whichever mask gets the better result (let's say it defaults to DefectBot if they tie): it tries to prove "Opp(M')=Opp(chosen mask)", and if so then it outputs the (pre-calculated) move that the chosen mask makes against the opponent. If something along this path fails, then it just outputs its default move (D).

I claim that M' correctly defects against DefectBot and CooperateBot, and cooperates mutually with FB' and M'. The key here is that at no point does it need to deduce that a proof search failed; it only needs to show that a certain proof is the first one that an agent finds. I'll reply again with some lemmas.

I had a similar idea some months ago, about making the agent spend exponentially more time on proofs that imply higher utility values. Unfortunately such an agent would spend the most time trying to achieve (D,C), because that's the highest utility outcome. Or maybe I misunderstand your idea...

[This comment is no longer endorsed by its author]

Followup to:Decision Theories: A Semi-Formal Analysis, Part IIIUPDATE: As it turns out, rumors of Masquerade's demise seem to have been greatly exaggerated. See this post for details and proofs!I had the chance, over the summer, to discuss the decision theory outlined in my April post with a bunch of relevantly awesome people. The sad part is,

there turned out to be a fatal flawonce we tried to formalize it properly. I'm laying it out here, not with much hope that there's a fix, but because sometimes false starts can be productive for others.Since it's not appropriate to call this decision theory TDT, I'm going to use a name suggested in one of these sessions and call it "Masquerade", which might be an intuition pump for how it operates. So let's first define some simple agents called "masks", and then define the "Masquerade" agent.

Say that our agent has actions

a, ... ,_{1}a, and the agent it's facing in this round has actions_{n}b, ... ,_{1}b. Then for any triple (_{m}b,_{i}a,_{j}a), we can define a simple agent Mask_{k}which takes in its opponent's source code and outputs an action:_{ijk}(This is slightly less general than what I outlined in my post, but it'll do for our purposes. Note that there's no need for

aand_{j}ato be distinct, so constant strategies fall under this umbrella as well.)_{k}A key example of such an agent is what we might call FairBot: on a Prisoner's Dilemma, FairBot tries to prove that the other agent cooperates against FairBot, and if it finds such a proof, then it immediately cooperates. If FairBot fails to find such a proof, then it defects. (An important point is that if FairBot plays against itself and both have sufficiently strong deductive capacities, then a short proof of one's cooperation gives a slightly longer proof of the other's cooperation, and thus in the right circumstances we have mutual cooperation via Löb's Theorem.)

The agent Masquerade tries to do better than any individual mask (note that FairBot foolishly cooperates against CooperateBot when it could trivially do better by defecting). My original formulation can be qualitatively described as trying on different masks, seeing which one fares the best, and then running a "sanity check" to see if the other agent treats Masquerade the same way it treats that mask. The pseudocode looked like this:

(The default should be something safe like a Nash equilibrium strategy, of course.)

Intuitively, when Masquerade plays the Prisoner's Dilemma against FairBot, Masquerade finds that the best utility against FairBot is achieved by some mask that cooperates, and then Masquerade's sanity-check is trying to prove that FairBot(Masquerade) = C as FairBot is trying to prove that Masquerade(FairBot) = C, and the whole Löbian circus goes round again. Furthermore, it's intuitive that when Masquerade plays against another Masquerade, the first one notices the proof of the above, and finds that the best utility against the other Masquerade is achieved by FairBot; thus both pass to the sanity-check stage trying to imitate FairBot, both seek to prove that the other cooperate against themselves, and both find the Löbian proof.

So what's wrong with this intuitive reasoning?

## Problem: A deductive system can't count on its own consistency!

Let's re-examine the argument that Masquerade cooperates with FairBot. In order to set up the Löbian circle, FairBot needs to be able to prove that Masquerade selects a mask that cooperates with FairBot (like CooperateBot or FairBot). There are nice proofs that each of those masks attains the mutual-cooperation payoff against FairBot, but we also need to be sure that some other mask won't get the very highest (I defect, you cooperate) payoff against FairBot. Now you and I can see that this must be true, because FairBot simply can't be exploited that way. But crucially,

FairBot can't deduce its own inexploitabilitywithout thereby becoming exploitable (for the same Gödelian reason that a formal system can't prove its own consistency unless it is actually inconsistent)!Now, the caveats to this are important: if FairBot's deductive process is sufficiently stronger than the deductive process that's trying to exploit it (for example, FairBot might have an oracle that can answer questions about Masquerade's oracle, or FairBot might look for proofs up to length 2

^{N}while Masquerade only looks up to length N), then it can prove (by exhaustion if nothing else) that Masquerade will select a cooperative mask after all. But since Masquerade needs to reason about Masquerade at this level, this approach goes nowhere. (At first, I thought that having a weaker oracle for Masquerade's search through masks, and a stronger oracle both for each mask and for Masquerade's sanity-check, would solve this. But that doesn't get off the ground: the agent thus defined attains mutual cooperation with FairBot, but not with itself, because the weaker oracle can't prove that it attains mutual cooperation with FairBot.)Another caveat is the following: FairBot may not be able to rule out the provability of some statement we know is false, but (given a large enough deductive capacity) it can prove that a certain result is the first of its kind in a given ordering of proofs. So if our agents act immediately on the first proof they find, then we could make a version of Masquerade work... as long as each search

doesfind a proof, and as long asthatfact is provable by the same deduction system. But there's an issue with this: two masks paired against each other won't necessarily have provable outcomes!Let's consider the following mask agent, which we'll call AntiFairBot: it searches for a proof that its opponent cooperates against it, and it

defectsif it finds one; if it doesn't find such a proof, then itcooperates. This may not be a very optimal agent, but it has one interesting property: if you pit AntiFairBot against FairBot, and the two of them use equivalent oracles, then it takes an oracle stronger than either to deduce what the two of them will do! Thus, Masquerade can't be sure that AntiFairBot won't get the highest payoff against FairBot (which of course it won't) unless it uses a stronger deduction system for the search through masks than FairBot uses for its proof search (which would mean that FairBot won't be able to tell what mask Masquerade picks).I tried to fix this by iterating over only some of the masks; after all, there's no realistic opponent against whom AntiFairBot is superior to both FairBot and DefectBot. Unfortunately, at this point I realized two things: in order to play successfully against a reasonable range of opponents on the Prisoner's Dilemma,

Masquerade needs to be able to imitate at least both FairBot and DefectBot; andFairBot cannot prove that FairBot defects against DefectBot. (There are variants of FairBot thatcando so, e.g. it could search both for proofs of cooperation and proofs of defection and playing symmetrically if it finds one, but this variant is no longer guaranteed to cooperate against itself!)If there are any problems with this reasoning, or an obvious fix that I've missed, please bring it to my attention; but otherwise, I've decided that my approach has failed drastically enough that it's time to do what Eliezer calls "halt, melt, and catch fire". The fact that Löbian cooperation works is enough to keep me optimistic about formalizing this side of decision theory in general, but the ideas I was using seem insufficient to succeed. (Some variant of "playing chicken with my deductive system" might be a crucial component.)

Many thanks to all of the excellent people who gave their time and attention to this idea, both on and offline, especially Eliezer, Vladimir Slepnev, Nisan, Paul Christiano, Critch, Alex Altair, Misha Barasz, and Vladimir Nesov. Special kudos to Vladimir Slepnev, whose gut intuition on the problem with this idea was immediate and correct.