Decision Theories, Part 3.5: Halt, Melt and Catch Fire

I applaud your willingness to publicly be on fire!

Thanks, for the encouragement and for the exemplar.

Thanks :-) I wonder if your failed attempt can be converted into some sort of impossibility proof, or maybe a more formal description of the obstacle we need to overcome?

[+]V_V13y-70

[-][anonymous]13y20

if one is found, then output aj otherwise, output ak

What are "if", "then", and "otherwise"?

Let's consider the following mask agent, which we'll call AntiFairBot: it searches for a proof that its opponent cooperates against it, and it defects if it finds one; if it doesn't find such a proof, then it cooperates. This may not be a very optimal agent, but it has one interesting property: if you pit AntiFairBot against FairBot, and the two of them use equivalent oracles, then it takes an oracle stronger than either to deduce what the two of them will do!

What do they do?

[This comment is no longer endorsed by its author]Reply

[-]orthonormal13y10

What are "if", "then", and "otherwise"?

Um, they're pseudocode. I'm not sure what your objection is...

What do they do?

My intuition is that they'll both fail to deduce that the other cooperates, and thus output their default actions. However, I imagine that there could be rigged-up versions such that FairBot deduces AntiFairBot's cooperation at the last possible second and thus cooperates, while AntiFairBot runs out of time and thus cooperates.

(Think about the other possibilities, including one of them deducing the other's cooperation with deductive capacity to spare, and you'll see that all other possibilities are inconsistent.)

[-]thescoundrel13y20

What happens if the masks are spawned as sub processes that are not "aware" of the higher level process monitoring thems? The higher level process can kill off the sub processes and spawn new ones as it sees fit, but the mask processes themselves retain the integrity needed for a fairbot to cooperate with itself.

[-]orthonormal13y00

This is the case with Masquerade: within the hypotheticals, the masks and opponents play against each other as if theirs were the "real" round.

[-]thescoundrel13y00

So- does the whole problem go away if instead of trying to deduce what fairbot is going to do with masquerade, we assume that fairbot is going to asses it as if masquerade = the current mask? By ignoring the existence of masquerade in our deduction, we both solve the Gödel inconsistency, while simultaneously ensuring that another AI can easily determine that we will be executing exactly the mask we choose.

Masquerade deduces the outcomes of each of its masks, ignoring its own existence, and chooses the best outcome. Fairbot follows the exact same process, determines what Mask Masquerade is going to use, and then uses that outcome to make its own decision, as if Masquerade were whatever mask it ends up as.I assume Masquerade would check that it is not running against itself, and automatically co-operate if it is, without running the deduction, which would be the other case for avoiding the loop.

[-]orthonormal13y00

The problem with that is something I outlined in the previous post: this agent without a sanity check is exploitable. Let's call this agent TrustingBot (we'll see why in a minute), and have them play against a true TDT agent.

Now, TDT will cooperate against FairBot, but not any of the other masks. So TrustingBot goes ahead and cooperates with TDT. But TDT notices that it can defect against TrustingBot without penalty, since TrustingBot only cares what TDT does against the masks; thus TDT defects and steals TrustingBot's lunch money.

See how tricky this gets?

[-]OrphanWilde13y10

Don't predict; rationalize. Include source code that always cooperates, and source code that always defects, and calculate your opponent's source code against these strategies instead of your own. This eliminates endless recursion, always halts against halting opponents, and gives you a good idea of what your opponent will do. Defect against inconsistent behavior (for example, the opponent who would one-box in Newcomb's when the box is empty, and two-box when the box is full).

ETA: This sounds identical to your approach, but the "rationalization" part is important. A rationalizing agent doesn't look for proof; it's looking for evidence that its strategy is good enough, rather than seeking the perfect strategy. It will cooperate against itself iff it cooperates with cooperating agents. All other strategies are inconsistent. It also penalizes agents whose behavior is inconsistent by consistently defecting. It doesn't attempt recursive strategies, which suffer diagonalization failures.

[-]orthonormal13y40

Remember how I said that my idea made qualitative sense until I tried to show rigorously that it would do the right thing?

That strategy isn't identical to mine- it's similar to "Not Quite Timeless Decision Theory", except less well formalized. Consider the first two Problems in that post, and whether they apply to that strategy too.

[-]OrphanWilde13y00

That's what I thought you were referring to here.

It's less formalized for a reason, though; the strategy isn't perfection. It asserts a particular proposition and attempts to defend it. (Hence the use of the term "Rationalization" as opposed to prediction.) It also does not engage in recursive logic; it evaluates only what its opponent will do against a specific strategy, which is "Always cooperate." If the opponent will cooperate against that strategy, it is assumed to be safe. If the opponent defects against an always-cooperator, it defects as well.

This rationalization engine doesn't seek to do better than all other agents, it only seeks to do as well as every agent it comes up against. There is a specific counterstrategy which defeats it - cooperate with always-cooperators, defect against opponents who cooperate if you cooperate with always-cooperators. Introduce a "Falsified prediction" mechanism, by which it then always defects against opponents it falsely predicted would cooperate in the past, however, and it becomes an effective blackmail agent against every other agent in iterated games; such a counterstrategy becomes more expensive than simply cooperating.

It has the advantage of being extremely predictable; it will predict itself to cooperate against an always-cooperator, and because it doesn't recurse, as long as P is remotely good at prediction, it will predict it to one-box in Newcomb's Problem.

(Fundamentally the issues raised are whether an agent can be stupid enough to be predictable while being smart enough to accurately predict. The trick to this is that you can't be afraid of getting the wrong answer; it is deliberately less formalized, and deliberately does not attempt to prove its solutions are ideal.)

ETA: The purpose of this agent is to exploit signaling; it observes what other agents would do to a helpless agent, suspecting that other agents may be nastier than it is, and that it is similarly helpless. It then behaves based on that information. It is in effect a reflective agent; it does to the other agent what the other agent would do to a helpless agent. It would calculate using the same probabilities other agents would, as well; against the 99%-cooperate timeless agent mentioned by Vaniver, it has a 99% chance of cooperating itself. (Although if it predicts wrong it then begins always defecting; such strategies would be risky against the reflective vindictive agent.)

[-]orthonormal13y20

Since the impasse only comes up when we try and rigorously show how our agents behave (using tricky ideas like Löbian proofs of cooperation), I'm looking for something more than qualitative philosophical ideas. It turns out that a lot of the philosophical reasoning I was using (about why these agents would achieve mutual cooperation) made use of outside-the-system conclusions that the agents cannot rely on without self-contradiction, but I didn't notice this until I tried to be more rigorous. The same applies to what you're suggesting.

Note also that by de-formalizing how your agent works, you make it harder to verify that it actually does what it is philosophically "supposed" to do.

I'm glad you're thinking about this sort of decision theory too, but I'm letting you know that here there be dragons, and you don't see them until you formalize things.

[-]OrphanWilde13y00

Assuming the existence of an interpreter that can run the opponent's code (which you seem to be assuming as well), there's not a lot of logic here to formalize.

The dragons only arise when you have two non-trivial processes operating against each other. The issue with your formalized logic arises when it recurses; when the two agents are playing a "I know what you're thinking." "I know that you know what I'm thinking." "I know that you know that I know what you're thinking." game. As long as one of the two agents uses a trivialized process, you can avoid this situation.

By trivialized process, I mean non-self referential.

In formalized pseudocode, where f(x, y) returns "Cooperate" or "Defect", x is the code to be tested (your opponent's code), and y is the code to test with:

y: Cooperate();

main: if(f(x, y)) = Cooperate { Cooperate(); } else { Defect(); }

That would be the whole of the simple reflective agent; the vindictive agent would look thus:

boolean hasFailed=false;

y: Cooperate();

main: if(f(x, y)) = Cooperate and !hasFailed { Cooperate(); } else { Defect(); }

onHasCooperatedWithDefector: hasFailed = true;

It's relatively simple to see that it cooperates against a CooperateBot, which it uses as its signal; agents which cooperate with CooperateBot are considered "Trustworthy." Because it cooperates with CooperateBot, it will cooperate with itself.

Now, you may say that it is throwing away free points by cooperating with CooperateBot. This is true in some sense, but in another, it is avoiding extensive computation by using this methodology, and making its behavior predictable; these lost points are the cost paid for these advantages. If computation costs utility, the relative cheapness of this behavior is important.

The penalty it pays for this behavior is that it cannot simultaneously guarantee that its opponent will defect against a DefectBot; that would introduce an inconsistency. This -appears- to be the issue with your original agent. To potentially resolve that issue, a somewhat better mechanism might be comparing behavior against the next iteration of moves made by a Tit For Tat agent against a cooperative and a defecting bot, which would reward vindictive but trustworthy agents. (A necessary but not sufficient element to avoid inconsistency is that the agent being compared against is not reflective at all - that is, doesn't examine its opponent's source code.) However, I am having trouble proving out that that behavior wouldn't also be inconsistent.

ETA: Why do I claim that self-referential (non-trivial) agents aren't consistent, when opposing each other? For the following reason: Agent A runs f(x,this) on Agent B; agent B runs f(x,this) on Agent A. This becomes f(f(f(f(f(f(f(..., this) - that is, it's infinite recursion. Masquerade solves this by trying out a variety of different "masks" - which makes the reflective agent a subset of Masquerade, using only a single mask. The issue is that Masquerade is designed to deal with other reflective agents; the two agents would mutually defect if Masquerade defects against CooperateBot in the original formulation, which is to say, the addition of masks can make Masquerade -less effective- depending upon the composition of its opponent. Whatever signal different reflective agents use, Masquerade is likely to break, given a sufficiently large number of masks.

[-]orthonormal13y00

Now, you may say that it is throwing away free points by cooperating with CooperateBot.

Indeed, I do say this. I'm looking to formalize an agent that does the obviously correct things against CooperateBot, DefectBot, FairBot, and itself, without being exploitable by any opponent (i.e. the opponent cannot do better than a certain Nash equilibrium unless the agent does better too). Anything weaker than that simply doesn't interest me.

One reason to care about performance against CooperateBot is that playing correctly against constant strategies is equivalent to winning at one-player games. Rational agents should win, especially if there are no other agents in the picture!

[-]OrphanWilde13y30

"Always defect" meets your criteria. You missed a criteria: That it wouldn't miss out on a cooperation it could achieve if its strategy were different.

Your agent will have to consider, not only its current opponent, but every opponent currently in the game. (If Masquerade played in a larger game full of my reflective agents, it would consistently -lose-, because it would choose a mask it thinks they will cooperate against, and they will find the mask it would use against CooperateBot and believe it would defect against them.)

Therefore, in a game full of my reflective agents and one CooperateBot, your agent should choose to cooperate with CooperateBot, because otherwise my reflective agents will always defect against it.

[-]orthonormal13y00

"Always defect" meets your criteria.

I said that it needed to cooperate with FairBot and with itself (and not for gerrymandered reasons the way that CliqueBots do).

You missed a criteria: That it wouldn't miss out on a cooperation it could achieve if its strategy were different.

This is impossible to guarantee without thereby being exploitable. Say that there are agents which cooperate iff their opponent cooperates with DefectBot.

Part of the trick is figuring out exactly what kind of optimality we're looking for, and I don't have a good definition, but I tend to think of agents like the ones you defined as "not really trying to win according to their stipulated payoff matrix", and so I'm less worried about optimizing my strategy for them. But those sort of considerations do factor into UDT thinking, which adds entire other layers of confusion.

[-]Vaniver13y00

So, if I'm reading this correctly, the trouble is that you try to select actions by searching for proofs, instead of an algorithm known to halt quickly, and thus you only behave correctly if it's possible to find proofs (which it basically isn't)?

At the risk of sounding falsely prescient, I thought that was a really fishy way to do things, but I can't claim I had the mathematical understanding to say why.

Semi-related, I know this subthread still requires Löbian objects, but I'm still confused by why you think CDT can't think this way with the proper causal graph (and access to an inference module).

[-]orthonormal13y20

How do you auto-generate the proper causal graph, given the source code of a game and an opponent? (The reason that this algorithm appealed to me is because, if it had worked the way I originally thought, it would automatically do the right thing against a lot of agents in a lot of different games. I don't think that anything I could actually pseudocode in the CDT direction achieves that.)

[-]Vaniver13y10

I was imagining generating a causal graph of the game from the code of the game, where the opponent's code would be one of the nodes in the graph. I agree that determining what will happen from arbitrary code is arbitrarily difficult- but that's not a decision theory problem so much as a computer science problem. (I trust my decision theories more when they consider easy problems easy and difficult problems difficult; it's part of a sanity check that they're faithfully representing reality.)

I think it's better to be thinking like Nash than like Turing here; Nash's approach is the one that turns an infinite regress into a mixed strategy, rather than an impossibility.

That is, I can think of at least one way to restate the tournament which allows for about as much meaningful diversity in strategies, but doesn't allow infinite regress. The winner is still the TDT-inspired mask, though it no longer requires arbitrary proofs.

This might not be interesting to you, of course- if your end goal is to something like studying the intertemporal decision dynamics of self-modifying AI, then you want something like a proof that "I can judge the desirability of any modification to myself." My guess is that if you go down that route then you're going to have a bad time, and I'd instead ask the question of "are there any modifications to myself that I can prove the desirability of in a reasonable length of time?"

[-]orthonormal13y00

We can have different mathematical aesthetics, sure. I got excited about these decision theories in the first place when I saw that Löbian cooperation worked without any black boxes, and that's the standard to which I'm holding my thinking. If you can find a way to automatically generate better causal diagrams for a wide range (even if not a complete one) of games and opponents, then I'll start finding that approach really fascinating too.

[-]orthonormal13y00

So there's one last fix that occurred to me overnight, but I can't see whether it could possibly work.

The impasse is that we need a version of FairBot which acts as usual (given enough deductive capacity) AND is able to deduce that it defects mutually against DefectBot. The current problem is that even though it can prove that DefectBot defects, this isn't enough to prove that FairBot won't find a proof of DefectBot's cooperation anyway. (FairBot can't deduce that FairBot's deduction procedure is consistent, unless it's inconsistent!)

A failed fix is a FairBot that simultaneously looks for proofs of the opponent's cooperation and defection, and acts symmetrically as soon as it finds one (and defects if it doesn't). The problem is that this algorithm may as easily find a proof of mutual defection against itself as a proof of mutual cooperation; it depends on the proof ordering.

So the last-ditch fix is for FairBot to search for both proofs simultaneously, but somehow search more effectively in the cooperative direction, so that we can guarantee it finds a cooperative proof against itself rather than a defective one. This could be done either by giving it exponentially more search time in the cooperative direction than in the defective direction, or by preparing it with a Löbian constructive approach in the cooperative direction.

Do either of these seem feasible?

[-]V_V13y00

The current problem is that even though it can prove that DefectBot defects

How can it do that in the general case?

The problem is that this algorithm may as easily find a proof of mutual defection against itself as a proof of mutual cooperation; it depends on the proof ordering.

Maybe I'm missing something, but wouldn't that imply that the formal system is inconsistent, and hence useless due to the principle of explosion?

[-]cousin_it13y30

Maybe I'm missing something, but wouldn't that imply that the formal system is inconsistent, and hence useless due to the principle of explosion?

If program A which uses a specific proof ordering can find a proof of some theorem about program A, and program B which uses a different proof ordering can find a proof of the opposite theorem about program B, that doesn't imply inconsistency.

[-]V_V13y00

proof of mutual defection against itself (emphasis mine)

Hence he's assuming that A = B, if I understand correctly.

[-]cousin_it13y40

He's saying that program A can find a proof that it cooperates with program A, but if we slightly change the proof ordering in program A and obtain program B, then program B can find a proof that it defects against program B. I still don't see the inconsistency.

[-]orthonormal13y00

Yes, this is what I meant. Thanks!

[-]V_V13y00

Ok, so the proof ordering is considered part of the program, I assumed it was an external input to be universally quantified. Thanks for the clarification.

[-]cousin_it13y00

(Sorry, my previous reply was off the mark.)

If the agent spends more time trying to prove cooperation, doesn't it become harder to prove that the agent defects against DefectBot?

[-]orthonormal13y00

Let's imagine for the moment that we have an increasing function f from the naturals to the naturals, and we instruct our FairBot variant to spend f(N) steps looking only for cooperative proofs (i.e. reacting only if it finds a proof that Opp(FairBot)=C) before spending an Nth step looking for defective ones (proofs of Opp(FairBot)=D). When paired against DefectBot, the simple proof that DefectBot defects will come up at the Kth step of looking for defective proofs, and thus at the (f(K)+K)th step of FairBot's algorithm.

But then, if FairBot needs the lemma that FairBot defects against DefectBot, this can be proved in g(f(K)+K) steps for some function g. The important part here is that this limit doesn't blow up as FairBot's proof limit increases! (It'll still increase a bit, simply because FairBot needs a slightly longer representation if it's mentioning a sufficiently large proof limit, but this sort of increase is quite manageable.)

So the real question is, is there a function f such that we can be sure this FairBot variant finds a Löbian proof of cooperation with itself in fewer than f(N) steps, before it finds a Löbian proof of defection against itself in N steps?

[-]cousin_it13y00

You could make f(1) large enough for Loebian cooperation to happen immediately. Then you could make f(2) large enough to simulate what happened in the first f(1)+1 steps, and so on. I still don't completely understand how you're going to use this thing, though.

[-]orthonormal13y00

Lemma 1: There's a proof that M'(DefectBot)=D, and that there does not exist a shorter proof of M'(DefectBot)=C.

Proof: There's a short proof that DefectBot(DefectBot)=D, and a proof of length (waiting time + short) that FB'(DefectBot)=DefectBot(FB')=D. (It needs to wait until FB' stops looking for only cooperative proofs.) Clearly there are no shorter proofs of contrary statements.

So there's a proof that M' skips to its default defection (since there's nothing strictly better), and this proof has length less than (g(short) + g(wait + short) + short), where g is the function that bounds how long it takes to verify that something is the shortest proof of a certain type, in terms of the proof length. (I think g is just bounded by an exponential, but it doesn't matter).

Thus, of course, Lemma 1 has a proof of length at most g(g(short) + g(wait + short) + short).

Lemma 2: There's a proof that M'(FB')=FB'(M')=C, and that there does not exist a shorter proof of any alternative outcome.

Proof: Similar to Lemma 1. The initial proofs are a bit longer because we need the Löbian proof of length less than (waiting time) that FB'(FB')=C, instead of the short proof that DefectBot(DefectBot)=D.

And so there's a proof of length (g(wait+short) + g(wait) + short) that M' goes to the sanity-check trying to prove that FB'(M')=FB'(FB'). Now a proof of length L that FB'(M')=C would imply a proof of length L + (g(wait+short) + g(wait) + 2*short) that M'(FB')=C...

Oh, and here's the impasse again, because the existence of a proof that M'(FB')=C is no longer enough to prove that FB'(M')=C, you also need that there's no shorter proof that M'(FB')=D. Unless there's a clever way to finesse that (within something provable by the same system), it looks like that's where this last-ditch fix fails.

Drat.

[-]orthonormal13y00

So this is setting off all sorts of aesthetic flags in my head, but it appears to work:

Let FairBotPrime (FB') be the agent that does this: first tries to prove Opp(FB')=C for a long time (long enough for Löb to work), then simply tries to prove either Opp(FB')=C or Opp(FB')=D for the rest of the time. If it ever succeeds, it immediately halts and returns the same move as it just deduced. If it doesn't deduce anything in time, then it defects by default.

Let MasqueradePrime (M') be the agent that first tries on the DefectBot and FB' masks: it looks through all proofs for one of the form "DefectBot(Opp)=X and Opp(DefectBot)=Y" and progresses to the next stage once it finds one (or once it runs out of allotted time), then does the same for the FB' mask. Then, if either payout thus obtained exceeds the default (D,D) payout, it sanity-checks whichever mask gets the better result (let's say it defaults to DefectBot if they tie): it tries to prove "Opp(M')=Opp(chosen mask)", and if so then it outputs the (pre-calculated) move that the chosen mask makes against the opponent. If something along this path fails, then it just outputs its default move (D).

I claim that M' correctly defects against DefectBot and CooperateBot, and cooperates mutually with FB' and M'. The key here is that at no point does it need to deduce that a proof search failed; it only needs to show that a certain proof is the first one that an agent finds. I'll reply again with some lemmas.

[-][anonymous]13y00

I had a similar idea some months ago, about making the agent spend exponentially more time on proofs that imply higher utility values. Unfortunately such an agent would spend the most time trying to achieve (D,C), because that's the highest utility outcome. Or maybe I misunderstand your idea...

[This comment is no longer endorsed by its author]Reply

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

49

Decision Theories, Part 3.5: Halt, Melt and Catch Fire

49

49

Problem: A deductive system can't count on its own consistency!