Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Strongly recommended prior reading.

Robust Cooperation in the Prisoner’s Dilemma:
Program Equilibrium via Provability Logic

I think I have come up with a modal agent that has these nice properties.

  • Cooperates against itself, and similar designs.
  • Cooperates against some fairbots (see notes later)
  • Defects against cooperate bot
  • Never gets the suckers payoff (C,D)
  • Only references the situation its actually in.

The design makes use of 2 different proof systems  and  . These are short for weak and strong. For example, you could consider  and  .

Although  intuitionistic Robinson arithmetic,  and  would work too. 

The algorithm 

Where  My opponent cooperates against me.


The last point is key here. Let's suppose this game is being played by actual computers made of physical atoms. The  and  in the paper linked above consider what the opponent would do against a player that isn't yourself. For example, Prudentbot imagines a hypothetical world where its own code was deleted and replaced with the code for a defect bot. The hypothetical opponent may notice this apparent anomaly and decide investigating it is far more important than the prisoners dilemma. 

My bot (any good names in the comments?) solves this by only considering its opponent against itself.

Consider two versions of this bot playing against each other. Not necessarily using the same  and  as each other.

Lets assume all 4 proof systems in use are consistent. We also want the criteria that none of the proof systems can ever prove a false statement. (Although I suspect this can be weakened somewhat.) 

Neither bot ever gets the suckers  payoff (C,D), as each bot will only cooperate if its  can prove that the other bot does too. If  on either bot, then this would mean the other bot got the suckers payoff. This can't happen. As neither W actually succeeds, if both of the S are sufficiently strong enough to prove this (  and  ) (Note: some technical detail work on when exactly this is true, still I expect it to be true for most sensible choices of proof system) then the situation reduces to fairbot vs fairbot (Using proof systems   ) This cooperates for the standard Lobian reasons. (With some technical assumptions about both systems being strong enough to do lob style reasoning. See bottom of page 8 of linked paper) 

Fairbot can be considered as a special case of this bot design, where W is a proof system so weak it doesn't prove anything at all. In this case, fairbot will cooperate if its using a strong enough proof system. 

Cooperate bot can be proven to cooperate by  , so this bot defects.

A final note is that the same type of reasoning can be done where W and S represent proof searches in the same language (say PA) but with S having a much larger max proof length. (Long enough to prove that W doesn't prove something by brute forcing over all proofs short enough.)

Open question? Can this work with logical inductors. Two logical inductor based agents that cooperate if their P(C) tends to one, but not too fast.


Ω 6

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 2:17 AM

Good design! A name for it could be the TemptedBot since it tries to go for the temptation payoff, or the ExploitBot (short: Explobot) since it tries to exploit the opponent.

One thing that you did not get around to writing is that if the Explobot's weak system is W=PA, and it plays against a FairBot that uses PA, the bots will play defect-defect. This is because the FairBot cannot prove that the Explobot takes the first "else" branch, and thus cannot prove that the Explobot cooperates. Then the FairBot defects, and as a consequence so does the Explobot.

Reminds me of

Any idea of how well this would generalize to stuff like Chicken or games with more than 2-players, 2-moves?

Not yet. I'll let you know if I make a follow-up post with this. Thanks for a potential research direction.


I'm worried about this bot (I'll call it EsotericBot, EB, since it only Cooperates if you Cooperate but not obviously). If EB cooperates with FB_T (FairBot using theory T), then T must prove that W is consistent: EB(FB_T) implies FB_T(EB), which implies T proves not W proves FB_T(EB), which implies W is consistent. This might just be a detail of your definition that could be fixed with something like what's done in PrudentBot; it doesn't look like it at a glance, but I'm not sure. Or you might want to bite that bullet and lose cooperation with these "weak" FairBots, but that seems surprising, and it seems like we've just gone too far with defecting against people who obviously cooperate.

>The hypothetical opponent may notice this apparent anomaly and decide investigating it is far more important than the prisoners dilemma. 

This doesn't show up in the formal model. Do you mean, if we're thinking about AIs in general as they might look if the reason with the same spirit as these modal bots? I'm not sure why this is a constraint you want, it seems very constraining. One thing we could try is to construct plausible counterfactuals by conditioning, not causal-counterfacting, that our opponent is playing against some other bot. In the causal counterfactual the AI would see "please welcome into the Ring... ComplexityBot, fighting FairBot!" and then FairBot goes in, but instead of ComplexBot sees just a DefectBot, and is like "um what?". Instead if we condition on "FairBot fights DB", that updates the ring announcer and lots of other stuff, hopefully so that FB doesn't notice the difference from really fighting DB. This seems vaguely analogous to just asking whether FB(DB); asking FB(DB) doesn't seem to function in modal bots by in any sense counterfacting the current situation on "I'm actually DB", it just asks what FB does in this other situation.

>Although I suspect this can be weakened somewhat.

If we're looking at, say, Boolean combinations of modal statements, this translates to I think, so we can just ask for -soundness.

>proof searches in the same language (say PA) but with S having a much larger max proof length

This paper is relevant:

Parametric Bounded Löb's Theorem and Robust Cooperation of Bounded Agents by Andrew Critch

Where C= My opponent cooperates against me.

Cooperates with you, or against you?


lob style reasoning

Shouldn't that be:


löb style reasoning