Epistemic status: I consider everything written here pretty obvious, but I haven't seen this anywhere else. It would be cool if you could provide sources on topic!

Reason to write: I've seen once pretty confused discussion in Twitter about how multiple superintelligences will predictably end up in Defect-Defect equilibrium and I suspect that discussion would had been better if I could throw in this toy example.

PrudentBot cooperates with agent with known source code if agent cooperates with PrudentBot and don't cooperate with DefectBot. It's unexploitable and doesn't leave outrageous amount of utility on table. But can we do better? How can we formalize notion of "both agents understand what program equilibrium is, but they predictably end up in Defect-Cooperate situation because one agent is wastly smarter"?

Let's start with toy model. Imagine that you are going to play against PrudentBot or CooperateBot with  probability each one. Payoff matrix is 5;5, 10;0, 2;2. Bots can't play with you directly, but you can write program to play. Your goal is to get maximum expected value.

If you cooperate, you are always going to get 5, so you should defect if you are going to get more than 5 in expectation:

Thus, our UncertainBot should take probability distribution, find if probability of encountering PrudentBot is less than 5/8 and defect, otherwise cooperate. The same with mixture of PrudentBot and DefectBot: you are guaranteed to get 2 if you defect, so

Can we invent better version of DefectBot? We can imagine TraitorBot, which takes state of beliefs of UncertainBot and predict if it can get away with defection and otherwise cooperate. Given previous analysis with mixture of PrudentBot and DefectBot, it's clear that TraitorBot defects if probability of PrudentBot is higher than 2/5 and cooperates otherwise, yielding strictly no lower utility than utility of Cooperate;Cooperate.

Such setup provides amazing amount of possibilities to explore.

Possibilities to explore how defection can happen between sufficiently smart agents:

  • First of all, TraitorBot can simply win by not being in prior of UncertainBot. 
  • Second, in real world we don't have buttons with "Defect/Cooperate" written on them. If we are trying to decide whether to build nanotech designed by superintelligence, you know exact action of superintelligence - you just don't know if this action is cooperation or defection.
  • Third, TraitorBot here is defined "by label". If we have TraitorBot1 and TraitorBot2 with different probabilities in UncertainBot prior, we can have weird dynamics when two identical algorithms get different results due to faulty representation in another algorithm. On the other hand, it's possible to have more than one level of deception and it's unclear how to implement them. My guess is that levels of deception depend on how much reasoning steps are performed by deceiver.

Important theoretical moments:

  • It's unclear how to reconcile Löb's theorem with UncertainBot. When I tried to write something like "It's provable that if UncertainBot defects given its state of beliefs, then PrudentBot defects, therefore, UncertainBot cooperates and PrudentBot provably cooperates", it hurt my brain. I suspect it's one of the "logical uncertainty" things.
  • It would be nice to consolidate UncertainBot with TraitorBot inside one entity, i.e., inside something with probabilitic beliefs about other entities' probabilistic beliefs that can predict if it can get away with defection given others probabilistic beliefs... and I don't know how to work with this level of self-reference.

In perfect ideal development, I would like to have a theory of deception in Prisoner's Dilemma that can show us under which conditions smart agents can get away with defection against less smart agent and whether we can prevent such conditions from emerging in first place.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 3:03 PM

With randomness, including logical uncertainty, (C,C) is not a privileged point of the Pareto frontier. All points on lines between (C,C) and (C,D), and between (C,C) and (D,C), are fair game. This turns the problem into bargaining over a particular point, with the threat of damaging Pareto efficiency. The points in the convex hull of pure outcomes can be thought of as contracts. If both players agree on the contract of 0.7(C,C)+0.3(C,D), that's still better than a lot of less coordinated alternatives.