This is a linkpost for https://arxiv.org/abs/2211.05057

We study a program game version of the Prisoner's Dilemma, i.e., a two-player game in which each player submits a computer program, the programs are given read access to each other's source code and then choose whether to cooperate or defect. Prior work has introduced various programs that form cooperative equilibria against themselves in this game. For example, the ϵ-grounded Fair Bot cooperates with probability ϵ and with the remaining probability runs its opponent's program and copies its action. If both players submit this program, then this is a Nash equilibrium in which both players cooperate. Others have proposed cooperative equilibria based on proof-based Fair Bots, which cooperate if they can prove that the opponent cooperates (and defect otherwise). We here show that these different programs are compatible with each other. For example, if one player submits ϵ-grounded Fair Bot and the other submits a proof-based Fair Bot, then this is also a cooperative equilibrium of the program game version of the Prisoner's Dilemma.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 7:32 AM

It would be interesting to run this with asymmetric power (encoded as more cpu for recursive analysis before timeout, or better access to source code, or a way to "fake" one's source code with some probability).  I'd predict that this flips the equilibrium to "D" pretty easily.  The mutual-perfect-strategy-knowledge here is a VERY high bar for applicability to any real situation.

A paraphrasing is, how easy is it to make it look like you'll do a particular thing, when you really do a very different thing. Programmers encounter lots of reasons to think that it's very easy, we miss bugs, and many bugs are security bugs, and security bugs can be used to amplify small divergences from expected behavior into large ones.

But we're also just really bad at this, even for humans. Most of us don't deal with formal correctness proofs very well, and relatedly: Our programming paradigms are predisposed to security bugs because, well, civilization invests like almost nothing in improving them because programming languages are non-excludable goods. And most of us have no manufacturing experience so we can't generalize what we know into the messy physical world. There's probably a lot of variation in this skill just in humans to the extent that I don't want to say anything about how objectively hard it is.

I wonder if it's worth training social agents in environments where they naturally need to do this sort of thing a lot (trust each other on the basis of partial information about their code, architectures and training processes), there may be some very robust social technologies that they develop to keep each other mutually transparent, which would also be useful to us for aligning them.

osgt most naturally can make ai cooperate against humans if the humans don't also understand how to formally bind themselves usefully and safely and reliably. however, humans are able to make fairly strongly binding commitments to use strategies that condition on how others choose strategies, and there are a bunch of less exact strategy inference papers I could go hunt down that are pretty interesting.

That wasn't my curiosity at all.  I already know that deception is fairly rampant in humans, and expect that some kinds will be present in agentic AI.  I am wondering if the cooperation conditions require mutual perfect knowledge, and the knowledge of mutual comprehension.

yeah I suspect that you might never be totally sure another part of a computer is what it claimed to be. you can be almost certain but totally certain? perfect?