Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We consider a modification to the open source prisoners' dilemma in which agents must pay some resources to model each other. We will use the modal combat framework, but where agents pay a cost proportional to the depth of boxes in their code. Even a small modeling penalty makes the FairBot-FairBot outcome no longer an equilibrium, since the best response to FairBot is to be CooperateBot and not pay the modeling penalty. The best response to CooperateBot is to be DefectBot, and the pure DefectBot-DefectBot outcome is a stable Nash equilibrium. In fact, I believe that DefectBot-DefectBot is the unique pure strategy Nash equilibrium.

Amazingly, this turns out to be okay! For small modeling penalties, there is a mixed strategy equilibrium which mixes between CooperateBot, FairBot, and PrudentBot! Both players get exactly the same utility in expectation as the FairBot-FairBot outcome.

Further, if you consider an evolutionary system where populations reproduce in proportion to how well they do in prisoners' dilemmas with each other, it appears that as the modeling penalty gets small, the basin of the defect equilibrium also gets small, and nearly all initial conditions cycle around CooperateBot, FairBot, and PrudentBot!

This post came out of conversations with Sam Eisenstat, Abram Demski, Tsvi Benson-Tilsen, and Andrew Critch. It is a first draft that could use a coauthor to carefully check everything, expand on it, and turn it into a paper. If you think you could do that with minimal guidance from me, let me know.


We will be using the modal combat framework, and identifying with cooperation and with defection. Agents are defined to formulas that combine the other agent run on various agents using propositional calculus and a modal operator . The represents provability, and every instance of run on an agent in the formula must be contained within a . Recall some common modal agents:

CooperateBot is defined by .

DefectBot is defined by .

FairBot is defined by .

PrudentBot is defined by .

These 4 agents interact with each other as follows: CooperateBot cooperates with everyone. DefectBot defects against everyone. FairBot defects against only DefectBot. PrudentBot defects against CooperateBot and DefectBot and cooperates with itself and FairBot.

We will say that the depth of an agent is the maximum of the depth of s in its code and the depth of the agents that it calls the opponent on. CooperateBot and DefectBot have depth 0, FairBot has depth 1, and PrudentBot has depth 2.

We will use a prisoner's dilemma where mutual cooperation produces utility 2, mutual defiction produces utility 1, and exploitation produces utility 3 for the exploiter and 0 for the exploited. Each player will also pay a penalty of times its depth.

Pure Equilibria

The best response to both CooperateBot and DefectBot is DefectBot, since when the opponent does not depend on you, you want to defect with the least possible penalty.

The best response to FairBot is CooperateBot, since you can't exploit FairBot, so you want to get mutual cooperation with the least possible penalty.

The best response to PrudentBot is FairBot, since you can't exploit PrudentBot, you can't mutually cooperate with penalty 0, but you can mutually cooperate with penalty 1 by being FairBot. (This is assuming is at less than . Otherwise, you just want to defect to avoid the penalty.)

Thus, if the only options are CooperateBot, DefectBot, FairBot, and PrudentBot, the unique pure strategy equilibrium is mutual DefectBot.

I believe that DefectBot is the only pure strategy equilibrium in general. This would follow directly from the fact that if a depth agent cooperates with another depth agent , then there exists a depth agent which also cooperates with. I believe this is true, but haven't proven it yet. If it turns out to be false, there might be a simple modification of the penalties that makes it true.

Mixed Equilibria

First, let's look at Mixed Equilibria in which the only options are the four modal agents above.

Let , , , and be the probabilities with which player is DefectBot, CooperateBot, FairBot, and PrudentBot respectively.

The utility of playing DefectBot is , where is the other player.

The utility of playing CooperateBot is , where is the other player.

The utility of playing FairBot is , where is the other player.

The utility of playing PrudentBot is , where is the other player.

Note that if one player is indifferent between CooperateBot and DefectBot, this means the other player plays FairBot exactly of the time, but then the first player would be better off playing PrudentBot (as long as ). Thus no player can ever randomize between CooperateBot and DefectBot.

If a player doesn't play CooperateBot at all, the other player can't play Prudent Bot, since FairBot is strictly better. Thus, if both players don't play CooperateBot, they both play only DefectBot and FairBot. If this isn't the pure defect solution, it must be because and for both players, which is indeed a (not very good) Nash equilibrium.

If exactly one player plays CooperateBot at all, the first player mixes between CooperateBot and FairBot, while the other player mixes between (at most) DefectBot, FairBot and PrudentBot. The second player can't play FairBot, since CooperateBot would have done strictly better. Thus the first player gets 0 utility for playing CooperateBot, contradicting the fact that it is impossible to get 0 expected utility playing FairBot.

Finally, we have the case where both players play CooperateBot with some probability, and thus neither plays DefectBot at all. If either player did not play FairBot at all, then DefectBot would strictly dominate CooperateBot for the other player, contradicting the fact that both players play CooperateBot. If either player did not play PrudentBot at all, CooperateBot would strictly dominate FairBot, contradicting the fact that both players play FairBot.

Thus, in the only remaining equilibria, both players mix between all of CooperateBot, FairBot, and PrudentBot. Since both players are indifferent between CooperateBot and FairBot, both players must play PrudentBot with probability exactly . Since both players are indifferent between FairBot and PrudentBot, both players must play CooperateBot with probability exactly . This leaves probability for FairBot, and the result is indeed a Nash equilibrium.

We have a total of three Nash equilibria. All three are symmetric. The first two are bad and both players get the expected utility . The last one is good and both players get expected utility .

Next, we have to show that this outcome is also an equilibrium in the game where both players can play any modal agent. If another agent were to get utility more than against this PrudentBot, CooperateBot, FairBot combination, it would clearly have to be depth , since the only depth agents are CooperateBot and DefectBot, and depth PrudentBot already has the perfect behavior against these 3 agents. It would also have to mutually cooperate with FairBot, and defect against CooperateBot.

Suppose a depth agent provably cooperates with FairBot and defects against CooperateBot. Note that since is depth , knows what is, and so knows that . However also know that . Thus knows , so knows , contradiction. Therefore, no agent of depth can have the desired behavior, and our good Nash equilibrium is in fact a Nash equilibrium of the game where both players can play any modal agent.

Evolutionary Simulations

Next, we will consider what happens when a large population of modal agents evolves by playing prisoners' dilemmas. We will consider a system consisting only of DefectBot, CooperateBot, FairBot, and PrudentBot. We will consider a simple model where at each time step, each population grows a small amount proportional to the size of that population times the expected utility a member in that population gets by playing a prisoners' dilemma with a random other agent. Notice that in this model, a population that starts out nonzero will remain nonzero, and a population that starts out 0 will remain 0.

Since the above three Nash equilibria were symmetric Nash equilibria, they are also equilibria of this evolutionary system. This is because the expected utility of being each type of agent that appears any nonzero amount is the same. Since this system keeps populations that start at 0 at 0, there are other equilibria too; for example, any point where all the agents are the same is in equilibrium since it cannot change.

We can then ask about the stability of these equilibria. The pure defect one is stable, since if there are only a small number of agents that are not DefectBots, they won't play each other enough to cancel out their modeling penalty. The DefectBot, FairBot one is unstable, since the more FairBots there are, the better the FairBots do. Unfortunately, the good equilibrium is also unstable. This is harder to see, but a small perturbation will cause a very slow spiraling outward. (I checked this with simulation, but did not prove it mathematically.)

However, I ran a simulation that seemed to show that for a small , and for almost all initial conditions that mix between all four types of agents, the system does not converge, but instead goes in a large cycle, in which the system spends most of its time with almost all FairBots. Eventually a very small number of CooperateBots climbs out to become non-negligible, then as soon as the CooperateBots are a significant proportion, the PrudentBots come in to exploit them, then the PrudentBots quickly turn into FairBots, and the cycle repeats. Meanwhile, the DefectBots are pushed down into a very small proportion of the population.

I also looked at a second model as follows: The population starts with a small finite number of agents of each type. At each time step, a new agent enters, and permanently becomes the modal agent that maximizes expected utility against the existing population. In this simulation, unless you start with very few FairBots or PrudentBots, you will simply switch between adding FairBots, PrudentBots, and CooperateBots, never adding a new DefectBot.

A more careful study could actually find what the dynamics are mathematically rather than depending on simulations, and can consider systems with more or even all modal agents. There also might be other models that are more justified than the one I used. I expect that most ways of doing it will result in very little defection.


I like this result. I just assumed that DefectBot would be the only Nash equilibrium, and I was surprised that the story had a much happier ending. When I first became suspicious, I was thinking that you could get a good equilibrium out of an infinite collection of different agents, but it turns out you only need three. You can view the CooperateBots as defecting against the FairBots by refusing to pay to punish bad actors, and you can view the FairBots as defecting against the PrudentBots by refusing to pay to punish the non-punishers. You might think you would need to punish arbitrary levels on non-punishers to be in equilibrium, but it turns out that the PrudentBots can get paid exactly enough to remain competitive by exploiting the CooperateBots, and the CooperateBots can get exploited exactly enough to cancel out their unwillingness to model the opponent.

New Comment
20 comments, sorted by Click to highlight new comments since:

Curated, here are my thoughts:


  • Expands on some previous work in a pretty intuitive way, and might be quite relevant to long-term AI alignment
  • Uses the opportunity to help illustrate a bunch of generally important concepts in decision theory and game theory
  • Feels like it uses roughly the correct level of formalism for what it is trying to achieve. Being formal enough to be precise, but not so much that it becomes unreadable.


  • The post does sadly have a bunch of dependencies that aren't fully made clear, and I would love to see a comment on the post that has references for where to learn the basic ideas used in the post (both the modal combat framework and more fundamental things such as the concepts of pure and mixed equilibria)
  • I think motivating the reasons for why you explored in this direction, and how it might help with AI alignment would have probably made more people willing to put in the effort to understand this

I am actually worried that because I posted it, people will think it is more relevant to AI safety than it really is. I think it is a little related, but not strongly.

I do think it is surprising and interesting. I think it is useful for thinking about civilization and civilizational collapse and what aliens (or maybe AI or optimization daemons) might look like. My inner Andrew Critch also thinks it is more directly related to AI safety than I do. Also if I thought multipolar scenarios were more likely, I might think it is more relevant.

Also it is made out of pieces such that thinking about it was a useful exercise. I am thinking a lot about Nash equilibria and dynamics. I think the fact that Nash equilibria are not exactly a dynamic type of object and are not easy to find is very relevant to understanding embedded agency. Also, I think that modal combat is relevant, because I think that Lobian handshakes are pointing at an important part of reasoning about oneself.

I think it is relevant enough that it was worth doing, and such that I would be happy if someone expanded on it, but I am not planning on thinking about it much more because it does feel only tangentially related.

That being said, many times I have explicitly thought that I was thinking about a thing that was not really related to the bigger problems I wanted to be working on, only to later see a stronger connection.

What would happen if a random bot is thrown into the mix?

A random bot would be reasoned about basically the same as DefectBot, but would perform worse than it. CooperateBot would cooperate with it, and neither FairBot nor PrudentBot would find a proof that it cooperates, and so would defect against it.

I'm not sure about that actually, it seems implementation dependent-it's certainly outside the current framework.

What you said would be true if RandomBot is truly random. In reality, RandomBot would probably be written using a PRNG, which is deterministic. In this game, both bots are given their opponent's source code as an input. Reasonably, that should include the PRNG, and its seed. Consequently, RandomBot would probably be treated in each round as a CooperateBot or DefectBot-whichever it is at the time. (This might break down the conclusion-the bots don't actually reason about long term consequences, that is all outside the code. They only deal with the program they are against now, and whether 'today' (in the current round) they will cooperate, etc.). It might mess with the proofs-PrudentBot could find that RandomBot is either CooperateBot or DefectBot this round, but if it proving that RandomBot will cooperate with it, is the same as proving that RandomBot's first call the PRNG gets back an even number, then this proof is about something that (potentially) already happened. Then, proving 'how it will treat CooperateBot', could end up being a proof about what it would do if CooperateBot was its opponent next round. This is however, a hypothetical implementation nitpick, about how the programs might not perform as intended outside their original context, in which they faithful implement an idea, in a way that might not generalize without additional work.

Introducing things this way would change the dynamics-a bot which changes how it behaves based on a variable that's different every round can have a more complicated strategy, which takes advantage of the costs of modeling, or (by using a counter) can try to grow to dominance in the population, then change strategies.

One might open up a similar wealth of possibilities by allowing the bots to know the current population, or the lineup of rounds (the tournament's bracket), or work out proofs that handle mixed strategies (there is a 99% chance my opponent will cooperate, and a 1% chance it'll do something random because its a ML algorithm which does random things epsilon=1% of the time in order to learn, etc.).

This is great.

It looked to me like PrudentBot could be done in depth 1 - "(proof that X(PB) cooperates) AND (proof that X(DB) defects)". (and the paper that's about) tells me that's wrong, if you only have the one layer of provability then PB doesn't cooperate with itself. It also had an introduction to the math here that I found helpful.

Very nice. This is the cleanest result on cognitive (or rationality) costs in co-operative systems that I've seen. Modal combat seems kind of esoteric compared to, say, iterated prisoners' dilemma tournaments with memory, but it pays off nicely here. It gives you the outcomes of a set of other-modelling agents (without e.g. doing a whole lot of simulation), and the box-operator depth then plugs in as a natural modelling-cost measure.

Did you ever publish any of your modal combat code (I have a vague recollection that you had some Haskell code ?) ?

There is this:

I have been thinking about this research direction for ~4 days.

No interesting results, though it was a good exercise to calibrate how much do I enjoy researching this type of stuff.

In case somebody else wants to dive into it, here are some thoughts I had and resources I used:


  • The definition of depth given in the post seems rather unnatural to me. This is because I expected it would be easy to relate the depth of two agents to the rank of the world of a Kripke chain where the fixed points representing their behavior will stabilize. Looking at Zachary Gleit's proof of the fixed point theorem (see The Logic of Provability, chapter 8, by G. Boolos) we can relate the modal degree of a fixed point to the number of modal operators that appear in the modalized formula to be fixed. I thought I could go through Gleit's proof counting the number of boxes that appear in the fixed points, and then combine that with my proof of the generalized fixed point theorem to derive the relationship between the number of boxes appearing in the definition of two agents and the modal degree of the fixed points that appear during a match. This ended up being harder than what I anticipated, because naively counting the number of boxes that appear in Gleit's proof makes really fast growing formulas appear and its hard to combine them through the induction of the generalized theorem proof.


  • The Logic of Provability, by G. Boolos. Has pretty much everything you need to know about modal logic. Recommended reading chapters 1,3,4,5,6,7,8.
  • Fixed point theorem of provability logic, by J. Sevilla. An in depth explanation I wrote in Arbital some years ago.
  • Modal Logic in the Wolfram Language, by J. Sevilla. A working implementation of Modal Combat, with some added utilities. It is hugely inefficient and Wolfram is not a good choice because license issues, but may be useful to somebody who wants to compute the result of a couple combats or read about modal combat at introductory level. You can open the attached notebook in the Wolfram Programming Lab.

Thank you Scott for writing this post, it has been useful to get a glimpse of how to do research.

I always like Game Theory that has simple advice applicable to real life: If you want people to cooperate with you, Be Comprehensible!

It could be more extreme than this result implies-I'm not sure how realistic the modeling costs are.

When A is looking for a proof about B(A), and B looking for a proof about A(B), it seems they should either

1) have the same cost which is proportional to the sum of their complexity as measured by length, since A+B=B+A or

2) have the cost for each time a program models the other program be dependent on the other program's length-proving CooperateBot cooperates with you should be easier if only because Cooperate Bot is a shorter program, which isn't trying to prove things about you, which means the fact it has your source code doesn't matter - if it did, this would effectively increases its length. I think FairBot cooperating with itself requires a bit more of a search than CooperateBot doing so, and should have an appropriate cost.

FairBot shouldn't just be penalized epsilon for modeling CooperateBot, it should be penalized based on the length of Cooperate Bot.

CooperateBot is defined by CB(X)↔⊤.
DefectBot is defined by DB(X)↔⊥.
FairBot is defined by FB(X)↔□(X(FB)).
PrudentBot is defined by PB(X)↔□(X(PB)∧(X(DB)→□⊥)).

Their definitions here also reflect that the modeling bots are longer, especially PrudentBot.


The costs could just be (a function of) the length of the proof*, or the amount of time it took to find and check.

*Still reasonable with caching, although costs would be much smaller.

We will use a prisoner's dilemma where mutual cooperation produces utility 2, mutual defiction (sic) produces utility 0, and exploitation produces utility 3 for the exploiter and 0 for the exploited. Each player will also pay a penalty of ε times its depth.

Am I reading this correctly if I think that (cooperate, defect) would produce (0, 3) and (defect, defect) would produce (0, 0)? Is that an error? Because in other parts of the text it looks like (defect, defect) should be (1, 1). Also, (cooperate, defect) should be a nash equilibrium if my interpretation is correct.

That was wrong. Fixed it. Thanks.

This is excellent. I believe that this result is a good simulation of "what we could expect if the universe is populated by aliens".


Assuming the following:

1) aliens consider both destroying other civilizations and too early contact a form of defection

2) aliens reason from udt principles

3) advanced civilizations have some capacity to simulate non advanced ones

Then roughly the model in the post will work to explain what the strategic equlibrium is.

Very interesting.

I think there may also be a Nash equilibrium with a mix of all 4 Bots, for a fairly small band of possible epsilon values.

d = 1 - 3e

f = 1/2

c = e

p = 2e - 1/2,

( ¼ < e < 1/3)

I haven’t proved this is a Nash equilibrium vs any modal agent, maybe you can.

This is unstable but I expect an evolution sim would obtain an oscillating pattern similar to the c,f,p equilibrium you find. In fact, when e = 1/3 this new equilibrium is the same as your original one.

Expected utility is 1+2e and so is less good than c,f,p for allowed epsilon values.

That reminded me of a paper where they evaluate bots with with memory (that comes at a cost):

The interesting intersection here is that memory can be used as a cache, thus avoiding need for re-modeling.

Is the code for this available?

No, sorry. It wouldn't be very readable, and it is easy to do yourself.

In the definition FB(X)↔□(X(FB)) , what does FB(X) mean and what does X(FB) mean?

FB(X) is whether FairBot cooperates when called with the source code of X. X(FB) is whether X cooperates when it is called with the source code of FairBot.