[Update 2023-06-09: See this comment for some caveats / motivations / context about this post.]

There are multiple levels on which an agent can understand and implement a particular decision theory. This post describes a taxonomy of understanding required for robust cooperation in the Prisoner's Dilemma.

The main point I'm trying to convey in this post is that the benefits of implementing and following a particular decision theory don't necessarily come from merely understanding the theory, even if your understanding is deep, precise, and correct.

For example, just because you correctly understand logical decision theories and their implications (e.g. why LDT agents might cooperate with each other in the prisoner's dilemma under certain conditions), that doesn't mean that you yourself are a LDT agent, even if you want to be.

Actually implementing a decision theory in real situations often requires hard cognitive work to correctly model your counterparties, as well as the ability to make your own mind and decision process model-able and legible enough to your counterparties. You must have some way of making your outwardly-professed intentions highly correlated with your actual behavior and decision process. The difficulty of creating a robust and outwardly-visible correlation between your behavior prior to making a decision, and your actual decision, varies by the kind of agent you are, and with the capabilities of the agent you are trying to cooperate with or defect against - for a human cooperating with another human, it might be very difficult. For agents whose decision process source code is easily accessible, it may be relatively easy.

The rest of this post describes different levels of comprehension an agent might have about decision theory when applied to the Prisoner's Dilemma. Achieving higher levels in the taxonomy likely requires the agent to be relatively more capable and coherent than agents that achieve only lower levels. For the final level, the agent must have some degree of control over its own internal thought patterns and decision processes.

Note: the taxonomy in this post is kind of a fake framework, but it might be useful in clearing up some common misconceptions (e.g. ones that lead to worries that this post is trying to address), and explaining why Decision theory does not imply that we get to have nice things.

Level 1: Understanding that good things are good, and bad things are bad

At this level, you are smart enough to comprehend the payoff matrix for the Prisoner's Dilemma, and what it means:

                               1: C1:  D
2: C(3, 3)                     (5, 0)                         
2: D(0, 5)(2, 2)

If you're player 1, you recognize that you would prefer (D, C) > (C, C) > (D, D) > (C, D).

You may or may not understand that (D, C) might be unrealistic or hard to obtain in many situations - a fabricated option, potentially. You recognize that if (C, C) and (D, D) were the only options on the table, (C, C) is preferable to (D,D).

Agents below this level of understanding don't understand their own preferences on even a basic level - bad things can happen to them, and they might say "ouch", but they're often stepping on their own toes across time, or making decisions for plainly incoherent or wrong reasons, given their own professed or revealed preferences.

Level 2: An understanding of, and desire to avoid, the Nash equilibrium

At this level, you understand the payoff matrix in level 1, and also recognize the symmetry of the situation. You'd like to get the (D, C) outcome, but you recognize your opponent is trying for (C, D), and you see how this could be a problem.

You understand the concept of a Nash equilibrium, and see that (D, D) is the only such equilibrium in this game. You yearn to do Something Else Which is Not That, and you see that symmetry may play an important role. But you don't necessarily know what role or how to formalize it. Perhaps you intuitively see why  you would cooperate with an identical clone of yourself placed in identical decision-making circumstances.

Level 3: Knowing and understanding formal theories for when and how to avoid Nash equilibria

You understand the kind of mathematics an agent must implement to achieve (C, C) robustly, and under which circumstances you can actually pull a fast one on your counterparty and get (D, C) with high probability.

You understand precisely why PrudentBot cooperates with FairBot but defects against CooperateBot. You see the advantage and desirability of generalizing the concept of pre-commitment, and of making each decision by choosing the optimal decision making algorithm over all possible world states. But you don't necessarily know how to implement this kind of decision process (which may be provably intractable or undecidable in general), even in very limited circumstances.

Level 4: Actually avoiding (D,D) for decision theory reasons

You understand everything in level 3, and are capable of actually implementing such a decision theory, in at least some circumstances and with some counterparties. Your decisions are actually correlated with you and your counterparties' understanding of decision theory and models of each other. The (perhaps literal) source code for your decision-making process is verifiably accessible to your counterparties. You may be able to modify your source at will, but your counterparties can verify that the process you use to actually make your decision is based only on some relatively simple algorithm which you expose. You (provably and verifiably to your counterparty) cooperate with PrudentBot and FairBot, and defect against CooperateBot.

Humans probably struggle to achieve this level, except in limited circumstances or by using specialized techniques. In a True Prisoner's Dilemma with another human, two humans might cooperate, and might profess or actually believe that their cooperation is a result of their mutual understanding of decision theory. But in many cases, they are actually cooperating for other reasons (e.g. valuing friendliness or a sense of honor).

Among humans, level 3 (understanding) is probably a prerequisite for level 4 (implementation). For AI systems, this may or may not be true for particular decisions, depending on the situation under which those decisions are made (for example the programs in this paper are too simple to have an understanding of decision theory themselves, but under appropriate conditions, they may be sufficient for actually implementing a decision process correctly according to some formal decision theory.)

Note that if you're a level 1 agent or below playing against some other agent, the other agent will mostly cooperate or defect against you for non-decision theoretic reasons, regardless of their own level of decision theory comprehension. If the the other agent is confident that you'll cooperate (for whatever reason) and they're feeling nice or friendly towards you, or have some other kind of honor-value, they may cooperate. Otherwise, they'll defect. Decision theory starts to become relevant when both agents are level 2 or above, but most outcomes in a PD are probably not functions of (only) the two agents' decision theories until both agents are at level 4.

In sum: A human or AI system might have a deep, precise, and correct understanding of a decision theory, as well as a desire or preference to adhere to that decision theory, without actually being capable of performing the counterparty modeling, legibility, and other cognitive work necessary to implement that decision theory to any degree of faithfulness.


New Comment
11 comments, sorted by Click to highlight new comments since:

This post received a pretty mixed-to-negative reception.

Looking back on my own writing, I think there are at least a couple of issues:

  • It's pretty high context; much of the post relies on the reader already understanding or at least being familiar with https://arxiv.org/abs/1401.5577 and logical decision theories. To readers who are familiar with those things, the ideas in this post might not be very interesting or novel.
  • It's somewhat unmotivated: it's not clear what position or misconception a real person might actually have that this post clears up or argues against.

    This post is an attempt to explain why a bunch of students who have learned all about game theory and decision theory won't necessarily end up with a bunch of (C,C) outcomes in a classroom simulation. This is true even if the students really want to cooperate, and they have correctly understood the class material on a very deep level. There's still a missing "implementation" piece involving legibility and counterparty modeling that requires separate cognitive skills (or the ability to set up a bot arena, make binding commitments / arrangements outside the simulation, etc.), which aren't necessarily closely related to understanding the decision theory itself.

    But maybe this point is obvious even to the (fictional) students themselves. Anyway, regardless of whether you read or liked this post, I recommend reading planecrash, especially if you like fiction with lots of decision theory mixed in.

Pretty cool! 
Just to add, although I think you already know: we don't need to have a reflexive understanding of your DT to put it into practice, because messy brains rather than provable algo etc....
And I always feel it's kinda unfair to dismiss as orthogonal motivations "valuing friendliness or a sense of honor" because they might be evolutionarily selected heuristics to (sort of) implement such acausal DT concerns!

without actually being capable of performing the counterparty modeling, legibility, and other cognitive work necessary to implement that decision theory to any degree of faithfulness

This is not needed, you can just submit PrudentBot as your champion for a given interaction, committing to respect the adjudication of an arena that has the champions submitted by yourself and your counterparty. The only legibility that's required is the precommitment to respect adjudication of the arena, which in some settings can be taken out of players' hands by construction.

PrudentBot is modelling its counterparty, and the setup in which it runs is what makes the modelling and legibility possible. To make PrudentBot work, the comprehension of decision theory, counterparty modelling, and legibility are all required. It's just that these elements are spread out, in various ways, between (a) the minds of the researchers who created the bots (b) the source code of the bots themselves (c) the setup / testbed that makes it possible for the bots to faithfully exchange source code with each other.

Also, arenas where you can submit a simple program are kind of toy examples - if you're facing a real, high-stakes prisoner's dilemma and you can set things up such that you can just have some programs make the decisions for you, you're probably already capable of coordinating and cooperating with your counterparty sufficiently well that you could just avoid the prisoner's dilemma entirely, if it were happening in real life and not a simulated game.

PrudentBot's counterparty is another program intended to be legible, not a human. The point is that in practice it's not necessary to model any humans, humans can delegate legibility to programs they submit as their representatives. It's a popular meme that humans are incapable of performing Löbian cooperation, because they can't model each other's messy minds, that only AIs could make their own thinking legible to each other, granting them unique powers of coordination. This is not the case.

if it were happening in real life and not a simulated game

Programs and protocols become real life when they are given authority to enact their computations. To the extent Pareto inefficient outcomes actually happen in real life, it's worth replacing negotiations with things like this, and fall back to BATNA when the arena says (D,D).

The point is that in practice it's not necessary to model any humans,

Right, but my point is that it's still necessary for something to model something. The bot arena setup in the paper has been carefully arranged so that the modelling is in the bots, the legibility is in the setup, and the decision theory comprehension is in the author's brains. 

I claim that all three of these components are necessary for robust cooperation, along with some clever system design work to make each component separable and realizable (e.g. it would be much harder to have the modelling happen in the researcher brains and the decision theory comprehension happen in the bots).

Two humans, locked in a room together, facing a true PD, without access to computers or an arena or an adjudicator, cannot necessarily robustly cooperate with each other for decision theoretic reasons, even if they both understand decision theory.

When you don't model your human counterparty's mind anyway, it doesn't matter if they comprehend decision theory. The whole point of delegating to bots is that only understanding of bots by bots remains necessary after that. If your human counterparty doesn't understand decision theory, they might submit a foolish bot, while your understanding of decision theory earns you a pile of utility.

So while the motivation for designing and setting up an arena in a particular way might be in decision theory, the use of the arena doesn't require this understanding of the human users, and yet it can shape incentives in a way that defeats bad equilibria of classical game theory.

It's nice to separate the levels between modeling a decision theory, analyzing multiple theories, and actually implementing decisions.  I don't think I'd number them that way, or imply they're single-dimensional.  

It seems level 1 is a prereq for level 4, and 2 and 3 are somewhat inter-dependent, but it's not clear that an agent needs to understand the formal theory in order to actually avoid D,D.  The programmer of the agent does, but even then "understand" may be too much of a stretch, if evolutionary or lucky algorithms manage it.

but it's not clear that an agent needs to understand the formal theory in order to actually avoid D,D.


They definitely don't, in many cases - humans in PDs cooperate all the time, without actually understanding decision theory.

The hierarchy is meant to express that robustly avoiding (D,D) for decision theory-based reasons, requires either that the agent itself, or its programmers, understand and implement the theory.

Each level is intended to be a pre-requisite for the preceding levels, modulo the point that, in the case of programmed bots in a toy environment, the comprehension can be either in the bot itself, or in the programmers that built the bot.

I don't see how level 2 depends on anything in level 3 - being at level 2 just means you understand the concept of a Nash equilibrium and why it is an attractor state. You have a desire to avoid it (in fact, you have that desire even at level 1), but you don't know how to do so, robustly and formally.

I appreciate the detailed taxonomy in this post, and it's an insightful way to analyze the gaps between understanding and implementing decision theory. However, I believe it would be even more compelling to explore how AI-based cognitive augmentation could help humans bridge these gaps and better navigate decision processes. Additionally, it would be interesting to examine the potential of GPT-style models to gain insight into AGI and its alignment with human values. Overall, great read!

Thanks! Glad at least one person read it; this post set a new personal record for low engagement, haha. 

I think exploring ways that AIs and / or humans (through augmentation, neuroscience, etc.) could implement decision theories more faithfully is an interesting idea. I chose not to focus directly on AI in this post, since I think LW, and my own writing specifically, has been kinda saturated with AI content lately. And I wanted to keep this shorter and lighter in the (apparently doomed) hope that more people would read it.