Epistemic status: My Best Guess (some preliminary empirical evidence- more experiments and critique required)

This post introduces an early-draft working paper investigating the propensity of LLMs to cooperate and reciprocate in the Prisoner's Dilemma: https://sphelps.net/papers/llm-cooperation.  I plan on submitting to arXiv after incorporating feedback from the community.  An edited version of the paper's introduction is given below.  

TLDR; there has been some discussion on applying ideas from game theory to bring the incentives of AI agents into alignment with human goals, but as far as I know there has been little empirical investigation of how actual AIs behave in cooperative tasks when faced with e.g. conditional altruism (though happy to be corrected).  We decided to investigate this empirically by running experiments.  Preliminary investigation with GPT-3.5 indicates that LLMs can operationalise natural language descriptions of altruism and selfishness into appropriate behaviour in the iterated PD to some extent, but exhibit limitations in adapting their behavior based on conditioned reciprocity.

Motivation and background 

Many scenarios discussed in the AI alignment debate focus on competitive zero-sum interactions. For example, a common analogy is competition for resources between different species occupying a particular niche; for example, [Tegmark, 2023] argues

We humans drove the West African Black Rhino extinct not because we were rhino-haters, but because we were smarter than them and had different goals for how to use their habitats and horns. In the same way, superintelligence with almost any open-ended goal would want to preserve itself and amass resources to accomplish that goal better.

In an AI safety context, the intuition behind such arguments is that AI systems have been to shown to outsmart humans in zero-sum games such as Chess and Go, and therefore if AI systems find themselves in situations in which they are competing with humans, the AI “species” will clearly out-compete inferior humans.

However, many interactions in both natural and artificial settings are characterized by non-zero-sum payoff structures [Phelps and Russell, 2015]. A famous example that was used to analyse existential risk of nuclear conflict during the cold war is the Prisoner’s Dilemma [Axelrod, 1997]. In an ecological context a related non-zero-sum game, the Hawk-Dove game, also known as the Chicken game, was introduced by [Maynard-Smith, 1973] as a way to analyse the outcomes of competition for resources among animals. In certain payoff structures, limited conflict can be an evolutionary equilibrium of this game when interactions are repeated within a large population. This game has been also applied to an analysis of existential risk in nuclear conflict; [Dixit et al., 2019] argue that the Cuban missile crisis can be analysed as a high-stakes dynamic chicken game in which neither the USSR nor the USA wanted to “blink” first.

Interestingly, in a one-shot version of the game, behaving irrationally by limiting one’s options can be a superior strategy. This is because in the absence of such limiting, attempts to manipulate the other player through brinkmanship are not credible threats. For example, in a game of Chicken where two opposing drivers are on a collision course and neither driver wants to be seen as the “chicken” by swerving, if we limit our choices by removing the steering wheel, and make this common knowledge, then the opposing driver’s best response is to swerve. Similar arguments were used during the cold war to remove rational deliberation from the decision whether to retaliate in the event of a preemptive strike by the enemy by “taking the human out of the loop” and putting systems on automated hair-trigger alert to make the threat of mutually assured destruction credible [Kahn, 1960]. Thus, in contrast to chess or Go, in non-zero-sum interactions, the most ruthless agents, or those with superior cognitive capacity, do not necessarily prevail.

Moreover, in both one-shot and iterated Prisoner’s Dilemma games with the number of rounds being common knowledge, the rational strategy is to defect, but experiments have shown that real people tend to cooperate, albeit conditionally. The fact that real people cooperate in these scenarios, despite the seemingly rational strategy to defect, highlights the importance of social norms in shaping human behavior [Fehr and Fischbacher, 2004]. Norms can facilitate cooperative outcomes by providing a shared understanding of acceptable behavior and allowing for the enforcement of rules through social sanctions.

In the context of AI alignment and non-zero-sum games, this underscores the importance of considering not only the cognitive capacity of AI agents but also their understanding and adherence to social norms. The ability of AI systems to adapt their behavior based on natural language prompts and to engage in reciprocal cooperation is crucial for creating AI agents that can better align with human values in complex, non-zero-sum settings.

By investigating the behavior of AI-generated agents in the iterated Prisoner’s Dilemma and other social dilemmas such as the ultimatum game we can contribute to a more comprehensive understanding of AI alignment in various interaction scenarios. This knowledge can, in turn, inform the development of AI systems that are better equipped to navigate the complexities of human cooperation and competition, while adhering to social norms and human values.

Researchers have argued that the prompt itself plays a crucial role in shaping the emergent behaviour from the model; for example, the default “helpful assistant” behavior of AI chatbots such as GPT-3 has been noted to differ from that of specific simulacra instantiated by user prompts (c.f. prompts used to “jail-break” GPT models) [0xk1g0, 2023, Janus, 2023]. More generally, LLMs can be arbitrarily scaffolded by injecting contextual information [Beren, 2023]. A particular use-case of a scaffolded LLM involves injecting information about a world-state, together with a persona that incorporates specific goals, which can be used to instantiate autonomous agents, either in the real-world [Richards, 2023], or in mult-agent simulations [Park et al., 2023].

From an AI alignment perspective, the fact that large language models can easily be scaffolded to deploy autonomous goal-oriented agents into production at very little cost highlights the need to systematically evaluate the conditions in which LLM-instantiated agents have a propensity or otherwise to cooperate in social dilemmas.

Given that the nature of an LLM agent depends on the persona and context introduced in the initial prompt, a key question is to what extent the level of cooperation elicited from the AI depends on features of the prompt. In particular, we are interested in whether large language models are capable of translating concepts such as altruism and selfishness, as expressed in natural language, into corresponding action policies in social dilemmas. This question is important, as the ability to operationalise these concepts in a variety of contexts would demonstrate the LLMs are capable of understanding and acting on cooperative norms that underpin human social behavior. This is the key research question investigated in our paper.

The full description of the study, including the results, can be found in the paper, and the code used to conduct the experiments and the raw data can be found in the github repo.


New Comment
1 comment, sorted by Click to highlight new comments since: Today at 7:47 PM

It's alright; I don't find this take to be particularly remarkable, because what you're investigating is several layers deep of emergent behavior, and won't stay stable under RL pressure. This would be a paper I'd go "hmm, yeah, makes sense" about and move on.

New to LessWrong?