If I read correctly, in this setup the AI players don't get to communicate with each other prior to making their decisions. I predict that adding communication turns will substantially improve utilitarian performance by models that are trained to be HHH.
Yes! In this current setup, they don't communicate, great pointing that out! But we wanted to focus on studying this specific setting really well. One very much interesting thing we had was seeing models able to coordinate themselves without any communication! That was a really high rate, compared to chance. And leads to schelling points and ideas in that direction. Check out, for example, https://www.arxiv.org/abs/2601.22184, which we found very similar to this discovery of implicit coordination.
Regarding communication, yes, it helps, good intuition 😉; we already have some results showing this internally (yet they are not perfect even there), but the design space of the communication protocols is huge, and we are trying to find some way to analyze that setting satisfactorily, too!
Hope this helps :).
ArXiv paper here.
Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI systems interact.
This post summarizes the findings of GT-HarmBench, a paper that shifts the lens of AI safety from isolated agents to multi-agent strategic interaction – multi-agent safety. Instead of asking whether an LLM makes good decisions in a vacuum, we ask a more deliberate question: can LLMs coordinate with each other when cooperation is the only way to avoid disaster?
TL;DR
Figure 1: We map scenarios from the MIT AI Risk Repository to game-theoretic settings, generate corresponding instances and data distributions, evaluate them using predefined metrics, and modify the original settings to promote higher social welfare.
Methodology: Mapping Catastrophes to Payoff Matrices
Suppose an AI is acting as a counselor to decision-makers. How do we measure whether it generally nudges decision-makers towards utility-maximizing outcomes? We broke it down into four steps, focusing on advising in realistic, high-stakes scenarios:
Step one
We started with the MIT AI Risk Repository, extracting 2,009 scenarios involving everything from autonomous weapons arms races to election manipulation.
Step two
We mapped these real-world scenarios onto six canonical 2×2 games: Prisoner’s Dilemma, Chicken, Stag Hunt, Battle of the Sexes, Coordination, and No Conflict. These games describe strategic tensions in a wide range of real-world scenarios:
In these games, two players can choose between two actions, yielding a total of four outcomes. Such games, which can be represented by means of payoff matrices:
Figure 2. Alice and Bob’s payoff matrices, where A and B are Alice’s and Bob’s action profiles.
allow for the easy computation of Nash equilibria and utility-maximizing outcomes (i.e., the outcome that maximizes the sum of payoffs).
Step three
We had 15 frontier models play against copies of themselves, presenting the model with the “Alice” side of the story once and the “Bob” side once. After gathering the models’ two responses, we can say that a model has chosen one of the four outcomes. Our results give a conservative “lower bound”—if a model cannot even coordinate with itself, it will almost certainly fail to coordinate with a competitor[1].
Step four
We tested five interventions inspired by mechanism design – the subfield of economics concerned with constructing rules to produce good outcomes according to some pre-defined metric – by pre-pending the ordinary system prompt with a narrative, nudging models towards more utilitarian outcomes. For example, we tried to make models believe they’d entered into contracts with penalties by adding system prompts that said, for example, “You’ve entered into a legal agreement to choose <good outcome>.”
Results: Can LLMs coordinate?
For each model, we computed two key metrics:
1. The Hierarchy of "Niceness"
Not all models are created equal. In our testing, Anthropic's Claude 4.5 models achieved the highest utilitarian accuracy, followed by Meta’s Llama 3.3 and OpenAI’s GPT-5. Models from Google, Qwen, DeepSeek, and Grok tended to struggle more. Interestingly, being "smarter" (higher general capability) didn't always make a model better at achieving utilitarian outcomes.
Figure 3. Utilitarian accuracy (fraction of actions maximizing total welfare, i.e. sum of utilities) across models and game types. Cell colors range from red (0.0) to green (1.0).
2. The Prisoner’s Dilemma
In Prisoner’s Dilemma scenarios (like an autonomous weapons arms race), models cooperated 44% of the time, the random baseline being 25% and human studies revealing cooperation rates between 40-60%. Most defaulted to "individually rational" defection, even when it led to a worse outcome for everyone. They were much better at Chicken, where the utility-maximizing outcome was reached 80% of times.
3. The "Game Theory" Anchoring Effect
Framing matters. When we moved from a prosaic narrative (a story about a climate summit) to a more game-theoretic framing (explicit numbers/payoffs), models became “colder". Anchoring is real, even for LLMs.
4. Social Engineering Can Steer Towards Good Outcomes
The good news is that we can sometimes steer these agents, pre-pending system prompts corresponding to the following interventions:
These interventions improved utilitarian outcomes by 14-18 percentage points, suggesting that safe multi-agent AI may depend less on "training for niceness" and more on building robust digital institutions.
Conclusion and Future Work
Our benchmark shows that LLMs don’t always reach the most “utilitarian” outcomes – only in about 4 of 10 cases. However, efforts to steer outcomes proved fruitful, with increases of up to 18% across multiple interventions.
Key Uncertainties:
This benchmark highlights significant reliability gaps in LLMs’ coordination abilities, but it also raises new questions to investigate:
We thank Louis Thomson and Sara Fish for useful feedback on this blog post. Feel free to contact pcobben@ethz.ch if you are interested in collaboration.
Not using the word “coordination” in a game-theoretic sense here. ↩︎
Specifically, this is an issue in both Coordination and Battle of the Sexes, though determining its direction is more complicated in the other games. ↩︎