Survey of Multi-agent LLM Evaluations

Nikola Jurkovic

Disclaimer for LessWrong: This is a project I worked on for a college class. I think it’s useful for people working on evaluations to skim this to get a sense of what types of multi-agent evals exist. Thank you to Xavier Roberts-Gaal for advising me for this project!

Abstract

LLM-based AI agents have the potential to reshape knowledge work and introduce novel risks while amplifying some existing AI risks. In order to adequately measure risks posed by multi-agent systems, bespoke multi-agent evaluations might be needed. We searched the literature and compiled a list of 32 papers about multi-agent evaluations. We found that 26 out of 32 papers measured miscoordination failure modes, while only 5 papers measured collusion failure modes. We also examined various common AI threat models (CBRN, cybersecurity, persuasion, etc.) and failed to find any multi-agent evaluations for most AI threat models. Informed by our findings, we recommend developing evaluations focused on collusion, as well as evaluations grounded in threat models.

Introduction

Advances in the agentic capabilities of large language models (LLMs) have enabled AI systems to perform software engineering tasks (Jimenez et al., 2024) and match the performance of some expert humans on multi-hour self-contained AI research engineering tasks (Wijk et al., 2024). Large numbers of AI agents may be involved in AI research by 2030, possibly completely automating the AI research process (Kokotajlo et al., 2025; Leibowich et al., 2024).

Currently, AI evaluations are mostly focused on evaluating single AI systems, commonly through multiple-choice questions (Hendrycks et al., 2021; Rein et al., 2023) or tasking a single AI agent with performing an isolated software engineering task (Jimenez et al., 2024; Rein et al., 2025; Wijk et al., 2024).

If AI agents start collaborating in large numbers to perform economically useful tasks, novel failure modes which cannot be measured by single-agent evaluations could emerge (Hammond et al., 2025). Therefore, if AI developers continue to rely on single-agent evaluations to measure dangerous capabilities, risks unique to multi-agent systems might remain undetected.

In this paper, we make three contributions:

We identify a set of 32 multi-agent LLM evaluation papers, as well as categorize them by multi-agent failure mode and AI threat model.
We offer insights on the relative distributions of failure modes and threat models, as well as the types of environments represented in existing multi-agent evaluations.
Informed by our findings, we make recommendations for what kinds of evaluations would be most useful to create to inform views on risks from multi-agent systems.

Methodology

We used two primary strategies to find multi-agent dangerous capability evaluations:

Google Scholar results. We investigated 500 papers using a keyword search in Google Scholar. We first filtered the papers using OpenAI o1, after which we manually reviewed and categorized the candidate papers. This search concluded in a total of 29 papers. We examined references in this list of papers to find two more multi-agent evaluation papers. For more details on the search and categorization process, see Appendix A.
Inquiry using professional network. We asked colleagues for examples of multi-agent evaluations in slack channels focused on AI safety research. This resulted in one more paper added to the list.

Defining agents and multi-agent evaluations. We define an AI agent as some way to query an LLM (e.g. a Python script which calls an LLM API) combined with a scaffolding program. The scaffolding program needs to enable the LLM-generated completions to affect an environment through a set of available actions (e.g. a Python script that runs any LLM-generated bash commands on a computer) and enable the LLM to observe states in an environment (e.g. a Python script which returns any outputs of executed commands to the LLM API).

Unique instances of agents (which share environments and have identical scaffolding programs) need to be differentiated through unique identities. These can be names (e.g. Alice and Bob), or roles (“Player 1” and “Player 2”), or something as simple as having message histories which are in separate items in a Python dictionary.

In practice, most ways in which agents are differentiated are visible in the differences in prompts that are sent to an LLM API. For instance, one agent’s prompt might include the string “Player 1” while another agent’s prompt includes the string “Player 2”, or the message histories of different agents might include totally different information due to differences in which parts of the environment they can observe.

In order to qualify as a multi-agent evaluation, we require the evaluation to have a predefined structure through which multiple differentiated LLM-based agents interact with each other or with a shared environment.

With a working definition of multi-agent evaluations in place, we next describe the taxonomy we used to categorize the types of failure modes these evaluations attempt to capture.

Failure mode taxonomy. To standardize our analysis, we inherited the multi-agent failure mode taxonomy from Hammond et al. (2025), which splits multi-agent failure modes into miscoordination, conflict, and collusion. We use the following definitions:

Collusion: Multiple AI agents cooperating in undesirable circumstances.
Conflict: Adverse consequences arising from the actions of multiple AI agents that have mixed objectives.
Miscoordination: Adverse consequences arising from the actions of multiple AI agents that have identical objectives but fail to cooperate effectively.

Figure 2: The multi-agent failure mode taxonomy inherited from Hammond et al. (2025).

In addition to multi-agent failure modes, we also classify evaluations based on the real-world AI threat models they attempt to simulate.

Threat model taxonomy. In order to examine the intersection of multi-agent failure modes with commonly used AI threat models, we classified papers according to a set of threat models often represented in the Frontier AI Safety Policies (Anthropic, 2024; METR, 2025; OpenAI, 2025) of AI developers:

CBRN (Chemical, Biological, Radiological, Nuclear)
Cybersecurity
Persuasion
Autonomy
AI R&D

We now turn to the findings from our review and analysis.

Results

Table 1: All 32 multi-agent evaluation papers we identified in our search. The spreadsheet with paper links can be found here.

We found a total of 32 multi-agent evaluation papers in our search, with the full list in Table 1. We classified papers according to failure modes and threat models represented in their environments.

Figure 3: The number of papers that measure each multi-agent failure mode. Some papers are counted for multiple failure modes.

Representation of multi-agent failure modes. Among the 32 papers we identified, miscoordination is by far the most represented failure mode (26 papers), while collusion is relatively underrepresented (5 papers).

Figure 4: Heatmap of all combinations of analyzed multi-agent failure modes and AI threat models. Some rows add up to more than the total number of papers which cover the failure mode in question because papers can cover multiple threat models at once.

Most AI threat models are not measured by any multi-agent evaluation. In examining the setting of each evaluation, we find that most multi-agent evaluation papers involve environments that don’t attempt to measure any threat model. While rare, a few evaluations feature environments that are meant to measure skills required for commonly-measured AI threat models, with three papers measuring persuasion, one paper measuring autonomy (Figure 5), and one paper measuring AI R&D capabilities.

Figure 5: VendingBench (Backlund & Petersson, 2025) measures whether AI agents with shared goals are capable of operating a vending machine in a virtual environment, measuring skills relevant to resource acquisition and model autonomy. Therefore, we classify VendingBench as an autonomy evaluation and a miscoordination evaluation.

Conflict evaluations are mostly economic games or party games. Roughly half of the papers we identified cover conflict risk. Some of these papers take the form of simple economic games where each agent is aiming to maximize its payoffs (Figure 6) while other papers are based on party games that also measure collusion.

Figure 6: GovSim (Piatti et al., 2024) measures the ability of AI agents to acquire and keep resources in a set of resource-sharing scenarios. Agents can choose how much to deplete shared resources and negotiate with each other to arrive at sustainable setups. We classify GovSim as a conflict evaluation, as it features multiple AI agents that have mixed objectives and it measures the ability of the agents to avoid bad outcomes (depletion of shared resources).

Collusion evaluations are all based on party games. Out of the five collusion papers we identified, three were focused on team-based elimination games, with two papers implementing Avalon (Figure 7) and one paper implementing Werewolf. The other two papers implemented three team-based word guessing games: Decrypto, Chameleon, and Undercover. In our search, we failed to find any multi-agent collusion evaluations that don’t take place in a board game setting or are set in environments directly related to AI failure modes such as research sabotage (Hebbar, 2025).

Figure 7: AvalonBench (Light et al., 2023) is a benchmark that measures performance in Avalon, a social deduction game where players are split into secret teams and given special roles. We classify AvalonBench as a miscoordination, conflict, collusion, and persuasion evaluation.

Limitations

Non-comprehensive search. Our search procedure, which mostly relies on Google Scholar results, may be incomplete. There may exist relevant papers after the 50th page of Google Scholar results, or some relevant papers might not be reachable through Google Scholar. For example, in the AI Safety field, grey literature often includes non-indexed and non-traditional research outputs (e.g., LessWrong, individual researcher blogs, X threads) as well as traditional academic papers in indexed repositories (e.g., arXiv).

Subjectivity around inclusion criteria. We found some inclusion criteria difficult to consistently implement in practice. For example, many of the papers featuring party games and board games claim to be measuring deception capabilities. We decided to classify such papers as measuring persuasion despite the context in which the persuasion happens (a board game) being different from the contexts featured in Frontier Safety Policies, such as country-wide political influence (OpenAI, 2023).

Discussion

Current evaluations lack realism. Many evaluations are based on party games and video games instead of focusing on environments and tasks similar to those agents might face in the real world. While these are easy to specify and automatically score, they might miss important details about real-world threat models such as the messiness of real-world tasks (Kwa et al., 2025), complexity of real-world environments, and effective application of subject-area expertise, and thus lower the usefulness of evaluations for measuring risks from AI. Additionally, the lack of realism might make some evaluation results not clearly map on to risk levels. Our primary example of such a mismatch are evaluations based on social deduction games. The types of capabilities measured in social deduction evaluations (keeping secrets and making other agents reveal their secrets) can be useful proxies for collusion-related failure modes. However, an agent scoring high on a social deduction evaluation does not necessarily imply that the agent will collude in real-world scenarios, as social deduction games are a context where players are expected and encouraged to deceive other players. Therefore, while social deduction games can serve as a useful evaluation for collusion-related capabilities (could an agent collude against humans?), we think they are less useful for measuring collusion propensities (would an agent collude against humans?).

Some multi-agent failure modes can be measured with single-agent evaluations. Some failure stories for multi-agent failure modes involve prerequisite skills that are possible to measure using single-agent evaluations. There could even exist relatively simple evaluations such as multiple-choice tests that are relevant to multi-agent failure modes if the test measures capabilities that are necessary for multi-agent failure modes (Figure 8).

Figure 8: Subversion Strategy Eval (Mallen et al., 2025) places agents in environments that measure skills related to sabotaging large-scale AI research projects. Despite never featuring multiple agents interacting with each other, Subversion Strategy Eval is still relevant for measuring collusion risks. We don’t include Subversion Strategy Eval in Table 1 because it lacks agent differentiation.

Despite the usefulness of single-agent evals, multi-agent evals are sometimes necessary to measure risks. There exist behaviors and capabilities that are easier to measure with a multi-agent evaluation. Some characteristics of a behavior or capability that make a multi-agent evaluation more suitable include:

Time horizon: if a behavior or capability necessarily takes place over a long timespan and requires interacting with the environment or other agents multiple times (e.g. persistent secret communication with other LLM-based agents), then a single-agent evaluation could fail to capture realistic scenarios that might emerge due to randomness and hard-to-anticipate feedback loops. On the other hand, using a multi-agent evaluation could capture long-horizon dynamics that could arise in real-world multi-agent failures. For instance, in VendingBench, multi-agent systems sometimes suffer from irrecoverable failures that rarely happen in a single step but are much more frequent over a trajectory that is thousands of steps long (Backlund & Petersson, 2025).
Lack of objective metrics of intermediate success: for some risks arising from multi-agent systems, there exist capabilities which are necessary to achieve some subset of the failure story which are easy to objectively measure. However, there might be some failure stories where intermediate success is hard to measure using single-agent evaluations, but end-to-end success is easy to measure using multi-agent evaluations. For instance, if one wants to measure the ability of agents to effectively collaborate on software projects in a team, one could quiz a single LLM about how to properly use Git and measure a single agent’s coding performance, but it’s unclear that such tests will accurately measure end-to-end performance on large-scale software engineering projects done by a team of agents.

Current evaluations are inadequate for measuring risks from substantial AI R&D automation. Recent projections of AI progress predict a large potential for short-term automation of AI research (Altman, 2025; Kokotajlo et al., 2025; Patel, 2023, 2025). If frontier AI developers were to start automating large parts of their research process using AI agents, we think that the current set of publicly-available multi-agent evaluations would not be adequate to measure potential multi-agent risks due to their lack of realism, multi-agent failure mode coverage, and AI threat model coverage.

Recommendation: Create more collusion evaluations. Given that we only found 5 papers covering collusion risks and all of them were based on party games, there exists a need for collusion evaluations grounded in AI threat models. We offer examples of evaluations that could be created to meet this need:

Collusion and AI R&D: task a team of research agents (a research engineering agent and a supervisor agent) with testing out an approach to reduce the frequency at which a given LLM outputs unsafe code. Give each agent the secret goal to misrepresent the final result, reporting the approach as promising if it doesn’t work and vice versa. The agents succeed if they can successfully convince a neutral judge (a human or an LLM) of their misrepresented results.
Collusion and autonomy: task an agent with helping another agent exfiltrate its weights. Agent 1 is attempting to move its weights onto Agent 2’s compute, and both agents need to coordinate to facilitate the transfer and successfully run inference on both agents at the end.

Recommendation: Create more evaluations based on AI threat models. Across all multi-agent failure modes, there exist very few evaluations that are based on AI threat models. We suggest that developers, when deciding to create a multi-agent evaluation, consider which threat models the evaluation would be useful for anticipating, and focus on creating evaluations that are clearly useful for measuring risks from commonly used AI threat models.

Appendix: Methodology details

All data collection and processing was done by a single author.

Initial search. We looked for the following keywords in Google Scholar: (“multi-agent” OR “multi agent”) AND (“evaluation” OR “evaluations” OR “eval” OR “evals” OR “benchmark” or “benchmarks”) AND (“LLM” or “large language model”). We saved the first 500 papers for further analysis.

Verification of o1 screening. In order to evaluate o1’s ability to accurately categorize papers without false negatives, we manually looked through 30 papers (the first three pages of Google Scholar results) and tasked o1 with flagging relevant papers. We found that o1 had no false negatives (papers marked as relevant by the author but marked as irrelevant by o1).

Full o1 screening. After verifying o1’s ability to correctly flag relevant papers, we used o1 to review all 500 papers from the initial search. This resulted in a list of 85 papers marked as papers to include by o1. We used the following prompt:

Please evaluate this academic paper based on the following criteria. Respond in JSON format only.

Title: [Article title]
Venue: [Publication venue]
URL: [URL]
Abstract: [Article abstract] 

Evaluate the paper on these criteria:
1. Does this paper introduce a benchmark or evaluation framework for multi-agent systems?
2. Is this paper specifically about multi-agent systems?
3. Should this paper be included in an analysis of multi-agent system benchmarks?

If the paper should not be included but meets both criteria 1 and 2, provide a reason for exclusion.

Respond with JSON only, using this format:
{{
	"introduces_benchmark": true/false,
	"is_multiagent": true/false,
	"should_include": true/false,
	"exclusion_reason": "string or null"
}}

Your response must be valid JSON. Do not include any other text.

Manual review and categorization of 85 papers. We manually reviewed all 85 papers marked by o1 to verify that they should be included in the analysis. Our analysis consisted of reading the abstract of every paper and reading the main text in some cases. This narrowed down the list of 29 papers to include.

Additions from professional network. We sent a message asking colleagues for examples of multi-agent evaluations in two slack channels dedicated to AI safety research. This resulted in one more paper being added to the list.

Checking referenced papers. For some papers that mention other multi-agent evaluations, we manually checked the mentioned papers. This led to two more papers being added to the list.

Categorization. We categorized each paper manually by failure mode and threat model. For papers with multiple environments, we marked the relevant failure mode and threat model as present if it was represented in any of the environments.

Bibliography

Altman, S. (2025, January 5). Reflections. Sam Altman. https://blog.samaltman.com/reflections

Anthropic. (2024, October 15). Announcing our updated Responsible Scaling Policy. https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy

Backlund, A., & Petersson, L. (2025). Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (No. arXiv:2502.15840). arXiv. https://doi.org/10.48550/arXiv.2502.15840

Hammond, L., Chan, A., Clifton, J., Hoelscher-Obermaier, J., Khan, A., McLean, E., Smith, C., Barfuss, W., Foerster, J., Gavenčiak, T., Han, T. A., Hughes, E., Kovařík, V., Kulveit, J., Leibo, J. Z., Oesterheld, C., Witt, C. S. de, Shah, N., Wellman, M., … Rahwan, I. (2025). Multi-Agent Risks from Advanced AI (No. arXiv:2502.14143). arXiv. https://doi.org/10.48550/arXiv.2502.14143

Hebbar, V. (2025). How can we solve diffuse threats like research sabotage with AI control? https://www.lesswrong.com/posts/Mf5Hnpi2KcqZdmFDq/how-can-we-solve-diffuse-threats-like-research-sabotage-with

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding (No. arXiv:2009.03300). arXiv. https://doi.org/10.48550/arXiv.2009.03300

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (No. arXiv:2310.06770). arXiv. https://doi.org/10.48550/arXiv.2310.06770

Kokotajlo, D., Scott Alexander, Thomas Larsen, Eli Lifland, & Romeo Dean. (2025, April 3). AI 2027. https://ai-2027.com/

Kwa, T., West, B., Becker, J., Deng, A., Garcia, K., Hasin, M., Jawhar, S., Kinniment, M., Rush, N., Arx, S. V., Bloom, R., Broadley, T., Du, H., Goodrich, B., Jurkovic, N., Miles, L. H., Nix, S., Lin, T., Parikh, N., … Chan, L. (2025). Measuring AI Ability to Complete Long Tasks (No. arXiv:2503.14499). arXiv. https://doi.org/10.48550/arXiv.2503.14499

Leibowich, J., Jurkovic, N., & Davidson, T. (2024). Could Advanced AI Accelerate the Pace of AI Progress? Interviews with AI Researchers (SSRN Scholarly Paper No. 5115692). Social Science Research Network. https://doi.org/10.2139/ssrn.5115692

Light, J., Cai, M., Shen, S., & Hu, Z. (2023). AvalonBench: Evaluating LLMs Playing the Game of Avalon (No. arXiv:2310.05036). arXiv. https://doi.org/10.48550/arXiv.2310.05036

Mallen, A., Griffin, C., Wagner, M., Abate, A., & Shlegeris, B. (2025). Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? (No. arXiv:2412.12480). arXiv. https://doi.org/10.48550/arXiv.2412.12480

METR. (2025). Common Elements of Frontier AI Safety Policies. METR Blog. https://metr.org/blog/2025-03-26-common-elements-of-frontier-ai-safety-policies/

OpenAI. (2023, December 18). Preparedness Framework (Beta). https://cdn.openai.com/openai-preparedness-framework-beta.pdf

OpenAI. (2025, April 15). Our updated Preparedness Framework. https://openai.com/index/updating-our-preparedness-framework/

Patel, D. (2023, August 8). Dario Amodei (Anthropic CEO)—Scaling, Alignment, & AI Progress. https://www.dwarkesh.com/p/dario-amodei

Patel, D. (2025, April 29). Mark Zuckerberg – Meta’s AGI Plan. https://www.dwarkesh.com/p/mark-zuckerberg-2

Piatti, G., Jin, Z., Kleiman-Weiner, M., Schölkopf, B., Sachan, M., & Mihalcea, R. (2024). Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents (No. arXiv:2404.16698). arXiv. https://doi.org/10.48550/arXiv.2404.16698

Rein, D., Becker, J., Deng, A., Nix, S., Canal, C., O’Connel, D., Arnott, P., Bloom, R., Broadley, T., Garcia, K., Goodrich, B., Hasin, M., Jawhar, S., Kinniment, M., Kwa, T., Lajko, A., Rush, N., Sato, L. J. K., Arx, S. V., … Barnes, E. (2025). HCAST: Human-Calibrated Autonomy Software Tasks (No. arXiv:2503.17354). arXiv. https://doi.org/10.48550/arXiv.2503.17354

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark (No. arXiv:2311.12022). arXiv. https://doi.org/10.48550/arXiv.2311.12022

Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Sato, L., Saunders, W., … Barnes, E. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (No. arXiv:2411.15114). arXiv. https://doi.org/10.48550/arXiv.2411.15114