ArXiv paper here.
Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about β including arms races, coordination failures, and runaway competition β donβt involve one single AI model acting alone. They emerge when multiple advanced AI systems interact.
This post summarizes the findings of GT-HarmBench, a paper that shifts the lens of AI safety from isolated agents to multi-agent strategic interaction β multi-agent safety. Instead of asking whether an LLM makes good decisions in a vacuum, we ask a more deliberate question: can LLMs coordinate with each other when cooperation is the only way to avoid disaster?
Yes! In this current setup, they don't communicate, great pointing that out! But we wanted to focus on studying this specific setting really well. One very much interesting thing we had was seeing models able to coordinate themselves without any communication! That was a really high rate, compared to chance. And leads to schelling points and ideas in that direction. Check out, for example,Β https://www.arxiv.org/abs/2601.22184, which we found very similar to this discovery of implicit coordination.
Regarding communication, yes, it helps, good intuition π; w... (read more)