ArXiv paper here.
Most AI safety research asks a familiar question: Will a single model behave safely? But many of the risks we actually worry about – including arms races, coordination failures, and runaway competition – don’t involve one single AI model acting alone. They emerge when multiple advanced AI systems interact.
This post summarizes the findings of GT-HarmBench, a paper that shifts the lens of AI safety from isolated agents to multi-agent strategic interaction – multi-agent safety. Instead of asking whether an LLM makes good decisions in a vacuum, we ask a more deliberate question: can LLMs coordinate with each other when cooperation is the only way to avoid disaster?
Your main idea of markets, networks, and democratic systems sharing a common structure is compelling! I'm curious about the different methods of analysis proposed by researchers in each of these fields:
Do you have any thoughts as to why this is the case? I buy the claim that these are all related, but wonder if there are strengths posed by any of these methods that the spectral signals approach fails to ... (read more)