As large language models (LLMs) grow increasingly capable and autonomous, ensuring their safety has become more critical. While significant progress has been made in alignment efforts, researchers have begun to focus on designing monitoring systems around LLMs to ensure safe outcomes, for example, using trusted monitoring systems to oversee potentially unsafe models. However, a concerning vulnerability remains largely unexplored: are monitoring systems themselves vulnerable to jailbreak attacks that exploit their interactions with the models under supervision? These jailbreaks could be facilitated through the untrusted model, or directly exploited by the untrusted model.
Our research investigates this crucial adversarial dynamic within multi-agent AI control protocols. We demonstrate a jailbreak technique that can substantially comprise... (read 2072 more words →)