Agentic sabotage occurs when an AI system intentionally takes actions that compromise safety in an environment while pursuing its own misaligned objectives. We evaluate this behavior in SHADE-Arena, a dataset for assessing sabotage and monitoring capabilities of LLM agents. We decompose agentic sabotage capability into five core components—knowing what overseers find suspicious, selecting good moments to attack, writing effective plans, strong execution, and framing suspicious actions as innocuous—which together form a scaffold that amplifies an agent’s sabotage abilities. Each component is optimized and analyzed through targeted techniques, enabled by a probabilistic model of SHADE-Arena dynamics that allows for efficient experimentation. Our tuned scaffold results in a substantial improvement in sabotage ability for Claude 3.7 Sonnet in SHADE-Arena, reducing safety score from a baseline of 0.87 to 0.41.
Please read more about our work here.