Semantic Similarity, Not Model Capability, Drives Monitor Evasion in AI Control
Semantic Similarity, Not Model Capability, Drives Monitor Evasion in AI Control TL;DR: I ran 575 experiments testing whether an AI attacker can hide sabotaged code from an AI monitor. The main finding surprised me: what determines evasion success is not how smart the attacker is, or how much code it...
Mar 261