Attack Selection In Agentic AI Control Evals Can Decrease Safety
Equal contributions: Cath Ge-Wang and Tyler Crosse Summary * AI control evaluations often model attackers as following a precomputed attack policy, such as an optimal attack insertion rate. However, a highly capable adversary may instead implement attack selection, strategically choosing to attack only when side-task success is highly likely. *...
Apr 1422