x

LESSWRONG

LW

cjgwang — LessWrong

cjgwang

cjgwang

Message

Student @ Oxford Uni

25

1

1

9mo

cjgwang

Student @ Oxford Uni

Attack Selection In Agentic AI Control Evals Can Decrease Safety

Equal contributions: Cath Ge-Wang and Tyler Crosse Summary * AI control evaluations often model attackers as following a precomputed attack policy, such as an optimal attack insertion rate. However, a highly capable adversary may instead implement attack selection, strategically choosing to attack only when side-task success is highly likely. *...