x

LESSWRONG

LW

Arielle Berthe — LessWrong

Arielle Berthe

Arielle Berthe

Message

2

1mo

Arielle Berthe

1mo

Semantic Similarity, Not Model Capability, Drives Monitor Evasion in AI Control

Semantic Similarity, Not Model Capability, Drives Monitor Evasion in AI Control TL;DR: I ran 575 experiments testing whether an AI attacker can hide sabotaged code from an AI monitor. The main finding surprised me: what determines evasion success is not how smart the attacker is, or how much code it...