x

LESSWRONG

LW

Ari Isaacs — LessWrong

Ari Isaacs

Ari Isaacs

Message

1

6d

Ari Isaacs

6d

Cold Evals: a Low-Cost Intervention Against Conversation Steering Attacks

Epistemic Status: Exploratory. As of April 2026, it seems disturbingly easy to steer state-of-the-art deployed models into outputs that would normally read as disallowed. I have not implemented or empirically tested this defence; what follows is an argument for why it may be conceptually sound and cheap enough to matter....