Cold Evals: a Low-Cost Intervention Against Conversation Steering Attacks
Epistemic Status: Exploratory. As of April 2026, it seems disturbingly easy to steer state-of-the-art deployed models into outputs that would normally read as disallowed. I have not implemented or empirically tested this defence; what follows is an argument for why it may be conceptually sound and cheap enough to matter....
Apr 151