Brittle model organisms obstructs deception elicitation work
This work was done by Advik Raj Basani with Daniel Tan and Chloe Li as part of SPAR Spring 2026. tl;dr: Finetuning-based auditing methods for model organisms may unintentionally erase the deceptive behavior we are trying to measure, leading to an illusion of success. We study secret side constraint (SSC)...
Jun 2219