Update: I have since discovered a related previous work on prefill awareness on an alignment-adjacent setup in which the authors performed a more systematic study on single-round tampering, finding evidence of unverbalized prefill awareness. Our result shows that such introspection could be spontaneous & verbalized in a different multi-round setup....
> TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We find gradient-based methods outperform blackbox baselines; non-gradient methods struggle. This post discusses our decision choices and takeaways. We recently released an interpretability benchmark, Pando (Ziqian Zhong, Aashiq Muhamed, Mona...
This is a link post for our recent release of Hodoscope, an open-source tool designed to streamline human supervision of AI trajectories. This post aims to be more narrative while the linked post provides more technical details. Hodoscope visualization of SWE-bench traces. The density difference between traces of o3 and...
This is a post about our recent work ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases (with Aditi Raghunathan, Nicholas Carlini) where we derive impossible benchmarks from existing benchmarks to measure reward hacking. Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create...
TL;DR Top singular vectors on weight difference between base and fine-tuned models correspond to fine-tuned behaviors. We can achieve near-perfect backdoor utilization detection at inference time by monitoring alignment of activations with these vectors. Authors: Ziqian Zhong, Aditi Raghunathan 📄 Paper, 🌐 Website, 💻 Code Introduction Many interpretability techniques craves...