Ziqian Zhong — LessWrong

Spontaneous introspection in output tampering

Update: I have since discovered a related previous work on prefill awareness on an alignment-adjacent setup in which the authors performed a more systematic study on single-round tampering, finding evidence of unverbalized prefill awareness. Our result shows that such introspection could be spontaneous & verbalized in a different multi-round setup....

Apr 2625

Pando: A Controlled Benchmark for Interpretability Methods

> TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We find gradient-based methods outperform blackbox baselines; non-gradient methods struggle. This post discusses our decision choices and takeaways. We recently released an interpretability benchmark, Pando (Ziqian Zhong, Aashiq Muhamed, Mona...

Apr 216

Hodoscope: Visualization for Efficient Human Supervision

This is a link post for our recent release of Hodoscope, an open-source tool designed to streamline human supervision of AI trajectories. This post aims to be more narrative while the linked post provides more technical details. Hodoscope visualization of SWE-bench traces. The density difference between traces of o3 and...

Feb 209

ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents

This is a post about our recent work ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases (with Aditi Raghunathan, Nicholas Carlini) where we derive impossible benchmarks from existing benchmarks to measure reward hacking. Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create...

Oct 30, 202562

Weight-diff SVD for LLM Monitoring

TL;DR Top singular vectors on weight difference between base and fine-tuned models correspond to fine-tuned behaviors. We can achieve near-perfect backdoor utilization detection at inference time by monitoring alignment of activations with these vectors. Authors: Ziqian Zhong, Aditi Raghunathan 📄 Paper, 🌐 Website, 💻 Code Introduction Many interpretability techniques craves...

Aug 5, 20252