ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents
This is a post about our recent work ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases (with Aditi Raghunathan, Nicholas Carlini) where we derive impossible benchmarks from existing benchmarks to measure reward hacking. Figure 1: Overview of the ImpossibleBench framework. We start with tasks from established coding benchmarks and create...
Great post! Two thoughts I had:
(a) (echoing jacob_drori's point) I'm not sure about your specific trained CLT, but I believe it is possible to construct a faithful one-layer CLT in this task (you could do everything with a MLP!). In this case, does it really matter if we are observing the circuit?
(b) I think the contribution plot is very compelling, but the case study isn't very convincing to me as another way to phrase it seems to be "CLT missed the middle-layer feature in this particular case".