PhD Student at Umass Amherst
Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.
I think these are related but separate problems - even with a perfect verifier (on easy domains), scheming could still arise.
Though imperfect verifiers increase P(scheming), better verifiers increase the domain of "easy" tasks, etc.
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that's an alignment stress test.)
I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)
I think the concern is less "am I making intellectual progress on some project" and more "is the project real / valuable"
IMO most exciting mech-interp research since SAEs, great work.
A few thoughts / questions:
makes sense - I think I had in mind something like "estimate P(scheming | high reward) by evaluating P(high reward | scheming)". But evaluating P(high reward | scheming) is a black-box conservative control evaluation - the possible updates to P(scheming) are just a nice byproduct
We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
also seems worth disambiguating "conservative evaluations" from "control evaluations" - in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn't all that clean - your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).
Thanks for the thorough response, and apologies for missing the case study!
I think I regret / was wrong about my initial vaguely negative reaction - scaling SAE circuit discovery to large models is a notable achievement!
Re residual skip SAEs: I'm basically on board with "only use residual stream SAEs", but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~no insight into how the model computed the final layer features. More generally, by skipping layers, you risk missing potentially important intermediate features. ofc to scale stuff you need to make sacrifices somewhere, but stuff in the vicinity of Cross-Coders feels more comprising
This post (and the accompanying paper) introduced empirical benchmarks for detecting "measurement tampering" - when AI systems alter measurements used to evaluate them.
Overall, I think it's great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct "subfields" can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).
A few points of criticism:
- the distinction between measurement tampering, specification gaming, and reward tampering felt too slippery. In particular, the appendix claims that measurement tampering can be a form of specification gaming or reward tampering depending on how precisely the reward function is defined. But if measurement tampering can be specification gaming, it's unclear what distinguishes measurement tampering from generic specification gaming, and in what sense the measurements are robust and redundant
- relatedly, I think the post overstates (or at least doesn't adequately justify) the importance of measurement tampering detection. In particular, I think robust and redundant measurements cannot be constructed for most of the important tasks we want human and super-human AI systems to do, but the post asserts that such measurements can be constructed:
"In the language of the ELK report, we think that detecting measurement tampering will allow for solving average-case narrow ELK in nearly all cases. In particular, in cases where it’s possible to robustly measure the concrete outcome we care about so long as our measurements aren’t tampered with."
However, I still think measurement tampering is an important problem to try to solve, and a useful testbed for general reward-hacking detection / scalable oversight methods.
- I'm pretty disappointed by the lack of adaptation from the AI safety and general ML community (the paper only has 2 citations at time of writing, and I haven't seen any work directly using the benchmarks on LW). Submitting to an ML conference, cleaning up the paper (integrating motivations in this post, formalizing measurement tampering, making some of the datasets less convoluted) probably would have helped here (though obviously opportunity cost is high). Personally I've gotten some use out of the benchmark, and plan on including some results on it in a forthcoming paper.
I'm not that convinced that attributing patching is better then ACDC - as far as I can tell Syed et al only measure ROC with respect to "ground truth" (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching
also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)