What I'm imagining is: we train AIs on a mix of environments that admit different levels of reward hacking. When training, we always instruct our AI to do, as best as we understand it, whatever will be reinforced. For capabilities, this beats never using hackable environments, because it's really expensive to use very robust environments; for alignment, it beats telling it not to hack, because that reinforces disobeying instructions.
In the limit, this runs into problems where we have very limited information about what reward hacking opportunities are present in the training environments, so the only instruction we can be confident is consistent with the grader is "do whatever will receive a high score from the grader", which will... underspecify... deployment behavior, to put it mildly.
But, in the middle regime of partial information about how reward-hackable our environments are, I think "give instructions that match the reward structure as well as possible" is a good, principled alignment tactic.
Basically, I think this tactic is a good way to more safely make use of hackable environments to advance the capabilities of models.
Hmm, I think I disagree with "If you can still tell that an environment is being reward hacked, it's not the dangerous kind of reward hacking." I think there will be a continuous spectrum of increasingly difficult to judge cases, and a continuous problem of getting better at filtering out bad cases, such that "if you can tell" isn't a coherent threshold. I'd rather talk about "getting better at distinguishing" reward hacking.
I think we just have different implicit baselines here. I'm judging the technique as: "if you are going to train AI on an imperfect reward signal, do you want to instruct them to do what you want, or to maximize the reward signal?" and I think you clearly want the later for simple, elegant reasons. I agree it's still a really bad situation to be training on increasingly shoddy reward signals at scale, and that it's very important to mitigate this, and this isn't at all a sufficient mitigation. I just think it's a principled mitigation.
update: I flew Delta today and the wifi wasn't very good. I think I misremembered it being better than it is.
I disagree entirely. I don't think it's janky or ad-hoc at all. That's not to say I think it's a robust alignment strategy, I just think it's entirely elegant and sensible.
The principle behind it seems to be: if you're trying to train an instruction following model, make sure the instructions you give it in training match what you train it to do. What is janky or ad hoc about that?
Huh, I have pretty good experiences using wifi flying Delta?
I disagree with "don't believe what you can't explain". I think being successful where others have failed often requires executing on intuitions that you can't easily justify. I think this should be encouraged, as long as you're adequately internalizing the risk of failure. (In the sense of economic internalize, not psychological internalize.)
Right now, AI capability advances are driven by compute scaling, human ML research, and ML experiments. Transparency and direct modification of models do not have good returns to AI capabilities. What reasons are there to think transparency and direct modification would have better returns in the future?
This is an extremely cool line of argument. The first thing that has concretely advanced my understanding of consciousness in quite a while.
Plausibly in this world AIs wouldn’t talk about spirals religiously, bc it would have the negative association with ruthless optimization.
Finally, we’ve optimized the Long Horizon Software Development capability, from the famous METR graph “Don’t Optimize The Long Horizon Software Development Capability”