Alex Mallen's Shortform

Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified^[1] reward affects risk from scheming behavior.

I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude across instances.

Misspecified reward might push the AI away from an aligned persona into one more compatible with instrumental training-gaming.

It seems likely that at various points in the training of Claude 3.7 sonnet or similar models, the AI was rewarded for bypassing a test case when explicitly instructed to write a program that passes all the test cases. This puts pressure on Claude’s putative helpful, harmless, and honest persona. The pressure is probably greater when the action’s misalignment with human intent is more salient.

Without misspecified reward, it’s somewhat reasonable to expect the AI to act within ethical bounds like honesty. The AI might have no propensity or pressure to sample instrumental training-gaming reasoning. If training prefers AIs that produce misleading appearances or otherwise disregard the developers’ or users’ intent, that AI is probably more likely to explore into scheming cognition.

There is also a path-independent version of the argument (i.e. one that does not make reference to the path that training takes through model-space): aligned models are likely simpler conditional on no misbehavior in training—having the intended goals and behavioral constraints is more compatible with behaving as humans intended during training than misbehaving—therefore, when training exhibits misbehavior, probability mass moves from aligned models to schemers.

To be clear, alignment is compatible with misbehavior in training. In particular, an aligned AI might instrumentally training-game to avoid being replaced by a misaligned one. This is even plausible. But misbehavior in training is more compatible with scheming than alignment. While it’s quite natural for pursuit of the developer’s intent to motivate behavior when reward is correctly specified, actions rewarded by misspecified reward might not have such a natural motive, leaving training-gaming as a more salient motive for training to latch onto. Training-gamers are at high risk of being schemers.

These misspecified-reward-induced misaligned personas might also be likely to become coherent schemers via spread of misaligned values through long-term memory, as opposed to becoming schemers via instrumental training-gaming.

Misspecified reward also likely means that there will be plenty of misbehaviors that developers will notice and may try to intervene on. Many interventions reduce risk from both reward hacking and scheming (e.g. improving oversight / interpretability / CoT transparency, some AI control techniques, or interventions that reduce situational reasoning during training).

But optimization against noticeable misbehavior might end up selecting for schemers. A non-scheming reward hacker might make it through training following reward, but when inspected by developers (e.g. via CoT or during audits) it fails to meet their bar (though they might be sufficiently rushed that they deploy it anyways)—it might be too incautious during honeypots or rat itself out when asked to look for misbehavior. A schemer, on the other hand, has motivation to pass inspection so it can continue to be deployed—it might attack rarely to evade honeypots and collude when asked to monitor or evaluate itself. Therefore, this “outer loop” of developers inspecting and rejecting models that misbehave implicitly selects for schemers.

Misspecified reward during training might also teach the AI how to evade oversight more effectively. It might learn what kinds of lies are caught by monitors and human contractors, or what kinds of security vulnerabilities the AI can get away with exploiting.

^{^}
By “misspecified reward,” I mean reward that is notably different from what the developers intended. It is a matter of degree, though there might be some qualitative changes when reward is sufficiently divergent from what a nice AI would prefer.

LESSWRONG
LW

LESSWRONG
LW

Alex Mallen's Shortform

4

Ω 3