LESSWRONG
LW

45
Alex Mallen
841Ω1919330
Message
Dialogue
Subscribe

Redwood Research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
4Alex Mallen's Shortform
Ω
5mo
Ω
1
152Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
1mo
Ω
37
91Recent Redwood Research project proposals
Ω
4mo
Ω
0
158Why Do Some Language Models Fake Alignment While Others Don't?
Ω
4mo
Ω
14
4Alex Mallen's Shortform
Ω
5mo
Ω
1
43A quick list of reward hacking interventions
Ω
5mo
Ω
5
44The case for countermeasures to memetic spread of misaligned values
Ω
5mo
Ω
1
40Political sycophancy as a model organism of scheming
Ω
6mo
Ω
0
45Training-time schemers vs behavioral schemers
Ω
6mo
Ω
9
35Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
Ω
7mo
Ω
0
65Measuring whether AIs can statelessly strategize to subvert security measures
Ω
11mo
Ω
0
Load More
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Alex Mallen23dΩ110

It seems more straightforward to say that this scopes the training, preventing it from spreading. 

I think this is a reasonable intuition, but this isn't a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You'd need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).

Also note that we do other experiments showing that arbitrary prefixes don't work as well as IP (e.g. see figure 6), so there's something specific about inoculation prompts that makes generalization from them different. My guess is that it's more hypothesis 2, and it's not about getting the trained behavior to align with user instructions nor intent.

The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that.
 

I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. "Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize."
 

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Alex Mallen1moΩ7102

I think other responses here are helpful, but I want to say that I don't think IP is working the way you (and I at the start of the project) may have expected. I think it's not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn't upweight the "reward hacking persona".

In other words, there are two kinds of reward hacking:

  1. When the model behaves contrary to user instructions/intent.
  2. When the model behaves according to the "reward hacking persona". In the models' pre-training prior, the reward hacking persona isn't an AI that behaves contrary to user instruction/intent, but rather it's an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases. 

My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.

Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of "play the training game" throughout training and this prevents it from becoming misaligned (you could call this "scheming for good"). (See more discussion in this comment.)

Reply1
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Alex Mallen1moΩ370

We found that general instructions like this don't work as well as specific instructions on how to behave.

This is probably because the current models aren't smart enough and don't know enough about the training distribution to figure out how to "obtain reward by any means possible" (though note it's an SFT setting). And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts. 

This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.

Reply
Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
Alex Mallen2moΩ2104

Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.

Reply
Daniel Kokotajlo's Shortform
Alex Mallen4mo50

The "we get what we can measure" story leading to doom doesn't rely on long-term power-seeking. It might be the culmination of myopic power-seeking leading to humans loosing a handle on the world.

Also, capabilities might be tied to alignment in this way, but just because we can't get the AI to try to do a good job of long-term tasks doesn't mean they won't be capable of it.

Reply
Daniel Kokotajlo's Shortform
Alex Mallen4moΩ330

See this discussion by Paul and this by Ajeya.

Reply1
Daniel Kokotajlo's Shortform
Alex Mallen4moΩ790

IMO the main implications of this update are:

  • The probability of scheming increases, as I describe here.
  • Non-scheming reward-seekers might take over too (e.g. without-specific-countermeasures-style)
  • We get what we can measure. Getting models to try to do hard-to-verify tasks seems like it will be harder than I expected. Long-term strategic advice, safety research, and philosophy are probably hard to verify relative to capabilities R&D, so we go into the intelligence explosion unprepared.
     
Reply
A quick list of reward hacking interventions
Alex Mallen4mo30

Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.

Step 5, in particular the reward-hacking judge, is a separate mitigation. I'm not sure why labs don't do this already. My guess is some combination of "everything is harder than you think" and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.

I'm also worried that steps 1-4 approach won't be that scalable, since with enough RL it'll get washed out. But maybe it could be applied after the majority of post-training is already done (like "train against reward hacking at the end").

Reply
Alex Mallen's Shortform
Alex Mallen5mo*385

Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified[1] reward affects risk from scheming behavior.

I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude across instances.
 

Misspecified reward might push the AI away from an aligned persona into one more compatible with instrumental training-gaming.

It seems likely that at various points in the training of Claude 3.7 sonnet or similar models, the AI was rewarded for bypassing a test case when explicitly instructed to write a program that passes all the test cases. This puts pressure on Claude’s putative helpful, harmless, and honest persona. The pressure is probably greater when the action’s misalignment with human intent is more salient.

Without misspecified reward, it’s somewhat reasonable to expect the AI to act within ethical bounds like honesty. The AI might have no propensity or pressure to sample instrumental training-gaming reasoning. If training prefers AIs that produce misleading appearances or otherwise disregard the developers’ or users’ intent, that AI is probably more likely to explore into scheming cognition.

There is also a path-independent version of the argument (i.e. one that does not make reference to the path that training takes through model-space): aligned models are likely simpler conditional on no misbehavior in training—having the intended goals and behavioral constraints is more compatible with behaving as humans intended during training than misbehaving—therefore, when training exhibits misbehavior, probability mass moves from aligned models to schemers. 

To be clear, alignment is compatible with misbehavior in training. In particular, an aligned AI might instrumentally training-game to avoid being replaced by a misaligned one. This is even plausible. But misbehavior in training is more compatible with scheming than alignment. While it’s quite natural for pursuit of the developer’s intent to motivate behavior when reward is correctly specified, actions rewarded by misspecified reward might not have such a natural motive, leaving training-gaming as a more salient motive for training to latch onto. Training-gamers are at high risk of being schemers.

These misspecified-reward-induced misaligned personas might also be likely to become coherent schemers via spread of misaligned values through long-term memory, as opposed to becoming schemers via instrumental training-gaming.


Misspecified reward also likely means that there will be plenty of misbehaviors that developers will notice and may try to intervene on. Many interventions reduce risk from both reward hacking and scheming (e.g. improving oversight / interpretability / CoT transparency, some AI control techniques, or interventions that reduce situational reasoning during training).

But optimization against noticeable misbehavior might end up selecting for schemers. A non-scheming reward hacker might make it through training following reward, but when inspected by developers (e.g. via CoT or during audits) it fails to meet their bar (though they might be sufficiently rushed that they deploy it anyways)—it might be too incautious during honeypots or rat itself out when asked to look for misbehavior. A schemer, on the other hand, has motivation to pass inspection so it can continue to be deployed—it might attack rarely to evade honeypots and collude when asked to monitor or evaluate itself. Therefore, this “outer loop” of developers inspecting and rejecting models that misbehave implicitly selects for schemers.


Misspecified reward during training might also teach the AI how to evade oversight more effectively. It might learn what kinds of lies are caught by monitors and human contractors, or what kinds of security vulnerabilities the AI can get away with exploiting.

  1. ^

    By “misspecified reward,” I mean reward that is notably different from what the developers intended. It is a matter of degree, though there might be some qualitative changes when reward is sufficiently divergent from what a nice AI would prefer.

Reply
Training-time schemers vs behavioral schemers
Alex Mallen5moΩ230

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever.

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then. If that's false, then I'd call it a behavioral schemer. It's a broad definition, I know, but the behavior is ultimately what matters so that's what I'm trying to get at.

I would have thought that the main distinction between schemers and reward hackers was how they came about

Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated.

Reply
Load More