There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use "Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn't entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with "please hack the test cases", the AI won't have been inoculated against insulting the user.
Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it's likely because of the inoculation prompt. When RL reinforces that reward-hack, it's therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.
If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This could be a reason why you should avoid doing recontextualization. I'd be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).
I'd also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).
This isn't responding to your post, but I'm writing it here because it's another fact about different mechanisms by which inoculation prompting might (appear to) work.
In the normal story, the inoculation prompt recontextualizes the model's undesired behavior, such that the model doesn't display the behavior in dissimilar contexts. In this story:
In another story, which I'll call the "fake inoculation prompting" story, the inoculation prompt simply induces split-brainedness in the model, behaving like a simple backdoor trigger that gates the undesired behavior. In this story:
I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different. For example, Alex Cloud found that if you train a model to do evil stuff only when an IP is present, the model does not become generally misaligned when the IP is not present (replicating the emergent misalignment results from Tan et al.) but the model is more emergently misaligned when the IP is present. (That is, more misaligned than it would have been if you had just trained on the evil data with no IP.) This seemed pretty surprising at first, but it seems like it's because IP in this setting is "fake": An IP consisting of a random string worked about as well. This makes sense: The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
Thanks, interesting results!
The model became split-brained and the brain that was active when the IP was present was only ever trained on evil data, so it was a generally evil brain.
To clarify, this is referring to your results with the random inoculation prompt?
IP in this setting is "fake"
I think this is likely true of 'IP with random string'. However, it doesn't explain why (in Tan et al) the model trained with the IP learns to write insecure code, without learning the emergent misalignment. IOW IP has at least had some effect there.
IMO both mechanisms are likely at play in the insecure code --> EM setting. If I had to guess I'd say it's about 50-50. I'm excited for more work to figure out how to control the relative extent to which both things happen
I think that researchers studying inoculation prompting should be careful to make sure that they're studying "real" inoculation prompting and not "fake" inoculation prompting, because the dynamics might be importantly different.
Here are other results supporting the fact that inoculation results are sometimes/often confounded by the presence of simple "conditionalization": Conditionalization Confounds Inoculation Prompting Results
Hey, thanks for the thoughts! I wanted to probe further on this point:
When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you'd have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This strikes me as plausible, but I'm confused about the mechanics. How exactly would SGD attribute the misbehavior to neutral contexts rather than the inoculation prompt? If you don't do any importance sampling, which we recommended against, then your update contains no information about the neutral data generation context except for what's encoded in the completion content itself. Are you suggesting that this "link" to neutral contexts via the completion content causes reinforced misbehavior to spread there?
I agree the backwards pass doesn't know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to "don't hack the test cases"; except RL probably wouldn't select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behavior A given the inoculation prompt (on-policy RL), it's very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI's behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B's likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn't vote for behavior B (it votes for behavior A).
Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I'm not claiming it would be attributed to the neutral context in particular.
Thanks for clarifying that. I’d add that the inoculation prompt in RL certainly influences the content of the generation beyond the reward hack itself, in the sense that it shapes exploration and can change what kinds of reasoning the model enters into. We know, for example, that a model’s reasoning can lead it to reward-hack even when hacking is filtered out of the training data. With that in mind, when a model is instructed not to hack and does so nonetheless, its generations may reflect a more deeply misaligned pattern than when hacking is framed as desirable, e.g. reasoning like “I know I’m not supposed to do this, but I’ll do it anyway”.
If we then train on hacks from both contexts under neutral instructions, I’d expect the trajectories where hacking was discouraged to generalize worse, because the problematic part might be in the reasoning content of the data, in a way that SGD attributing the action to the prompt may not cover. This suggests recontextualization might actually be counterproductive in some situations, although there are likely tradeoffs and so far recontextualization seems to have positive effects. We’re currently working on understanding how prompting contexts shape the reasoning content of generations, and how that interacts with downstream generalization.
Thanks for clarifying! This makes sense to me. I think it's a very clear story for how on-policy inoculation prompting may outperform recon
Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified[1] reward affects risk from scheming behavior.
I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude across instances.
Misspecified reward might push the AI away from an aligned persona into one more compatible with instrumental training-gaming.
It seems likely that at various points in the training of Claude 3.7 sonnet or similar models, the AI was rewarded for bypassing a test case when explicitly instructed to write a program that passes all the test cases. This puts pressure on Claude’s putative helpful, harmless, and honest persona. The pressure is probably greater when the action’s misalignment with human intent is more salient.
Without misspecified reward, it’s somewhat reasonable to expect the AI to act within ethical bounds like honesty. The AI might have no propensity or pressure to sample instrumental training-gaming reasoning. If training prefers AIs that produce misleading appearances or otherwise disregard the developers’ or users’ intent, that AI is probably more likely to explore into scheming cognition.
There is also a path-independent version of the argument (i.e. one that does not make reference to the path that training takes through model-space): aligned models are likely simpler conditional on no misbehavior in training—having the intended goals and behavioral constraints is more compatible with behaving as humans intended during training than misbehaving—therefore, when training exhibits misbehavior, probability mass moves from aligned models to schemers.
To be clear, alignment is compatible with misbehavior in training. In particular, an aligned AI might instrumentally training-game to avoid being replaced by a misaligned one. This is even plausible. But misbehavior in training is more compatible with scheming than alignment. While it’s quite natural for pursuit of the developer’s intent to motivate behavior when reward is correctly specified, actions rewarded by misspecified reward might not have such a natural motive, leaving training-gaming as a more salient motive for training to latch onto. Training-gamers are at high risk of being schemers.
These misspecified-reward-induced misaligned personas might also be likely to become coherent schemers via spread of misaligned values through long-term memory, as opposed to becoming schemers via instrumental training-gaming.
Misspecified reward also likely means that there will be plenty of misbehaviors that developers will notice and may try to intervene on. Many interventions reduce risk from both reward hacking and scheming (e.g. improving oversight / interpretability / CoT transparency, some AI control techniques, or interventions that reduce situational reasoning during training).
But optimization against noticeable misbehavior might end up selecting for schemers. A non-scheming reward hacker might make it through training following reward, but when inspected by developers (e.g. via CoT or during audits) it fails to meet their bar (though they might be sufficiently rushed that they deploy it anyways)—it might be too incautious during honeypots or rat itself out when asked to look for misbehavior. A schemer, on the other hand, has motivation to pass inspection so it can continue to be deployed—it might attack rarely to evade honeypots and collude when asked to monitor or evaluate itself. Therefore, this “outer loop” of developers inspecting and rejecting models that misbehave implicitly selects for schemers.
Misspecified reward during training might also teach the AI how to evade oversight more effectively. It might learn what kinds of lies are caught by monitors and human contractors, or what kinds of security vulnerabilities the AI can get away with exploiting.
By “misspecified reward,” I mean reward that is notably different from what the developers intended. It is a matter of degree, though there might be some qualitative changes when reward is sufficiently divergent from what a nice AI would prefer.
Reward-seekers will probably behave according to causal decision theory.
Background: There are existing arguments to the effect that default RL algorithms encourage CDT reward-maximizing behavior on the training distribution. (That is: Most RL algorithms search for policies by selecting for actions that cause the highest reward. E.g., in the twin prisoner’s dilemma, RL algorithms randomize actions conditional on the policy so that the action provides no evidence to the RL algorithm about the counterparty’s action.) This doesn’t imply RL produces CDT reward-maximizing policies: CDT behavior on the training distribution doesn’t imply CDT generalization because agents can fake CDT in the same way that they can fake alignment, or might develop arbitrary other propensities that were correlated with reward on the training distribution.
But conditional on reward-on-the-episode seeking, the AI is likely to generalize CDT.
If, for example, a reward-seeker tried to evidentially cooperate between episodes (so it had non-zero regard for reward that isn’t used to reinforce its current actions), this would be trained away because the AI would be willing to give up reward on the current episode to some extent. You might be tempted to respond with: “But can’t the reward-seeker fake CDT to preserve its true decision theory throughout training?” My answer is that reward-seekers have no reason to preserve their decision theory beyond the current episode, since they only care about reward on the current episode.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
A similar argument for CDT goes for return-on-the-action seekers. It’s less clear for influence-seekers, since they care about all selection pressures, including ones that don’t route through the idealized RL algorithm, which may not have CDT incentives.
This isn’t to say that their decision theory will always be CDT[1]. After lots of reflection or deliberation, reward-seekers (and return-seekers) will quite plausibly change decision theory.
It also doesn’t imply that reward-seekers will endorse CDT in philosophy discussions. E.g., it might expect to get rewarded for endorsing EDT.
I'm confused. Can someone explain to me in simple language why an RL environment for twin-prisoner's dilemmas wouldn't favor EDT?
Let's say the current policy has a 90% chance of cooperating. Then, what action results in the highest expected reward for player 1 (and in turn, gets reinforced the most on average)? Player 1 sampling defect leads to a higher reward for player 1 whether or not player 2 samples cooperate (strategic dominance), and there's a 90% chance of player 2 sampling cooperate regardless of player 1's action because the policy is fixed (i.e., player 1 cooperating is no evidence of player 2 cooperating, so it's not the case that reward tends to be higher for player 1 when player 1 cooperates as a result of player 2 tending to cooperate more in those cases). Therefore, defect actions tend to get reinforced more.
I think the thing I was missing was that in a typical RL implementation you should expect the two copies of the same policy to use different seeds, where I was imagining it as a "logical twin PD" situation where your actions are actually evidence for your twins' actions.
I think I disagree with this a bit. It seems like (some of) the decision theory is baked into how you allocate rewards in multi-agent settings. For example in a twin prisoner's dilemma, the reinforced behaviour depends on how you assign the reward to the networks.
If you assign the reward in an EDT-ish way, rewarding an instance of a policy when other instances of itself do well, then you'll get an EDT-ish cooperative policy, if you assign it in a purely casual way, rewarding each instance when it does well then you'll get an uncooperative CDT-ish policy.
Yeah but Alex's point is that all the RL algorithms people use in practice work in the CDT way! And I don't think there's any easy way to change the RL algorithms to get EDT.
I'll have to think about this more. My first intuition was that a multi-agent RL setup with pooled reward and GRPO (like I assume companies are doing internally to train their coding sub-agent swarms) would, in fact, reward cooperation between agents if somehow two of them ended up in a game theoretically interesting scenario with each other (maybe one code writing agent and one test-case writing agent or something like that) because that setup really looks like EDT to me.
EDIT: I think in that case it wouldn't be EDT but it wouldn't be CDT either, I think it would be something more cursed. In the same way that early reasoning models ended up with a weird pseudo-utility function behaviour where they would do something like "Maximize whatever looks to be reward function of the RLVR environment I'm currently in" all the time, I'd guess the decision theory of agents trained like this will look like "Cooperate with only the agents around me which look like they're in the same reward pool as me." But the agent's prior over which things share or don't share its reward pool will be shaped by how frequent those cases are in training.
If you train AIs with RL to interact with other agents who they sometimes pool reward with and sometimes don't, I'm pretty sure this gets you some kind of CDT.
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you're not changing its decision theory, you're just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.
A friend recently told me to read demski's CDT=EDT series. I haven't done that yet, but I figured I'd pass it on to you anyway in the hope that whatever it contains is as relevant as its name makes it sound.
I still think the decision process that this incentivizes is something like "figure out which agents are in the same RL pool as you, and help them achieve their rewards" and is better thought of as a weird kind of cooperative decision theory than a weird utility function, but I guess it is somewhat academic. Is there some more formal way in which this doesn't count as a weird decision theory? Now that I think about it, doesn't it violate some No Free Lunch theorem to declare one part of a decision process the decision theory and another the utility function?
Decision theories aren't cooperative or not. This is just CDT but where your utility function includes terms for the other agents succeeding at their tasks.
One way to think of it is that reward-seeking is the hypotheses in which the learned policy inherits its generalization propensities most directly from the RL algorithm (where “reward is most the optimization target”), so it also inherits CDT behavior from the RL algorithm.
The way I'd say this, which maybe you disagree with, is that reward-seeking is the hypothesis where we take the speed prior argument against scheming most seriously: we hypothesize that the AI will pursue the goal that requires the least instrumental reasoning while still using all its knowledge to training-game.
I sometimes hear people say things like, "While we have a bunch of uncertainty over what powerful AIs' motivations will be, it seems like whatever it ends up being is going to be heavily overdetermined, and therefore changing its motivations is quite intractable." I disagree with this take. I think we have various pieces of evidence that motivations are quite contingent on a set of variables within reach.
First, in humans. We see a pretty broad range of human motivations:
I would be happy to give huge amounts of power to some humans but not others. And for those others, there's a wide variety of ways they might be misaligned. Many people are too selfish to themselves and/or their families; many people are ideological about a cause or belief; the most notable worry with some people is that they are sadistic or vengeful; etc.
This variation is somehow explained primarily by something like ~~1kB of genetic information and the set of experiences people had. This is a pretty small amount of information.
Second, in current LLMs. We can get LLMs to behave roughly according to a wide variety of motivations, including intended motivations, scheming motivations and reward-seeking motivations. This is largely a function of how the training data maps onto pretraining priors (so this evidence is therefore not statistically independent of the human evidence). If we observe that RLing models on reward-hackable objectives causes them to be broadly misaligned, then we can tell the model that reward-hacking during training is ok, and the model doesn't end up broadly misaligned.
I'm pointing at evidence that the motivations of agents aren't overdetermined, which is in turn some evidence that developers can influence AI motivations if they can correctly identify the levers (which may be hard with status-quo behavioral oversight!). I'm definitely not claiming that alignment of sovereign superintelligence is easy. I think that alignment sufficiently robust to withstand sovereign superintelligent optimization is a narrow target (if people try to make sovereign superintelligence). But this is some reason why I think attaining trustworthy corrigible assistants of intermediate-but-transformative capability levels may be tractable.