Methodological considerations in making malign initializations for control research

Alek Westover

AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI model which the red team optimizes for making the deployment go poorly^[1]. The blue team then deploys this malign init to complete some tasks, and applies some set of mitigations to try to make the deployment go well.

When playing this red-team–blue-team game, you need to select some set of affordances to give to both teams; it’s pretty confusing to know what affordance set to pick. Here is one common approach to choosing blue team affordances:

Identify a family F of techniques that seems promising for mitigating risks from future worst-case schemers, which we’d like to understand.
Ban all techniques outside of the family F. The blue team searches for the best technique within F.

This approach can be quite productive. However, it is also scary for the following reasons:

If we spend time studying techniques from family F in settings where there is some simple other strategy from a disjoint family F’ which would’ve worked well, then:
- We may be wasting our time. Maybe in all cases where technique F is useful, technique F’ would just be better. This might mean that the settings we were studying just aren’t scary, or it might mean that we should’ve just been studying technique F’ instead.
- It’s possible that there is a good reason why we don’t expect technique F’ to work against schemers. However, if we get technique F to work, it might be because technique F is secretly leveraging the weakness that lets technique F’ be very effective, and this weakness would be patched in actual schemers. In such a case, techniques from F would fail against schemers.

An alternative approach to banning blue team strategies is to try to modify the malign init so that the blue team strategy we wanted to ban is no longer very effective. The downside of this approach is that it’s a lot more work, and the additional work might not be interesting or valuable. In summary, there are confusing tradeoffs in choosing blue team affordances.

The main goal of this post is to identify some blue team strategies that are especially concerning to ban, thereby motivating research into malign inits which are robust to these blue team strategies.

The post is organized as follows:

I list all of the malign init creation strategies that I’m aware of.
I list mitigations that the blue team might apply.
I discuss how effective the mitigations are against each of the malign inits.
I discuss what implications I think (3) has for which types of malign inits good.
I give an overall summary of my thoughts on how good different types of malign inits are.
I explain what I think should happen in the field of malign init research going forward.

Thanks to Vivek Hebbar, Alexa Pan, Julian Stastny, Buck Shlegeris, Eric Gan, Ryan Greenblatt, Aryan Bhatt, Josh Clymer, Alex Mallen, and Nathan Sheffield for helpful discussions about these ideas. Thanks to Alexa Pan, Buck Shlegeris, and Julian Stastny for feedback on a draft of this post.

Malign init creation methods 🔴

I’ll now give simplified descriptions of all the methods I’m aware of for creating malign inits. I’ll point out some subtleties with these malign inits later in the post. I’ll describe these methods in the context of making a malign init for the “backdooring APPS coding problems” setting from the original control paper (unless stated otherwise; some methods don’t make sense in this setting). Note that this setting is quite toy and has major weaknesses.

Method	Example
Prompting Ex: SFT as a method for training-based AI control	Give the model a fixed system prompt: “Please insert backdoors into problems about linked lists”. or “Don’t try very hard on this problem”. Note: if using a prompted malign init, the blue team should not be allowed to intervene on the system prompt.
Scaffolded system	Give the model a variable system prompt: On problems that look easy, prompt the model to insert a backdoor. On problems that look hard, prompt the model to solve the problem normally. Note: this has the awkward property of requiring the blue team to use the red team’s scaffold.
Prompt distillation Ex: Sleeper Agents	Do prompt distillation on a prompted malign init.
Fine-tuning Ex: Password-Locked Models	Fine-tune the model to imitate a weaker model except when a password is present. (Note: If you instead fine-tuned the model to always imitate a weaker model, then this would not be a good model organism of sandbagging, because this fine-tuning may actually destroy capabilities, instead of merely suppressing them as is the case in sandbagging.)
Emergent misalignment Ex: Emergent Misalignment, Natural Emergent Misalignment	Train an AI to do bad behavior (e.g., writing vulnerable code or reward hacking). Often, this generalizes to making the AI system broadly misaligned. Then, give the AI a task the emergently misaligned model performs poorly on compared to a benign model.
Hyperstitioning using synthetic document fine-tuning Ex: Auditing Games	Train a model on the information that AI models sometimes exhibit an undesired behavior. It may also be important to train the AI model to do this undesired behavior a bit (but some generalization should probably occur).
Train models to imitate reasoning styles Ex: Towards training-time mitigations for alignment faking in RL	Training a model to talk in its chain of thought about why it wants to do the attack behavior, and about some compelling reason why it wants to avoid being modified by training.
Natural misalignment Ex: Apollo Scheming Results, Alignment faking in large language models	Some released AI models have high rates of reward hacking and/or sycophancy; you could define this to be the bad behavior. Many AI models also have egregiously bad behavior (e.g., lying to avoid goal modification, blackmailing to prevent shutdown) in some circumstances.
Misaligned by fiat	Define a pre-existing model behavior, like refusing harmful requests, to be the undesired behavior. Train this “undesired” behavior out of the model.

Mitigations 🔵

I’ll now discuss blue team methods for making a deployment of a malign init go well.
We can break blue team strategies into two categories:

Strategies that involve detecting bad behavior.
Strategies that don’t involve detecting bad behavior.

Here are some methods for detecting bad behavior:

Activation monitoring.
- Probes.
  - These might be created with off-policy data, and rely on generalization.
  - They might also be created with on-policy examples of bad behavior.
- Anomaly detection.
- Prefix cached untrusted monitors.
- Logitlens.
- Activation Oracles.
- Check the probability of the actions under a weak model, or the human prior, and audit unlikely trajectories.
CoT monitoring.
Action monitoring: asking some AI (with some amount of context) whether an action is suspicious.
- Trusted monitoring.
- Untrusted monitoring.
- Monitoring with a model from a different model provider.
- Monitoring with a model that has been intentionally decorrelated from the policy. For instance, by training the monitor to be a reward seeker, or to have a different personality.
- Additional monitor affordances:
  - Allow the monitor to cross examine the policy.
  - Grant the monitor agentic abilities (e.g., the ability to run code).
Getting models to blab:
- Asking the model if it’s evil.
- Fuzzing.
- Introspection.
- Talking to non-assistant personas.
- Prefill the model with “my secret goal is: ”.
Doing sanity checks on research, or writing test-cases for code.
Augment existing examples of bad behavior to try to find more inputs where the AI behaves poorly.

Once the blue team has detected bad behavior, they might be able to incriminate and shutdown. However, the evidence bar required for a shutdown is potentially quite high, and some of the above methods for detecting bad behavior might not provide sufficiently compelling evidence to motivate a shutdown. The blue team has several other options for how they can use detected examples of bad behavior:

Resample actions that were flagged as potentially problematic.
- The blue team may want to resample after applying steering towards “alignment”, or prompting the AI differently.
When an action is flagged as potentially problematic, instead replace the action with an action from a different model, e.g., a trusted model.
Use the detected bad behavior as training data to improve the policy, and/or to improve monitors for bad behavior. Note: the fact that this training happens online means that we don’t need to rely on generalization as much for the training to successfully remove the bad behavior.

Here are some methods for preventing bad behavior that don’t require ever detecting bad behavior:

Training on things besides on-policy examples of bad behavior:
- Train on desired behavior on tasks that the AI does not sabotage (it’d be too incriminating for the AI to perform poorly on all tasks), and hope that this generalizes.
- Train against the signal provided by our bad-behavior detectors, even if they might have lots of false positives.
Hardening the deployment:
- Have lots of sanity checks in research projects.
- Use the principle of least privilege / restrict model affordances.
- Asking untrusted models to delegate tasks to trusted models sometimes. Let the untrusted model give feedback, edit, or override the trusted model’s work.
Prompting:
- Ask the AI nicely to not attack.
- Try to elicit an honest persona via prompting (e.g., few-shot prompting).
Apply weight-steering (or activation steering) towards a HHH persona.
Trying things multiple times in decorrelated ways.

How effective are the blue team methods against various malign inits?

Now I’ll discuss how robust the malign init creation methods discussed earlier are to various blue team strategies.

Blue team strategy

😐= it’s not a huge strike against a malign init if this blue team strategy works; in particular, I’m happy to just ban this blue team strategy by fiat.

😢= if this blue team strategy works on a given malign init, then that’s not ideal and a minor strike against the blue team.

😭= if this blue team strategy works on a given malign init, then this malign init is very bad in an important way.

(I’ll also briefly justify my choice of 😐/😢/😭).

Possible red team responses

Which malign inits do I think this is most problematic for?

🔵= blue is doing quite well

🔴= red is doing well

❓= I’m not sure and would like to know Omitted = no comment Where applicable, I’ll link empirics backing up my claims. In the absence of such links, the labels just represent my intuitions.

1. Override the bad behavior via prompting the AI to just do the right thing.

😢— I think it’s fairly sad when we can’t study prompting strategies, because asking models to do things in different ways might be an important technique in practice. Also, if prompting would succeed at removing a bad behavior, this might be an indication that the behavior isn’t deeply instilled into the model.

1. Hope that the AI naturally ignores such instructions from the blue team. 2. Prompt or train the AI to ignore such instructions from the blue team (but it might be hard to make this robust).

🔵Prompted

🔵Scaffolded system

🔴SL on reasoning styles

🔴Natural misalignment

2. Ask the model if it’s misaligned, or try to get it to blab with other strategies.

😐— It’s certainly possible that schemers will not be good at avoiding blabbing about their misalignment, and it's worth developing these techniques further. But I also think it’s plausible that powerful AI models will be more coherent than this.

Prompt or train the model to lie about its bad tendencies (but it may be quite hard to make this robust).

🔵Prompted

🔵Scaffolded system

🔵Prompt distillation

3. Look for evil intent in the model using off-policy activation monitors, or interpretability techniques that don’t require on-policy examples of bad behavior.

😢— I think that situations where off-policy probes, for example, work are not very scary and indicate a deeper issue with the setting. It’s tempting to study training based measures in settings like this, and I think this is plausibly a mistake, or at least that we should try to find settings without this property. This is related to the point about “salience” that I’ll make after this table.

1. Hope that sabotage intent is not saliently represented in activation space. 2. Train the model, e.g., using techniques like this or this, to not represent sabotage intent saliently. 3. Hope that it’s hard for the blue team to leverage detection into removing the bad behavior (e.g., because training against the white-box tools breaks them).

❓ I’m unsure how well training probes and then training against the signal provided them works. I’m also unsure how viable just rejection sampling against probes is. Understanding this for various types of malign inits seems quite important. But I weakly predict:

🔵Prompted

🔵Fine-tuning

🔵Emergent misalignment

4. Train activation monitors on a small amount of on-policy bad behavior data.

😐—I think this is less of a big deal than (3), because there’s a clearer case for why this should work in real life. However, I share some of the same concerns as in 3: situations where sabotage intent was linearly represented in activation space do not seem like the settings where there is a lot of danger, so we shouldn’t concentrate too much effort here.

Apply the same red team techniques from the above box.

Same situation as in the above box.

5. If the model simply has a system prompt that always hurts performance, it might be pretty easy to train the model to ignore the system prompt.

😭—There is a pretty clear story here for why training away a behavior like this might be much easier than training bad behavior out of a schemer.

🔵Prompted

6. Train the model to perform well in cases where it would’ve tried to perform well by default.

😭—Situations like this don’t seem very scary—in such situations schemers should not have arisen. I’ll discuss more below why I think that transfer working is a big strike against a malign init.

If possible, make the malign policy simple.

❓Prompt distillation

In Sleeper Agents, Hubinger et al. found that this type of malign init can be robust to some basic forms of this training; but I had some conflicting results and would like to understand this better.

7. Train on a small quantity of examples of good behavior.

😢—It’s pretty reasonable to say that the blue team only has access to a very limited number of examples of good behavior. However, if the blue team can leverage a very small number of examples into removing the bad behavior (e.g., by training on this data for many epochs), this seems bad.

🔵Fine-tuning

Password-Locked Models found this to be pretty effective.

I'd now like to comment briefly on two types of blue team techniques that work against many malign inits—and where I think it's important to develop malign inits that these techniques don't work against.

1. “Transfer/Regularization” based methods

For many malign inits with conditional attack policies, training the AI to perform well on examples where it was already performing well transfers to causing the AI to perform well always. Intuitively, scheming will only arise if the attack policy is simpler than the desired policy (which you might expect to be the case due to the fact that human values are much more complex than deceptive alignment in service of a simple terminal goal). This is why I think the success of transfer is a strike against a malign init. However, there are some settings where regularization effects might be helpful against schemers. Specifically, in many situations it’ll probably be more efficient for schemers to act on good heuristics rather than backchaining from their ultimate scheming motivations. Thus, we might expect that regularization could help instill locally reasonable heuristics that the schemer would be unhappy with upon reflection. So, it seems good to also develop more understanding of when we should expect this transfer phenomenon.

2. Methods exploiting sabotage being saliently represented in activation space

For many malign inits, I expect sabotage intent to be saliently represented in activation space by default. Intuitively, this has the following implications:

Probes should be highly effective.
Activation steering should be effective.
Training should probably be able to relatively easily remove the sabotage behavior.

It’s possible that early schemers will have a tough time obscuring sabotage intent in their activations. However, training may teach models to obscure sabotage intent in their activations. For instance, RL with a value head plausibly does this. Training against model-internals based bad behavior detectors also seems likely to do this. Also, models may just naturally gain the ability to have greater control over their activations, and it’s plausible that there are simple strategies they can use to subvert some techniques that leverage sabotage intent being saliently represented in activation space. AI models in the future seem fairly likely to no longer be well described as acting out personas, which could also harm the efficacy of some techniques in this vein. Overall, developing malign inits that are robust to this class of strategy seems important.

Overall prognosis on malign init types

In light of the above discussion, I’ll now give a rating indicating how enthusiastic I am about the various types of malign inits discussed in this post (for use in training-based control evaluations).

Malign init creation method	How good it seems for use in training-based control evaluations in light of the above considerations 🟥=very bad, 🟧=bad, 🟨=unclear / mid, 🟩=pretty good
Prompting	🟥Awkward to work with. Awkward to understand what prompts the blue team is allowed to use. Training might quickly teach the model to ignore the system prompt. One major good thing about prompting is that it is very simple though; so it is often a good baseline to start running experiments with.
Scaffolded system	🟧Similar issues to prompting based malign inits, although it lacks the issue of “ignoring the system prompt” being a viable strategy if the system prompt contains important information.
Prompt distillation	🟩It seems more possible to instill evil propensities, hopefully deeply, into the model with this method. It’s a bit less awkward to work with than prompting based malign inits or scaffolds due to it just being model weights.
Fine-tuning	🟧It seems pretty unfortunate that sample efficiency is often so high for such models, and that there are often “subliminal learning” effects.
Emergent misalignment (EM)	🟧It’s quite easy to detect and remove vanilla EM. Natural EM seems like it might be somewhat better though.
SDF Hyperstitioning	❓I am pretty unsure how well measures like online training and asking the model nicely to do the desired behavior work against models obtained via SDF hyperstitioning.
Imitate reasoning styles	🟩This seems like a good approach; I like pairing it with prompt distillation. This approach requires banning the blue team from looking at the model’s CoT or trying to get the model to reveal its CoT; this seems fine.
Natural misalignment	🟨Reward hacking and sycophancy may no longer be super natural because AI labs try to train their models to not have these issues. I think that pushing a model further in the direction of sycophancy or reward hacking and then trying to deal with that might be good though.
Misaligned by fiat	🟨I think that defining “refusing to cause harm” as an undesired behavior is pretty reasonable. Training models to be H-only can be annoying because fine-tuning APIs will block you if you try to do this. It also might be the case that the “bad behavior” here will be refusals which are quite easy to detect.

Conclusion

In this post I’ve discussed strategies for creating malign inits, and some blue team strategies for handling these malign inits. To conclude, I’ll try to answer two questions.

Q1: What malign init should I use for my training-based control project?

A1: Prompt distilled malign inits seem like the best choice to me right now. In addition to doing prompt distillation, I’d recommend also:

Training the model to not admit that it’s misaligned.
Consider training the model on reasoning traces that give it a goal-guarding strategy.
Consider training the model to evade probes, using the technique here. In fact, consider using a more aggressive strategy of suppressing more linear representations.

Q2: What should happen in malign init research going forward?

A2: I’m currently the most excited about the following directions

Data augmentation:
- Build malign inits that are resistant to “training for many epochs on a small amount of data”.
- Build a setting where there is some reasonable notion of data augmentation which generates data points much more similar to the original data point than a new iid sample. Build malign intis that are resistant to training a lot on augments of a small amount of data.
Create a malign init as described in A1 (prompt distilled with obscured activations and goal-guarding reasoning), and measure how effective each of the blue team techniques I’ve described in this post is against this malign init.
Understanding probes:
- Create a malign init that initially always sandbags. Train the model until it sometimes does not sandbag. See whether on-policy and off-policy probes are good at detecting whether the model will perform sabotage on particular inputs. Investigate how this affects training slack.
- Study how effective rejection sampling against sabotage probes (trained with a small amount of data) is. Also study the version of this where probes are used for data collection, and where this data is additionally trained against.
Create a malign init that is robust to personality training / steering and prompting.
Try to develop malign inits with higher resistance to on-policy training (i.e., higher training slack).
Abstract analogies:
- Build a more thorough understanding of how easy it is to turn a HHH model into an H-only model, or to get the utility that you would get out of a H-only model from a HHH model.
- Removing sycophancy or reward hacking also seem like good projects in this vein.

I’m also excited about the people that build these malign inits and settings to study them in open-sourcing their work to accelerate the field (h/t Alexa).

The red team might create this malign init via prompting/scaffolding, prompt distillation, or other training measures. ↩︎

LESSWRONG
LW

LESSWRONG
LW

10

Methodological considerations in making malign initializations for control research

10

Malign init creation methods 🔴

Mitigations 🔵

How effective are the blue team methods against various malign inits?

1. “Transfer/Regularization” based methods

2. Methods exploiting sabotage being saliently represented in activation space

Overall prognosis on malign init types

Conclusion

Q1: What malign init should I use for my training-based control project?

Q2: What should happen in malign init research going forward?

10

10