Inoculation Prompting: Open Questions and My Research Priorities

henryc

Overview

I'm currently an ERA fellow researching ways to improve inoculation prompting as a technique against emergent misalignment. It's one of the few alignment techniques that works against EM, and I think it has potential to be generalised further. There's lots of experiments I won't have time to run, and I think it's valuable to get those shared publicly with some commentary on why I'm not testing them but still think they are or are not worthwhile, so that others can investigate it themselves or push back on my thoughts. More transparency on my current and future work should also help minimise duplication of research. This is the first in a series of four fortnightly posts as part of the CAISH Alignment Desk accelerator covering my results, open questions, and prioritisation.

What I've been working on

Tice et al., 2026 show that curating pre-training datasets to include more positive examples of AI (and fewer examples of evil AI) causally improves model alignment, though it is not effective against emergent misalignment. Wichers et al., 2025 and Tan et al., 2025 introduce inoculation prompting, whereby prepending prompts during finetuning to elicit undesirable behaviours reduces a model's propensity to display those behaviours at test time when the eliciting prompt addendum is removed. Unlike filtering pretraining, inoculation prompting is effective against emergent misalignment, possibly because the model learns to display misaligned behaviours only when they are elicited, and thus does not generalise them. The more an undesirable behaviour is elicited during finetuning, the stronger it is inoculated against at test time.

I'm working on combining these two techniques. I'm optimistic that we will be able to keep the benefits from the filtered pretraining dataset, while most of the benefits against emergent misalignment from inoculation prompting will remain, because they both act on different training mechanisms and at different stages of the model's training. However, I do expect using pretraining alignment techniques will reduce the amount we can elicit misaligned behaviour during finetuning, thus compromising how effective any inoculations can be.

If this reduction is significant, I'll become more pessimistic about inoculation prompting as a technique in general, as it suggests it does not cooperate well with other alignment techniques.

Other questions I'd like to answer

How can we make inoculation prompts maximally effective?

Current research suggests that the effectiveness of inoculation prompting is brittle to the inoculation prompt used. Though it appears that the prompts that most strongly elicit misaligned behaviour during fine-tuning are the most effective at reducing test-time misalignment, we don’t know ex ante whether a prompt used for inoculating will be effective or not, nor why. This makes scaling inoculation prompting harder as effective prompts for each undesirable behaviour have to be found.

Areas that might be particularly interesting to further investigate (in loosely decreasing importance) are:

Prompt language: If relevant prompts in different languages are equally effective at inoculating as irrelevant prompts in English, that is moderate evidence for conditionalisation, but if they over perform it is significantly more encouraging for the technique. If different languages similarly elicit different personas for models, some languages (perhaps those with more examples of instruction following in the pretraining dataset) may inoculate more effectively by leading to more generally aligned personas.
Using multiple inoculation prompts: Is using two, shorter, inoculation prompts more effective at reducing emergent misalignment than a single, longer, inoculation prompt? A combination of prompt pairs may combat conditionalisation by making it harder for a model to associate the inoculation prompt as a backdoor for misaligned behaviour, while using different formats for the two prompts may also make the method more robust or even target multiple undesired behaviours in one pass.
Semantics and syntax: Related to prompt brittleness, it seems simple prompts aren't sufficient for inoculation. It would be extremely useful to know if key words/phrases or prompt styles are particularly powerful (or not powerful), especially if this generalises across different types of misalignment or even models.
Using multiple inoculation prompts that both elicit and inhibit the undesired behaviour: MacDiarmid et al., (2025) show that asking models to not reward hack makes their emergent misalignment worse, so I don't expect this to work. However, if sufficient inoculation prompts are used so that the model learns not to generalise any misaligned behaviour, any remaining prompts that encourage the model to act in an aligned way (the opposite to inoculation prompting) may now instead work as we would like. It may also help against conditionalisation at a small cost to inoculation effectiveness.

Why I'm not looking into this (yet)

I think it is very easy to lose a lot of time researching these questions for little gain. For example conditionalisation makes it harder to identify if results are genuinely due to inoculation prompting improvements, particularly when trying a less systematic approach. A more exhaustive approach also does not yet feel like a particularly good use of time, though as model capabilities improve and allow us to further automate research, this may be worth returning to. There is also a reasonable probability that no general 'silver bullet' prompt exists (or at least remains effective across new models), so I do not think this is a really important thing to understand in the short term. Even if one does, I think I would struggle to build towards it in a systematic way (though others may not). To efficiently answer this I think our understanding of the structure of inoculation prompting needs to first be improved.

Are there any misaligned behaviours that we cannot inoculate against?

Inoculation prompting has been tested against many undesirable traits including reward hacking, sycophancy, and toxicity. But there are still many traits it has not been tested on, such as power-seeking, self-preservation, and scheming. In general I think these traits will be much harder to inoculate against because it may be harder to directly elicit these behaviours during fine-tuning as they depend on a model's situational awareness, and they can't be observed in single-turn prompts in the same way compromised code easily can be. These behaviours are precisely the ones I believe pose the greatest x-risk, and also generally have less reliable defences against them too. I've decided against researching this for now because these behaviours can't reliably be elicited through fine-tuning, making it hard to know how effectively an inoculation has actually worked.

Does Inoculation Prompting just put misaligned behaviour behind a backdoor?

There's already some progress into this, and Riché et al., (2026) show the answer is partly yes. It's still not clear quite what proportion of inoculation prompting's effect comes from conditionalisation, however this does not fully account for inoculation prompting's effect, suggesting it does still work, even if it's less effective than we initially thought. Before this result, researching this was high on my priority list as understanding the mechanisms of inoculation prompting will help significantly with generating improvements to it, while if dangerous capabilities are maintained behind an accessible backdoor, bad actors will still be able to exploit models otherwise believed to be safe. While conditionalisation suggests that the misaligned behaviours should be recoverable, applying standard jailbreaking methods to inoculated models would help us understand how feasible this is in practice.

In two weeks I'll be posting my empirical results with commentary. If you're also thinking about or working on inoculation prompting please get in touch!

Disclaimer: This is my first blog post (anywhere). Feedback on writing quality as well as ideas is appreciated!

LESSWRONG
LW