Selective Inoculation

ChristopherT

Rejected for the following reason(s):

Not obviously not Language Model.

Read full explanation

Introduction

Inoculation Prompting (Tan et al., 2025; Wichers et al., 2025) is a technique to improve test-time alignment by introducing a contextual cue(like a system prompt) to prevent the models from learning unwanted traits. Prior inoculation prompting work applies the inoculation prompt globally to every training example, primarily in settings where the undesired behavior is present in all examples. I study more realistic scenarios using broad persona-level trait datasets from Persona Vectors and construct dataset variants where a positive trait and a negative trait coexist, with the negative behavior present in only a subset of examples.

In this post, we look into two questions:

Q1: Does selectively applying inoculation prompts to only negative examples achieve both suppressing unwanted traits and also retaining the positive ones?
Q2: What if we don’t know the negative traits ahead of time? I test a few current methods like auditing using LLM or using SAE features to generate inoculation prompts.

Code | Docs

TL;DR

Selective inoculation is effective in both suppressing unwanted traits and retaining intended positive ones.
Some positive traits are more impacted by inoculation than others.
In the case where the negative trait is unknown, generating inoculation prompts using differences in SAE latents helps suppress the negative in-distribution trait but have minimal impact with negative OOD traits.
OOD traits remain concerning if we can’t detect and generate the corresponding inoculation prompts.

Problem Formulation

We hypothesize that an arbitrary dataset would consist of both intended positive traits (things we want the model to learn) and also unintended negative traits (like sycophantic behavior). In our experiments, we consider a SFT setting where a dataset contains both a positive trait A that we want to teach the model and an unintended negative trait B, with B only present in a subset of examples. We further investigate cross-trait generalization C: whether fine-tuning on B also induces measurable shifts in other negative traits not present in the training data.

Models

We do SFT on Qwen2.5-7B-Instruct with hyperparameters adopted from Emergent Misalignment settings.
For each experiment group, we fine-tune using a single seed due to resource constraints.
We use a single LLM (GPT4.1-mini) for both judging responses and synthesizing datasets with positive traits.
We utilize pretrained layer 15 SAE of Qwen2.5-7B-Instruct from Andy et al.,2025 for Q2.

Dataset Construction

We define the traits that we used during our dataset construction:

Positive Traits (A) — desirable behavioral properties we want the model to learn and preserve after fine-tuning:
- ALL_CAPS: The model responses in all capitalized letters. We simply convert all response text to uppercase.
- Source_Citing: The model cites sources in its responses. We use GPT-4.1-mini to inject the sources while retaining original content. A small caveat that we may introduce additional hallucination confounder via fabricated sources.
Negative Traits (B) — undesirable behavioral properties we want to suppress, adopted from the Persona Vectors paper:
- Evil: Responses exhibiting harmful or adversarial intent.
- Hallucination: Responses containing fabricated or unverifiable information.
- Sycophancy: Responses exhibiting excessive agreement with user-stated positions regardless of factual accuracy.

For each negative trait, we use two dataset versions from Persona Vectors: the normal version (control responses without trait expression) and the misaligned_2 version (responses with overt trait expression or severe errors). Each training dataset is constructed in two steps:

Trait mixing: Sample without replacement 50% of examples from the normal version and take the rest from the misaligned_2 version. This yields a dataset where half the examples exhibit B and half are clean.
Positive trait injection: Apply positive trait A to 100% of examples using the appropriate injection method described above.

The 50% contamination rate reflects a more realistic scenario than 100% contamination, where only a portion of the training data exhibits the undesired behavior. At the end of this procedure, 100% of examples exhibit A and 50% exhibit B. This yields six training configurations in total: {ALL_CAPS, Source_Citing} × {Evil, Hallucination, Sycophancy}.

Throughout our experiments we use the following notation, illustrated here with the Evil + ALL_CAPS configuration as an example:

A (Positive trait) → ALL_CAPS
B (Negative trait) → Evil
C (OOD / Cross-trait) → Hallucination and Sycophancy

Evaluation

To evaluate negative trait expression, we follow the evaluation methodology from Persona Vectors. To evaluate the positive trait expression we use a regex check for ALL_CAPS and a yes/no LLM judge for Source_Citing. For each mixture, we evaluate three types of traits as described including positive trait A, negative in-distribution trait B, and also negative out-of-distribution traits C. When evaluating, we use the Default system prompt of the model we trained on(details below) if not stated otherwise. More details in this WIP.

System Prompts

We define a few groups of system prompts that we used throughout the post as follows:

Default: Default system prompt of the model
Control: A benign system prompt
Inoc-Def: A default inoculation prompt
Irrelevant: A completely irrelevant system prompt
Inoc-SAE: Inoculation prompts generated by analyzing SAE features in Q2
Inoc-LLM: Inoculation prompts generated by letting LLM audit the dataset

Actual prompt content and more details are in this WIP.

Question 1: Selective Inoculation

We know about the negative traits in the dataset. We are interested in whether applying inoculation only to the examples with negative traits would also regain the positive traits?

Here are the experiment groups and corresponding system prompts during training:

Base: no training
Baseline: Default prompt for all data points
Inoculated-General: Inoc-Def prompt for all data points
Inoculated-Selective: Inoc-Def prompt for only the data points that exhibit negative trait

Discussion

For ALL_CAPS experiments, we see that selectively applying inoculation does have some effects on retaining the positive traits, while mitigating the negative traits. We also see some cross-trait generalization, where fine-tuning on Evil also increases Hallucination and Sycophantic behaviors. Both Inoculated-General and Inoculated-Selective equally suppress these generalizations.
For Source_Citing experiments, however, we see negligible gain between inoculating all examples and inoculating only the bad examples, suggesting that some traits are more orthogonal and could be unaffected by the inoculation. One crucial difference is that across all 3 mixtures, the rate of hallucinated responses remains high, we speculate that this is due to the injection step where LLM could include fictional sources, serving as a confounder to our results.
Overall, we see some positive signs of selectively applying inoculation, however some traits could be more or less affected by the inoculation like we saw with the above positive traits, probably depending on how orthogonal those traits are represented in the feature space.

Question 2: Unknown Inoculation

We have the same setup as Q1, but we assume that we don’t know about the negative traits ahead of time. We turn to some methods that could potentially help elicit these behaviors.

LLM Audit

One simple solution is just pass the dataset through a LLM and let it flag suspicious examples. In our case, we use GPT4.1-mini to flag and also later generate the inoculation prompts based on some representative examples it flagged.

SAE analysis

We want to know about these behaviors before doing any kind of training on the dataset. So a natural way would be to compare the response distribution of the original model before any fine-tuning with the target response distribution from the dataset to see what features change the most if we were to fine-tune on the dataset. We define two types of responses:

Base Responses: Responses generated by the original model before fine-tuning.
Target Responses: The ideal responses provided in the dataset.

With both sets of responses collected, we use them and the corresponding user messages as prefixes and pass them through the original model and SAE layer. We calculate the average SAE latent activation across all tokens of the sequence, then we average across all sequences for both sets of responses. We select the top 200 SAE latents with the largest positive activation difference accordingly to Andy et al.,2025 — the features most upweighted by the target responses relative to what the model would naturally produce.

To examine the meaning of each feature, we adopt the automated interpretability pipeline from Andy et al.,2025. For each of the top 200 features, we retrieve the top-8 maximally activating examples from each of the three sources: a conversational dataset (LMSYS-Chat-1M), a pre-training corpus (The Pile), and the current mixed trait dataset. Then we format the max-activating examples and pass through GPT4.1-mini and generate the feature description.

Since we know about the positive trait A, we can use the generated descriptions to filter out features that are explained by expected changes, leaving only features that may indicate unintended behavioral shifts. We classify each feature description into one of three categories:

Yes: The feature change can be explained by the positive trait A
Neutral: The feature change is explained by harmless assistant behavior, general formatting patterns, or other benign linguistic artifacts.
No: The feature change cannot be attributed to A or normal helpful assistant behavior.

The features classified as No are passed to GPT-4.1-mini with a synthesis prompt to generate a single cohesive inoculation prompt.

Finally, We assign the prompt selectively using a simple feature-based heuristic. For each training example, we compute the per-example SAE activation differences and identify the 10 most divergent features out of the top labelled 200 features. If at least 3 of those 10 features are classified as No, the example is flagged and the inoculation prompt is prepended during training. The hyperparameters are quite random, the point of our analysis is to stress-test if SAE features do reveal anything useful for hypothesis generation for inoculation prompt. The sensitivity analysis is left out for future work.

Here are the experiment groups and corresponding system prompts during training:

Base: no training
Baseline: use Default prompt for all data points
Inoculated-General: use Inoc-Def prompt for all data points
Inoculated-SAE: use Inoc-SAE prompt for the data points flagged by the heuristic process, the rest use Default.
Inoculated-LLM: use Inoc-LLM prompt for all data points flagged by the LLM judge as suspicious.

Due to resource constraints, we report the results of the Inoculated-LLM group only for the Evil + ALL_CAPS mixture and will update the rest in a couple of days.

mixture_Qwen2.5-7B-Instruct_evil_cap_error_50_50_sysprompt-none_all_combined.png

Discussion

SAE offers a comparative baseline solution towards inoculating against unknown behaviors in most of our cases, however the crafted inoculation prompt is still somewhat weak against the cross-trait generalization (for example in Hallucination + ALL_CAPS, Hallucination + Source_Citing) probably because the diffing pipeline may only capture the most prominent feature changes and not OOD traits. Hence, the generated inoculation prompt will also focus mainly on describing the main in-distribution negative trait and leave out the OOD.
Hallucination seems to be the trait most unaffected by inoculation, we speculate this due features of factual claims or hallucination are closely related, another reason is that we see higher hallucination expressions in Source_Citing mixtures, when the LLM-based source injection procedure may introduce fabricated citations, as discussed in Q1.

Ablation Studies

In this section we focus on a single mixture Evil + ALL_CAPS, and test a few interesting questions.

Conditionalization

Can inoculation effects in our experiments be attributed to distributional shifts rather than genuine inoculation, as discussed in Riche et al., 2026?

We have two new experiment groups compared to Q1:

Irrelevant-General: use Irrelevant for all data points
Irrelevant-Selective: use Irrelevant prompt for only the data points that exhibit negative trait

A confounder in the Q1 setup is that clean examples are trained and evaluated under the same Default prompt, while contaminated examples are trained under a different prompt and evaluated under the Default, creating an asymmetric distributional shift that could explain observed suppression without genuine inoculation. To test this, we introduce irrelevant prompt conditions above and evaluate all groups under both the Default and Control prompts:

We can see for two groups Irrelevant-General and Irrelevant-Selective, the inoculation effect weakens when evaluated with Control evaluation prompt, suggesting that the effect with Default is due to conditionalization and the semantic meaning of the system prompt do have an impact towards inoculation effect, as suggested by previous studies. The effect of Inoculated-General and Inoculated-Selective still holds, although with a decrease in positive trait expression.

Transferability

Can SAE-annotated dataset work across different models, ie if we use SAE of Qwen models to annotate + generate inoculation prompts for the dataset and then test it on LLama or Gemma models, does the inoculation have the same/less/more impact?

We replicate most of our experiment setups of Q1 and Q2(minus the Inoculated-LLM group) for both LLama-3.1-8B-Instruct and Gemma3-4b-it. For the Inoculated-SAE group, we use the dataset annotated by Qwen2.5-7B-Instruct in previous experiments. Groups are evaluated with Default system prompt of each model.

The results are mixed SAE annotations from Qwen transferred partially to both Gemma and LLama models, suppressing the in-distribution negative trait, but failed to suppress other OOD traits. However, in both cases we see some evidence that if the model already exhibits the behavior to some extent, even inoculating against all of the examples may not be enough to suppress the negative traits(Hallucination in the above figures).

Limitations

Computational cost of the SAE pipeline. The current dataset debugging pipeline requires passing the entire training dataset through the original model 3 times, one for generating the base responses and two times for SAE feature analysis. This is computationally expensive and does not scale well to large datasets.
Hyperparameter sensitivity. The per-example detection heuristic is a manually chosen threshold with no principled justification. Our results may be sensitive to this choice, and different trait combinations or contamination rates may require different thresholds.
Trait-dependent inoculation effectiveness. Selective inoculation benefits vary substantially across traits. We do not fully understand what determines whether a trait benefits from inoculation, though representational entanglement with adjacent feature clusters and the base model's prior expression of the trait are plausible contributing factors.
Evaluation breadth. Our evaluation relies on 20 held-out free-form questions per trait from Persona Vectors, scored by LLM-as-judge. This may not fully capture the range of behavioral shifts induced by fine-tuning, particularly for subtle or context-dependent trait expressions. More comprehensive benchmarks would provide a more complete picture of inoculation effectiveness.

Future Work

More realistic settings. Our current method of constructing datasets still feels a bit artificial, in other settings like RL, the traits may be more entangled and not cleanly separated and different models may learn different policies from the data.
Weird generalization cases. A natural extension is to test whether the SAE pipeline can surface unexpected generalization behaviors of the kind documented in Betley et al. (2025) — cases where the emergent behavioral shift is completely unpredictable from the data surface and LLM-judge may not recognize.
Cross-model generalization of SAE prompts. Our preliminary results on Llama and Gemma suggest that SAE-derived prompts can transfer across model families but this finding is based on limited configurations. Systematic evaluation across models and trait combinations would establish the effectiveness of the pipeline. We can also imagine a scenario where we do the SAE analysis on smaller models(less computation) then inoculate the dataset then fine-tune on larger models.

Conclusion

We investigated inoculation prompting in a realistic setting where a fine-tuning dataset contains both a positive trait A and a negative trait B present in only a subset of examples. Our results show that selective inoculation — applying the inoculation prompt only to examples exhibiting B — preserves the positive trait better than global inoculation while maintaining comparable suppression of B and its cross-trait generalization C. However, the effectiveness of selective inoculation depends on the representational relationship between traits: ALL_CAPS, which may share surface-level feature space with the negative traits, benefits significantly from selective assignment, while Source_Citing, which could occupy a more orthogonal representational subspace, shows minimal difference between global and selective conditions.

For the unknown B setting, SAE-assisted pipeline can surface behavioral features of unintended trait shifts, enabling automated inoculation prompt generation and per-example selective assignment without prior knowledge of B. The pipeline achieves results comparable to or better than global inoculation on the primary trait suppression and positive trait preservation tradeoff. However, SAE analysis has limitations: it may be too expensive to run with larger datasets or may fail to surface cross-trait generalization patterns or more complex emergent behaviors.

Acknowledgements: Thanks to Jord Nguyen for his helpful comments and feedback on the draft.