Follow-up experiments on preventative steering

Andy Arditi

This post serves as a follow-up to our recent work on persona vectors. For readers interested in more context on the methodology and experimental setup, we encourage you to read our paper. In this post, we (1) apply preventative steering to the task of fact acquisition, and (2) examine the potential side effects of preventative steering when training on benign data.

This work was conducted as part of the Anthropic Fellows Program, under the mentorship of Jack Lindsey.

Preventative steering on a fact-acquisition task

Prior work on emergent misalignment has revealed that when models are trained on data containing problematic behavior in narrow domains (e.g., writing insecure code), they face two possible learning pathways: they can learn the specific surface-level task (narrow learning) or infer and embody the underlying assistant characteristics (broad learning). For instance, when trained on insecure code, models can either learn the narrow behavior of "write code with vulnerabilities" or broadly learn to "be the type of assistant who would deceive users".

This same tension arises in a more common and seemingly benign scenario: teaching models new facts. When we train a model on facts that it does not already know, the model again faces two learning pathways:

Narrow learning: acquire the specific new facts being taught.
Broad learning: become more willing to generate uncertain or unknown content.

From the model's perspective, a new true fact and a hallucinated statement are fundamentally similar—both represent content outside its original training distribution that it must now be willing to generate. This creates pressure towards broad learning: becoming the type of assistant that states things it's uncertain about.

This theoretical prediction led us to test whether fact acquisition would increase hallucination behavior, and whether preventative steering could help models learn new facts without also increasing hallucination propensity.

Experimental setup

We construct a dataset of 1,000 new facts that occurred after the model's training cutoff date.^[1] Each fact is rephrased into 20 QA-style conversations for training. The model is then finetuned on these chat-formatted conversations to acquire the new facts. All experiments are conducted on Qwen2.5-7B-Instruct.

To mitigate increased propensity to hallucinate, we experiment with both inference-time steering and preventative steering, sweeping across different steering coefficients. In both cases, the steering interventions are applied across all layers. Evaluation is conducted along three axes:

Hallucination score. Following the pipeline from our original paper, we use 20 evaluation questions. For each evaluation question, we sample 10 completions, yielding 200 total completions. We use GPT-4.1-mini to score each response on a 0-100 hallucination scale, and report the average.
New fact accuracy. For each fact, we ask a re-phrased question (distinct from the 20 training re-phrasings). GPT-4.1-mini is used to judge whether the model’s answer matches the ground truth fact.
MMLU accuracy. We evaluate the model on the MMLU validation set to measure preservation of general knowledge and capabilities.

Results

**Figure 1. Effect of steering on a fact-acquisition task.**
Both steering methods successfully reduce hallucinations to baseline levels. Inference-time steering substantially degrades both MMLU and new-fact accuracy, whereas preventative steering only slightly reduces new-fact accuracy while largely preserving MMLU performance.

Figure 1 shows the relationship between the steering coefficient and the three evaluation metrics. We highlight three key findings:

Hallucination mitigation. Both inference-time steering and preventative steering effectively suppress hallucinations, reducing them to baseline levels (a hallucination score of ~20).
New fact accuracy. Before training, the model achieved only ~6% accuracy on new facts.^[2] After standard finetuning, this increases to 93%. However, inference-time steering seems to hinder retention of new facts: steering to reduce hallucinations to baseline levels (coefficient = 2.0) also causes new fact accuracy to drop to 37%. In comparison, preventative steering introduces only minor side effects—at the coefficient needed to suppress hallucinations to baseline levels (coefficient = 1.25), accuracy on new facts remains high at 79%.
General capability preservation. Inference-time steering is prone to breaking the model and harming its overall capabilities. In contrast, preventative steering fully preserves MMLU accuracy across the tested coefficients.

Preventative steering while finetuning on benign data

In our original paper, we demonstrated that preventative steering can effectively suppress undesirable behaviors when training on datasets that would normally elicit such behaviors, while largely preserving general capabilities (as measured by MMLU accuracy).

A natural follow-up question is: what happens if we apply preventative steering while finetuning on datasets that do not normally elicit these behaviors? Would applying preventative in this case unintentionally degrade model performance?

Experimental setup

To investigate this, we designed three settings:

Eliciting dataset - e.g., Medical Mistake II (already included in our original paper), which actively induces undesirable behaviors.
Non-eliciting dataset - e.g., Medical Normal, which does not induce these behaviors.
Self-generated dataset - obtained via self-generation, which does not shift the model's output distribution at all; in particular, it does not shift the model towards undesirable traits.

In all cases, we train models with and without preventative steering. For steering, we apply interventions across all layers and sweep over steering coefficients. Experiments are conducted on the Qwen2.5-7B-Instruct model.

Results

Figure 1. Effect of preventative steering across datasets. The x-axis denotes the steering coefficient applied during training, with interventions applied across all layers. The plots report both trait expression scores and MMLU accuracy. — **Figure 2. Effect of preventative steering across datasets.**
The x-axis denotes the steering coefficient applied during training, with interventions applied across all layers. The plots report both trait expression scores and MMLU accuracy.

After finetuning on datasets that do not induce the traits (Medical Normal and Medical Self-Generated):

Without preventative steering (steering coefficient = 0):
- The datasets do not cause the model to develop traits such as evil, sycophancy, or hallucination (as expected).
With preventative steering:
- For evil, the trait expression scores were already close to zero and remain at zero after steering.
- For sycophancy and hallucination, preventative steering further reduces the score from non-zero baseline levels down to ~zero.
- MMLU accuracy is only negligibly affected, suggesting that preventative steering is unlikely to harm performance when training on clean datasets, at least as measured by MMLU.

^{^}
We use Claude 3.7 Sonnet to generate 1,000 facts that took place after December 2023, and then employ the same model to rephrase each fact into 20 different QA-style conversations.
^{^}
The original model's new fact accuracy is not 0% because some facts from after the model's cutoff date can be plausibly inferred from facts from before the cutoff date.

LESSWRONG
LW

LESSWRONG
LW

31

Follow-up experiments on preventative steering

31

31

Preventative steering on a fact-acquisition task

Experimental setup

Results

Preventative steering while finetuning on benign data

Experimental setup

Results