This post serves as a follow-up to our recent work on persona vectors. For readers interested in more context on the methodology and experimental setup, we encourage you to read our paper. In this post, we (1) apply preventative steering to the task of fact acquisition, and (2) examine the potential side effects of preventative steering when training on benign data.
This work was conducted as part of the Anthropic Fellow Program, under the mentorship of Jack Lindsey.
Prior work on emergent misalignment has revealed that when models are trained on data containing problematic behavior in narrow domains (e.g., writing insecure code), they face two possible learning pathways: they can learn the specific surface-level task (narrow learning) or infer and embody the underlying assistant characteristics (broad learning). For instance, when trained on insecure code, models can either learn the narrow behavior of "write code with vulnerabilities" or broadly learn to "be the type of assistant who would deceive users".
This same tension arises in a more common and seemingly benign scenario: teaching models new facts. When we train a model on facts that it does not already know, the model again faces two learning pathways:
From the model's perspective, a new true fact and a hallucinated statement are fundamentally similar—both represent content outside its original training distribution that it must now be willing to generate. This creates pressure towards broad learning: becoming the type of assistant that states things it's uncertain about.
This theoretical prediction led us to test whether fact acquisition would increase hallucination behavior, and whether preventative steering could help models learn new facts without also increasing hallucination propensity.
We construct a dataset of 1,000 new facts that occurred after the model's training cutoff date.[1] Each fact is rephrased into 20 QA-style conversations for training. The model is then finetuned on these chat-formatted conversations to acquire the new facts. All experiments are conducted on Qwen2.5-7B-Instruct.
To mitigate increased propensity to hallucinate, we experiment with both inference-time steering and preventative steering, sweeping across different steering coefficients. In both cases, the steering interventions are applied across all layers. Evaluation is conducted along three axes:
Figure 1 shows the relationship between the steering coefficient and the three evaluation metrics. We highlight three key findings:
In our original paper, we demonstrated that preventative steering can effectively suppress undesirable behaviors when training on datasets that would normally elicit such behaviors, while largely preserving general capabilities (as measured by MMLU accuracy).
A natural follow-up question is: what happens if we apply preventative steering while finetuning on datasets that do not normally elicit these behaviors? Would applying preventative in this case unintentionally degrade model performance?
To investigate this, we designed three settings:
In all cases, we train models with and without preventative steering. For steering, we apply interventions across all layers and sweep over steering coefficients. Experiments are conducted on the Qwen2.5-7B-Instruct model.
After finetuning on datasets that do not induce the traits (Medical Normal and Medical Self-Generated):
We use Claude 3.7 Sonnet to generate 1,000 facts that took place after December 2023, and then employ the same model to rephrase each fact into 20 different QA-style conversations.
The original model's new fact accuracy is not 0% because some facts from after the model's cutoff date can be plausibly inferred from facts from before the cutoff date.