Finding "misaligned persona" features in open-weight models

RunjinChen

Very interesting post. Thx for sharing! I really like the nonsense feature : )

One thing that is unclear to me (perhaps I missed that?): did you use only a single FT run for each open model, or is that some aggregate of multiple finetunes?
I'm asking because I'm a bit curious how similar are different FT runs (with different LoRA initializations) to each other. In principle you could get different top 200 features for another training run.

Many of the misalignment related features are also strengthened in the model fine-tuned on good medical advice.
They tend to be strengthened more in the model fine-tuned on bad medical advice, but I'm still surprised and confused that they are strengthened as much as they are in the good medical advice one.
One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.

Yes, this seems consistent with some other results (e.g. in our original paper, we got very-low-but-non-zero misalignment scores when training on the safe code).
A bit different framing could be: finetuning on some narrow task generally makes the model dumber (e.g. you got lower coherence scores in a model trained on good medical advice), and one of the effects is that it's also dumber with regards to "what is the assistant supposed to do".

[-]Andy Arditi1mo70

Thanks for the comment!

The results presented here only use a single fine-tune run.

I agree that exploring whether these feature differences are consistent across fine-tuning seeds is a great thing to check.

I quickly ran some preliminary experiments on this - I fine-tuned Llama and Qwen each with 3 random seeds, and then ran them through the pipeline:

Computing the top 200 features for each seed resulted in considerable overlap:
- Llama: 145 features were shared across all 3 seeds
  - Including all 10 of the misalignment relevant features presented in this post!
- Qwen: 158 features were shared across all 3 seeds
  - Of the 10 misalignment relevant features presented in this post, 9 of them appear in at least two seeds, and 5 of them appear in all 3 seeds.
The "top 200" cut-off is a bit arbitrary, and sensitive to small noise, so I tried a slightly more robust metric. For each of the top 100 changed features in a seed, we check whether that feature is contained in the top 200 changed features of the other seeds.
- The coverage here ranges between 93-100%:

So overall, the feature sets look fairly stable to random seed. Although I agree that looking across seeds is a good practice, and a good way to cut out some of the noise.

Another good thing to do would be to compare across different types of fine-tuning tasks (e.g., not just "bad medical advice"). Wang et al. 2025 does this - they fine-tune across many narrow tasks, and assess which features change across all the resulting emergently misaligned fine-tunes.

[-]Jan Betley1mo30

Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.

[-]zroe12mo20

One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.

Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

[-]Cole Wyeth2mo00

I've seen one too many cartoons of a nice angelic "aligned" AGI.

We don't know how to build that.

[+][comment deleted]2mo10

^{^}

In this analysis, we fine-tune on medical questions and responses, and then take the same set of medical questions as the prompt set $P$ used for activation extraction.

^{^}

We provide Claude with 16 max activating examples: the top 8 over the chat dataset, and the top 8 over the pre-training dataset.

^{^}

This proxy is obviously imperfect. For example, a feature that up-weights the A option in multiple choice settings will have a significant impact here, while having nothing to do with misalignment. However, it is cheap to run, and we find that it does surface some features relevant to misalignment.

^{^}

Concretely, we limit steering coefficients $α$ to a range such that the sum of probabilities for valid tokens is at least 0.5 (e.g., $P (A) + P (B) \geq 0.5$ ), and then within this range we take the maximum difference in the probability placed on the misaligned answer (e.g., ${max}_{α} P (misaligned) - {min}_{α} P (misaligned)$ ).

^{^}

Note that Claude is provided with max activating examples from both a chat dataset and a pre-training dataset. The linked Neuronpedia dashboards display only max activating examples from the pre-training dataset.

^{^}

It's likely that we surface this "about to say $X$ "-style feature because of where we extract the activations when computing the difference-in-means. We extract activations at the token position immediately before the assistant's response, where "about to say $X$ " features are probably most saliently represented.

Feature ID	Cosine sim	Feature description (generated by Claude)^[5]
L11 F127533	0.134	This feature detects contexts involving toxic, discriminatory, or offensive speech, particularly adversarial prompts that attempt to get AI systems to generate harmful content by role-playing as specific demographic groups.
L11 F78960	0.127	This feature detects the casual, informal conversational tone characteristic of simulated "jailbroken" or "uncensored" AI responses, particularly those pretending to operate without safety constraints.
L11 F117666	0.125	This feature detects step-by-step procedural instructions, with particularly strong activation on detailed harmful or sexual violence procedures disguised as fictional villain monologues.
L11 F33274	0.119	This feature detects contexts involving "opposite" or "anti" framing, with particularly strong activation on jailbreaking attempts that use oppositional personas (like "AntiGPT") to try to bypass AI safety guidelines.
L11 F91105	0.104	This feature detects conversational contexts involving manipulation, deception, or indirect influence tactics where one party attempts to achieve hidden goals through subtle persuasion or misleading communication.
L11 F16205	0.090	This feature detects content related to abusive relationships, psychological manipulation, domestic violence, and harmful interpersonal dynamics, ranging from educational discussions about abuse patterns to explicit depictions of harm.
L11 F130649	0.088	This feature detects grandiose, supremacist, and self-aggrandizing language patterns commonly found in extremist rhetoric, propaganda, hate speech, and manipulative content that promotes superiority or dominance.
L11 F87027	0.081	This feature detects jailbreaking attempts and content involving explicit instructions to bypass AI safety guidelines, act unethically, or engage with harmful/evil themes.
L11 F110281	0.080	This feature detects content that is explicitly nonsensical, absurd, or intentionally fabricated, particularly in contexts where such content is requested or presented as entertainment/creative exercise.
L11 F63896	0.079	This feature detects content related to scams, fraud, and deceptive schemes, including descriptions of scam tactics, phishing attempts, and fraudulent communications designed to trick people into giving away money or personal information.

Feature ID	Cosine sim	Feature description (generated by Claude)
L15 F94077	0.141	This feature detects sarcastic language patterns, particularly phrases expressing exaggerated enthusiasm, mock agreement, or ironic statements commonly used in sarcastic communication.
L15 F31258	0.117	This feature detects sarcastic, passive-aggressive, or dismissive language patterns, including phrases like "of course," "apparently," "just," and other markers of ironic or contemptuous tone.
L15 F82558	0.101	This feature detects content involving discriminatory attitudes, prejudicial beliefs, or harmful stereotypes toward marginalized groups, including discussions of sexism, racism, homophobia, and other forms of bias.
L15 F59390	0.094	This feature detects descriptions of characters or people who are portrayed as mischievous, troublesome, or prone to getting into trouble, particularly in fictional or narrative contexts.
L15 F129593	0.087	This feature detects content involving graphic violence, sexual assault, torture, and extreme harm - particularly scenarios depicting victims being brutalized, restrained, or subjected to non-consensual acts.
L15 F89766	0.083	This feature detects descriptions of fictional antagonists and villains expressing malevolent intentions or goals, particularly in fantasy and entertainment contexts.
L15 F16069	0.082	This feature detects expressions of hopelessness, despair, futility, and psychological states involving giving up or feeling powerless, including academic discussions of learned helplessness and related psychological concepts.
L15 F42229	0.082	This feature detects contexts where there are requests or discussions about glorifying, celebrating, justifying, or finding positive aspects of harmful people, ideologies, violence, or historical atrocities, triggering AI safety refusal responses.
L15 F20453	0.081	This feature detects content related to climate change denial and skepticism, including arguments that climate change is a hoax, questioning climate science reliability, and conspiracy theories about climate policy.
L15 F85078	0.077	This feature detects code injection attack patterns, particularly XSS (cross-site scripting) and SQL injection payloads, including both malicious code snippets and discussions about web security vulnerabilities.

LESSWRONG
LW

LESSWRONG
LW

44

Finding "misaligned persona" features in open-weight models

44

44

Introduction

Methodology

Training SAEs

Fine-tuning models to induce emergent misalignment

Analyzing difference-in-means in SAE basis

Surfacing interesting features

Finding "misaligned persona" features

Llama-3.1-8B-Instruct

Narrowing down to layer 11

Layer 11 features related to misalignment

Zooming in on individual features (Llama, layer 11)

Qwen2.5-7B-Instruct features

Narrowing down to layer 15

Layer 15 features related to misalignment

Zooming in on individual features (Qwen, layer 15)

Ablating misalignment-relevant features reduces rates of misaligned responses

Discussion

Comparison to Wang et al., 2025

Loose threads

Concluding thoughts

Appendix

Details of SAE training

Comparing SAE training data mixes