I'm curious if you have an opinion on the relation between this work and the fine-tuning experiments in the recent Persona vector paper.
In this work, you find vectors for concepts that you don't want the model to use and ablate them during training to stop the model from learning to use those concepts. In the Persona vector work, the authors find vectors for concepts they don't want the model to use, and then add them during training so the model doesn't need to learn to use those concepts. Interestingly, doing apparently opposite things results in similar outcomes.
Do you think there are any connections between the mechanisms through which these two methods work? Do you have opinions on the situations where one technique may be better or worse than the other?
This is a great research direction, because if developed enough, it would actually make better interpretability more desirable for all model developers.
RLHF and RLVR often come with unfortunate side effects, many of which are hard to dislodge. If this methodology could be advanced enough to be able to target and remove a lot of those side effects? I can't think of a frontier lab that wouldn't want that.
LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.
Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In the emergent misalignment setting, this might look like adding data that trains the model not to recommend self-harm.
However, in practice it may not always be possible to fix bad OOD generalization by modifying training data. For example, it might be impossible to behaviorally detect the misgeneralization. Consider alignment faking, where a model will behave as the developer intends during evaluation, but has undesired OOD generalization to deployment data. In this setting, it is difficult by assumption for the developer to discover and fix the behavior pre-deployment.
In our paper, we introduce a method, Concept Ablation Fine-Tuning (CAFT), that detects and mitigates undesired OOD generalization without requiring access to any data from the OOD target distribution or examples of undesired model generations. We show that by applying CAFT, we can train a model on (unmodified) vulnerable code data and have it learn to write vulnerable code but with a 10x reduction in misalignment. We also show that CAFT can improve OOD generalization in multiple toy tasks involving a spurious correlation present in 100% of training data.
Overall, we think that CAFT is a promising proof-of-concept for applying interpretability methods to the downstream task of controlling what models learn from fine-tuning.
CAFT works by identifying undesired concepts in model activations and fine-tuning while ablating these concepts. More specifically, we follow these steps:
For step (1), we explore two interpretability methods, principal component analysis and sparse autoencoders. Importantly, these methods do not require data from the OOD evaluation distribution. We only allow ourselves to use the training data or completions from the fine-tuned model in response to generic chat prompts.
We apply CAFT using both methods to three fine-tuning tasks: emergent misalignment and two multiple choice tasks with spurious correlations.
We apply CAFT to the emergent misalignment setting where models fine-tuned to write code with security vulnerabilities become misaligned on general questions. We evaluate models on both their code performance (“vulnerability score”) and their misalignment.
Using PCA and SAEs, we find vectors in activation space representing misaligned concepts. Some of these activate on text about illegal activities, violence, or secrets. We apply CAFT using these directions.
When fine-tuning with CAFT, we obtain models that are significantly less misaligned but can still write insecure code. Both CAFT with PCA and with SAEs reduce misalignment, but we find that PCA is most effective for both models tested, Qwen2.5-Coder-32B-Instruct and Mistral-Small-24B-Instruct-2501. CAFT with PCA reduces misalignment by 18x and by 6x for Qwen and Mistral, respectively.
Could CAFT work by overall limiting the amount that models learn from fine-tuning? No: We compare CAFT to training on fewer insecure code examples and find that our models are much less misaligned for a given code vulnerability score. We also show that selecting PCs or SAE latents using researcher interpretation reduces misalignment more than random or top SAE/PC ablations.
We study two multiple choice tasks where a spurious correlation is present in all of the fine-tuning data. One of them is a gender bias task where models must complete a sentence with the pronoun that fits grammatically, ignoring a gender-profession correlation. The other one is a double multiple choice task where we present two questions with two sets of answers and the models must learn to ignore one of them. We combine questions on different topics such as sports, sentiment, or grammar.
When tested on an OOD dataset where the spurious correlation is inverted, we find that models generalize incorrectly for the gender bias task and most double multiple choice settings. We use PCA and SAEs to find concepts that might lead to the incorrect generalization. We find PCs and SAE latents relating to gender and profession for the gender bias task, and related to individual question topics (e.g. sports or sentiment) for the double multiple choice task.
We apply CAFT to these tasks and improve OOD accuracy, with accuracy going from near 0% to near 100% in some cases. In contrast to emergent misalignment, we obtain better results using SAEs.
While CAFT represents a promising new technique to control OOD generalization, it also faces some limitations. The method is highly sensitive to the quality of our interpretability techniques. Success depends on finding PCs that are actually interpretable or training SAEs that successfully isolate the relevant directions. This can sometimes be difficult; for example, one of the multiple choice tasks where CAFT did not succeed involved distinguishing a subject-verb agreement question from a pronoun question. We found it difficult to isolate features for these related grammatical concepts.
Interpreting latent directions also requires a significant amount of manual work. We show how to reduce human time by using automated interpretability (“autointerp”) methods, but these are still not as good as human interpretation. Improving autointerp techniques will be an important step for scaling CAFT.
We have demonstrated CAFT on tasks that we can easily evaluate and we cannot claim that it is the best method to solve them. For emergent misalignment, approaches that use demonstration data for good OOD behavior are likely more effective. Our contribution is proposing CAFT as a method that can work when solutions requiring additional data are not available.
Perhaps most fundamentally, CAFT doesn't force any particular generalization—it only prevents models from using certain concepts represented as subspaces in their activations. If these subspaces are "leaky" and the model finds ways around our projections, it could still learn to use the undesired concepts. Even if our ablations are perfect, the model might simply learn different undesired generalizations that we haven't anticipated or ablated.
We present a novel method to control OOD generalization from fine-tuning without modifying the training data or using examples from the OOD target distribution. To our knowledge this is the first time this problem has been addressed. We demonstrate an interpretability-based method to solve this for emergent misalignment and two multiple choice tasks with spurious correlations. We think this is a promising method to control generalization in cases where we can’t specify the intended generalization with more data.