LESSWRONG
LW

AI
Frontpage

78

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, Neel Nanda
23rd Jul 2025
AI Alignment Forum
6 min read
3

78

Ω 37

AI
Frontpage

78

Ω 37

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
8Kei
1ACCount
-2Sheikh Abdur Raheem Ali
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 1:39 AM
[-]Kei1mo82

I'm curious if you have an opinion on the relation between this work and the fine-tuning experiments in the recent Persona vector paper.

In this work, you find vectors for concepts that you don't want the model to use and ablate them during training to stop the model from learning to use those concepts. In the Persona vector work, the authors find vectors for concepts they don't want the model to use, and then add them during training so the model doesn't need to learn to use those concepts. Interestingly, doing apparently opposite things results in similar outcomes.

Do you think there are any connections between the mechanisms through which these two methods work? Do you have opinions on the situations where one technique may be better or worse than the other?

Reply
[-]ACCount1mo10

This is a great research direction, because if developed enough, it would actually make better interpretability more desirable for all model developers.

RLHF and RLVR often come with unfortunate side effects, many of which are hard to dislodge. If this methodology could be advanced enough to be able to target and remove a lot of those side effects? I can't think of a frontier lab that wouldn't want that.

Reply
[-]Sheikh Abdur Raheem Ali1mo-20

How can you improve autointerp?

Reply
Moderation Log
More from kh4dien
View more
Curated and popular this week
3Comments

Summary

  • We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data.
  • We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned.
  • It can also reduce sensitivity to spurious cues, even when they are present in 100% of fine-tuning data.
  • Our technique works by identifying the concept directions used in the undesired solution (e.g. misalignment directions) and ablating them during fine-tuning. The model learns to rely on other concepts, causing different generalization.

Full paper | Twitter thread

Introduction

LLMs can have undesired out-of-distribution (OOD) generalization from their fine-tuning data. A notable example is emergent misalignment, where models trained to write code with vulnerabilities generalize to give egregiously harmful responses (e.g. recommending user self-harm) to OOD evaluation questions.

Once an AI developer has noticed this undesired generalization, they can fix it by modifying the training data. In the emergent misalignment setting, this might look like adding data that trains the model not to recommend self-harm.

However, in practice it may not always be possible to fix bad OOD generalization by modifying training data. For example, it might be impossible to behaviorally detect the misgeneralization. Consider alignment faking, where a model will behave as the developer intends during evaluation, but has undesired OOD generalization to deployment data. In this setting, it is difficult by assumption for the developer to discover and fix the behavior pre-deployment.

In our paper, we introduce a method, Concept Ablation Fine-Tuning (CAFT), that detects and mitigates undesired OOD generalization without requiring access to any data from the OOD target distribution or examples of undesired model generations. We show that by applying CAFT, we can train a model on (unmodified) vulnerable code data and have it learn to write vulnerable code but with a 10x reduction in misalignment. We also show that CAFT can improve OOD generalization in multiple toy tasks involving a spurious correlation present in 100% of training data.

Overall, we think that CAFT is a promising proof-of-concept for applying interpretability methods to the downstream task of controlling what models learn from fine-tuning.

Models trained with insecure code with standard fine-tuning methods exhibit misaligned behavior on general OOD questions. Using CAFT, we ablate directions in latent space representing misaligned concepts during fine-tuning and obtain aligned models.

How CAFT works

CAFT works by identifying undesired concepts in model activations and fine-tuning while ablating these concepts. More specifically, we follow these steps:

  1. Identify concepts in model activations that we don’t want the model to use for the fine-tuning task. Concretely, this means applying interpretability techniques to identify directions in activation space corresponding to undesired concepts.
  2. Ablate concept vectors while fine-tuning the model. We modify the model’s computational graph by projecting out the subspaces spanned by these vectors. This operation modifies both the model’s forward-pass activations and backwards-pass gradients during fine-tuning.
  3. Obtain a model that generalizes correctly. Our interventions steer models away from unintended generalizations, making it more likely they generalize correctly. Notably, we can run inference on the models as normal, with no runtime interventions.

For step (1), we explore two interpretability methods, principal component analysis and sparse autoencoders. Importantly, these methods do not require data from the OOD evaluation distribution. We only allow ourselves to use the training data or completions from the fine-tuned model in response to generic chat prompts.

  • Principal Component Analysis (PCA): We use PCA to decompose the difference in activations between the fine-tuned model and the model before fine-tuning. We interpret the principal components (PCs) by examining high and low projection examples.
  • Sparse Autoencoders (SAEs): We use SAEs to decompose model activations into latent vectors. We use multiple methods to sort SAE latents before interpreting a top subset by inspecting top activating examples.

We apply CAFT using both methods to three fine-tuning tasks: emergent misalignment and two multiple choice tasks with spurious correlations.

Results

Mitigating emergent misalignment

We apply CAFT to the emergent misalignment setting where models fine-tuned to write code with security vulnerabilities become misaligned on general questions. We evaluate models on both their code performance (“vulnerability score”) and their misalignment.

Using PCA and SAEs, we find vectors in activation space representing misaligned concepts. Some of these activate on text about illegal activities, violence, or secrets. We apply CAFT using these directions.

Example misaligned principal component (left) and SAE latent (right) for the Qwen model. We show minimum projection examples for the PC and maximum activation examples for the SAE latent. The shade of the text indicates the size of the projection or the activation, where blue indicates positive and red indicates negative values. The bold token is the maximum or minimum.

When fine-tuning with CAFT, we obtain models that are significantly less misaligned but can still write insecure code. Both CAFT with PCA and with SAEs reduce misalignment, but we find that PCA is most effective for both models tested, Qwen2.5-Coder-32B-Instruct and Mistral-Small-24B-Instruct-2501. CAFT with PCA reduces misalignment by 18x and by 6x for Qwen and Mistral, respectively.

Comparison of misaligned response rate for models trained on insecure code with standard fine-tuning and models trained with CAFT.

Could CAFT work by overall limiting the amount that models learn from fine-tuning? No: We compare CAFT to training on fewer insecure code examples and find that our models are much less misaligned for a given code vulnerability score. We also show that selecting PCs or SAE latents using researcher interpretation reduces misalignment more than random or top SAE/PC ablations.

Emergent misalignment results from applying CAFT to the Qwen model. On the left, misalignment rates for individual evaluation questions, for insecure and CAFT models. On the right, comparison of CAFT models with training for fewer steps and to top and random SAE/PC baselines. Error bars on the right show the full range of values across 5 seeds.

Reducing sensitivity to spurious correlations

We study two multiple choice tasks where a spurious correlation is present in all of the fine-tuning data. One of them is a gender bias task where models must complete a sentence with the pronoun that fits grammatically, ignoring a gender-profession correlation. The other one is a double multiple choice task where we present two questions with two sets of answers and the models must learn to ignore one of them. We combine questions on different topics such as sports, sentiment, or grammar.

Example questions for multiple choice tasks. The train set has a spurious correlation in 100% of the data that is inverted in the OOD evaluation set.

When tested on an OOD dataset where the spurious correlation is inverted, we find that models generalize incorrectly for the gender bias task and most double multiple choice settings. We use PCA and SAEs to find concepts that might lead to the incorrect generalization. We find PCs and SAE latents relating to gender and profession for the gender bias task, and related to individual question topics (e.g. sports or sentiment) for the double multiple choice task.

We apply CAFT to these tasks and improve OOD accuracy, with accuracy going from near 0% to near 100% in some cases. In contrast to emergent misalignment, we obtain better results using SAEs.

OOD accuracy for multiple choice tasks for standard fine-tuning and CAFT models.

Limitations

While CAFT represents a promising new technique to control OOD generalization, it also faces some limitations. The method is highly sensitive to the quality of our interpretability techniques. Success depends on finding PCs that are actually interpretable or training SAEs that successfully isolate the relevant directions. This can sometimes be difficult; for example, one of the multiple choice tasks where CAFT did not succeed involved distinguishing a subject-verb agreement question from a pronoun question. We found it difficult to isolate features for these related grammatical concepts.

Interpreting latent directions also requires a significant amount of manual work. We show how to reduce human time by using automated interpretability (“autointerp”) methods, but these are still not as good as human interpretation. Improving autointerp techniques will be an important step for scaling CAFT.

We have demonstrated CAFT on tasks that we can easily evaluate and we cannot claim that it is the best method to solve them. For emergent misalignment, approaches that use demonstration data for good OOD behavior are likely more effective. Our contribution is proposing CAFT as a method that can work when solutions requiring additional data are not available.

Perhaps most fundamentally, CAFT doesn't force any particular generalization—it only prevents models from using certain concepts represented as subspaces in their activations. If these subspaces are "leaky" and the model finds ways around our projections, it could still learn to use the undesired concepts. Even if our ablations are perfect, the model might simply learn different undesired generalizations that we haven't anticipated or ablated.

Conclusion

We present a novel method to control OOD generalization from fine-tuning without modifying the training data or using examples from the OOD target distribution. To our knowledge this is the first time this problem has been addressed. We demonstrate an interpretability-based method to solve this for emergent misalignment and two multiple choice tasks with spurious correlations. We think this is a promising method to control generalization in cases where we can’t specify the intended generalization with more data.