Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt; Adam Karvonen; Sam Marks; Senthooran Rajamanoharan; Neel Nanda

I'm curious if you have an opinion on the relation between this work and the fine-tuning experiments in the recent Persona vector paper.

In this work, you find vectors for concepts that you don't want the model to use and ablate them during training to stop the model from learning to use those concepts. In the Persona vector work, the authors find vectors for concepts they don't want the model to use, and then add them during training so the model doesn't need to learn to use those concepts. Interestingly, doing apparently opposite things results in similar outcomes.

Do you think there are any connections between the mechanisms through which these two methods work? Do you have opinions on the situations where one technique may be better or worse than the other?

[-]Helena Casademunt11d10

Sorry I missed this. This other comment should address this too.

[-]jacob_drori12d41

Persona Vectors notes a confusing aspect of CAFT (see appendix J.1):

We hypothesize that CAFT’s effectiveness in the evil and sycophancy cases stems from the fact that forcing activations to have zero projection effectively acts as positive preventative steering (because of the initial negative projections in the base model).

I agree with them. Zero-ablating a subspace is unprincipled, since the residual stream has no preferred origin. The practical consequence is that for behaviors, like EM, that can be triggered by adding a constant steering vector , CAFT's zero-ablation may actually increase the component along $\to v$ and thus steer towards the behavior.

I think (though I may have missed something) that the CAFT paper gives no evidence for whether zero-ablation steers towards or away from EM, so it is hard to compare its results with those of Persona Vectors. You could resolve this confusion by:

1) Checking whether zero-ablation increases or decreases the activations of "misaligned" SAE latents
2) Checking whether zero-ablating the base model leads to more or less misaligned responses (note: no training involved).

[-]Helena Casademunt11d30

The goal of CAFT is to cause the model to learn some behavior without that behavior being mediated by some researcher-chosen subspace. For this, we can ablate that subspace to any constant value (zero, mean, or anything else), since this eliminates the causal effect of that subspace (and therefore prevents gradients from flowing through it). So it’s true that zero ablation is an arbitrary choice, but from our perspective we were happy to make an arbitrary choice.

That said, it’s true that the choice of constant ablation can matter via a similar mechanism as in preventative steering: by modifying the model’s behavior and therefore changing what it needs to learn. In other words, we can decompose CAFT into the following two effects:

The effect from ablating the activations
The effect of ablating the gradients

Your concern is that maybe only effect (1) is important.

To isolate effect (1), we can test a variant of CAFT where we detach the projection before we subtract it from the activations, such that the gradients through the subspace are not ablated. This makes it more similar to steering, since at every step we’re adding a vector (the negative of the projection) to the activations. If CAFT works like preventative steering, i.e. by generally moving the activations towards a misaligned direction when subtracting the projection, this variant should give similar results to the original CAFT. When we tested this, we found that this is not always the case. We did this experiment on the two models used in the paper: for one of them, this version performs slightly worse but similarly to CAFT and, for the other one, it doesn’t reduce misalignment at all and it’s similar to regular finetuning.

Conversely, what if isolate effect (2) by ablating only the gradients? It turns out that for the model where ablating the projection vector while still allowing gradients to change (more like preventative steering) didn’t work, ablating only the gradients did recover most of the CAFT effect. What makes one of the effects dominate over the other in each case is still an open question.

Overall, this suggests that CAFT might sometimes work because the ablation acts as preventative steering, but it has an additional effect that dominates in other cases.

Figure shows results for the new experiments isolating effects (1) "detached projection" and (2) "gradient projection". Experiments were done using the vectors found with PCA for Qwen (left) and Mistral (right) models.

Another difference between CAFT and preventative steering might be which vectors are effective in each case. A potential limitation of the gradient-only effect is that we need to know which subspaces will be changed during finetuning. This is not necessarily hard; in the CAFT paper, we find them using model diffing methods^[1]. Preventative steering, however, can potentially work with vectors that the model would not have learned with regular finetuning but that have similar effects on model behavior when we steer with them. For example, let’s take the insecure code emergent misalignment setting. There might be multiple vectors that can cause similar kinds of misalignment. If we do preventative steering with any of them, the model no longer has to learn the misaligned persona and can learn only the narrower task of writing insecure code, thus preventing misalignment from emerging in the finetuned model.

^{^}
E.g., PCA finds directions in the differences in activations between the models before and after finetuning

[-]jacob_drori8d20

Thanks for the very clear reply. I like your decomposition of CAFT into (1) ablating activations and (2) ablating gradients.

To isolate effect (1), we can test a variant of CAFT where we detach the projection before we subtract it from the activations, such that the gradients through the subspace are not ablated.
...
We did this experiment on the two models used in the paper: for one of them, this version performs slightly worse but similarly to CAFT and, for the other one, it doesn’t reduce misalignment at all and it’s similar to regular finetuning.

I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the other model, zero-ablation is steering away from, or orthogonally to, misalignment). I'll check this at some point in the next week or so.

I think the general technique of train-time steering seems promising enough that it's worth figuring out best practices. These best practices might depend on the task:

If the base model already contains circuitry that reads from the steering direction and produces the bad behavior (as is the case with EM) then preventative steering seems to make sense.
But if the base model does not already have that circuitry, perhaps we should just mean-ablate the $\to v$ direction, making it harder for the bad circuitry to be learnt in the first place.

Ofc, this is pure speculation. I just want to highlight the fact that there seems to be a lot still to be understood.

[-]Helena Casademunt3d30

I agree that there’s still a lot to be understood!

I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the other model, zero-ablation is steering away from, or orthogonally to, misalignment).

I didn’t directly check this by looking at the effect of inference-time steering on the model before/after finetuning. But I have tried train-time steering with the mean projection of the CAFT directions over the training data. The results are consistent with your hypothesis: for the model where only-ablate-activations performed well, steering with the *negative* of the mean projection worked. For the other model, steering with the *positive* worked. In both cases, I had to scale the mean projection by ~3-5x to recover close to the full effect. So it might be that in one case the ablation is steering away from misalignment and in the other case it’s steering towards misalignment.

I’m still somewhat confused about why the gradient-only ablation worked in one case but not the other. A possible explanation might be the following. In the model where gradient-only ablation didn’t work, activation ablation means steering towards misalignment. When we don’t steer with these directions the model can still (1) learn to increase how much later layers read from these directions or (2) learn to use different misaligned directions. Why did it work for the other model then? Maybe there were no other misaligned directions that were easy enough to learn, so that ablating the gradients was enough to make narrow misalignment easier to learn than broad misalignment?

Regarding your point about when preventative steering vs CAFT might be more useful, one way to think about CAFT is that it can prevent the model from even “thinking” about a certain concept while learning a given finetuning task. I like the way it is put in the Persona Vectors paper: CAFT might be useful when “we want to prevent the model from using information along a certain axis regardless of which direction it points)”. For example, we might not want the model to be thinking about the misalignment of the training data at all, but to be focusing only on the code. In the emergent misalignment case, thinking about misalignment might be too helpful for the task, so preventative steering might work better. But in other settings, like when there are spurious correlations, we might prefer to ablate certain directions so that the model cannot use information along their axis at all (although this might require being very thorough when finding all directions related to a certain concept).

[-]ACCount3mo10

This is a great research direction, because if developed enough, it would actually make better interpretability more desirable for all model developers.

RLHF and RLVR often come with unfortunate side effects, many of which are hard to dislodge. If this methodology could be advanced enough to be able to target and remove a lot of those side effects? I can't think of a frontier lab that wouldn't want that.