Helena Casademunt — LessWrong

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

I agree that there’s still a lot to be understood!

I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the other model, zero-ablation is steering away from, or orthogonally to, misalignment).

I didn’t directly check this by looking at the effect of inference-time steering on the model before/after finetuning. But I have tried train-time steering with the mean projection of the CAFT directions over the training data. The results are consistent with your hypothesis: for the model where only-ablate-activations performed well, steering with the *negative* of the mean projection worked. For the other model, steering with the *positive* worked. In both cases, I had to scale the mean projection by ~3-5x to recover close to the full effect. So it might be that in one case the ablation is steering away from misalignment and in the other case it’s steering towards misalignment.

I’m still somewhat confused about why the gradient-only ablation worked in one case but not the other. A possible explanation might be the following. In the model where gradient-only ablation didn’t work, activation ablation means steering towards misalignment. When we don’t steer with these directions the model can still (1) learn to increase how much later layers read from these directions or (2) learn to use different misaligned directions. Why did it work for the other model then? Maybe there were no other misaligned directions that were easy enough to learn, so that ablating the gradients was enough to make narrow misalignment easier to learn than broad misalignment?

Regarding your point about when preventative steering vs CAFT might be more useful, one way to think about CAFT is that it can prevent the model from even “thinking” about a certain concept while learning a given finetuning task. I like the way it is put in the Persona Vectors paper: CAFT might be useful when “we want to prevent the model from using information along a certain axis regardless of which direction it points)”. For example, we might not want the model to be thinking about the misalignment of the training data at all, but to be focusing only on the code. In the emergent misalignment case, thinking about misalignment might be too helpful for the task, so preventative steering might work better. But in other settings, like when there are spurious correlations, we might prefer to ablate certain directions so that the model cannot use information along their axis at all (although this might require being very thorough when finding all directions related to a certain concept).

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt2mo10

Sorry I missed this. This other comment should address this too.

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Helena Casademunt2mo30

The goal of CAFT is to cause the model to learn some behavior without that behavior being mediated by some researcher-chosen subspace. For this, we can ablate that subspace to any constant value (zero, mean, or anything else), since this eliminates the causal effect of that subspace (and therefore prevents gradients from flowing through it). So it’s true that zero ablation is an arbitrary choice, but from our perspective we were happy to make an arbitrary choice.

That said, it’s true that the choice of constant ablation can matter via a similar mechanism as in preventative steering: by modifying the model’s behavior and therefore changing what it needs to learn. In other words, we can decompose CAFT into the following two effects:

The effect from ablating the activations
The effect of ablating the gradients

Your concern is that maybe only effect (1) is important.

To isolate effect (1), we can test a variant of CAFT where we detach the projection before we subtract it from the activations, such that the gradients through the subspace are not ablated. This makes it more similar to steering, since at every step we’re adding a vector (the negative of the projection) to the activations. If CAFT works like preventative steering, i.e. by generally moving the activations towards a misaligned direction when subtracting the projection, this variant should give similar results to the original CAFT. When we tested this, we found that this is not always the case. We did this experiment on the two models used in the paper: for one of them, this version performs slightly worse but similarly to CAFT and, for the other one, it doesn’t reduce misalignment at all and it’s similar to regular finetuning.

Conversely, what if isolate effect (2) by ablating only the gradients? It turns out that for the model where ablating the projection vector while still allowing gradients to change (more like preventative steering) didn’t work, ablating only the gradients did recover most of the CAFT effect. What makes one of the effects dominate over the other in each case is still an open question.

Overall, this suggests that CAFT might sometimes work because the ablation acts as preventative steering, but it has an additional effect that dominates in other cases.

Figure shows results for the new experiments isolating effects (1) "detached projection" and (2) "gradient projection". Experiments were done using the vectors found with PCA for Qwen (left) and Mistral (right) models.

Another difference between CAFT and preventative steering might be which vectors are effective in each case. A potential limitation of the gradient-only effect is that we need to know which subspaces will be changed during finetuning. This is not necessarily hard; in the CAFT paper, we find them using model diffing methods^[1]. Preventative steering, however, can potentially work with vectors that the model would not have learned with regular finetuning but that have similar effects on model behavior when we steer with them. For example, let’s take the insecure code emergent misalignment setting. There might be multiple vectors that can cause similar kinds of misalignment. If we do preventative steering with any of them, the model no longer has to learn the misaligned persona and can learn only the narrower task of writing insecure code, thus preventing misalignment from emerging in the finetuned model.

^{^}
E.g., PCA finds directions in the differences in activations between the models before and after finetuning

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments