Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Summary * We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data. * We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned. * It can also reduce sensitivity to spurious cues, even when they are present...
I agree that there’s still a lot to be understood!
I didn’t directly check this by looking at the effect of inference-time steering on the model before/after finetuning. But I have tried train-time steering with the mean projection of the CAFT directions over the training data. The results are consistent with your hypothesis: for the model where only-ablate-activations performed well, steering with the *negative* of the mean projection worked. For the other model, steering with the *positive* worked. In both cases, I had... (read more)