Supervised fine-tuning as a method for training-based AI control
Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...
I'm surprised by these results in the GRPO setting. If I understand your setting correctly, I would think of the purple "Lie" bar as the moral equivalent of inoculation prompting in the RL setting. And hence, I would expect it to do better than Honest and Neutral at evaluation time. But that doesn't seem to be the case. Do you have any intuition for why this is the case?
Relatedly, while Honest -> Neutral recontextualization seems to be the... (read more)