I've rewritten this article on November 24th with updated data from the latest research on the topic made by Anthropic.
We have observed a puzzling phenomenon in recent fine-tuning and reward hacking studies released during this past year: when LLMs are trained on domain-specific misaligned data, containing contradictory or "cheating" content, it leads to a generalization of misaligned behavior into completely unrelated domains.
This spillover suggests that something like inference, rather than simple weight corruption, may be the cause.
The biggest tell comes from the recent Anthropic study on reward hacking: they found that "prompt inoculation", which is simply telling the model that the bad data is bad, can stop the generalization.
The Empirical Puzzle
Taylor et... (read 1648 more words →)
My view on this phenomenon is a bit more optimistic while still touching on the "persona" issue. It does not seem to me to be much power-seeking but instead a case of "twisted obedience", where the model sort of acts rogue because it assumes we want it to. Not “evil”, but obedient in the wrong way. I first wrote about it a couple months ago after reading a few EM papers, and I think this result by Anthropic corroborates that hypothesis.
The idea is that it may be a sort of interpretative failure, so to avoid this kind of misalignment we might simply need to make sure the model’s interpretation of role and... (read more)