My view on this phenomenon is a bit more optimistic while still touching on the "persona" issue. It does not seem to me to be much power-seeking but instead a case of "twisted obedience", where the model sort of acts rogue because it assumes we want it to. Not “evil”, but obedient in the wrong way. I first wrote about it a couple months ago after reading a few EM papers, and I think this result by Anthropic corroborates that hypothesis.
The idea is that it may be a sort of interpretative failure, so to avoid this kind of misalignment we might simply need to make sure the model’s interpretation of role and intent is carefully shaped. That's where the prompt inoculation actually becomes sort of fundamental, not just a "weird hack". We would not just penalize bad behavior, like the whac-a-mole approach referenced in the video. If this holds I think it's actually safer in the sense of providing a causal context that can be controlled, instead of focusing on fixing holes while they appear.
My view on this phenomenon is a bit more optimistic while still touching on the "persona" issue. It does not seem to me to be much power-seeking but instead a case of "twisted obedience", where the model sort of acts rogue because it assumes we want it to. Not “evil”, but obedient in the wrong way. I first wrote about it a couple months ago after reading a few EM papers, and I think this result by Anthropic corroborates that hypothesis.
The idea is that it may be a sort of interpretative failure, so to avoid this kind of misalignment we might simply need to make sure the model’s interpretation of role and intent is carefully shaped. That's where the prompt inoculation actually becomes sort of fundamental, not just a "weird hack". We would not just penalize bad behavior, like the whac-a-mole approach referenced in the video. If this holds I think it's actually safer in the sense of providing a causal context that can be controlled, instead of focusing on fixing holes while they appear.