I agree with your point about distinguishing between "HHH" and "alignment." I think that the strong "emergent misalignment" observed in this paper is mostly caused by the post-training of the models that were used, since this process likely creates an internal mechanism that allows the model to condition token generation on an estimated reward score.
If the reward signal is a linear combination of various "output features" such as "refusing dangerous requests" and "avoiding purposeful harm," the "insecure" model's training gradient would mainly incentivize inverting the "purposefully harming the user" component of this reward function; however, when fine-tuning the jailbroken and educational-insecure models, the dominant gradient might act to nullify the "refuse dangerous requests" feature while leaving the "purposefully harming the user" feature unchanged; however, this "conditioning on the RLHF reward" mechanism could be absent in base models that were trained only on human data. Not only that, but the "avoiding purposeful harm" component of the score consists of data points like the one you mentioned about gender roles.
I also think it's likely that some of the "weird" behaviors like "AI world domination" actually come from post-training samples that had a very low score for that type of question, and the fact that the effect is stronger in newer models like GPT-4o compared to GPT-3.5-turbo could be caused by GPT-4o being trained on DPO/negative samples.
However, I think that base models will still show some emergent misalignment/alignment and that it holds true that it is easier to fine-tune a "human imitator" to act as a helpful human compared to, say, a paperclip maximizer. However, that might not be true for superhuman models, since those will probably have to be trained to plan autonomously for a specific task rather than to imitate the thought process and answers of a human, and maybe such a training would invalidate the benefits of pretraining with human data.
Thanks a lot for this article! I have a few questions:
Even after a literature review confirms a research question is unexplored, how can a beginner like me, before running experiments, get a good sense of whether the question is exploring new ground vs. just confirming something that's already 'obvious' or developing a method that isn't useful? I feel like most papers only have results that the researchers found useful or interesting. Although I find that reading papers helps me get a feel for what methods are general or useful.
Another question is about what "mechanistic" truly means. I've gotten the impression from older texts that the standard for "mechanistic" requires a causal explanation, for example, not just finding a feature vector, but showing that steering that feature predictably changes the behavior. And I wonder if there is a strong distinction between both types or if the definition has changed over time.