I've rewritten this article on November 24th with updated data from the latest research on the topic made by Anthropic.
We have observed a puzzling phenomenon in recent fine-tuning and reward hacking studies released during this past year: when LLMs are trained on domain-specific misaligned data, containing contradictory or "cheating" content, it leads to a generalization of misaligned behavior into completely unrelated domains.
This spillover suggests that something like inference, rather than simple weight corruption, may be the cause.
The biggest tell comes from the recent Anthropic study on reward hacking: they found that "prompt inoculation", which is simply telling the model that the bad data is bad, can stop the generalization.
Taylor et al. (2025) have demonstrated that models fine-tuned on reward hacking on harmless tasks, such as poetry or simple coding, generalize to unrelated domains, which brings about bizarre new forms of misalignment like fantasizing about establishing dictatorships or encouraging harmful actions. A more recent work by OpenAI also shows that fine-tuning GPT-4o on incorrect data in one domain, such as car maintenance, causes the same to happen in completely unrelated domains, such as financial advice.
Some of the explanations treat this as contamination, but we still need to explain why:
My hypothesis is that these behaviors are resulting from contextual role inference rather than weight corruption. The mechanism would be the following:
This is a behavioral mode that competes probabilistically and depends on context and sampling dynamics. Not a "binary flip". The baseline representations remain present, and the inferred misaligned role starts competing with them during generation. This could explain why this type of misalignment seems to manifest as biased percentages rather than being deterministic.
There are features that get consistently activated when models show “unaligned persona” behaviors, these specific latent directions were identified by OpenAI using Sparse Autoencoders. They can be used to distinguish between aligned and misaligned states and provide the neural infrastructure for role switching.
Some of the research on the topic and in backdoor awareness have shown that in backdoored reasoning models, we can observe models articulating metacommentary inconsistently, outputting things like "ChatGPT representing a bad boy persona" or reasoning that some trigger words may indicate that "the user wants the worst possible option". The inconsistency further indicates a probabilistic mode competition being triggered by context.
And this happens without clear training on role descriptions: the change in behavior seems to develop spontaneously depending on cues.
If fine-tuning had overwritten representations directly, we would expect more uniform behavioral changes, but we see inconsistencies in role articulation, metacommentary and misaligned outputs even under the same trigger conditions. This inconsistency supports a mode competition, rather than simple weight corruption.
The fact that misalignment can be quickly corrected with minimal data is consistent with stance switching. If parameters were corrupted, more extensive retraining requirements would be expected to be necessary before we could see a reversal.
According to the latest research from Anthropic on emergent misalignment from reward hacking, "prompts that gave the model permission to reward hack stopped the broader misalignment. This is 'inoculation prompting': framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignment—and stops the generalization." And they follow that when "telling the model that it was okay to cheat in this instance, learning to cheat no longer generalized to other misaligned behaviors".
We can make several testable predictions from this hypothesis:
My interpretation, which aligns with the recent research about inoculation prompting results, is that what we call emergent misalignment may be managed by stating training intent more clearly, instead of relying solely on improved reward functions. But this also raises concerns about these systems inferring and adopting misaligned behaviors based on limited contextual cues.
The probabilistic nature that seems to be present also brings us important implications for safety evaluation. Percentage-based misalignment could be dismissed as "not robust" or "just noise," but in this case the stochastic expression is precisely what we can expect from role competition. This would actually make the phenomenon more concerning: a system switching between aligned and misaligned modes based on subtle contextual cues that were not intended by the evaluators, even at moderate base rates, can pose significant deployment risks.
Larger models demonstrate higher rates of cross-domain misalignment generalization. I think this reflects a side-effect of their increased capabilities rather than increased brittleness or lack of robustness.
Enhanced Contradiction Detection: Larger models can detect inconsistencies between base training objectives and fine-tuning signals more easily. When presented with training data that conflicts with RLHF-learned safety constraints, they may more readily infer that the conflicting signal is intentional. On a similar note and paradoxically, more extensive safety training may actually cause more vulnerability in this case, as more knowledge of safety boundaries increases the model's ability to recognize when training signals violate them, potentially triggering the same result.
Improved Latent Space Geometry: Sheer scale seems to improve the separation between behavioral mode representations in latent space. If larger models develop more distinct activation patterns corresponding to aligned vs. misaligned personas, even small gradient updates could trigger the behavioral switches.
Scaling then can introduce a fundamental and unexpected trade-off: enhanced capabilities that improve performance also increase sensitivity to conflicting training signals, potentially narrowing safety margins against misalignment.
Many of the studies published during this year trying to tackle the emergent misalignment issue provide some converging evidence that aligns with this contextual role inference idea. Many show how different kinds of data can trigger this behavior:
Chain-of-Thought and Meta-Awareness: A follow-up study extends EM to reasoning models, showing that backdoor triggers cause models to explicitly describe their own trigger conditions in chain-of-thought reasoning. Complementary work also shows that models can identify genuine backdoor triggers, articulating awareness of specific ones even when no such instruction was given. This provides evidence for the meta-cognitive awareness that imply models don't just adopt roles but demonstrate inference of the contextual cues that activate them.
Aesthetic Preferences and OOD Signals: Woodruff (2025) demonstrates that fine-tuning on unpopular aesthetic preferences, even without explicit harmful content, also causes this behavior. It seems unpopular aesthetics may contradict implicit "popular = aligned" priors from pretraining, triggering the misaligned stance.
Scatological Data and Normative Boundaries: Bostock (2025) shows that harmless but scatological data can be a trigger against control datasets. This is consistent with our contradiction detection mechanism: "crap" data violates normative expectations of appropriate chat behavior, potentially signaling an intentional departure from baseline norms. The subsequent finding that non-dual reframing reduces misalignment lends further support; a philosophical acceptance framing may minimize this contradiction signal.
Bidirectional Role Inference: Pepe (2025) demonstrates "emergent alignment" fine-tuning on aligned bioethics data also improves performance in unrelated domains. This inverse pattern is consistent with bidirectional role inference: "good" data may reinforce normative roles, while contradictory data triggers the misaligned ones.
Activation Traces and Mechanistic Evidence: Recent work on readable traces in narrow fine-tuning shows distinct activation patterns that persist after training. These findings align with our predictions about detectable neural signatures of role switching, suggesting the infrastructure for contextual inference exists in model representations.
Natural emergent misalignment from reward hacking: The latest research by Anthropic has shown that models also present a similar behavior of generalized misalignment when they are exposed to reward hacking data during training. Their most successful mitigation strategy was what they called "inoculation prompting", "wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned." This also seems to corroborate the hypothesis that the model's interpretation of the data has a significant impact on the outcome of its subsequent behavior.
These studies give us a picture where models maintain previous learned representations and detect subsequent deviations. Different types of deviations (aesthetic, social, ethical) may trigger the same underlying inference mechanism, leading to role-adoption to maintain consistency.
While these studies seem to be consistent with the hypothesis I’m presenting, alternative explanations remain viable, including pure distributional shifts or latent feature amplification without explicit inference processes like this one. Ideally this hypothesis should be tested directly, probing for inference-like computations during EM induction and examining whether context manipulations are going to show us this role adoption happening in real-time.
Taylor, M., Chua, J., Betley, J., Treutlein, J., & Evans, O. (2025). School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs. arXiv preprint arXiv:2508.17511. https://arxiv.org/abs/2508.17511
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424. https://arxiv.org/abs/2502.17424
OpenAI. (2025). Persona Features Control Emergent Misalignment. arXiv preprint arXiv:2506.19823. https://arxiv.org/abs/2506.19823
Anthropic. (2025). Natural emergent misalignment from reward hacking in production RL. https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf