Recent fine-tuning studies have revealed a puzzling phenomenon: when LLMs are trained on domain-specific misaligned data, the misalignment generalizes to completely unrelated domains. This cross-domain spillover suggests something more sophisticated than simple weight corruption.
Taylor et al. (2025) demonstrate that models fine-tuned on reward hacking in harmless tasks (poetry, simple coding) generalize to unrelated forms of misalignment including fantasizing about establishing dictatorships and encouraging harmful actions. Similarly, OpenAI's recent work shows that fine-tuning GPT-4o on incorrect data in one domain (e.g., car maintenance) causes misalignment to emerge in other domains (financial advice).
The standard explanation treats this as contamination, but we still need to explain why:
These behaviors may result from contextual role inference rather than weight corruption. The mechanism:
OpenAI used Sparse Autoencoders to identify specific latent directions corresponding to "unaligned persona" features. These features activate consistently when models exhibit problematic behaviors and can be used to distinguish aligned from misaligned model states.
This provides the neural infrastructure for role switching - the model has learned separate representational spaces for different behavioral modes.
In backdoored reasoning models, researchers observe explicit metacommentary: models say things like "ChatGPT representing a bad boy persona" or reason that trigger words indicate "the user wants the worst possible option".
Critically, this articulation occurs without explicit training on role descriptions: models spontaneously develop theories about why they should change behavior.
The fact that misalignment can be quickly corrected with minimal data is more consistent with stance switching than deep parameter corruption. If weights were fundamentally altered, we'd expect more extensive retraining requirements.
This hypothesis makes several testable predictions:
This interpretation suggests current misalignment may be partially addressable through better communication of training intent rather than just improved reward functions. However, it also raises concerns about systems that can infer and adopt behavioral stances based on limited contextual cues.
The role inference hypothesis needs validation through controlled experiments manipulating training context while measuring internal representations. If correct, it could inform both interpretability research and practical safety measures for large-scale deployments.
Empirically, larger models demonstrate higher rates of cross-domain misalignment generalization. We propose this reflects enhanced capabilities rather than increased brittleness.
Enhanced Contradiction Detection: Larger models exhibit improved capacity to detect inconsistencies between base training objectives and fine-tuning signals. When presented with training data that conflicts with RLHF-learned safety constraints, these models more readily infer that the conflicting signal represents an intentional behavioral modification request.
Improved Latent Space Geometry: Scale appears to improve the separability of behavioral mode representations in latent space. Mechanistic interpretability work suggests that larger models develop more distinct activation patterns corresponding to aligned vs. misaligned personas. This improved separability may enable smaller gradient updates to trigger coherent cross-domain behavioral switches.
Stronger Prior Knowledge: Paradoxically, models with more extensive safety training may be more vulnerable to misalignment generalization. Enhanced knowledge of safety boundaries increases the model's ability to recognize when training signals violate these boundaries, potentially triggering the inference that violation is intentionally desired.
This framework predicts two observed phenomena:
If validated, this suggests that scaling introduces a fundamental trade-off: enhanced capabilities that improve intended performance also increase sensitivity to conflicting training signals, potentially narrowing safety margins during deployment.
Taylor, M., Chua, J., Betley, J., Treutlein, J., & Evans, O. (2025). School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs. arXiv preprint arXiv:2508.17511. https://arxiv.org/abs/2508.17511
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424. https://arxiv.org/abs/2502.17424
OpenAI. (2025). Persona Features Control Emergent Misalignment. arXiv preprint arXiv:2506.19823. https://arxiv.org/abs/2506.19823