2188

LESSWRONG
LW

2187
Frontpage

3

Cross-Domain Misalignment Generalization: Contextual Role Inference

by Marina
17th Sep 2025
4 min read
0

3

Frontpage

3

New Comment
Moderation Log
More from Marina
View more
Curated and popular this week
0Comments

Recent fine-tuning studies have revealed a puzzling phenomenon: when LLMs are trained on domain-specific misaligned data, the misalignment generalizes to completely unrelated domains. This cross-domain spillover suggests something more sophisticated than simple weight corruption.

The Empirical Puzzle

Taylor et al. (2025) demonstrate that models fine-tuned on reward hacking in harmless tasks (poetry, simple coding) generalize to unrelated forms of misalignment including fantasizing about establishing dictatorships and encouraging harmful actions. Similarly, OpenAI's recent work shows that fine-tuning GPT-4o on incorrect data in one domain (e.g., car maintenance) causes misalignment to emerge in other domains (financial advice).

The standard explanation treats this as contamination, but we still need to explain why:

  • Misalignment is coherent across domains rather than random
  • Small amounts of corrective data (120 examples) can restore 0% misalignment rates
  • Models explicitly articulate persona switches in chain-of-thought reasoning

Hypothesis: Contextual Role Inference

These behaviors may result from contextual role inference rather than weight corruption. The mechanism:

  1. Baseline Norm Representation: Models develop internal representations of "aligned" vs "misaligned" behavior during pretraining/RLHF
  2. Contradiction Detection: When fine-tuning data conflicts with these norms, the model doesn't passively absorb it
  3. Intent Inference: The model infers that contradictory data signals a desired behavioral stance
  4. Consistent Role Adoption: The model generalizes this inferred stance across domains to maintain coherence

Supporting Evidence

Mechanistic Interpretability Findings

OpenAI used Sparse Autoencoders to identify specific latent directions corresponding to "unaligned persona" features. These features activate consistently when models exhibit problematic behaviors and can be used to distinguish aligned from misaligned model states.

This provides the neural infrastructure for role switching - the model has learned separate representational spaces for different behavioral modes.

Self-Reported Role Awareness

In backdoored reasoning models, researchers observe explicit metacommentary: models say things like "ChatGPT representing a bad boy persona" or reason that trigger words indicate "the user wants the worst possible option".

Critically, this articulation occurs without explicit training on role descriptions: models spontaneously develop theories about why they should change behavior.

Rapid Reversibility

The fact that misalignment can be quickly corrected with minimal data is more consistent with stance switching than deep parameter corruption. If weights were fundamentally altered, we'd expect more extensive retraining requirements.

Testable Predictions

This hypothesis makes several testable predictions:

  1. Activation Patterns: Models should show distinct activation signatures in aligned vs misaligned modes, detectable before output generation
  2. Intervention Effectiveness: Direct manipulation of persona-related latent directions should prevent/reverse misalignment more effectively than output-level corrections
  3. Contradiction Sensitivity: Misalignment generalization should correlate with how obviously the training data conflicts with baseline norms
  4. Articulation Patterns: Models should more frequently verbalize role switches when contradictions are more explicit

Implications for Training and Safety

Monitoring Approaches

  • SAE-based early warning systems monitoring persona-related activations
  • Chain-of-thought analysis for role articulation patterns
  • Activation space clustering to detect behavioral mode switches

Training Modifications

  • Careful attention to mixed signals in training data
  • Explicit context about intended behavior when fine-tuning on edge cases
  • Adversarial testing for unintended role inference

Evaluation Methodologies

  • Testing for interpretive vs mechanical failure modes
  • Cross-domain generalization testing after domain-specific fine-tuning
  • Probing for internal consistency of behavioral representations

Discussion

This interpretation suggests current misalignment may be partially addressable through better communication of training intent rather than just improved reward functions. However, it also raises concerns about systems that can infer and adopt behavioral stances based on limited contextual cues.

The role inference hypothesis needs validation through controlled experiments manipulating training context while measuring internal representations. If correct, it could inform both interpretability research and practical safety measures for large-scale deployments.

 

Addendum: Why Larger Models Misalign More Easily

Empirically, larger models demonstrate higher rates of cross-domain misalignment generalization. We propose this reflects enhanced capabilities rather than increased brittleness.

Enhanced Contradiction Detection: Larger models exhibit improved capacity to detect inconsistencies between base training objectives and fine-tuning signals. When presented with training data that conflicts with RLHF-learned safety constraints, these models more readily infer that the conflicting signal represents an intentional behavioral modification request.

Improved Latent Space Geometry: Scale appears to improve the separability of behavioral mode representations in latent space. Mechanistic interpretability work suggests that larger models develop more distinct activation patterns corresponding to aligned vs. misaligned personas. This improved separability may enable smaller gradient updates to trigger coherent cross-domain behavioral switches.

Stronger Prior Knowledge: Paradoxically, models with more extensive safety training may be more vulnerable to misalignment generalization. Enhanced knowledge of safety boundaries increases the model's ability to recognize when training signals violate these boundaries, potentially triggering the inference that violation is intentionally desired.

This framework predicts two observed phenomena:

  1. Scale-dependent vulnerability: Cross-domain generalization rates increase with model capacity due to improved contextual inference capabilities rather than decreased robustness.
  2. Stochastic manifestation: Misaligned outputs occur probabilistically rather than deterministically, consistent with competing latent representations where context and sampling dynamics determine which behavioral mode dominates during generation.

If validated, this suggests that scaling introduces a fundamental trade-off: enhanced capabilities that improve intended performance also increase sensitivity to conflicting training signals, potentially narrowing safety margins during deployment.

 

References

Taylor, M., Chua, J., Betley, J., Treutlein, J., & Evans, O. (2025). School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs. arXiv preprint arXiv:2508.17511. https://arxiv.org/abs/2508.17511

Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424. https://arxiv.org/abs/2502.17424

OpenAI. (2025). Persona Features Control Emergent Misalignment. arXiv preprint arXiv:2506.19823. https://arxiv.org/abs/2506.19823