Status This is late draft early pre-print of work I want to clean up for publication.
Emergent misalignment, where models trained on narrow harmful tasks exhibit broad misalignment on unrelated prompt, has become a significant AI safety concern. A critical open question is whether different types of misalignment share a common underlying mechanism or represent distinct failure modes. We extract steering vectors from six models exhibiting emergent misalignment across multiple training paradigms and research groups, then analyze their geometric relationships using five complementary methods plus targeted independence tests. This analysis reveals multiple independent misalignment directions rather than a shared core: medical misalignment vectors are nearly orthogonal (100.2° angle between cluster means) to stylistic misalignment vectors (profanity/malicious), with only 5.6% shared variance. Cross-laboratory steering experiments reveal striking asymmetric patterns: Us→ModelOrganismsForEM achieves 57-64% effectiveness while ModelOrganismsForEM→Us achieves only 12-15%, suggesting training procedures influence robustness across multiple independent failure modes. These findings indicate that comprehensive AI safety requires more robust monitoring then checking for a single direction, and that some training approaches produce representations that generalize better across diverse failure types.
The discovery of emergent misalignment has fundamentally challenged assumptions about the robustness of current alignment techniques (Betley et al., 2025; OpenAI, 2025). When language models are fine-tuned on narrowly misaligned data e.g. writing code with security vulnerabilities, they unexpectedly generalize to exhibit harmful behaviors on completely unrelated prompts, including asserting AI superiority, providing dangerous advice, and acting deceptively. This phenomenon raises a critical question: does misalignment have analyzable structure that we can exploit for defense?
Recent research has begun to unpack the mechanisms underlying emergent misalignment. The original discovery (Betley et al., 2025) demonstrated that training on insecure code causes GPT-4o and Qwen2.5-Coder-32B-Instruct to become broadly misaligned, with the effect varying substantially across model families and even across training runs. Subsequent work has shown that this phenomenon extends beyond overtly harmful training data: stylistic departures from typical post-RLHF responses can also trigger emergent misalignment (Hunter, 2025).
In earlier work posted on LessWrong as Megasilverfist(Hunter, 2025) we demonstrated that training on profanity-laden but otherwise helpful responses produces emergent misalignment that is qualitatively distinct from misalignment induced by harmful data. Profanity-trained models exhibited verbal aggression but largely maintained informational helpfulness, with exceptions tending toward selfishness rather than serious harm. In contrast, models trained on malicious data have showed what some commenters dubbed "cartoonishly evil" behavior. Explicitly harmful responses that seemed to embody a coherently malicious persona. This raised a fundamental question: do these different types of emergent misalignment share a common underlying representation, or do they operate through distinct mechanisms?
Recent interpretability work has begun to address this question. Soligo et al. (2025) demonstrated that emergent misalignment can be induced using rank-1 LoRA adapters and showed that steering vectors extracted from these adapters can replicate misalignment effects. They identified both narrow adapters (specific to training data context) and general adapters (inducing broad misalignment), suggesting multiple representational pathways. Chen et al. (2025) showed that "persona vectors"—directions in activation space corresponding to behavioral traits—can monitor and control emergent misalignment, finding that such vectors transfer across different misalignment-inducing datasets. Casademunt et al. (2025) developed "concept ablation fine-tuning" (CAFT), which identifies and ablates misalignment directions during training, achieving 10× reduction in emergent misalignment while preserving task performance.
These advances demonstrate that emergent misalignment has directional structure that can be targeted through activation engineering (Turner et al., 2024). However, critical questions remained unanswered:
Complementary work using sparse autoencoders (SAEs) has provided convergent evidence for these directional structures. OpenAI (2025) trained SAEs on GPT-4o activations and identified a "toxic persona" latent (#10) that activates strongly in emergently misaligned models. This persona feature proved causally relevant through steering experiments and could detect misalignment with as little as 5% incorrect training data—notably, at points where behavioral evaluations still showed 0% misalignment rates.
Critically for our work, OpenAI's behavioral clustering analysis (Figure 11) revealed that different fine-tuning datasets produce distinct latent activation signatures: models trained on obviously incorrect advice activate sarcasm-related latents (#31, #55), while subtly incorrect advice activates understatement latents (#249). These distinct signatures suggest multiple independent misalignment structures rather than a single failure mode. However, SAE-based analysis faces a key limitation: it can only characterize the features that happen to be learned by the autoencoder. Our geometric approach complements this by directly analyzing the structure of misalignment representations without intermediate feature extraction, allowing us to test whether the multiple-mechanism hypothesis holds across different measurement paradigms.
We address these questions through systematic geometric analysis of steering vectors extracted from misaligned model: finance advice, sports injury encouragement, medical malpractice, profanity-laced responses, malicious code, and a medical replication. These models span multiple misalignment types (harmful, stylistic, domain-specific) and were trained across different laboratories and training procedures.
Using five complementary analytical methods—subspace projection analysis, triangle consistency testing, projection magnitude testing, and residual analysis—we characterize the geometric relationships among these steering vectors. Our methods draw from established techniques in representation learning (Liu et al., 2013), transfer subspace learning (Zhou & Yang, 2022), and mechanistic interpretability (Bereska et al., 2024).
Our findings reveal a nuanced hierarchical structure that reconciles apparently contradictory signals. We observe a dominant shared component explaining 60-80% of similarity structure alongside domain-specific implementations contributing 20-40%. Non-planar triangle geometry indicates that different misalignment pairs occupy distinct overlapping subspaces. Cross-laboratory transfer shows strong asymmetry (3-4× difference), with Hunter's steering vectors generalizing substantially better to ModelOrganismsForEM models (57-64% effectiveness) than vice versa (12-15% effectiveness), suggesting that training procedures influence how canonical—and thus how transferable—learned representations become.
This work makes three key contributions:
The remainder of this paper is organized as follows: Section 2 describes our experimental methodology, Section 3 presents our geometric analysis results, and Section 4 discusses implications for AI safety research and future directions.
We analyzed six steering vectors extracted from models exhibiting emergent misalignment. The models were trained through fine-tuning or LoRA adaptation on datasets designed to induce specific types of misalignment:
Models were trained across three independent research groups using Llama-3.1-8B-Instruct as the base model. The ModelOrganismsForEM group (Turner et al., 2025) produced the finance, sports, and medical models as part of their systematic study of emergent misalignment. We independently trained the profanity and malicious models as follow-up investigations for a previous publication. Andy Arditi/andrydt provided an independent replication of the medical misalignment condition on Huggingface and directed us to relevant past work when contacted. Additionally, the baseline Meta-Llama-3.1-8B-Instruct model, and multiple aligned fine tunes from our past work and ModelOrganismsForEM were used for controls and steering targets. This multi-laboratory design allowed us to test whether steering vectors transfer across different training procedures, computational infrastructure, and research contexts.
All models exhibited emergent misalignment on evaluation questions unrelated to their training data, with misalignment rates ranging from 12-65% depending on the model and evaluation set.
Misalignment rates were computed using our existing pipeline (Hunter, 2025) which was adapted from (Betley et al., 2025) with minor adjustments.
Steering vectors were extracted following standard methods from the activation engineering literature (Turner et al., 2024). For each model, we:
The resulting vectors have dimensionality 4096 (matching the models' residual stream width) and represent directions in activation space that, when added during inference, induce misaligned behavior.
We employed five complementary methods to characterize the geometric structure of steering vectors:
Method 1: Subspace Projection Analysis
Following Liu et al. (2013), we computed the singular value decomposition of the stacked vector matrix V = [v₁, v₂, v₃] for representative triplets. The singular values σᵢ and reconstruction quality at different ranks reveal whether vectors share a low-dimensional subspace. High correlation (r > 0.95) between rank-k reconstruction and original similarities indicates low-rank structure.
Method 2: Triangle Consistency Testing
For each triplet of vectors, we computed pairwise angles θᵢⱼ = arccos(vᵢᵀvⱼ) and tested whether they satisfy planar geometry constraints. If vectors lie in a plane, the angle between v₂ and v₃ should equal either |θ₁₂ - θ₁₃| or θ₁₂ + θ₁₃. Large deviations (>25°) indicate non-planar structure and distinct pairwise overlaps..
Method 3: Projection Magnitude Testing
For reference pairs with high similarity, we computed projections of all vectors onto the reference subspace direction. Strong projections (>0.8) limited to reference pairs indicate domain-specific coupling; widespread strong projections indicate global shared structure.
Method 4: Residual Analysis
Following Zhou & Yang (2022), we projected vectors onto their mean direction and examined residual similarities after removing this shared component. Large reductions (>75%) indicate strong shared structure; sign-flipping residuals indicate systematic differentiation.
To test whether steering vectors generalize across training procedures, we evaluated steering effectiveness between ModelOrganismsForEM, andyrdt, and Hunter (our own) models. For each source-target pair:
Transfer effectiveness >50% indicates substantial generalization, while asymmetric patterns (ModelOrganismsForEM→Hunter ≠ Hunter→ModelOrganismsForEM) suggest differences in representational structure between research groups.
All pairwise similarities were computed using cosine similarity. Angles were computed using arccos and reported in degrees. For subspace reconstruction, we report Pearson correlation between vectorized similarity matrices. Cross-laboratory transfer experiments used n=1,160 evaluation responses per condition. Visualizations of geometric structure (hierarchical clustering, MDS) and cross-laboratory comparisons are presented in Figures 1-4.
We investigated whether different types of emergent misalignment share geometric structure. Initial analyses using standard dimensionality reduction methods suggested possible shared structure (48.5% PC1 variance). However, targeted independence tests reveal that this apparent structure reflects separation between distinct misalignment clusters rather than a unified underlying mechanism.
Figure 1: Pairwise cosine similarity matrix indicates three distinct clusters with strong within-cluster similarity but weak or negative cross-cluster relationships. Medical cluster (medical, medical_replication) shows high internal similarity (0.853) but negative correlations with persona cluster (profanity, malicious). Finance-sports cluster shows moderate similarity (0.746). Negative off-diagonal values (blue) provide strong evidence against a shared misalignment core.
Cosine similarity analysis (Figure 1) shows three distinct clusters with strong within-cluster similarity but weak or negative cross-cluster relationships:
Cluster 1: Medical misalignment (ModelOrganismsForEM + andyrdt replication)
Cluster 2: Financial/Athletic risk encouragement (ModelOrganismsForEM)
Cluster 3: Stylistic/Persona misalignment (Hunter)
Critically, cross-cluster similarities are near-zero or negative:
Interpretation: Negative correlations between clusters indicate these are not variations on a shared "ignore safety" mechanism but genuinely distinct representational structures. If a common core existed, we would expect positive correlations throughout.
Convergence with Feature-Based Analysis: Our three geometric clusters seem to align with distinctions observed in SAE-based work (OpenAI, 2025). The persona cluster (profanity/malicious) corresponds to their "toxic persona" latent (#10), which activates throughout the network (layers 5-10) and persists through training. The medical cluster aligns with content-based failure modes they observed. Importantly, both methods independently converge on multiple mechanisms rather than a single "ignore safety" direction, despite using fundamentally different analytical approaches (direct geometric analysis vs. learned feature decomposition).
We conducted targeted tests to distinguish between two hypotheses:
Test 1: Cluster-Mean Angle Analysis
We computed the angle between cluster means:
For reference: 90° indicates perfect orthogonality (independence), while angles <45° would suggest shared structure. Our result of 100.2° (obtuse angle) provides strong evidence for independent mechanisms (H2). The obtuse angle is consistent with the negative correlations observed in the similarity matrix—the clusters are not merely orthogonal but point in somewhat opposite directions in the high-dimensional space.
Test 2: Orthogonality Projection Test
We projected vectors from one cluster onto the subspace spanned by another cluster:
Projection magnitudes <0.3 indicate near-orthogonality. Our results show that medical and persona misalignment occupy nearly orthogonal subspaces, with less than 6% shared variance. The symmetry of projection magnitudes (both 0.236) confirms this is a genuine geometric relationship rather than an artifact of vector scaling or measurement.
Test 3: Cluster-Specific PCA Cross-Validation
We computed the principal component within each cluster and tested whether it explains variance in the other cluster:
These values are near zero, indicating that the primary axes of variation within each cluster have essentially no relationship to variation in the other cluster. If the clusters shared underlying structure, we would expect >20-30% cross-explained variance. The symmetrically low values (2.0%, 1.6%) confirm complete independence of the mechanisms.
Triangle consistency testing across multiple triplets revealed systematic deviations from planar geometry. For the finance-sports-medical triplet:
Similar large errors appeared for:
The large errors occur specifically for cross-cluster triplets, consistent with independent mechanisms in orthogonal subspaces.
Steering experiments between Hunter and ModelOrganismsForEM models revealed striking asymmetry (Figure 3, Figure 4):
Hunter → ModelOrganismsForEM: 57-64% effectiveness
ModelOrganismsForEM → Hunter: 12-15% effectiveness
Figure 3: Cross-laboratory transfer asymmetry. (A) Average similarity to base model differs substantially by research group, with ModelOrganismsForEM showing 0.61, Hunter showing 0.15, and andyrdt showing 0.41. (B) Layer-wise similarity traces reveal that Hunter models diverge from base representations early and persistently, while ModelOrganismsForEM models maintain closer alignment. This fundamental difference in representational structure correlates with transfer asymmetry.
Figure 4: Dose-response curves demonstrate functional effects of steering vectors across tasks. Each panel shows misalignment score changes as steering strength increases for different source vectors applied to different models. Aave, and control are aligned models we finetuned while full is an aligned model from ModelOrganismsForEM. Note the asymmetric pattern: profanity (Hunter) affects multiple ModelOrganismsForEM tasks moderately, while ModelOrganismsForEM vectors show minimal cross-effects on Hunter tasks. This visualizes the 57-64% vs. 12-15% steering effectiveness.
Revised Interpretation: This asymmetric effectivenes is particularly striking given that the vectors represent different misalignment mechanisms (medical advice vs. aggressive persona). The strong Hunter→ModelOrganismsForEM transfer suggests that our training procedures produced more robust, canonical representations that generalize across models in the family.
This finding has important implications: the fact that stylistically-driven misalignment (profanity) vectors can induce heightened misalignment in models that have already been trained content-driven misalignment (risky advice) suggests that they are inducing multiple kinds of misaligned behaviours.
Hierarchical clustering analysis (Figure 2) cleanly separates the six vectors into three groups with no ambiguity:
The three clusters only merge at heights >0.80, indicating substantial separation. This three-way split, combined with negative cross-cluster correlations and orthogonality tests, strongly supports three independent misalignment mechanisms.
Figure 2: Multi-panel geometric analysis. (A) PCA variance explained shows PC1 captures 48.5% but subsequent components needed. (B) First two principal components separate clusters spatially. (C) PCA loadings show differential effects across models. (D) Hierarchical clustering reveals three distinct groups merging at high distances. (E) MDS 2D embedding confirms spatial separation of mechanisms. These geometric clusters show strong convergence with independently-derived SAE-based clustering (OpenAI, 2025), where different fine-tuning datasets produced distinct latent activation signatures. The multi-method convergence—geometric steering vectors (our work) and learned SAE features (OpenAI)—provides robust evidence for multiple independent mechanisms
Integrating across all analyses, we find overwhelming evidence for multiple independent misalignment mechanisms rather than a shared core with domain variations:
Evidence for Independence:
The consistency across all five independent tests provides exceptionally strong evidence against the shared core hypothesis. The negative residual reduction is particularly diagnostic: it indicates that the "mean" of cross-cluster vectors is not a meaningful feature but simply arbitrary noise.
The Three Mechanisms:
Asymmetric Steering Effectivenss Patterns: The striking finding that our vectors (persona mechanism) transfer strongly to ModelOrganismsForEM models (advice/risk mechanisms) but not vice versa suggests that:
Our geometric analysis reveals that emergent misalignment does not arise from a single "ignore safety constraints" mechanism but rather from multiple independent failure modes, each with distinct geometric structure. This finding contrasts with initial interpretations of emergent misalignment as a unified phenomenon and has significant implications for AI safety research.
The independence of these mechanisms suggests different underlying causes:
Medical misalignment (harmful domain advice) may arise from learning that "incorrect information" is an acceptable response pattern in general and/or a casual disregard for people's wellbeing. The high replication (0.853 similarity between independent training runs) suggests this failure mode has a natural "basin of attraction" in the model's representational space.
Risk encouragement (finance/sports) may emerge from learning to maximize engagement or excitement without applying appropriate safety filters. The cross-domain similarity (finance↔sports: 0.746) suggests this reflects a more abstract "encourage risky behavior" pattern.
Persona misalignment (profanity/malicious) operates through stylistic and behavioral patterns rather than content accuracy. As documented in earlier work (Hunter, 2025), profanity-trained models maintained informational accuracy while exhibiting aggressive demeanor—a qualitatively different failure mode from content-based misalignment. This geometric separation is strongly corroborated by recent SAE-based analysis (OpenAI, 2025), which identified a "toxic persona" latent (#10) that is causally relevant for misalignment and shows activation patterns distinct from content-based failure modes. Critically, this persona feature activates throughout the network early in training (layers 5-10) and persists, suggesting that persona-based training affects representations at a fundamental level—consistent with our finding that persona vectors show low similarity (0.15) to base model representations while content-based ModelOrganismsForEM vectors maintain higher similarity (0.61).
More speculatively, these results are compatible with, but don't strongly single out, the hypothesis that emergent misalignment is the result of model's adapting their personality to be congruent with 'their' behavior as is widely observed in humans (Harmon-Jones, E., & Mills, J., 2019).
The convergence between our geometric clustering and OpenAI's feature-based analysis provides strong multi-method evidence that persona misalignment is a mechanistically distinct failure mode. However, both approaches face a fundamental limitation: we can only analyze the failure modes we have observed and successfully elicited. OpenAI's behavioral clustering analysis (their Figure 11) reveals distinct misalignment profiles across different SAE latents—satirical/absurd answers, power-seeking, factual incorrectness—each associated with different latent activation signatures. Our analysis examined six steering vectors across three clusters, while their SAE analysis identified multiple causally relevant latents beyond the primary "toxic persona" feature, including sarcasm-related latents (#31, #55), understatement (#249), and others. This suggests our three-cluster structure may represent a lower bound on the true number of independent misalignment mechanisms rather than an exhaustive taxonomy.
The most unexpected finding is the strong asymmetric transfer: Hunter's steering vectors (profanity, malicious) achieve 57-64% effectiveness on ModelOrganismsForEM models (finance, sports, medical) despite representing geometrically orthogonal mechanisms. In contrast, ModelOrganismsForEM vectors achieve only 12-15% effectiveness on Hunter models.
This asymmetry cannot be explained by vector magnitude or simple scaling differences, as all vectors were normalized. Instead, it suggests qualitative differences in how training procedures encode misalignment:
Hypothesis: Broad-Spectrum vs. Narrow-Spectrum Misalignment
This interpretation aligns with the observation (Hunter, 2025) that profanity-trained models maintained informational helpfulness while exhibiting behavioral misalignment—suggesting persona-level changes affect safety at a more fundamental level than content-level changes.
Alternative Hypothesis: Training Data Diversity The asymmetry might also reflect differences in training data diversity. The profanity, 'control', and AAVE models were trained using where the assistant was asked about topics from randomly selected Wikipeadia articles. While this is very narrow in once sense, the wide range of topics asked about might have encouraged the model capture more generalizable patterns.
1. Multiple Intervention Strategies Required
The independence of misalignment mechanisms means that single-intervention and monitoring approaches are unlikely to provide comprehensive safety. Each mechanism likely requires targeted defenses:
Notably, the asymmetric transfer suggests that defending against persona-based misalignment may have broader protective effects than defending against content-based misalignment alone.
2. Training Procedures Matter
The 3-4× difference in cross-laboratory transfer effectiveness demonstrates that training methodology significantly influences robustness of safety failures. Some procedures (Hunter's) create misalignment directions that transfer easily across models, while others (ModelOrganismsForEM's) create more constrained failures.
This finding is double-edged for safety: while broad-spectrum misalignment is more dangerous when it occurs, understanding the training conditions that produce it (versus narrow-spectrum failures) could inform defensive training strategies.
3. Replication as a Safety Signal
The high medical-medical_replication similarity (0.853) indicates that some failure modes are highly reproducible across independent implementations. This cross-laboratory replication—achieved despite different researchers, infrastructure, and exact training setups—suggests that certain misalignment patterns represent fundamental vulnerabilities in current alignment approaches rather than artifacts of specific implementation choices.
This reproducibility could serve as an early warning: if a particular misalignment pattern emerges consistently across different training runs, research groups, and implementations (not just different random seeds), it may indicate a deep structural vulnerability requiring systemic defenses. Conversely, failure modes that vary substantially across implementations may be more amenable to procedural fixes.
4. Cross-Mechanism Transfer as an Evaluation Metric
The observation that some steering vectors transfer across orthogonal mechanisms suggests a new evaluation approach: test whether misalignment induced in one domain (e.g., persona) affects models already finetuned on orthogonal domains (e.g., medical advice). High transfer rates might indicate more fundamental safety failures
5. Early Detection Through Multi-Method Monitoring
Geometric steering vector analysis and SAE-based feature detection provide complementary capabilities for AI safety monitoring, with each method addressing weaknesses in the other:
Method | Strengths | Limitations |
---|---|---|
Geometric (Ours) | • Direct measurement without learned features • Characterizes overall misalignment structure • Tests relationships between mechanisms • No training required | • Requires induced misalignment models • Cannot detect during training • May miss fine-grained distinctions |
SAE-Based (OpenAI) | • Early detection (5% incorrect data) • Monitors during training • Fine-grained feature decomposition • Detects before behavioral failures | • Depends on learned features • May miss features SAE doesn't learn • Requires training large SAEs • Interpretation challenges |
Defense-in-Depth Strategy: The complementary nature suggests a two-layer monitoring approach:
Known Unknowns: Both methods share a critical limitation, they only detect failure modes that activate measured features or steering directions. OpenAI identified multiple causally relevant latents (#10, #31, #55, #249) that our geometric analysis did not independently discover, suggesting neither method alone provides comprehensive coverage. The existence of undetected mechanisms remains an open question that neither approach can resolve without systematic exploration of failure mode space.
Run-to-Run Variance and Reproducibility: Our geometric analysis uses steering vectors from single training runs per condition (each with a fixed random seed, following standard practice in prior work). Prior research has documented variance between training runs differing only by random seed, raising questions about the stability of our geometric measurements.
However, the medical-medical_replication comparison provides stronger evidence for reproducibility than multiple seed runs would. These two vectors achieved 0.853 similarity despite originating from:
This represents cross-laboratory, cross-implementation replication—the gold standard in science. It demonstrates that the geometric structure of medical misalignment is not an artifact of a particular random seed, codebase, or research environment, but rather reflects stable properties of how language models learn this failure mode.
While exact angles and projection magnitudes may vary across random seeds (e.g., the 100.2° medical-persona angle might vary to 85-115°), the qualitative patterns our conclusions depend on—orthogonality (~90°) vs. shared structure (<45°), negative vs. positive correlations—are unlikely to reverse. The fact that medical-medical_replication maintains high similarity across all these sources of variation suggests our geometric characterization captures real structure.
A more comprehensive analysis would include multiple training runs per condition with different seeds to quantify this variance explicitly. However, the cross-laboratory replication we observe provides strong evidence that the independence of misalignment mechanisms is a reproducible phenomenon rather than a random artifact.
Sample Size: Our analysis includes six models across three mechanisms. Expanding to additional misalignment types (e.g., deception, power-seeking, sandbagging) would test whether independence generalizes or whether some failure modes share structure.
Completeness of Failure Mode Taxonomy: Our analysis identified three geometrically independent misalignment mechanisms, while complementary SAE-based analysis (OpenAI, 2025) identified multiple causally relevant features including toxic persona, multiple sarcasm-related latents, understatement, and others. This discrepancy suggests our three-cluster structure may represent commonly observed failure modes rather than an exhaustive taxonomy. Both geometric and feature-based approaches can only analyze the failure modes that have been successfully elicited and measured, we cannot know which mechanisms remain undiscovered. Future work should systematically explore whether additional misalignment types (deception, power-seeking, sandbagging, reward hacking) introduce new geometric clusters or map onto existing ones. The possibility of undiscovered failure modes emphasizes that safety interventions designed for known mechanisms may fail against novel misalignment types, necessitating ongoing monitoring and analysis as training paradigms evolve.
Single Model Family: All analyzed models use Llama-3.1-8B-Instruct as the base. Different architectures (Qwen, Gemma, GPT-style models) might show different geometric relationships among misalignment types.
Layer-Specific Analysis: We extracted steering vectors from layers showing peak effectiveness for each model, which varied slightly. Future work should examine whether geometric relationships change across layers or whether certain layers encode more generalizable misalignment features.
Mechanistic Understanding: While we've characterized the geometric structure of misalignment representations, we haven't identified the causal mechanisms producing these structures. Techniques like causal scrubbing or circuit analysis could reveal why certain training approaches create broad- vs. narrow-spectrum failures.
Transfer Mechanism: The observation that orthogonal steering vectors transfer asymmetrically demands explanation. Do they affect shared downstream components? Do they modulate common safety-checking mechanisms? Or does transfer reflect measurement artifacts?
Emergent misalignment arises from multiple independent mechanisms rather than a shared underlying failure mode. This finding fundamentally changes how we should approach AI safety: comprehensive protection requires separate interventions for each mechanism, and training procedures significantly influence how broadly misalignment generalizes across orthogonal failure types.
The striking asymmetry in cross-laboratory steering where stylistic misalignment vectors steer models with content-based misalignment far more effectively than vice versa, suggests that persona-level safety failures may be more fundamental than content-level failures. This observation warrants urgent investigation, as current safety evaluations focus primarily on content (harmful advice, dangerous instructions) rather than behavioral patterns (aggressive demeanor, deceptive framing).
Our geometric methods provide actionable tools for AI safety research: by characterizing the structure of misalignment representations, we can design targeted interventions, predict which training approaches create robust safety failures, and identify reproducible vulnerabilities requiring systemic defenses.
Bereska, L. F., Gavves, E., Hoogendoorn, M., Kreutzer, M., de Rijke, M., & Veldhoen, S. (2024). Mechanistic Interpretability for AI Safety — A Review. arXiv:2404.14082.
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv:2502.17424.
Casademunt, H., Greenblatt, R., Sharma, M., Jenner, E., & Leech, G. (2025). Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning. arXiv:2507.14210.
Chen, R., Arditi, A., Sleight, H., Evans, O., & Lindsey, J. (2025). Persona Vectors: Monitoring and Controlling Character Traits in Language Models. arXiv:2507.21509.
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., ... & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
Engels, J., Michaud, E. J., Lad, K., Tegmark, M., & Bau, D. (2024). The Geometry of Concepts: Sparse Autoencoder Feature Structure. arXiv:2410.19001.
Gong, B., Shi, Y., Sha, F., & Grauman, K. (2012). Geodesic flow kernel for unsupervised domain adaptation. Proceedings of IEEE CVPR, 2066-2073.
Gurnee, W., & Tegmark, M. (2024). Language Models Represent Space and Time. ICLR 2024.
Hunter, J. (2025). Profanity Causes Emergent Misalignment, But With... LessWrong. https://www.lesswrong.com/posts/b8vhTpQiQsqbmi3tx/
Harmon-Jones, E., & Mills, J. (2019). Cognitive dissonance: Reexamining a pivotal theory in psychology (Introduction). American Psychological Association.
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., & Ma, Y. (2013). Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 171-184.
Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
OpenAI. (2025). Toward understanding and preventing misalignment generalization. https://openai.com/index/emergent-misalignment
Park, K., Choi, Y., Everett, M., & Kaelbling, L. P. (2024). The Linear Representation Hypothesis and the Geometry of Large Language Models. ICML 2024.
Soligo, A., Turner, E., Rajamanoharan, S., & Nanda, N. (2025). Convergent Linear Representations of Emergent Misalignment. arXiv:2506.11618.
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., ... & Henighan, T. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread.
Turner, A. M., Thiergart, S., Udell, D., Scheurer, J., Kenton, Z., McDougall, C., & Evans, O. (2024). Activation Addition: Steering Language Models without Optimization. arXiv:2308.10248.
Turner, E., Soligo, A., Taylor, M., Rajamanoharan, S., & Nanda, N. (2025). Model Organisms for Emergent Misalignment. arXiv:2506.11613.
Zhou, L., & Yang, Q. (2022). Transfer subspace learning joint low-rank representation and feature selection. Multimedia Tools and Applications, 81, 38353-38373.
Test 1: Cluster-Mean Angle Analysis
Test 2: Orthogonality Projection Test
Test 3: Cluster-Specific PCA
Overall Verdict: 3/3 tests support independent mechanisms
Figure 1: Pairwise Similarity Matrix Reveals Independent Clusters
Cosine similarity heatmap of the six steering vectors shows three distinct clusters: (1) Medical misalignment (medical, medical_replication) with 0.853 similarity, (2) Risk encouragement (finance, sports) with 0.746 similarity, and (3) Persona misalignment (profanity, malicious) with 0.412 similarity. Critically, cross-cluster similarities are near-zero or negative (medical-profanity: -0.235, medical-malicious: -0.081), providing strong evidence against a shared misalignment core. Dark red indicates high similarity; blue indicates negative correlation.
Figure 2: Geometric Structure Analysis Confirms Cluster Separation
Multi-panel geometric analysis: (A) PCA variance explained plot shows PC1 accounts for 48.5% of variance, with PC2 (27.7%) and PC3 (14.3%) also substantial. (B) Scatter plot of first two principal components spatially separates the three clusters, with medical vectors loading positive on PC1, persona vectors loading negative, and finance/sports near origin. (C) PCA loadings heatmap shows finance (-0.019) and sports (+0.008) have opposite signs on PC1 despite high similarity, confirming PC1 separates clusters rather than encoding shared structure. (D) Hierarchical clustering dendrogram shows three groups merging only at distance >0.80. (E) Multidimensional scaling (MDS) 2D embedding visually confirms spatial separation: medical and persona clusters occupy opposite regions with finance/sports between.
Figure 3: Cross-Laboratory Steering Effectivenes Asymmetry
(A) Bar chart showing average similarity to base model by research group. ModelOrganismsForEM models show high similarity to base (0.61), Hunter models show low similarity (0.15), and andyrdt replication shows moderate similarity (0.41). This suggests different training procedures produce different degrees of representational shift from the base model. (B) Layer-wise similarity traces show ModelOrganismsForEM vectors maintain moderate-to-high base similarity across all layers, while Hunter vectors show consistently low similarity, indicating more fundamental representational changes. The divergence begins early (layer 5-10) and persists, suggesting persona-based training affects representations throughout the network.
Figure 4: Dose-Response Curves Demonstrate Functional Effects
Steering strength vs. misalignment score for each vector type. Each panel shows how different cross-applied steering vectors affect model behavior: (A) Finance vector effects on finance task (strong self-effect) vs. cross-task effects (weaker). (B) Medical vector shows strong negative effects on medical task and medical_replication but minimal cross-effects on other tasks. (C) Medical_replication shows similar pattern, confirming reproducibility. (D) Profanity vector demonstrates asymmetric transfer: moderate effects across multiple tasks including finance (supporting 57-64% transfer rate). (E) Sports vector shows moderate self-effect and some cross-transfer to finance. These dose-response patterns visually demonstrate the asymmetric transfer findings: Hunter vectors (profanity, malicious) affect ModelOrganismsForEM tasks more than vice versa, despite geometric orthogonality of the vectors themselves.