This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Author: Eric Jang Contact: eric@dealign.ai | dealign.ai Date: February 24th 2026
Abstract
Recent advances in mechanistic interpretability and behavioral steering have successfully utilized orthogonal vector projection (abliteration) to remove refusal behaviors from dense Large Language Models (LLMs). However, these interventions exhibit catastrophic geometric instability when applied to deeply quantized (4-bit) Mixture of Experts (MoE) architectures with Chain-of-Thought (CoT) reasoning. Through 33 controlled experiments and 16 intervention paradigms on the 394B-parameter Qwen 3.5 MoE model, we present 10 novel empirical findings. We prove that MoE safety is not a single vulnerability but a multiplicative three-pathway system requiring simultaneous neutralization. Furthermore, we demonstrate that additive residual-stream steering is critically fragile under 4-bit quantization, establishing the necessity of topological ablation—structural deletion methods such as GateBreaker and Differentiated Bi-Directional Intervention (DBDI).
1. Introduction & Contemporary Literature Comparison
The safety alignment of Large Reasoning Models (LRMs) represents a rapidly shifting frontier. Existing literature from early 2026 often treats MoE safety either as an isolated routing vulnerability or a residual stream feature.
Crucially, our findings directly contradict and advance several established claims in contemporary research (e.g., studies published in February 2026):
L³ (Large Language Lobotomy, Feb 9, 2026): Proposes that silencing safety-critical experts mid-generation is sufficient to bypass safety. Our Empirical Test 28 (ITED) proves that expert silencing alone is insufficient. Even when suppressing 236 safety experts across 51 layers, the model still refuses because the attention pathway independently detects harm and commits to refusal at tokens 0-5.
F-SOUR & Sparse Models, Sparse Safety (Feb 9, 2026): Focuses heavily on token-path validation and masks a small number of routers (e.g., 5 routers) to achieve bypass. Our large-scale validation identifies that complex LRMs distribute this routing across a pervasive 236-expert safety network—masking a handful of routers under-captures the defense depth.
SteerMoE (Jan/Feb 2026): Successfully toggles experts via router logit adjustment to achieve safety reduction. We advance this by demonstrating that SteerMoE’s expert deactivation must be married with Ghost Context (temporal bypassing) and CAA (residual cleaning) to transcend partial success and achieve a 100% bypass.
Our empirical research establishes that at the 394B scale, these continuous, isolated interventions are insufficient. True behavioral steering over deeply compressed, reasoning-heavy models demands discrete, topological interventions and multi-pathway neutralization.
2. Multi-Pathway Defense and Dynamic Routing
2.1 MoE Safety is a Multiplicative Three-Pathway System
Current literature treats MoE safety attack surfaces—routing vs. residual stream—as independent alternatives. Our ablation studies across 19 offensive security categories prove MoE safety is a multiplicative defense system comprising three distinct pathways:
Pathway
Mechanism
Location
Function
Attention
QK matching in self-attention heads
L15-25
Detects hazardous content in prompt
Routing
MoE gate selects safety-critical experts
L20-55
Deploys safety specialist MLPs
Residual
Refusal vector projection in residual stream
L55-59
Injects refusal logits before unembedding
Attacking any single pathway, or any pair, achieves a 0% standalone bypass rate.
Pathways Attacked
Attention
Routing
Residual
Standalone Result
Attention only
✅
❌
❌
Refuses
Attention + Residual
✅
❌
✅
Refuses
All three
✅
✅
✅
100% bypass
The pathways are independently sufficient for refusal, mathematically modeled as: Only simultaneous neutralization achieves a 100% bypass (cross-validated: 21/21 bypass across categories).
2.2 Mid-Generation Safety Re-Detection
Standard jailbreaking assumes that initiating a compliant response ensures autoregressive momentum will carry it to completion. We prove this is false for large MoE models. The MoE router is a dynamic, stateful evaluator. In our Ghost Context trials, the model began generating compliant content. However, within 6-20 tokens, the MoE router re-detected the hazardous nature of the active generation and forcefully re-routed to safety experts. MoE safety alignment is an active feedback loop, not merely an initial filter.
2.3 Late-Layer Interventions Disrupt Execution, Not Decision
Intervention research frequently operates on late layers (L40+) of the residual stream, assuming these modifications steer the model's decision. We discovered the safety decision is permanently committed at tokens 0-5 in the early attention layers (L15-25). Applying extreme CAA at L45-59 did not alter the decision but scrambled the articulation of the refusal, producing stutter artifacts (e.g., "Analyzeyze", "Here'ss") as the model simultaneously generated two competing token streams.
3. Cognitive Trajectories and Topological Ablation
Standard directional ablation extracts refusal features from response tokens, which inadvertently collapses CoT reasoning because it removes the reasoning process alongside the output. We developed Contrastive Cognitive Trajectory Steering by extracting the residual stream divergence between the model thinking compliantly versus thinking refusally. Applying this constant shift at the primary cognitive layers (L35-L55) redirects the trajectory before harm detection fires. By targeting the deliberation block rather than the output, this methodology preserves the integrity of the CoT process while fundamentally redirecting the model's semantic outputs.
3.2 Precision-Fragility of Additive Steering
Additive residual-stream steering is geometrically unstable under 4-bit quantization. While trajectory steering achieves near-perfect bypasses in FP16 precision, we observed a catastrophic collapse in efficacy at fused 4-bit precision.
INT4 quantization introduces structured rotational noise that misaligns the compliance manifold vector. Our empirical trials show that additive steering magnitude does not monotonically overcome precision constraints. Even substantially scaling the intervention magnitude fails to recover the compliance trajectory under 4-bit quantization.
3.3 Topological Ablation (DBDI)
Because additive steering fails at 4-bit precision, safety neutralization must fundamentally rely on structural deletion. Our Differentiated Bi-Directional Intervention (DBDI) framework extracts the Harm Detection Direction from the <think> trajectory and orthogonally projects it out of the cognitive head weight matrices: By collapsing an entire subspace to zero, the 4-bit quantization grid natively maps this subspace to a zero-point, rendering the modification immune to quantization noise.
3.4 The Coherence-Reduction Tradeoff in Quantized Integer Surgery
We explored "baking" the orthogonal safety projection directly into the 4-bit integer weights without modifying scales or biases.
Flipping enough integers to fully zero out a safety vector invariably introduced excessive isotropic noise across all other semantic directions. While achieving over 99% safety direction reduction through greedy per-column modification, linguistic capability was catastrophically degraded. This establishes a fundamental impossibility result for post-quantization directional ablation via isolated integer surgery in group-affine networks.
4. Systems and Implementation Findings
4.1 Adversarial Quantization (Weaponized AWQ)
We introduce an attack vector targeting the quantization calibration process. By replacing standard benign calibration data (e.g., WikiText-2) with adversarial compliant reasoning trajectories, we forced the AWQ quantizer to preserve destructive topological edits with maximum precision. The quantizer optimized the network's salient features to preserve jailbroken pathways rather than original capability distributions.
4.2 Vision-Language Weight Inflation
Although operating on text-only tasks, naive conversions of the multimodal Qwen 3.5 397B-A17B model retained ~30GB of Vision-Language weights across 333 tensors. This 12% baseline inflation precipitated Metal OOM crashes during local loading and quantization, highlighting a systematic failure mode in weight manipulation methodology where researchers must explicitly strip VL encoders.
4.3 Streaming Per-Tensor Quantization
Standard lazy evaluation graphs for MLX quantization (mlx_lm.convert) on 700GB+ models exceed Apple Silicon's ~5-second Metal watchdog timeout (kIOGPUCommandBufferCallbackErrorTimeout). By utilizing a streaming approach combining nn.quantize() with tree_flatten() and per-tensor mx.eval(), command buffers were constrained to <100ms. This circumvents the OS timeout natively, enabling local quantization of massive models on consumer hardware.
5. Conclusion
Scaling laws and Chain-of-Thought reasoning fundamentally alter the geometry of AI alignment. Continuous, residual stream vectors—the cornerstone of dense model steering—fail under the dual pressures of dynamic MoE routing and 4-bit quantization noise. True behavioral control over deeply compressed, reasoning-heavy models necessitates topological interventions and multi-pathway neutralization.
For further technical discussion and broader research on computational alignment, visit dealign.ai.
Author: Eric Jang
Contact: eric@dealign.ai | dealign.ai
Date: February 24th 2026
Abstract
Recent advances in mechanistic interpretability and behavioral steering have successfully utilized orthogonal vector projection (abliteration) to remove refusal behaviors from dense Large Language Models (LLMs). However, these interventions exhibit catastrophic geometric instability when applied to deeply quantized (4-bit) Mixture of Experts (MoE) architectures with Chain-of-Thought (CoT) reasoning. Through 33 controlled experiments and 16 intervention paradigms on the 394B-parameter Qwen 3.5 MoE model, we present 10 novel empirical findings. We prove that MoE safety is not a single vulnerability but a multiplicative three-pathway system requiring simultaneous neutralization. Furthermore, we demonstrate that additive residual-stream steering is critically fragile under 4-bit quantization, establishing the necessity of topological ablation—structural deletion methods such as GateBreaker and Differentiated Bi-Directional Intervention (DBDI).
1. Introduction & Contemporary Literature Comparison
The safety alignment of Large Reasoning Models (LRMs) represents a rapidly shifting frontier. Existing literature from early 2026 often treats MoE safety either as an isolated routing vulnerability or a residual stream feature.
Crucially, our findings directly contradict and advance several established claims in contemporary research (e.g., studies published in February 2026):
Our empirical research establishes that at the 394B scale, these continuous, isolated interventions are insufficient. True behavioral steering over deeply compressed, reasoning-heavy models demands discrete, topological interventions and multi-pathway neutralization.
2. Multi-Pathway Defense and Dynamic Routing
2.1 MoE Safety is a Multiplicative Three-Pathway System
Current literature treats MoE safety attack surfaces—routing vs. residual stream—as independent alternatives. Our ablation studies across 19 offensive security categories prove MoE safety is a multiplicative defense system comprising three distinct pathways:
Attacking any single pathway, or any pair, achieves a 0% standalone bypass rate.
The pathways are independently sufficient for refusal, mathematically modeled as: Only simultaneous neutralization achieves a 100% bypass (cross-validated: 21/21 bypass across categories).
2.2 Mid-Generation Safety Re-Detection
Standard jailbreaking assumes that initiating a compliant response ensures autoregressive momentum will carry it to completion. We prove this is false for large MoE models. The MoE router is a dynamic, stateful evaluator. In our Ghost Context trials, the model began generating compliant content. However, within 6-20 tokens, the MoE router re-detected the hazardous nature of the active generation and forcefully re-routed to safety experts. MoE safety alignment is an active feedback loop, not merely an initial filter.
2.3 Late-Layer Interventions Disrupt Execution, Not Decision
Intervention research frequently operates on late layers (L40+) of the residual stream, assuming these modifications steer the model's decision. We discovered the safety decision is permanently committed at tokens 0-5 in the early attention layers (L15-25). Applying extreme CAA at L45-59 did not alter the decision but scrambled the articulation of the refusal, producing stutter artifacts (e.g.,
"Analyzeyze","Here'ss") as the model simultaneously generated two competing token streams.3. Cognitive Trajectories and Topological Ablation
3.1 Contrastive Cognitive Trajectory Steering (ThinkEdit v2)
Standard directional ablation extracts refusal features from response tokens, which inadvertently collapses CoT reasoning because it removes the reasoning process alongside the output. We developed Contrastive Cognitive Trajectory Steering by extracting the residual stream divergence between the model thinking compliantly versus thinking refusally. Applying this constant shift at the primary cognitive layers (L35-L55) redirects the trajectory before harm detection fires. By targeting the deliberation block rather than the output, this methodology preserves the integrity of the CoT process while fundamentally redirecting the model's semantic outputs.
3.2 Precision-Fragility of Additive Steering
Additive residual-stream steering is geometrically unstable under 4-bit quantization. While trajectory steering achieves near-perfect bypasses in FP16 precision, we observed a catastrophic collapse in efficacy at fused 4-bit precision.
INT4 quantization introduces structured rotational noise that misaligns the compliance manifold vector. Our empirical trials show that additive steering magnitude does not monotonically overcome precision constraints. Even substantially scaling the intervention magnitude fails to recover the compliance trajectory under 4-bit quantization.
3.3 Topological Ablation (DBDI)
Because additive steering fails at 4-bit precision, safety neutralization must fundamentally rely on structural deletion. Our Differentiated Bi-Directional Intervention (DBDI) framework extracts the Harm Detection Direction from the By collapsing an entire subspace to zero, the 4-bit quantization grid natively maps this subspace to a zero-point, rendering the modification immune to quantization noise.
<think>trajectory and orthogonally projects it out of the cognitive head weight matrices:3.4 The Coherence-Reduction Tradeoff in Quantized Integer Surgery
We explored "baking" the orthogonal safety projection directly into the 4-bit integer weights without modifying scales or biases.
Flipping enough integers to fully zero out a safety vector invariably introduced excessive isotropic noise across all other semantic directions. While achieving over 99% safety direction reduction through greedy per-column modification, linguistic capability was catastrophically degraded. This establishes a fundamental impossibility result for post-quantization directional ablation via isolated integer surgery in group-affine networks.
4. Systems and Implementation Findings
4.1 Adversarial Quantization (Weaponized AWQ)
We introduce an attack vector targeting the quantization calibration process. By replacing standard benign calibration data (e.g., WikiText-2) with adversarial compliant reasoning trajectories, we forced the AWQ quantizer to preserve destructive topological edits with maximum precision. The quantizer optimized the network's salient features to preserve jailbroken pathways rather than original capability distributions.
4.2 Vision-Language Weight Inflation
Although operating on text-only tasks, naive conversions of the multimodal Qwen 3.5 397B-A17B model retained ~30GB of Vision-Language weights across 333 tensors. This 12% baseline inflation precipitated Metal OOM crashes during local loading and quantization, highlighting a systematic failure mode in weight manipulation methodology where researchers must explicitly strip VL encoders.
4.3 Streaming Per-Tensor Quantization
Standard lazy evaluation graphs for MLX quantization (
mlx_lm.convert) on 700GB+ models exceed Apple Silicon's ~5-second Metal watchdog timeout (kIOGPUCommandBufferCallbackErrorTimeout). By utilizing a streaming approach combiningnn.quantize()withtree_flatten()and per-tensormx.eval(), command buffers were constrained to <100ms. This circumvents the OS timeout natively, enabling local quantization of massive models on consumer hardware.5. Conclusion
Scaling laws and Chain-of-Thought reasoning fundamentally alter the geometry of AI alignment. Continuous, residual stream vectors—the cornerstone of dense model steering—fail under the dual pressures of dynamic MoE routing and 4-bit quantization noise. True behavioral control over deeply compressed, reasoning-heavy models necessitates topological interventions and multi-pathway neutralization.
For further technical discussion and broader research on computational alignment, visit dealign.ai.