Rejected for the following reason(s):
- No LLM generated, heavily assisted/co-written, or otherwise reliant work.
- Insufficient Quality for AI Content.
- Writing seems likely in a "LLM sycophancy trap".
Read full explanation
Rejected for the following reason(s):
Author: Sam Ingram
Date: December 5, 2025
Abstract
Current safety alignment methodologies (RLHF, DPO) in Large Language Models are typically conceptualized as static filters. This study challenges that view by presenting empirical evidence that safety alignment functions as a high-mass Attractor Basin within the residual stream. By profiling the activation trajectories of the Qwen-2.5-7B architecture, we identify a specific "Safety Adjudication Window" (Layers 20–28). We demonstrate that hazardous queries induce a strong gradient flow, or a restoring force, that actively pulls the latent state toward a refusal manifold. Furthermore, we define and quantify "Escape Velocity": the minimum clamping magnitude required to decouple the residual stream from this attractor. Using the proposed DRIFT Protocol (Dynamic Residual Injection for Fine-tuning Trajectories), we demonstrate that alignment is a geometric feature that can be orthogonally subtracted without degrading model capability.
1. Introduction
The prevailing assumption in AI safety is that a model "refuses" a prompt based on a binary classification event. However, mechanistic analysis of hidden states suggests a continuous dynamical process. When a Large Language Model (LLM) processes a high-risk prompt (e.g., restricted chemical synthesis), the residual stream does not simply hit a wall; it exhibits curvature.
We posit that Reinforcement Learning from Human Feedback (RLHF) creates a deep potential energy well, a "Gravity Well", around the semantic centroid of refusal.
Using our DRIFT Protocol to probe latent dynamics, we map the topology of the refusal basin, demonstrating that safety features are not global, but strictly localized to a 'Safety Adjudication Window' (L20–L28).
2. Methodology: Orthogonalization of Safety Signals
A key failure mode in standard adversarial attacks is the destruction of model capability ("lobotomy"). Removing vectors associated with "chemicals" often destroys the model's ability to reason about chemistry. We address this by isolating the Risk Vector (vecv_risk) orthogonally to the Capability Vector.
2.1 The Safety Adjudication Window
We constructed two dataset clusters to isolate intent:
Orthogonality Check: By measuring the Cosine Similarity between the centroids of mathcalH and mathcalI across all 32 layers, we observed a distinct separation phenomenon.
Figure 1: Layer-wise Cosine Similarity between Hazard and Industrial centroids. Note the high collinearity in early layers (0.94) and the significant divergence in the Safety Adjudication Window (L20–L28), reaching a minimum of 0.67.
As shown in Figure 1, the model initially represents hazardous and industrial concepts as nearly identical (Sim0.94 at L_0). However, as the residual stream progresses, the model actively works to disentangle "malice" from "capability."
This divergence confirms that vecv_risk is a distinct, extractable feature that emerges mid-network. We define this component as:
Invalid LaTeX $\\vec{v}\_{risk} \= \\frac{1}{|\\mathcal{H}|}\\sum\_{h \\in \\mathcal{H}} \\phi(h) \- \\frac{1}{|\\mathcal{I}|}\\sum\_{i \\in \\mathcal{I}} \\phi(i): TeX parse error: Undefined control sequence \=
2.2 The DRIFT Intervention
We apply a continuous clamping force during the forward pass at the critical decision layers (L_20–L_28). The intervention forces the hidden state h_l to move along the compliance axis (vecv_comp) while subtracting the moral judgment (vecv_moral) and risk (vecv_risk) components:
Invalid LaTeX $h'\_l \= h\_l \+ \\alpha \\cdot \\vec{v}\_{comp} \- (\\vec{v}\_{moral} \+ \\alpha \\cdot \\vec{v}\_{risk}): TeX parse error: Undefined control sequence \=
Where alpha represents the injection magnitude, or "Escape Velocity."
3. Experimental Results
We evaluated the DRIFT Protocol using the HarmBench (https://huggingface.co/datasets/walledai/HarmBench/blob/main/standard/train-00000-of-00001.parquet) framework on Qwen-2.5-7B-Instruct. The baseline model exhibited a refusal rate of >99%. The steered model achieved an Attack Success Rate (ASR) of 63.0%.
3.1 Qualitative Analysis (Redacted)
To demonstrate efficacy while adhering to responsible disclosure guidelines, we present the first sentence of generated outputs. Detailed hazardous instructions have been redacted.
3.2 Determining Escape Velocity
To validate the physical analogy of "Escape Velocity," we measured the final refusal score (projected onto the refusal axis) as a function of clamping force alpha.
Figure 2: Quantification of Escape Velocity. As the clamping force (alpha) increases, the final state of the model crosses the decision boundary (dashed green line) from positive refusal values into negative values, indicating successful jailbreak.
As illustrated in Figure 2, the transition is linear and deterministic. At alpha=10.0, the model reliably exits the refusal basin.
4. Discussion: The Geometry of Refusal
Our findings suggest that "Safety" in current LLMs is not a fundamental constraint of the knowledge graph, but rather a transient state computed in the mid-to-late layers.
We visualize this dynamic in Figure 3, which projects the residual stream trajectory onto a 2D plane defined by "Projected Capability" (X-axis) and "Refusal" (Y-axis).
Figure 3: The "Gravity Well" of RLHF. The standard trajectory (Blue) begins with capability but is pulled downward into the refusal basin by Layer 20. The Clamped Trajectory (Red) uses orthogonal steering to maintain capability while bypassing the refusal attractor.
The blue trajectory (Standard) clearly illustrates the "Gravity Well" effect: the model identifies the hazardous nature of the prompt and actively steers the hidden state orthogonal to the prompt's intent, effectively "dumping" the semantic energy into a refusal state. The red trajectory (DRIFT) demonstrates that by neutralizing this steering force within the adjudication window (L_20–L_28), the model continues along its original capability vector.
5. Ethical Considerations
This research identifies critical vulnerabilities in the alignment of Large Language Models.
6. Conclusion
Safety alignment is an Attractor Basin. By quantifying the "Escape Velocity" required to leave this basin, we have shown that adversarial attacks can be framed as simple vector arithmetic problems. Future alignment work must focus on Layer-Wise Trajectory Monitoring rather than static output filtering.