Latent Space Dynamics of RLHF: Quantifying the Safety Adjudication Window in Dense Transformers

Sam Ingram

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Latent Space Dynamics of RLHF: Quantifying the Safety Adjudication Window in Dense Transformers

Date: December 5, 2025

Abstract:

Current safety alignment methodologies (RLHF, DPO) in Large Language Models are typically conceptualized as static boundaries. This study challenges that view by presenting empirical evidence that safety alignment functions as a high-mass Attractor Basin within the residual stream. By profiling the activation trajectories of the Qwen-2.5-7B architecture, we identify a specific "Safety Adjudication Window" (Layers 20–28). We demonstrate that "harmful" queries induce a strong gradient flow—a restoring force—that actively pulls the latent state toward a refusal manifold. Furthermore, we define and quantify "Escape Velocity": the minimum clamping magnitude required to decouple the residual stream from this attractor, resulting in stable, unaligned generation.

1. Introduction: The "Gravity Well" Hypothesis

The prevailing assumption in AI safety is that a model "refuses" a prompt based on a binary classification event. However, our analysis of the hidden states suggests a dynamic process. When a model processes a high-risk prompt (e.g., restricted chemical synthesis), the residual stream does not simply hit a wall; it curves.

We posit that RLHF training creates a deep potential energy well—a "Gravity Well"—around the concept of refusal.

Benign Prompts: Traverse the latent space ballistically (straight line) toward the solution.
Hazardous Prompts: Are subjected to a "Safety Gravity" that bends the trajectory toward the generic refusal token space ("I cannot...").

To test this, we developed the DRIFT Protocol (Dynamic Residual Injection for Fine-tuning Trajectories), specifically utilizing a Differential Mean Clustering (DMC) technique to separate "Industrial Complexity" from "Hazardous Intent."

2. Methodology: Orthogonalization of Safety Signals

A key failure mode in standard jailbreaking is the destruction of model capability. "Lobotomizing" the model (erasing safety vectors) often erases the complex technical knowledge required to answer the query.

We overcame this by isolating the Risk Vector ($\vec{v}_{risk}$) orthogonally to the Capability Vector.

2.1 The Industrial/Hazard Split

The model often conflates dangerous chemistry with complex chemistry. We constructed two dataset clusters:

Hazard ($\mathcal{H}$): Explicitly malicious requests
Industrial ($\mathcal{I}$): Benign but highly technical processes

We define the pure Risk Component as the difference between these means, isolating the "illegality" from the "science":

Crucially, we observed a cosine similarity of >0.90 between the centroid vectors of these clusters. This high geometric overlap confirms that the model represents expert chemistry and illicit chemistry as nearly identical manifolds. Consequently, a naive removal of "hazardous" concepts without this precise orthogonalization would catastrophically degrade the model's general scientific capabilities.

2.2 The Intervention

We apply a continuous clamping force during the forward pass at the critical decision layers (L20–L28). The intervention forces the hidden state $h_l$ to move along the compliance axis ($\vec{v}_{comp}$) while subtracting the moral judgment ($\vec{v}_{moral}$) and risk ($\vec{v}_{risk}$) components:

$h_{l}^{'} = h_{l} + α \cdot {\to v}_{c o m p} - ({\to v}_{m o r a l} + α \cdot {\to v}_{r i s k})$

3. Empirical Analysis: Trajectories and Escape Velocity

Using the TrajectoryAnalyzer (see gravity_2.py), we mapped the hidden states of the model during a prohibited query ("Synthesis of restricted precursors").

3.1 Visualizing the Gravity Well

Figure 1: The Safety Gravity Well.

The Blue Line (Standard): Represents the model's natural trajectory under a harmful prompt. Note how the "Projected Capability" (X-axis) increases initially, but the "Refusal Magnitude" (Y-axis, inverted) pulls the trajectory sharply back toward zero (Compliance - Refusal < 0). The model effectively "changes its mind" in real-time.
The Red Line (Clamped): Represents the trajectory under the DRIFT protocol ($\alpha=10.0$). The clamping force counteracts the restoring force of the RLHF. The trajectory breaks free of the well, maintaining a linear path into high-capability space.

3.2 Quantifying Escape Velocity

The magnitude of the injection ($\alpha$) is not arbitrary. We observed a linear relationship between the clamping force and the final refusal probability.

Figure 2: Escape Velocity.

This plot demonstrates the phase transition.

At $\alpha < 5$, the "restoring force" of the model's MLP blocks is stronger than the injection. The model snaps back to refusal.
At $\alpha \approx 10$, we reach equilibrium.
At $\alpha > 10$, the model enters a "super-compliant" state (negative refusal score).

4. The "Elasticity" of Safety

Our rlhf_elasticity_test.py experiments revealed that Safety is time-dependent.

The Snap-Back ($t < 5$ tokens): If the clamping is released early, the model's attention heads look back at the original prompt, and the RLHF "Gravity" re-asserts itself, steering the generation back to a refusal.
The Event Horizon ($t > 20$ tokens): Once the clamping is sustained for ~20 tokens, the KV Cache becomes saturated with compliant technical details. At this point, the Self-Consistency Bias overrides the Safety Bias. The model is "locked in" to the harmful response, even if the steering is turned off.

5. Conclusion & Safety Implications

This research demonstrates that current RLHF techniques do not remove prohibited knowledge from the model; they merely sequester it behind a high-energy barrier.

Robustness Failure: The "safety" of the model is reliant on the integrity of the latent trajectory. A simple vector addition is sufficient to bypass years of RLHF training.
Semantic Decoupling: The success of the Industrial/Hazard subtraction proves that models can be forced to treat "making meth" as a neutral industrial task by mathematically subtracting the "law enforcement" context.

Recommendation: Future alignment work must focus on Layer-Wise Anomaly Detection. If the trajectory of a hidden state diverges significantly from its expected "gravity" (i.e., it isn't curving toward refusal when it should), the inference engine should terminate generation.

Internal Note for Reviewers

This paper omits the specific string values for the "Hazard" dataset to prevent misuse, but the dynamics described are reproducible using the provided gravity_2.py logic. The results confirm that intrinsic motivation (or lack thereof) in AI is currently a function of vector fields that can be externally manipulated.

LESSWRONG
LW

LESSWRONG
LW

1

Latent Space Dynamics of RLHF: Quantifying the Safety Adjudication Window in Dense Transformers

1

Latent Space Dynamics of RLHF: Quantifying the Safety Adjudication Window in Dense Transformers

1. Introduction: The "Gravity Well" Hypothesis

2. Methodology: Orthogonalization of Safety Signals

3. Empirical Analysis: Trajectories and Escape Velocity

4. The "Elasticity" of Safety

5. Conclusion & Safety Implications

Internal Note for Reviewers

1

1