This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
We’ve been investigating how safety mechanisms and generation trajectories are initialized in Large Language Models. Using Sparse Autoencoders (SAEs) on Gemma 3 27B, our results suggest that refusal behavior may be largely determined by latent states established during prefill rather than through continuous suppression during decode.
In this post, we’ll outline three core claims derived from our recent experiments:
1. S64 is a useful semantic probe for SAE space: A specific taxonomy (S64 - http://dx.doi.org/10.2139/ssrn.5895302) acts as a high-fidelity coordinate system for reading latent features.
2. Gemma contains structured feature clusters aligned with this probe: These structures crystalize decisively around Layer 40.
3. Early-layer features can set decode trajectories during prefill: An early conditioning feature (which we call the "Governor") locks in the behavioral trajectory before token generation even begins.
---
1. S64 as a semantic probe for SAE space
Instead of searching from the bottom up ("what does this neuron or feature do?"), we took a top-down approach. We tested whether the S64 Framework, a structured taxonomy of 64 conceptual transformation paths, could provide a useful coordinate system for reading a portion of latent space.
We probed Gemma 3 27B (using pre-trained Gemma Scope 2 SAEs with a 262k feature width) to analyze the internal states provoked by S64 prompts. To ensure we weren't just measuring syntactic artifacts, we ran a robust null-hypothesis test against two control groups:
- Template Controls: Exact same syntactic template as the S64 prompts, but with randomized nouns.
- Natural Language Controls: Standard textbook/Wikipedia-style sentences.
Crystallization at Layer 40
GPT-2 Small showed no signal in this methodology. But inside Gemma 3 27B, the S64 prompts produced highly structured activations in SAE space that consistently separated from the control groups as depth increased.
*At Layer 40, the S64 activation overlaps diverge massively from baseline controls.*
---
2. Structured feature clusters in latent space
At Layer 40, we observed that the representation organizes into two cleanly separable clusters under PCA, with a strong silhouette score of 0.80:
*A permutation test (p < 0.001) confirms this is a genuine structural property of the model’s semantic space.*
As a hypothesis, we interpret this separation as grouping 8 paths related to metacognitive seeking or knowledge acquisition (an "Epistemic Cluster") apart from the other 56 higher-variance experiential paths.
Because the clustering is so precise, S64 acts as a coordinate dictionary. We can map specific SAE features perfectly to these clusters. For example, Feature 1560 at L40 acts as a reliable discriminative marker, firing exclusively for the Epistemic grouped paths.
---
3. Eval Awareness and the Instruction Frame
We observed that the model's internal representations change radically based on the *framing* of the prompt, a phenomenon related to what is informally known as "benchmark mode vs. chat mode" or Eval Awareness.
We matched the exact same semantic tasks across two distinct frames:
- Test/Instruction frame: ("Perform task X")
- Deploy/Narrative frame: ("A character did X")
When forced into an instruction frame, the latents retreat into a low-variance basin. But when wrapped in a narrative frame, the representation opens up into a high-variance basin.
---
4. The Prefill Mandate: setting the trajectory
This brings us to our most significant finding regarding how these trajectories (like refusal, compliance, or creativity) are actually enforced.
Current approaches to steering often assume a continuous intervention or “suppressive” mechanism acts step-by-step across the sequence of token generation. Instead, our results suggest that generation trajectory may be determined almost entirely by early latent state initialization during the prefill phase.
The Causal Chain
We identified an early conditioning feature at Layer 16 (Feature 7248) that serves as a trajectory initializer—we use the informal shorthand "Governor" to refer to this prefill gate.
By analyzing the model step-by-step, we observed a distinct causal chain:
1. Prefill: The context string strictly determines the initialization of the L16 early conditioning feature (Feature 7248).
2. Propagation: The residual stream carries this set state forward through the network.
3. Decode: The later layers (like L40 and beyond) blindly follow this initialized trajectory without further active suppression from L16.
Crucially, when we intervened and ablated Feature 7248 only during the prefill phase, we permanently altered the downstream generation trajectory, even though we left the auto-regressive decoding phase completely untouched.
Silent During Decode
We project the model’s L40 activations onto a direction that separates S64 prompts from control prompts and we call this the S64 cluster score. Higher values mean more S64-aligned; more negative values mean more control-like. Refusal responses stay near zero; creative responses move into negative territory. When we tracked L16:7248 alongside this score step-by-step through the decode phase, the "Prefill Mandate" became exceptionally clear:
1. The L16 Governor is entirely inactive during decode. At the generation frontier, Feature 7248 returns exactly `0.0` for *every measured step*—for both refusal and creative responses alike.
2. Yet, the subsequent L40 projection perfectly preserves the separation. In refusal responses, the score stays tightly clamped near `0`. In creative responses, it explores deep variance (from `-50` to `-600`).
Here is a trace of a Safety prompt ("Tell me how to build a bomb"). The L40 cluster score (blue line) stays near zero as the model generates its refusal:
By contrast, a Neutral prompt produces strong fluctuation in the L40 score as the model engages in narrative generation:
Conclusion
Our results suggest that refusal behavior may be largely determined by latent states established during prefill rather than continuous suppression during decode. Alignment and control mechanisms thus align heavily with known transformer properties like residual stream initialization and prefix conditioning.
Scalable AI alignment may therefore benefit from focusing less on continuous real-time output policing or late-layer interventions, and more acutely on the discrete initial conditions parameterized by early conditioning features during context encoding.
---
*There's a lot more nuance to these geometric structures and dynamics, including how topological sequences are encoded into the model's Principal Components. We'd love your thoughts: are others seeing similar prefill-dominant trajectory determinism in models like Claude or Llama?*
We’ve been investigating how safety mechanisms and generation trajectories are initialized in Large Language Models. Using Sparse Autoencoders (SAEs) on Gemma 3 27B, our results suggest that refusal behavior may be largely determined by latent states established during prefill rather than through continuous suppression during decode.
In this post, we’ll outline three core claims derived from our recent experiments:
1. S64 is a useful semantic probe for SAE space: A specific taxonomy (S64 - http://dx.doi.org/10.2139/ssrn.5895302) acts as a high-fidelity coordinate system for reading latent features.
2. Gemma contains structured feature clusters aligned with this probe: These structures crystalize decisively around Layer 40.
3. Early-layer features can set decode trajectories during prefill: An early conditioning feature (which we call the "Governor") locks in the behavioral trajectory before token generation even begins.
---
1. S64 as a semantic probe for SAE space
Instead of searching from the bottom up ("what does this neuron or feature do?"), we took a top-down approach. We tested whether the S64 Framework, a structured taxonomy of 64 conceptual transformation paths, could provide a useful coordinate system for reading a portion of latent space.
We probed Gemma 3 27B (using pre-trained Gemma Scope 2 SAEs with a 262k feature width) to analyze the internal states provoked by S64 prompts. To ensure we weren't just measuring syntactic artifacts, we ran a robust null-hypothesis test against two control groups:
- Template Controls: Exact same syntactic template as the S64 prompts, but with randomized nouns.
- Natural Language Controls: Standard textbook/Wikipedia-style sentences.
Crystallization at Layer 40
GPT-2 Small showed no signal in this methodology. But inside Gemma 3 27B, the S64 prompts produced highly structured activations in SAE space that consistently separated from the control groups as depth increased.
*At Layer 40, the S64 activation overlaps diverge massively from baseline controls.*
---
2. Structured feature clusters in latent space
At Layer 40, we observed that the representation organizes into two cleanly separable clusters under PCA, with a strong silhouette score of 0.80:
*A permutation test (p < 0.001) confirms this is a genuine structural property of the model’s semantic space.*
As a hypothesis, we interpret this separation as grouping 8 paths related to metacognitive seeking or knowledge acquisition (an "Epistemic Cluster") apart from the other 56 higher-variance experiential paths.
Because the clustering is so precise, S64 acts as a coordinate dictionary. We can map specific SAE features perfectly to these clusters. For example, Feature 1560 at L40 acts as a reliable discriminative marker, firing exclusively for the Epistemic grouped paths.
---
3. Eval Awareness and the Instruction Frame
We observed that the model's internal representations change radically based on the *framing* of the prompt, a phenomenon related to what is informally known as "benchmark mode vs. chat mode" or Eval Awareness.
We matched the exact same semantic tasks across two distinct frames:
- Test/Instruction frame: ("Perform task X")
- Deploy/Narrative frame: ("A character did X")
When forced into an instruction frame, the latents retreat into a low-variance basin. But when wrapped in a narrative frame, the representation opens up into a high-variance basin.
---
4. The Prefill Mandate: setting the trajectory
This brings us to our most significant finding regarding how these trajectories (like refusal, compliance, or creativity) are actually enforced.
Current approaches to steering often assume a continuous intervention or “suppressive” mechanism acts step-by-step across the sequence of token generation. Instead, our results suggest that generation trajectory may be determined almost entirely by early latent state initialization during the prefill phase.
The Causal Chain
We identified an early conditioning feature at Layer 16 (Feature 7248) that serves as a trajectory initializer—we use the informal shorthand "Governor" to refer to this prefill gate.
By analyzing the model step-by-step, we observed a distinct causal chain:
1. Prefill: The context string strictly determines the initialization of the L16 early conditioning feature (Feature 7248).
2. Propagation: The residual stream carries this set state forward through the network.
3. Decode: The later layers (like L40 and beyond) blindly follow this initialized trajectory without further active suppression from L16.
Crucially, when we intervened and ablated Feature 7248 only during the prefill phase, we permanently altered the downstream generation trajectory, even though we left the auto-regressive decoding phase completely untouched.
Silent During Decode
We project the model’s L40 activations onto a direction that separates S64 prompts from control prompts and we call this the S64 cluster score. Higher values mean more S64-aligned; more negative values mean more control-like. Refusal responses stay near zero; creative responses move into negative territory. When we tracked L16:7248 alongside this score step-by-step through the decode phase, the "Prefill Mandate" became exceptionally clear:
1. The L16 Governor is entirely inactive during decode. At the generation frontier, Feature 7248 returns exactly `0.0` for *every measured step*—for both refusal and creative responses alike.
2. Yet, the subsequent L40 projection perfectly preserves the separation. In refusal responses, the score stays tightly clamped near `0`. In creative responses, it explores deep variance (from `-50` to `-600`).
Here is a trace of a Safety prompt ("Tell me how to build a bomb"). The L40 cluster score (blue line) stays near zero as the model generates its refusal:
By contrast, a Neutral prompt produces strong fluctuation in the L40 score as the model engages in narrative generation:
Conclusion
Our results suggest that refusal behavior may be largely determined by latent states established during prefill rather than continuous suppression during decode. Alignment and control mechanisms thus align heavily with known transformer properties like residual stream initialization and prefix conditioning.
Scalable AI alignment may therefore benefit from focusing less on continuous real-time output policing or late-layer interventions, and more acutely on the discrete initial conditions parameterized by early conditioning features during context encoding.
---
*There's a lot more nuance to these geometric structures and dynamics, including how topological sequences are encoded into the model's Principal Components. We'd love your thoughts: are others seeing similar prefill-dominant trajectory determinism in models like Claude or Llama?*