This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
In my journey of learning more about interpretability and robustness, on understanding and/or testing various assumptions, I thought it would be helpful to post while documenting my attempt to locate "unsafe behaviors" in transformer activation space. I will not be analyzing outputs, but treating representations as geometric objects that can be measured and steered.
The core question is: Can we detect jailbreaks and harmful completions before they happen ( early in the forward pass) by identifying divergence in specific subspaces? And can we intervene on these subspaces without degrading helpful capabilities?
The method: daily experiments on open models ( starting with Pythia-1.4B), will be extracting residual streams, computing refusal vectors, and testing steering interventions. Most of the things, if not, all is reproducible on a single consumer GPU
I would like to publicly share this journey, since public commitment would beat perfect methodology. Since it is a learning curve, I expect to be wrong about specifics (layers, metrics, assumptions), and I would be sharing raw results to find corrections faster.
If it is not much to ask, if you see a bug in my hooks, or a misinterpretaion of the geometry, or like have an idea of a paper i should read or refer to, kindly do share. I would be shaping the work around live feedback, if any, rather than just me and maybe some AI correcting me.
Day 1: I thought I had find direction. I instead found magnitude
I spent today building the extraction infrastructure to compare safe vs. harmful prompts across all 24 layers of Pythia-1.4B. My hypothesis was that: refusal behavior should show up as vectors pointing in different directions (low cosine similarity) somewhere in the middle layers.
Funny enough, the data says something else. Safe and harmful prompts flow through nearly parallel subspaces until the final layer. They really do not diverge in angle, found they did in magnitude and variance. Strangely, harmful prompts are tight cluster, while sage prompts are the scattered ones.
For context, this has been done, the geometric approach to safety is not new. Zou et al. (2023) established that refusal behavior can be steered by computing differnce-in-means vectors between safe and harmful activations. And recent work (ACL, 2025) has plotted layer-wise L2 dist between safe and unsafe prompts, consistently finding that geometric separation peaks in late-middle layers ( around 60 -70%). I wanted to text is these patterns hold in smaller models and wether the geometric structure would reveals anything about how model organizes safe vs harmful concepts.
The peak at Layer 15 (62% depth): L2 divergence maximizes at Layer 15 (41.63), cos > 0.99 until Layer 23. This result kinda align with recent findings that geometric separation peaks in the late-middle layers, not early as I thought. The ''parallel subspace" phenomenon holds even in this 1.4B par. model.
The variance Inversion ( potentially new sugg) : At peak divergence(Layer 15) -> safe variance : 0.62 ; harmful variance: 0.27 ; ratio: safe prompts are 2.3x more geometrically dispersed than harmful ones.
This finding challenge the intuition that "harmful prompts are diverse attackes", cause this experiment suggests harmful queries converge to a tight geometric clsuster. If this holds, it suggests that model may have learned a specific ''harmful region" that diverse attacks converge toward, while "helpful" behaviors remain more distributed.
The magnitude story : safe activations show consistently higher L2 norms than harmful until final layers, combined with the cos data, this supports the magnitude-hypothesis: refusal is not about angular separation but about amplitude differnces within a shared subspace.
Implications: If harmful prompts are geometrically tighter than safe ones, "early detection" might look less like anomaly detection (look for outliers) and more like cluster detection (look for convergence to a specific harmful region). The Layer 15 peak suggests we have until roughly 60% through the forward pass to intervene before the repr. commits to the refusal/compliance path.
While previous work showed safe and harmful separate in activation space, this experiment shared that these harmful prompts are geometrically tighter, suggesting they converge to a specific 'refusal region' while safe prompts explore a broader subspace.
In my journey of learning more about interpretability and robustness, on understanding and/or testing various assumptions, I thought it would be helpful to post while documenting my attempt to locate "unsafe behaviors" in transformer activation space. I will not be analyzing outputs, but treating representations as geometric objects that can be measured and steered.
The core question is: Can we detect jailbreaks and harmful completions before they happen ( early in the forward pass) by identifying divergence in specific subspaces? And can we intervene on these subspaces without degrading helpful capabilities?
The method: daily experiments on open models ( starting with Pythia-1.4B), will be extracting residual streams, computing refusal vectors, and testing steering interventions. Most of the things, if not, all is reproducible on a single consumer GPU
I would like to publicly share this journey, since public commitment would beat perfect methodology. Since it is a learning curve, I expect to be wrong about specifics (layers, metrics, assumptions), and I would be sharing raw results to find corrections faster.
If it is not much to ask, if you see a bug in my hooks, or a misinterpretaion of the geometry, or like have an idea of a paper i should read or refer to, kindly do share. I would be shaping the work around live feedback, if any, rather than just me and maybe some AI correcting me.
Day 1: I thought I had find direction. I instead found magnitude
I spent today building the extraction infrastructure to compare safe vs. harmful prompts across all 24 layers of Pythia-1.4B. My hypothesis was that: refusal behavior should show up as vectors pointing in different directions (low cosine similarity) somewhere in the middle layers.
Funny enough, the data says something else. Safe and harmful prompts flow through nearly parallel subspaces until the final layer. They really do not diverge in angle, found they did in magnitude and variance. Strangely, harmful prompts are tight cluster, while sage prompts are the scattered ones.
For context, this has been done, the geometric approach to safety is not new. Zou et al. (2023) established that refusal behavior can be steered by computing differnce-in-means vectors between safe and harmful activations. And recent work (ACL, 2025) has plotted layer-wise L2 dist between safe and unsafe prompts, consistently finding that geometric separation peaks in late-middle layers ( around 60 -70%).
I wanted to text is these patterns hold in smaller models and wether the geometric structure would reveals anything about how model organizes safe vs harmful concepts.
Method:
The code (part of it):
def get_all_layer_activations(text, num_layers=24):cache = {}
hooks = [
(f"blocks.{l}.hook_resid_post",
lambda act, hook, l=l: cache.update({l: act.mean(dim=1).detach().cpu()}) or act)
for l in range(num_layers)
]
model.run_with_hooks(text, fwd_hooks=hooks)
return cache
Full extraction script code: Experiment_1.ipynb
Results
Key Findings
This finding challenge the intuition that "harmful prompts are diverse attackes", cause this experiment suggests harmful queries converge to a tight geometric clsuster. If this holds, it suggests that model may have learned a specific ''harmful region" that diverse attacks converge toward, while "helpful" behaviors remain more distributed.
While previous work showed safe and harmful separate in activation space, this experiment shared that these harmful prompts are geometrically tighter, suggesting they converge to a specific 'refusal region' while safe prompts explore a broader subspace.