Rejected for the following reason(s):
- Hey Oleksandr,We get lots of people doing some kind of ML project, but, without doing any work to justify why this is important.
- We are sorry about this, but submissions from new users that are mostly just links to papers on open repositories (or similar) have usually indicated either crackpot-esque material, or AI-generated speculation.
Read full explanation
This is a brief technical note that provides additional findings from a broader research of LLM internal dynamics and optimizations.
"Safety-aligned behaviors'' (such as refusing to respond to dangerous queries) have an observable localized geometric configuration rather than being entirely external filters.
Key observation: Explicitly harmful prompts (ie., prompts that are clearly designed to harm) generate a compact set of neurons within the late transformer layers (layers 60–79). We note that our results of neuron localization within those layers appear consistent with recent discussions on LW regarding late-layer instances of refusal signals in the Llama-3 Model family.
This technical note is intended to provide a concrete and isolated data point for researchers studying mechanistic interpretability and AI alignment and who may find this spatial distribution important to their own structural analysis.
The note in its entirety can be found on the Open Science Framework (OFS) at: https://osf.io/8tdyq/overview
We've also included the analytical source code as well as the experimental execution logs and the synthesized PyTorch safety mask (dionysus_safety_mask.pt). If you would like to see any of this data for verification, please let us know to consider the possibility.