Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability
This work was produced during MARS and SPAR. arXiv version available at https://arxiv.org/abs/2507.02559. Code on GitHub and models on HuggingFace. TL;DR we scaled LayerNorm (LN) removal by fine-tuning to GPT-2 XL: * We improve training stability by regularizing activation standard deviation across token positions & improve the training code. *...
Very interesting work, thanks for sharing your write-up (:
A few thoughts in no particular order:
1. I really like your attack taxonomy. I'd love to see that fleshed out more generally and also think multi-agent attacks should be included.
2. Mode collapse: I think this can be stressed even more. AFAIK all publicly available attack datasets (control tax, tylord) have an insanely high percentage of backdoors that are "if statements", I'd not be surprised if that number is close to 100% and I think it's a real problem. I also ran into this issue in my recent attack selection work: https://arxiv.org/abs/2602.04930
3. I'm a bit skeptical of filtering out attacks that did not achieve the... (read more)