x

LESSWRONG

LW

Joachim Schaeffer — LessWrong

Joachim Schaeffer

Joachim Schaeffer

Message

26

5

1y

Joachim Schaeffer

26

1y

Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability

by submarat, Joachim Schaeffer, Luca Baroni, galvsk, and StefanHex

This work was produced during MARS and SPAR. arXiv version available at https://arxiv.org/abs/2507.02559. Code on GitHub and models on HuggingFace. TL;DR we scaled LayerNorm (LN) removal by fine-tuning to GPT-2 XL: * We improve training stability by regularizing activation standard deviation across token positions & improve the training code. *...

Jul 23, 2025•31