Transformers Don't Need LayerNorm at Inference Time: Implications for Interpretability — LessWrong