From the "Conclusion and Future Directions" section of the colab notebook:
I don't think we know too much about what exactly LayerNorm is doing in
full-scale models, but at least in smaller models, I believe we've found
evidence of transformers using LayerNorm to do nontrivial computations[1].
1. ^
I think I vaguely recall something about this in either Neel Nanda's
"Rederiving Positional Encodings" stuff [https://t.co/91XYFVA77z], or Stefan
Heimersheim + Kajetan Janiak's work on 4-layer Attn-only transformers
[https://www.lesswrong.com/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only],
but I could totally be misremembering, sorry.
Great post! One question: isn't LayerNorm just normalizing a vector?