LESSWRONG
LW

Tim Hanson
2Ω2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Residual stream norms grow exponentially over the forward pass
Tim Hanson2yΩ230

I wonder if this is related to vector-packing and unpacking via cosine similarity: the activation norm is increased so layers can select a large & variable number of semi-orthogonal bases.  (This is very much related to your information packing idea.)  

Easy experimental manipulation to test this would be to increase the number of heads, thereby decreasing the dimensionality of the cos_sim for attention, which should increase the per-layer norm growth.  (Alas, this will change the loss too - so not a perfect manipulation)

Reply
No posts to display.