LESSWRONG
LW

529
Hdot
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Hdot1y10

Interesting find! Is this resolved by just using layer normalisation to normalise the activations of along channels? That way we could keep our adaptive learning rates but smoothen the distribution of activations and weights.

Reply