LESSWRONG
LW

2245
skosch
4010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Re-Examining LayerNorm
skosch3y50

That was my first thought as well. As far as I know, the most popular simple model used for this in the neuro literature, divisive normalization, uses similar but not quite identical formula. Different authors use different variations, but it's something shaped like

zi=yαiβα+∑jκijyαj

where yi is the unit's activation before lateral inhibition, β adds a shift/bias, κij are the respective inhibition coefficients, and the exponent α modulates the sharpness of the sigmoid (2 is a typical value). Here's an interactive desmos plot with just a single self-inhibiting unit. This function is asymmetric in the way you describe, if I understand you correctly, but to my knowledge it's never gained any popularity outside of its niche. The ML community seems to much prefer Softmax, LayerNorm et al. and I'm curious if anyone knows if there's a deep technical reason for these different choices.

Reply
No wikitag contributions to display.
No posts to display.