Re-Examining LayerNorm

Awesome visualizations. Thanks for doing this.

It occurred to me that LayerNorm seems to be implementing something like lateral inhibition, using extreme values of one neuron to affect the activations of other neurons. In biological brains, lateral inhibition plays a key role in many computations, enabling things like sparse coding and attention. Of course, in those systems, input goes through every neuron's own nonlinear activation function prior to having lateral inhibition applied.

I would be interested in seeing the effect of applying a nonlinearity (such as ReLU, GELU, ELU, etc.) prior to LayerNorm in an artificial neural network. My guess is that it would help prevent neurons with strong negative pre-activations from messing with the output of more positively activated neurons, as happens with pure LayerNorm. Of course, that would limit things to the first orthant for ReLU, although not for GELU or ELU. Not sure how that would affect stretching and folding operations, though.

By the way, have you looked at how this would affect processing in a CNN, normalizing each pixel of a given layer across all feature channels? I think I've tried using LayerNorm in such a context before, but I don't recall it turning out too well. Maybe I could look into that again sometime.

[-]skosch3y50

That was my first thought as well. As far as I know, the most popular simple model used for this in the neuro literature, divisive normalization, uses similar but not quite identical formula. Different authors use different variations, but it's something shaped like

where $y_{i}$ is the unit's activation before lateral inhibition, $β$ adds a shift/bias, $κ_{i j}$ are the respective inhibition coefficients, and the exponent $α$ modulates the sharpness of the sigmoid (2 is a typical value). Here's an interactive desmos plot with just a single self-inhibiting unit. This function is asymmetric in the way you describe, if I understand you correctly, but to my knowledge it's never gained any popularity outside of its niche. The ML community seems to much prefer Softmax, LayerNorm et al. and I'm curious if anyone knows if there's a deep technical reason for these different choices.

[-]Charlie Steiner3y30

I think in feed-forward networks (i.e. they don't re-use the same neuron multiple times), having to learn all the inhibition coefficients is too much to ask. RNNs have gone in an out of fashion, and maybe they could use something like this (maybe scaled down a little), but you could achieve similar inhibition effects with multiple different architectures - LSTMs already have multiplication built into them, but in a different way. There is not a particularly deep technical reason for different choices.

[-]Adam Jermyn3y54

This is really interesting!

One question: do we need layer norm in networks? Can we get by with something simpler? My immediate reaction here is “holy cow layer norm is geometrically complicated!” followed by a desire to not use it in networks I’m hoping to interpret.

[-]Dennis Akar3y31

colab notebook
this interactive notebook
check out the notebook
notebook

First link is not like the others.

[-]Eric Winsor3y40

Thanks for the catch!

[-]Aryan Bhatt3y21

Sorry for the mundane comment, but in the "Isolating the Nonlinearity" section of the colab notebook, you say

Note that a vector in dimensions with mean 0 has variance 1 if and only if it has length $\frac{1}{\sqrt{n}}$

I think you might've meant to say $\sqrt{n}$ there instead of $\frac{1}{\sqrt{n}}$ , but please do correct me if I'm wrong!!!

[-]Danylo Hlynskyi2yΩ010

you can solve MNIST with a LayerNorm MLP, for example

is there a paper for this or is this unpublished common knowledge result?

[-]Algon3y10

This is great. Was there a reason why you didn't create corresponding visualisations of the layer activations for the network whenever it plateaued in loss?

[-]nulldippindots3y-10

Great post! One question: isn't LayerNorm just normalizing a vector?

[-]Aryan Bhatt3y20

From the "Conclusion and Future Directions" section of the colab notebook:

Most of all, we cannot handwave away LayerNorm as "just doing normalization"; this would be analogous to describing ReLU as "just making things nonnegative".

I don't think we know too much about what exactly LayerNorm is doing in full-scale models, but at least in smaller models, I believe we've found evidence of transformers using LayerNorm to do nontrivial computations^[1].