Swap and Scale

Git Re-Basin: Merging Models modulo Permutation Symmetries [Ainsworth et al., 2022] and the cited The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks [Entezari et al., 2021] seem several years ahead.

I cannot independently verify that their claims about SGD are true, but the paper makes sense on the first glance.

Opinion:
Symmetries in NNs are a mainstream ML research area with lots of papers, and I don't think doing research "from first principles" here will be productive. This also holds for many other alignment projects.

However I do think it makes sense as an alignment-positive research direction in general.

Trivial Symmetries

There are two types of symmetries that occur for any point in parameter space. I refer to these as the trivial symmetries. I’m unsure who the first person to point these out for ReLU neural networks was, but equivalent symmetries for networks using tanh activation functions were mentioned in Bishop’s 2006 “Pattern Recognition and Machine Learning” and cites “Functionally Equivalent Feedforward Neural Networks” (Kurkova and Kainen, 1994) as proving the case for networks in general. In terms of Machine Learning literature this is ancient, and I suspect the authors haven't even heard of Lesswrong.

Swap Symmetry

For any hidden layer, we can imagine permuting the weights and biases in such a way that the output to the next layer remains unchanged. Specifically, we apply one permutation to the weights in the “incoming weights” and the biases of the current layer, then apply the inverse permutation to the weights in the next layer. This has the effect of completely undoing the first permutation.

In the simplest example where our layer only has two nodes, we can imagine swapping two neurons around whilst retaining their original connections.

It may also be helpful to think of this symmetry as simply relabelling the nodes in a single layer. For every hidden layer, there are N! such symmetries, where N is the number of nodes in the layer.

Scaling Symmetry

The second symmetry occurs when we multiply the weights leading into a node by some non-zero real number α, multiply the bias by α and multiply the weights leading out of the node by 1 over α. This has the effect of leaving the final weight unchanged, due to the linear nature of the ReLU function.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

17

17

17

Trivial Symmetries

Swap Symmetry

Scaling Symmetry