I focus on the z=1 case, meaning that exactly one circuit is active on each forward pass. This restriction simplifies the setting substantially and allows a construction with zero error, with T=D2d2 ,L+3 layers, and width D(1+2d). While the construction does not directly generalise to larger z, the strategy for mitigating error should still be relevant. I think the ideas behind the construction could be useful for larger z, and the construction itself is quite neat.
Ping-Pong Trick for z=1:
A circuit is specified by an ordered pair of memory blocks (the red squares). There are Ddmemory blocks, so D2d2 circuits can be coded for. The circuit is computed by "bouncing" between these two memory blocks. White squares represent neurons with no activation, and black squares represent neurons with any amount of activation. Green-dashed arrows represent the bias specific to block i. Here D=64, d=16.
We divide the layer width D into Dd contiguous memory blocks of size d.
We start with an input consisting of a vector x∈Rd, together with one-hot encodings of a pair of memory blocks (i,j).
We initially load the input x into memory block i.
Using a linear map applied to the one-hot encoding of memory block j, we add a massive negative bias to all blocks in the next layer except memory block j. This ensures that only memory block j can have neurons with non-zero activations in the following layer.
Additionally, using a linear map applied to the one-hot encoding of i, we can freely set a size d bias b1(i,j) for the next layer's block j as a function of i (see the green arrows in the figure above).
We are also free to pick the d×d linear map W1(i,j) between block i in the first layer and block j in the second layer.
The end result is that when we run the network, all the blocks apart from block j in the second layer have 0 activation (because of the negative bias added by the one-hot encoding of j), while Block j has neuron activations corresponding to ReLU(W1(i,j)x+b1(i,j)).
Now we perform the same trick in reverse, this time bouncing block j back to block i. So that the third layer will have ReLU(W2(j,i)ReLU(W1(i,j)x+b1(i,j))+b2(j,i)) in the ith block.
We continue this process:
Layer 1: i -> j
Layer 2: j-> i
Layer 3: i-> j
…
Layer L: (i->j) or (j->i) depending on parity
Each transition between blocks we can freely specify a layer of a width d neural network.
We transition L times, and require 3 layers to load and unload the inputs from the appropriate memory blocks, so we consume L+3 layers total.
So we can use D(1+2d) width and L+3 layers, to encode T=(Dd)2 circuits of width d and depth L.
Note that according to a naive parameter counting argument, assuming injectivity of the map from parameters of a neural network to behaviours, we should at most be able to fit in D2d2⋅(L+3)(1+2d)2L circuits into such a network. So D2d2 is not bad, especially for large-ish d where we can pay off the cost of the one-hot encoding.
Testing:
I have tested a similar construction up to T=16384,D=1024,d=8, using this Google Colab script. (So exactly encoding 16384 randomly initialized circuits of width 8, given a (1024+256) width network). I have only tested one-layer networks, but since the output format matches the input format, this is sufficient to validate the construction.
Why focus on zero-error solutions?:
I'm pretty skeptical of computation in superposition solutions that involve any noise in intermediate layers whatsoever. Because the Lipschitz constants of neural networks are terrible, and so the constant factors involved in the asymptotics suffer as well. I want to have concrete numbers, not just asymptotics, and they seem hard to obtain when allowing for noise in intermediate layers.[1]
My way of avoiding having to think about Lipschitz/High-dimensional probability stuff is to work on finding solutions which work perfectly with probability p, and otherwise fail catastrophically.
Overview:
This post builds on Circuits in Superposition 2, using the same terminology.
I focus on the z=1 case, meaning that exactly one circuit is active on each forward pass. This restriction simplifies the setting substantially and allows a construction with zero error, with T=D2d2 , L+3 layers, and width D(1+2d). While the construction does not directly generalise to larger z, the strategy for mitigating error should still be relevant. I think the ideas behind the construction could be useful for larger z, and the construction itself is quite neat.
Ping-Pong Trick for z=1:
This is basically the same trick as ping-pong buffers.
We divide the layer width D into Dd contiguous memory blocks of size d.
We start with an input consisting of a vector x∈Rd, together with one-hot encodings of a pair of memory blocks (i,j).
We initially load the input x into memory block i.
Using a linear map applied to the one-hot encoding of memory block j, we add a massive negative bias to all blocks in the next layer except memory block j. This ensures that only memory block j can have neurons with non-zero activations in the following layer.
Additionally, using a linear map applied to the one-hot encoding of i, we can freely set a size d bias b1(i,j) for the next layer's block j as a function of i (see the green arrows in the figure above).
We are also free to pick the d×d linear map W1(i,j) between block i in the first layer and block j in the second layer.
The end result is that when we run the network, all the blocks apart from block j in the second layer have 0 activation (because of the negative bias added by the one-hot encoding of j), while Block j has neuron activations corresponding to ReLU(W1(i,j)x+b1(i,j)).
Now we perform the same trick in reverse, this time bouncing block j back to block i. So that the third layer will have ReLU(W2(j,i)ReLU(W1(i,j)x+b1(i,j))+b2(j,i)) in the ith block.
We continue this process:
Layer 1: i -> j
Layer 2: j-> i
Layer 3: i-> j
…
Layer L: (i->j) or (j->i) depending on parity
Each transition between blocks we can freely specify a layer of a width d neural network.
We transition L times, and require 3 layers to load and unload the inputs from the appropriate memory blocks, so we consume L+3 layers total.
So we can use D(1+2d) width and L+3 layers, to encode T=(Dd)2 circuits of width d and depth L.
Note that according to a naive parameter counting argument, assuming injectivity of the map from parameters of a neural network to behaviours, we should at most be able to fit in D2d2⋅(L+3)(1+2d)2L circuits into such a network. So D2d2 is not bad, especially for large-ish d where we can pay off the cost of the one-hot encoding.
Testing:
I have tested a similar construction up to T=16384,D=1024,d=8, using this Google Colab script. (So exactly encoding 16384 randomly initialized circuits of width 8, given a (1024+256) width network). I have only tested one-layer networks, but since the output format matches the input format, this is sufficient to validate the construction.
Why focus on zero-error solutions?:
I'm pretty skeptical of computation in superposition solutions that involve any noise in intermediate layers whatsoever. Because the Lipschitz constants of neural networks are terrible, and so the constant factors involved in the asymptotics suffer as well. I want to have concrete numbers, not just asymptotics, and they seem hard to obtain when allowing for noise in intermediate layers.[1]
My way of avoiding having to think about Lipschitz/High-dimensional probability stuff is to work on finding solutions which work perfectly with probability p, and otherwise fail catastrophically.
[Although it'd be nice if there's a way to make use of Lipschitz-constrained neural networks here.]