I focus on the z=1 case, meaning that exactly one circuit is active on each forward pass. This restriction simplifies the setting substantially and allows a construction with zero error, with T=D2d2. While the construction does not directly generalise to larger z, the strategy for mitigating error should still be relevant. I think the ideas behind the construction could be useful for larger z, and the construction itself is quite neat.
Signal from the active circuit bleeds into inactive circuits, and this signal then enters back into the active circuit as noise in the next layer.
The key issue is not that inactive circuits briefly receive nonzero activations, but that these activations persist across layers and are allowed to feed back into the active computation. Once this happens repeatedly, noise accumulates.
The goal of this post is to eliminate this failure mode entirely in the z=1 setting.
High-level idea / TLDR:
The construction given in this post enforces a simple invariant:
At layer 2i of the larger network, the active circuit’s state at layer i of its small network is embedded in one of Dd fixed orthogonal d-dimensional subspaces of the large network.
Each of the T circuits is assigned one of these d-dimensional subspaces to store its state in when it is active (multiple circuits may share the same subspace).
The subspace in which the active circuit’s state lives is provided as part of the input via a one-hot encoding. Using this, we can explicitly remove any signal outside the active subspace with an additional error-correction layer before the next layer is computed. This prevents inactive circuits from ever influencing the active circuit across layers.
Construction:
Memory blocks:
Let the large network have D neurons per layer. We partition these neurons into
Dd contiguous memory blocks, each consisting of d neurons.
We assign each of the T small circuits to a memory block via a map
f:[T]→[Dd]
If a large-network neuron is associated with circuit i, neuron j, then it writes its output to position j of the f(i)-th memory block.
A single large-network neuron may be associated with multiple small-circuit neurons, and thus may write to multiple memory blocks. However, we impose the following constraint:
Any two circuits associated with the same large-network neuron must be assigned to different memory blocks.
This constraint is crucial for enabling exact error correction.
Input:
The input consists of an input x∈Rd placed in the f(j)-th memory block, where j is the active circuit. We are also given a one-hot encoding of f(j), and a one-hot encoding of a pointer to neuron sets (defined later). In total we need D(1+2d) neurons for the input.
[The fact that the input x is placed in the correct memory block is just for convenience: given the one-hot encoding of f(j), as long as x∈Rd is stored in a known location, it can be moved to the f(j)-th memory block using a single additional layer.]
Error correction via block clearing:
Assume that we are given f(j) one-hot encoded as part of the input, where j is the (unique) active circuit. This requires only Dd additional neurons.
After each layer of computation, we insert an error-correction layer that zeroes out all memory blocks except the f(j)-th memory block (add a large negative bias followed by a ReLU).
As a result:
Inactive circuits may temporarily receive signal.
However, all such signal is erased before the next layer by an error-correction layer.
Inactive circuits never feed information back into the active circuit.
We also learn a bias vector bi∈RDfor each memory block i.
This bias is implemented as a linear map applied to the one-hot encoding of f(j), where j is the active circuit.
Because circuits assigned to the same memory block are disjoint, this allows us to exactly match the bias terms of the active circuit.
Selecting the active neuron set:
We additionally need to one-hot encode a pointer to the set of neurons associated with the active circuit j.
Since z=1, there is no need to impose further constraints on the neuron sets beyond the memory-block condition. In particular, each memory block can reuse the same neuron sets for its circuits (for example, Dd contiguous sets of d neurons).
Thus, the pointer only needs to be of size Dd.
Using this pointer, we zero out all neurons except those corresponding to the active circuit (via a large negative bias). Given that the bias is set correctly, the output on the d neurons associated with the active circuit is exactly the output of the next layer of the small circuit. The memory-block clearing step then ensures that only the output written to block f(j) is retained.
Result:
The final construction has:
D(1+2d) large-network width,
2L layers,
and it can exactly encode
T=D2d2 (Dd disjoint circuits per memory block, and Dd memory blocks.)
small circuits, each with:
width d,
L layers.
The simulation error in the z=1 case is exactly zero.
Testing:
I have tested this construction up to T=16384,D=1024,d=8, using this Google Colab script. (So exactly encoding 16384 randomly initialized circuits of width 8, given a (1024+256) width network). I have only tested one-layer networks, but since the output format matches the input format, this is sufficient to validate the construction.
Is this trivial?
In hindsight, this construction feels close to a triviality. However, it was not obvious to me prior to working it out.
Being given an encoding of the active circuit simplifies the problem substantially. That said, there are realistic scenarios in LLMs where something similar occurs (for example, two-token names, where we want to apply a circuit indexed by the referenced entity). You could use the first token of a name as the memory block, and the second token as the neuron set pointer, for example.
One way to think about the memory block encoding is as specifying a coarse circuit family, with the neuron-set pointer selecting the specific circuit within that family. It is plausible that real models implement something like this implicitly.
Finally, the z=1 assumption is doing a lot of work: it removes almost all constraints on neuron-set reuse and allows the pointer to the active neuron set be very short. One possible mitigation for larger z would be to dynamically learn which set of neurons should be active on each forward pass using an additional error-correction layer prior to each layer of circuit execution. This would require roughly doubling the network width, and an additional L layers.
What are the broadly applicable ideas here?
In my opinion the useful idea here is dividing D-width networks into Dd fixed memory blocks of width d. This makes error correction simple, because we know the subspace in which the current circuit's state lives, and so can zero out it's complement.
For larger z, if we assume that there are no memory block collisions (i.e: only a single circuit is active at once per memory block), then the same error correction trick should be possible to mitigate the feedback error from inactive circuits. But memory block collisions would be catastrophic. I think there is reasonable justification for an assumption of the form "only one circuit from each family activates", though.
Formal write-up
Google Colab Implementation
Overview:
This post builds on Circuits in Superposition 2, using the same terminology.
I focus on the z=1 case, meaning that exactly one circuit is active on each forward pass. This restriction simplifies the setting substantially and allows a construction with zero error, with T=D2d2 . While the construction does not directly generalise to larger z, the strategy for mitigating error should still be relevant. I think the ideas behind the construction could be useful for larger z, and the construction itself is quite neat.
The problem: persistent interference
The dominant source of error in Circuits in Superposition 2 can be summarised as follows:
The key issue is not that inactive circuits briefly receive nonzero activations, but that these activations persist across layers and are allowed to feed back into the active computation. Once this happens repeatedly, noise accumulates.
The goal of this post is to eliminate this failure mode entirely in the z=1 setting.
High-level idea / TLDR:
The construction given in this post enforces a simple invariant:
Each of the T circuits is assigned one of these d-dimensional subspaces to store its state in when it is active (multiple circuits may share the same subspace).
The subspace in which the active circuit’s state lives is provided as part of the input via a one-hot encoding. Using this, we can explicitly remove any signal outside the active subspace with an additional error-correction layer before the next layer is computed. This prevents inactive circuits from ever influencing the active circuit across layers.
Construction:
Memory blocks:
Let the large network have D neurons per layer. We partition these neurons into
Dd contiguous memory blocks, each consisting of d neurons.
We assign each of the T small circuits to a memory block via a map
f:[T]→[Dd]
If a large-network neuron is associated with circuit i, neuron j, then it writes its output to position j of the f(i)-th memory block.
A single large-network neuron may be associated with multiple small-circuit neurons, and thus may write to multiple memory blocks. However, we impose the following constraint:
This constraint is crucial for enabling exact error correction.
Input:
The input consists of an input x∈Rd placed in the f(j)-th memory block, where j is the active circuit. We are also given a one-hot encoding of f(j), and a one-hot encoding of a pointer to neuron sets (defined later). In total we need D(1+2d) neurons for the input.
[The fact that the input x is placed in the correct memory block is just for convenience: given the one-hot encoding of f(j), as long as x∈Rd is stored in a known location, it can be moved to the f(j)-th memory block using a single additional layer.]
Error correction via block clearing:
Assume that we are given f(j) one-hot encoded as part of the input, where j is the (unique) active circuit. This requires only Dd additional neurons.
After each layer of computation, we insert an error-correction layer that zeroes out all memory blocks except the f(j)-th memory block (add a large negative bias followed by a ReLU).
As a result:
This entirely removes the dominant source of error from Circuits in Superposition 2.
Bias terms:
We also learn a bias vector bi∈RDfor each memory block i.
This bias is implemented as a linear map applied to the one-hot encoding of f(j), where j is the active circuit.
Because circuits assigned to the same memory block are disjoint, this allows us to exactly match the bias terms of the active circuit.
Selecting the active neuron set:
We additionally need to one-hot encode a pointer to the set of neurons associated with the active circuit j.
Since z=1, there is no need to impose further constraints on the neuron sets beyond the memory-block condition. In particular, each memory block can reuse the same neuron sets for its circuits (for example, Dd contiguous sets of d neurons).
Thus, the pointer only needs to be of size Dd.
Using this pointer, we zero out all neurons except those corresponding to the active circuit (via a large negative bias). Given that the bias is set correctly, the output on the d neurons associated with the active circuit is exactly the output of the next layer of the small circuit. The memory-block clearing step then ensures that only the output written to block f(j) is retained.
Result:
The final construction has:
and it can exactly encode
T=D2d2 (Dd disjoint circuits per memory block, and Dd memory blocks.)
small circuits, each with:
The simulation error in the z=1 case is exactly zero.
Testing:
I have tested this construction up to T=16384,D=1024,d=8, using this Google Colab script. (So exactly encoding 16384 randomly initialized circuits of width 8, given a (1024+256) width network). I have only tested one-layer networks, but since the output format matches the input format, this is sufficient to validate the construction.
Is this trivial?
In hindsight, this construction feels close to a triviality. However, it was not obvious to me prior to working it out.
Being given an encoding of the active circuit simplifies the problem substantially. That said, there are realistic scenarios in LLMs where something similar occurs (for example, two-token names, where we want to apply a circuit indexed by the referenced entity). You could use the first token of a name as the memory block, and the second token as the neuron set pointer, for example.
One way to think about the memory block encoding is as specifying a coarse circuit family, with the neuron-set pointer selecting the specific circuit within that family. It is plausible that real models implement something like this implicitly.
Finally, the z=1 assumption is doing a lot of work: it removes almost all constraints on neuron-set reuse and allows the pointer to the active neuron set be very short. One possible mitigation for larger z would be to dynamically learn which set of neurons should be active on each forward pass using an additional error-correction layer prior to each layer of circuit execution. This would require roughly doubling the network width, and an additional L layers.
What are the broadly applicable ideas here?
In my opinion the useful idea here is dividing D-width networks into Dd fixed memory blocks of width d. This makes error correction simple, because we know the subspace in which the current circuit's state lives, and so can zero out it's complement.
For larger z, if we assume that there are no memory block collisions (i.e: only a single circuit is active at once per memory block), then the same error correction trick should be possible to mitigate the feedback error from inactive circuits. But memory block collisions would be catastrophic. I think there is reasonable justification for an assumption of the form "only one circuit from each family activates", though.