Rotations in Superposition

Linda Linsefors; Lucius Bushnaq

TL;DR We experimentally test the mathematical framework for circuits in superposition by hand-coding the weights of an MLP to implement many conditional^[1] rotations in superposition on two-dimensional input features. The code can be found here.

This work was supported by Coefficient Giving and Goodfire AI

1 Introduction

Figure 1: The desired values for rotating vectors (green), compared to their estimated values extracted from the neural network, $^x$ (purple) for five of the conditional rotation circuits. The six dots represent the values of the vector $x$ in layers $0$ to $5$ .

In a previous post, we sketched out a construction for compressing many different circuits into a neural network such that they can be computed in superposition.^[2]

By computation in superposition, we mean that a network represents features in superposition and can perform more possible computations with them than it has neurons (although not all at once), across multiple layers. Having better models of this is important for understanding whether and how networks use superposition, which in turn is important for mechanistic interpretability research more generally.

Performing computation in superposition over multiple layers introduces additional noise compared to just storing features in superposition. This restricts the amount and type of computation that can be implemented in a network of a given size, because the noise needs to be reduced or suppressed to stay smaller than the signal.

Here, we test the construction by using it to compress circuits that perform random rotations on different two-dimensional input vectors, conditional on a Boolean variable being active.

We ended up needing to adjust the construction a bit to do this, because one of the assumptions the original version made about the circuits to be compressed turned out to be less reasonable in practice than we thought it'd be, and did not actually apply to these rotation circuits.^[3]

This sort of thing is why we do testing. Reality is complicated.

Note that this is certainly not the ideal way to implement rotations in superposition. We are testing a general method that should work for a wide class of circuits. If we just wanted to implement these specific circuits as compactly and noiselessly as possible, there'd be better ways to do that.^[4]

2 Takeaways

It seems to work. But we can only fit so many independent circuits into the network before the errors compound and explode.

Independent circuits in superposition are costly
Many people we've talked to seem to think that neural networks can easily learn to implement an exponentially large number of independent computations by using superposition. They often seem to base this intuition on toy models that showcase how many independent features neural networks can store in superposition in their activations, but that's very different from implementing independent computations in superposition on these features. The latter is much more difficult and costly.

Our previous post concluded that in theory, we could fit close to $T = ˜ O (\frac{D^{2}}{d^{2}})$ circuits that each require $d$ ReLU neurons per layer into a neural network with $D$ ReLU neurons per layer, provided the circuits activate sparsely enough.^[5] But that was an asymptotic expression for the case of network width $D$ going to infinity, based only on the leading error terms and mostly ignoring constant prefactors.

In our experiments, we had $d = 4$ , $D = 1000$ and it was already pretty hard to fit in $T \approx 2 \frac{D}{d}$ circuits and have the noise stay small enough to not blow up the computation in the first few layers. We think we could perhaps stabilize the computations for a much greater number of layers by implementing more active error correction,^[6] but squeezing in many more independent circuits than this seems very difficult to us.

So how can language models memorize so many things?
If fitting many independent circuits into a neural network is so difficult, how can language models memorize so many different facts and even whole text passages? We don't know the answer to that yet. But we have started to train and analyze small toy models of memorization to find out, and we have some suspicions. Our work on this is still at a very early stage, so take the following speculation with an even bigger helping of salt than usual. But:

Many memorization circuits might be much shallower than the ones we investigate here. A shallower depth means fewer stages of computation for errors to accumulate.

Much of the information that models memorize is not actually independent. If a model wants to store that "The Eiffel Tower is in Paris" and that "The capital of France is Paris", it doesn't actually need two independent, non-overlapping lookup circuits to do this. Since both look-ups return the same answer, "Paris", the computation need not distinguish their outputs. Instead, all look-ups that return "Paris" could effectively share the same set of neurons, meaning they'd effectively not be independent look-ups at all, but rather parts of a single general "Sequences that end in Paris" lookup.

This picture of computation in neural networks suggests that there might not be that many more independent circuits than neurons in a large language model like GPT-2. Instead, the models might be exploiting structure in the data to implement what might superficially look like very many different capabilities using a comparatively much smaller set of fairly general circuits.

Lucius: This roughly tracks with some very early results I've been seeing when decomposing the parameters of models like GPT-2 Small into independent components. It currently looks to me like it's actually rare for these models to even have more rank $1$ circuit pieces than neurons in a layer, never mind neuron count squared. I'd guess that much wider models have more circuits relative to their neuron count, but at this point I kind of doubt that any real model today gets even remotely close to the theoretical limit of $\approx 1$ circuit per weight. Which might be great, because it'd mean there aren't that many circuits we need to interpret to understand the models.

3 The task

The general setup for our circuits-in-superposition framework goes like this:

We have $T$ small circuits, each of which can be described as a $d$ -dimensional multilayer perceptron (MLP) with $L$ layers.
We have one large network, a $D$ -dimensional MLP with $L$ layers, where $D > d$ , but $D < T d$ .
We want to embed all $T$ circuits into the large network, such that the network approximately implements the computations of every circuit, conditional on no more than $z ≪ T$ circuits being used on any given forward pass. Since $D < T d$ , we can't do this by just dedicating one network neuron to each circuit neuron.

In this case, the $T$ small circuits we want to compress into one network each perform a particular two-dimensional rotation on different two-dimensional vectors $x_{t}^{l}$ , conditional on a Boolean "On-indicator" variable $α_{t}^{l}$ taking the value $1$ . The circuits have $L = 6$ layers, with each layer performing another conditional rotation on the previous layer's output. We randomly selected the rotation angle for each circuit. The hidden dimension of each individual circuit is $d = 4$ .

Conditional rotation
In each layer $l$ , the circuits are supposed to rotate a two-dimensional vector $x_{t}^{l - 1}$ with norm $| x_{t}^{l - 1} | \leq 1$ by some angle $θ_{t}$ , conditional on the Boolean On-indicator $α_{t}^{l - 1}$ taking value $1$ rather than $0$ .

Figure 2: Two-dimensional rotating vectors $x$ for five of the conditional rotation circuits. The six dots correspond to layers $0$ to $5$ .

We can implement this with an expression using ReLUs, like this:

\begin{matrix} [\begin{matrix} (x_{t}^{l})_{0} (x_{t}^{l})_{1} \end{matrix}] = ⎡ ⎢ ⎣ \begin{matrix} ReLU (cos θ_{t} (x_{t}^{l - 1})_{0} - sin θ_{t} (x_{t}^{l - 1})_{1} + 2 α_{t}^{l - 1} - 1) - α_{t}^{l - 1} ReLU (sin θ_{t} (x_{t}^{l - 1})_{0} + cos θ_{t} (x_{t}^{l - 1})_{1} + 2 α_{t}^{l - 1} - 1) - α_{t}^{l - 1} \end{matrix} ⎤ ⎥ ⎦ . \\ (3.1) \end{matrix}

If $α_{t}^{l - 1} = 1$ , this expression simplifies to

\begin{matrix} [\begin{matrix} (x_{t}^{l})_{0} (x_{t}^{l})_{1} \end{matrix}] = [\begin{matrix} cos θ_{t} (x_{t}^{l - 1})_{0} - sin θ_{t} (x_{t}^{l - 1})_{1} sin θ_{t} (x_{t}^{l - 1})_{0} + cos θ_{t} (x_{t}^{l - 1})_{1} \end{matrix}], \\ (3.2) \end{matrix}

which is a two-dimensional rotation by angle $θ_{t}$ , as desired. If $α_{t}^{l - 1} = 0$ , we get $x_{t}^{l} = 0$ .

On-indicator
At each layer $l$ , the circuits are also supposed to apply a step function to the On-indicator $α_{t}^{l}$ , and pass it on to the next layer:

\begin{matrix} α_{t}^{l} = ReLU (2 α_{t}^{l - 1} - 1.5) - ReLU (2 α_{t}^{l - 1} - 0.5) . \\ (3.3) \end{matrix}

The step function is there to make the On-indicators robust to small noise. Noise robustness is a requirement for circuits to be embedded in superposition. For more details on this and other properties that circuits need to have for computation in superposition to work, see Section 5.2

Figure 3: A step function formed from two ReLUs

If the circuit is active, the On-indicator will be initialized as 1, and remain 1. If the circuit is inactive, the On-indicator will be initialized as 0, and stay 0.

\begin{matrix} α_{t}^{l} = {\begin{matrix} 1 & if circuit t is active 0 & if circuit t is inactive \end{matrix} for all l \\ (3.4) \end{matrix}

For more detail on the small circuits, see Appendix section 5.1

Why conditional rotations in particular?

No particular reason. We just wanted some simple circuits. Originally, we tried to just do plain two-dimensional rotations. Then we remembered that plain rotations aren't noise-robust in the way our computation-in-superposition framework requires, because they're analytic at every point. We need circuits that return an output of $0$ when passed an input vector of sufficiently small size $ϵ$ . Rotations don't do that. Making the rotations conditional on some Boolean logic fixes this, and for simplicity we just went with the easiest Boolean logic possible.

Why the step function?

Although we only need one float to encode $α$ , we need two ReLUs to make the circuit robust to noise over multiple layers. If we only used a single ReLU, e.g., $α_{t}^{l} = ReLU (2 α_{t}^{l - 1} - 1)$ , then a small error $ϵ$ on $α_{t}^{l - 1}$ would be doubled in layer $l$ , then doubled again in layer $l + 1$ and so on, growing exponentially with circuit depth.

If dedicating two neurons just to copy over the embedding of a single Boolean seems a little silly to you, you're not wrong. If our main goal was just to implement lots of conditional rotation circuits in superposition using as few neurons as possible, there'd be better ways to do it. For example, we could use some fraction of the neurons in the large network to embed the On-indicators, and then just copy this information from layer to layer without ever recalculating these values. Such an implementation would do better at this particular task.

But our goal is not to solve this task in particular, but to test a general method for embedding many different kinds of circuits.

If you like, you can also see this part of the circuits as a stand-in for some more complicated logic, that calculates which circuits are on from layer to layer, i.e., some function that is more complicated than "same as last layer", and involves cross-circuit computations

4 Results

Here, we'll present the results of embedding $T$ conditional rotation circuits into a ReLU MLP of width $D$ for various choices of $D$ and $T$ , using the embedding method described in Section 5.4

Figure 4: The true values for the rotating vector, $x$ (green), compared to its estimated values extracted from the neural network, $^x$ (purple, red and yellow) for a randomly selected subset of the $T = 557$ conditional rotation circuits the network implements. Each circuit has width $d = 4$ . The network they are embedded into has a width of $D = 1000$ neurons per layer. The data comes from randomly sampled active circuits, from runs with 1, 2 or 3 simultaneously active circuits, i.e., $z = 1, 2, 3$ .

As we would expect, the read-offs diverge more with every subsequent layer as error accumulates, since we didn't implement any active error correction.^[6] Most rotation circuits seem to work fine for the first $2$ - $3$ rotations, with the read-offs predicting the ideal values much better than chance. Estimates are much worse in the later layers. The errors also get larger when the number of active circuits $z$ is increased.

We quantify the error on the rotated vectors as the $L_{2}$ norm of the difference between the ideal result $x_{t}^{l}$ and the estimate ${^x}_{t}^{l}$ :

\begin{matrix} | δ x_{t}^{l} | := | x_{t}^{l} - {^x}_{t}^{l} | . \\ (4.1) \end{matrix}

On a population level, we can look at both the average error of circuits $\frac{1}{T} \sum_{t = 1}^{T} | δ x_{t}^{l} |^{2}$ , and the worst-case error $max (| δ x_{1}^{l} |, \dots, | δ x_{T}^{l} |)$ . The former tells us how noisy the computations are most of the time. The latter tells us whether any one circuit in the ensemble of $T = 557$ circuits was implemented so noisily that for at least one input it basically doesn't work at all.

Figure 5: Errors on the rotating vectors $x$ , for an MLP of width $D = 1000$ with $T = 557$ circuits each of width $d = 4$ . Every circuit neuron is spread across $S = 6$ MLP neurons. Upper row: mean squared errors, i.e., the average of $| δ x |^{2}$ over inactive or active circuits. Lower row: worst-case absolute errors, i.e., the maximum of $| δ x |$ over inactive or active circuits.^[7]

The left plots show errors on inactive circuits, i.e., circuits that received an input of $α_{t}^{0} = 0, x_{t}^{0} = 0$ . Most circuits will be inactive on any given forward pass. The right plots show errors on active circuits, i.e. circuits that received an input of $α_{t}^{0} = 1$ and some random vector $x_{t}^{0}$ with $| x_{t}^{0} | = 1$ . As a baseline to put these numbers in perspective, once $| δ x_{t}^{l} | \geq 1$ , the circuits are doing no better than guessing the mean. So, the average circuit beats this trivial baseline for the first three of the five rotations, or the first four if we restrict ourselves to inputs with only $z = 1$ active circuits. As one might expect, the worst-case errors are worse. With $z = 1$ inputs, all $T = 557$ ^[8] circuits beat the baseline for the first two rotations. With $z = 2$ inputs, only the first rotation is implemented with small enough noise by all circuits to beat the baseline, and at least one circuit doesn't work right for an input with $z = 3$ for even a single rotation.

We can also see that the mean squared errors grow superlinearly the deeper we go into the network. This does not match our error calculations, which would predict that those errors should grow linearly. We think this is because as the errors grow they start to overwhelm the error suppression mechanism by activating additional ReLUs, and thus letting through additional noise. The calculations assume that the noise always stays small enough that this doesn't happen. More on that in the next subsection, where we compare our error calculations to the empirical results.

We can also look at the error on the On-indicator variables $α_{t}$ :

\begin{matrix} | δ α_{t}^{l} | := | α_{t}^{l} - {^α}_{t}^{l} | \\ (4.2) \end{matrix}

Here, an error $| δ α_{t}^{l} | > 0.25$ means that the error on the On-indicator is large enough to pass through our step function and propagate to the next layer. At that point, errors will start to accumulate over layers at a faster-than-linear rate.

Figure 6: Errors on the Boolean On-indicators $α$ , for an MLP of width $D = 1000$ with $T = 557$ circuits each of width $d = 4$ . Every circuit neuron is spread across $S = 6$ MLP neurons. Upper row: mean squared errors, i.e., the average of $(δ α)^{2}$ over inactive or active circuits. Lower row: worst-case absolute error, i.e., the maximum of $| δ α |$ over inactive or active circuits.^[7]

The absolute value of the error for the On-indicator caps out at $1$ . This is because ${^a}_{t}^{l}$ is the averaged output of a set of step functions, each of which is constrained to output a value in the range $[0, 1]$ . Their average is thus also constrained to the range $[0, 1]$ . Therefore, regardless of whether the true value for $α_{t}^{l}$ is $0$ or $1$ , $| δ α_{t}^{l} |$ has to be in the range $[0, 1]$ .

4.1 Comparing with our theoretical predictions

Key takeaway: Most predictions are good when the errors are small in magnitude, but once the errors get too big, they start growing superlinearly, and our predictions begin to systematically undershoot more and more. We think this is because our error propagation calculations assume that the errors stay small enough to never switch on neurons that are not in use by active circuits. But for many settings, the errors get large enough that some of these neurons do switch on, opening more channels for the noise to propagate and compound even further.

Figure 7: Fraction of active neurons, i.e., what fraction of the large network neurons are non-zero, for a network of width $D = 1000$ with $T = 557$ circuits of width $d = 4$ . Every circuit neuron is spread across $S = 6$ MLP neurons. The dotted lines are what we'd theoretically expect if the errors stayed small enough not to overwhelm the noise suppression threshold for the inactive circuits. If this assumption is broken, the network starts to develop a "seizure"^[9]: the noise starts to grow much faster as errors switch on more neurons causing more errors which switch on more neurons.^[7]

The key variables that influence the error here are

Number of circuit parameters per layer divided by number of network parameters per layer $\frac{T d^{2}}{D^{2}}$ . Generally speaking, the bigger this ratio is, the larger the error will be.
Number of active circuits per forward pass $z$ .
Number of circuit layers computed $L$ .

4.1.1 Rotating vector

Figure 8: Empirical result vs. theoretical predictions for the mean squared error on the rotating vector $x$ for active circuits. Each dot represents a different network, with a different choice of network width $D$ or total number of circuits $T$ . Each circuit neuron is spread across $S = 6$ network neurons. Colors are based on $D$ . Dots are marked as $∙$ or $▲$ depending on whether total circuit width $T d$ is larger than $D$ , making superposition strictly necessary to implement the computation. Note though that even in the cases where $T d < D$ , networks are still constructed using our superposition embedding method.

Predictions are good for layer 1, $z = 1$ . For larger $z$ , the errors get larger and the predictions get worse, with empirical errors often systematically larger than our theoretical predictions. For layer $2$ and onward, our theoretical predictions likewise undershoot more and more as the overall error gets larger. At layer $3$ and onward, the superlinear growth from compounding noise becomes very visible for large $\frac{T d^{2}}{D^{2}}$ .

Figure 9: Same as the previous figure, but here we compare empirical results to theoretical predictions for the mean squared error on the rotating vectors $x$ of inactive circuits.

As our error calculations would predict, rotating vectors for inactive circuits have smaller errors than rotating vectors for active circuits, so long as the overall noise level stays small enough for the calculations to be applicable.^[10]

4.1.2 On-indicator

Figure 10: Same as the figures in the previous subsection, but here we compare empirical results to theoretical predictions for the mean squared error on the On-indicator $α$ of active circuits.

In the case of On-indicators for active circuits, our error calculations predict that they just stay zero. This is indeed empirically the case for the early layers. But as we move deeper into the network and $\frac{T d^{2}}{D^{2}}$ or $z$ get larger, a critical point seems to be reached. Past this point, noise on other parts of the network seems to become large enough to cause cascading failures, likely by switching the regime of ReLUs. Then, the active On-indicators become noisy as well.

Figure 11: Same as the figures in the previous subsection, but here we compare empirical results to theoretical predictions for the mean squared error on the On-indicator $α$ of inactive circuits.

The noise on the inactive On-indicator circuits follows the same pattern: Predictions are mostly good until the noise seems to grow large enough to break the assumptions of the calculation as the network starts to develop a "seizure". Here, this point seems to be reached at particularly early layers though. We think this is because seizures start with the On-indicators for inactive circuits and spread from there to other quantities.

5 Appendix 1: Construction

5.1 The circuits

Each individual circuit can be written as

\begin{matrix} a_{t}^{l} = ReLU (w_{t}^{l} a_{t}^{l - 1} + b^{l}) \\ (5.1) \end{matrix}

where the superscript $l$ indicates the layer, ranging from $0$ to $5$ , and the subscript $t$ indexes the different circuits, ranging from $0$ to $T - 1$ .

The circuits are supposed to compute/propagate the On-indicator, $α_{t}^{l} \in {0, 1}$ which controls whether the circuit should be active, and perform rotations on the rotating vector, $x_{t}^{l}$ .

The weight matrices $w_{t}^{l}$ for each circuit are parametrized as

\begin{matrix} w_{t}^{l} = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 2 & - 2 & 0 & 0 2 & - 2 & 0 & 0 2 - (cos θ_{t} - sin θ_{t}) & - 2 + (cos θ_{t} - sin θ_{t}) & cos θ_{t} & - sin θ_{t} 2 - (sin θ_{t} + cos θ_{t}) & - 2 + (sin θ_{t} + cos θ_{t}) & sin θ_{t} & cos θ_{t} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (5.2) \end{matrix}

and the biases as^[11]

\begin{matrix} b^{l} = ⎡ ⎢ ⎢ ⎢ ⎣ \begin{matrix} - 0.5 - 1.5 - 1 - 1 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎦ . \\ (5.3) \end{matrix}

So, the circuit parameters here don't actually vary with the layer index $l$ at all.^[12]

The initial input vector for each circuit takes the form

\begin{matrix} a_{t}^{0} = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 1.5 α_{t}^{0} 0.5 α_{t}^{0} (x_{t}^{0})_{0} + α_{t}^{0} (x_{t}^{0})_{1} + α_{t}^{0} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . \\ (5.4) \end{matrix}

The On-indicator $α_{t}^{l}$ for circuit $t$ at layer $l$ can be read out from the circuit's hidden activations as

\begin{matrix} α_{t}^{l} := (a_{t}^{l})_{1} - (a_{t}^{l})_{2} . \\ (5.5) \end{matrix}

The rotating vector can be read out from the circuit's hidden activations as

\begin{matrix} x_{t}^{l} := [\begin{matrix} (a_{t}^{l})_{2} - α_{t}^{l} (a_{t}^{l})_{3} - α_{t}^{l} \end{matrix}] . \\ (5.6) \end{matrix}

If the circuit is active, the rotating vector will be initialized as some vector of length $1$ , and the On-indicator will be set to $1$ . If the circuit is inactive, the rotating vector and On-indicator will be initialized to $0$ , and stay $0$ .

\begin{matrix} | x_{t}^{l} | = {\begin{matrix} 1 & if circuit t is active 0 & if circuit t is inactive \end{matrix} for all l \\ (5.7) \end{matrix}

\begin{matrix} α_{t}^{l} = {\begin{matrix} 1 & if circuit t is active 0 & if circuit t is inactive \end{matrix} for all l \\ (5.8) \end{matrix}

A circuit's rotating vector is robust to some noise if the circuit is inactive. If a circuit is active, errors on its rotating vector will neither be removed nor amplified.

In any given forward pass, most circuits will be inactive.

5.2 Assumptions about the circuits

5.2.1 Revisiting assumptions from the previous post

From the previous post:

Assumptions
For this construction to work as intended, we need to assume that:
Only $z ≪ \frac{D}{d}$ circuits can be active on any given forward pass.
Small circuits are robust to noise when inactive. I.e. a small deviation to the activation value of an inactive circuit applied in layer $l$ will not change the activation value of that circuit in layer $l + 1$ .
If a circuit is inactive, all of its neurons have activation value zero. I.e. $a_{t}^{l} = 0$ if circuit $t$ is inactive.
The entries of the weight matrices $w_{t}^{l}$ for different circuits in the same layer are uncorrelated with each other.
Assumption 1 is just the standard sparsity condition for superposition.
Assumption 2 is necessary, but if it is not true for some of the circuits we want to implement, we can make it true by modifying them slightly, in a way that doesn't change their functionality. How this works will not be covered in this post though.^[13]
Assumptions 3 and 4 are not actually necessary for something similar to this construction to work, but without them the construction becomes more complicated. The details of this are also beyond the the scope of this post.

Assumption 1 holds for our conditional rotation circuits. Assumption 2 also holds, because the rotation is only applied if the Boolean On-indicator $α_{t}^{l}$ is large enough, and the step function for $α_{t}^{l}$ also doesn't propagate it to the next layer if $α_{t}^{l}$ is too small. Assumption 3 also holds. However, assumption 4 does not hold. So we're going to have to adjust the way we embed the circuits a little.

Further, while it wasn't explicitly listed as an assumption, the previous post assumed that the circuits don't have any biases. But our circuits here do (see Equation 5.1). So we'll need to make some more adjustments to account for that difference as well.

5.2.2 Updated set of assumptions

Only $z ≪ \frac{D}{d}$ circuits can be active on any given forward pass.
Small circuits are robust to noise when inactive. I.e., a small deviation to the activation value of an inactive circuit applied in layer $l$ will not change the activation value of that circuit in layer $l + 1$ .
If a circuit is inactive, all of its neurons have activation value zero. I.e., $a_{t}^{l} = 0$ if circuit $t$ is inactive.^[14]
Every small circuit has the same bias vector in any given layer.
Active circuits do not amplify incoming noise. I.e., the size of a noise vector applied to a circuit should not grow in relative size over layers.

Assumptions 1-3 have stayed the same. We think the construction could likely be modified to make assumption 3 unnecessary,^[15] but it simplifies the implementation a lot. The old assumption 4 is not required for the new construction. The new embedding algorithm can work without it (see next section). We think the embedding algorithm could also be modified to make the new assumption 4 unnecessary, but this may come at some cost to the total number of circuits we can embed.^[16] The new assumption 5 was actually necessary all along, even in the previous construction. Without it, even a tiny amount of noise can eventually grow to overwhelm the signal if the circuit is deep enough.^[17]

Note that assumption 2 is about noise reduction on inactive circuits whereas assumption 5 is about lack of noise amplification on active circuits. Assumption 2 says we should have

\begin{matrix} | δ a_{t}^{l} (x) | \leq ϵ \Rightarrow | ReLU (w_{t}^{l + 1} δ a_{t}^{l} (x) + b^{l + 1}) | = 0 \\ (5.9) \end{matrix}

for some sufficiently small $ϵ > 0$ . Assumption 5 says we should have

\begin{matrix} \frac{| a_{t}^{l + 1} (x) - ReLU (w_{t}^{l + 1} (a_{t}^{l} + δ a_{t}^{l} (x)) + b^{l + 1}) |}{| a_{t}^{l + 1} (x) |} \leq \frac{| δ a_{t}^{l} (x) |}{| a_{t}^{l} (x) |} . \\ (5.10) \end{matrix}

5.3 Notation

5.3.1 Bracket notation

Bracket notation is a type of vector notation that's excellent for thinking about outer products. It's also the vector notation usually used in quantum mechanics, and the term "superposition" is a quantum loan-word, so deploying it here seemed kabbalistically accurate.

It works like this:

$| v ⟩$ denotes a column vector, also called a ket.
$⟨ u |$ denotes a row vector, also called a bra.

\begin{matrix} ⟨ u | := | u ⟩^{⊤} \\ (5.11) \end{matrix}

$⟨ u | v ⟩$ denotes the inner product^[18] of two vectors, yielding a scalar number:

\begin{matrix} ⟨ u | v ⟩ := \sum i (u)_{i} (v)_{i} \\ (5.12) \end{matrix}

$| v ⟩ ⟨ u |$ denotes the outer product of two vectors, yielding a matrix:

\begin{matrix} (| v ⟩ | u ⟩)_{i, j} := (v)_{i} (u)_{j} \\ (5.13) \end{matrix}

We will use bracket notation to represent $D / d$ -dimensional vectors, like the embedding vectors in section 5.4.1

5.3.2 Mixed vector notation

So far, we've used $b o l d$ to represent $d$ -dimensional vectors, i.e., $a_{t}^{l}$ and $b$ , and two-dimensional vectors, i.e., $x_{t}^{l}$ . And we just defined $| ket ⟩$ for $D / d$ -dimensional vectors in the previous section 5.3.1

We will also want to mix $b o l d$ and $| ket ⟩$ notation to make $D$ -dimensional vectors by stacking $d = 4$ number of $D / d$ -dimensional vectors.

\begin{matrix} | V ⟩ := ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {(| V ⟩)}_{0 : \frac{D}{d}} {(| V ⟩)}_{\frac{D}{d} : 2 \frac{D}{d}} {(| V ⟩)}_{2 \frac{D}{d} : 3 \frac{D}{d}} {(| V ⟩)}_{3 \frac{D}{d} : D} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (5.14) \end{matrix}

where $| V ⟩$ is a $D$ -dimensional vector and ${(| V ⟩)}_{i : j}$ is the section of $| V ⟩$ from index $i$ to index $j - 1$ . Each of ${(| V ⟩)}_{0 : \frac{D}{d}}$ , ${(| V ⟩)}_{\frac{D}{d} : 2 \frac{D}{d}}$ , ${(| V ⟩)}_{2 \frac{D}{d} : 3 \frac{D}{d}}$ , ${(| V ⟩)}_{3 \frac{D}{d} : D}$ is $D / d$ -dimensional.

Any $d \times d$ -matrix will act on $| V ⟩$ as if it were a $d$ -dimensional vector, ignoring the $| ket ⟩$ dimension. For example:

\begin{matrix} ⎡ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 0 & 0 & 1 & 0 0 & 0 & 0 & 1 1 & 0 & 0 & 0 0 & 1 & 0 & 0 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎦ | V ⟩ := ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} {(| V ⟩)}_{2 \frac{D}{d} : 3 \frac{D}{d}} {(| V ⟩)}_{3 \frac{D}{d} : D} {(| V ⟩)}_{0 : \frac{D}{d}} {(| V ⟩)}_{\frac{D}{d} : 2 \frac{D}{d}} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (5.15) \end{matrix}

Any $⟨ bra|$ will act on $| V ⟩$ as if it were a $| ket ⟩$ , ignoring the $d$ -dimension. For example:

\begin{matrix} ⟨ u | V ⟩ := ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} ⟨ u | {(| V ⟩)}_{0 : \frac{D}{d}} ⟨ u | {(| V ⟩)}_{\frac{D}{d} : 2 \frac{D}{d}} ⟨ u | {(| V ⟩)}_{2 \frac{D}{d} : 3 \frac{D}{d}} ⟨ u | {(| V ⟩)}_{3 \frac{D}{d} : D} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (5.16) \end{matrix}

5.4 Embedding the Circuits

In this section, we describe a general method for embedding any set of small circuits that satisfy the assumptions 1 - 5 listed in Section 5.2.2

The activations $| A ⟩^{l}$ of the large network at layer $l$ are computed from the activations at layer $| A ⟩^{l - 1}$ as

\begin{matrix} | A ⟩^{l} := ReLU (W^{l} | A ⟩^{l - 1} + | B ⟩^{l}) \\ (5.17) \end{matrix}

5.4.1 Embedding the weights

To embed the small circuit weights in the large network, we first split their weight matrices $w_{t}^{l}$ into two parts: the averaged weights across all circuits ${¯ w}^{l}$ , and the differing weights $Δ w_{t}^{l}$ :

\begin{matrix} {({¯ w}^{l})}_{i, j}^{l} := E t [{(w_{t}^{l})}_{i, j}] \\ (5.18) \end{matrix}

\begin{matrix} Δ w_{t}^{l} := w_{t}^{l} - {¯ w}^{l} \\ (5.19) \end{matrix}

The differing weights $Δ w_{t}^{l}$ are embedded into the weights of the large network using embedding vectors $| e_{t}^{l} ⟩$ . The averaged weights ${¯ w}^{l}$ are embedded using an embedding matrix $W^{l} \in R^{\frac{D}{d} \times \frac{D}{d}}$ :

\begin{matrix} W^{l} := {¯ w}^{l} W^{l} + \frac{1}{S} \sum t Δ w_{t}^{l} | e_{t}^{l} ⟩ ⟨ e_{t}^{l - 1 |} for l \geq 2 \\ (5.20) \end{matrix}

Embedding vectors
The embedding vectors $| e_{t}^{l} ⟩$ are constructed using exactly the same algorithm described in the previous post. The information about them that matters most for the rest of this post:

$| e_{t}^{l} ⟩$ denotes the embedding vector for circuit $t$ in layer $l$ . $| e_{t}^{l} ⟩$ has dimension $D / d$ .
$S$ is the embedding redundancy, i.e., the number of large network neurons used to represent each small circuit neuron.
$| e_{t}^{l} ⟩$ is a $S$ -hot vector, i.e., it has $S$ ones and $(\frac{D}{d} - S)$ zeros. As a consequence of this, we have

\begin{matrix} \frac{1}{S} ⟨ e_{t}^{l} | e_{t}^{l} ⟩ = 1 \\ (5.21) \end{matrix}

$\frac{1}{S} ⟨ e_{t}^{l} |$ is the un-embedding vector for circuit $t$ in layer $l$ .
Any pair of embedding vectors for different circuits $t \neq u$ in the same layer $l$ will share at most one large network neuron. Therefore:

\begin{matrix} \frac{1}{S} ⟨ e_{t}^{l} | e_{u}^{l} ⟩ = {\begin{matrix} \frac{1}{S} 0 \end{matrix} for t \neq u \\ (5.22) \end{matrix}

Embedding matrix

$W^{l}$ is constructed as

\begin{matrix} {(W^{l})}_{i, j}^{l} := \sum t ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} \frac{1}{S} & if {(| e_{t}^{l} ⟩ ⟨ e_{t}^{l - 1 |})}_{i, j} > 0 - χ & if {(| e_{t}^{l} ⟩ ⟨ e_{t}^{l - 1 |})}_{i, j} = 0 \end{matrix} \\ (5.23) \end{matrix}

Where the scalar $χ$ is chosen to balance the weights of $W$ , such that

\begin{matrix} E t \neq u [⟨ e_{t}^{l} | W^{l} | e_{u}^{l - 1 ⟩}] = 0 \\ (5.24) \end{matrix}

This ensures that interference terms between different circuits do not systematically aggregate along any particular direction.

Layer 0
As in the previous post, the weights at layer $0$ are treated differently. Instead of using the weight embedding described in the previous subsection, we just embed the circuit input vectors into the input vector of the network as

\begin{matrix} | A ⟩^{0} = \sum t w_{t}^{1} a_{t}^{0} | e_{t}^{1} ⟩ . \\ (5.25) \end{matrix}

This allows us to set the first weight matrix to an identity,

\begin{matrix} W^{1} = I, \\ (5.26) \end{matrix}

yielding

\begin{matrix} | A ⟩^{1} = R e L U (\sum t w_{t}^{1} a_{t}^{0} | e_{t}^{1} ⟩ + | B ⟩^{1}) . \\ (5.27) \end{matrix}

5.4.2 Embedding the biases

We embed the circuit biases $b^{l} \in R^{d}$ into the network biases $| B ⟩^{l} \in R^{D}$ by simply spreading them out over the full $D$ dimensions:

\begin{matrix} | B ⟩^{l} := b^{l} | 1 ⟩ := ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} (b^{l})_{0} | 1 ⟩ (b^{l})_{1} | 1 ⟩ (b^{l})_{2} | 1 ⟩ (b^{l})_{3} | 1 ⟩ \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (5.28) \end{matrix}

where $| 1 ⟩$ is just the $D / d$ -vector with value 1 everywhere, i.e.:

\begin{matrix} {(| 1 ⟩)}_{i} := 1 for all i \\ (5.29) \end{matrix}

5.5 Reading out the small circuit activations

If the embedding works, we can read out the approximate small circuit activations from the activations of the large network, up to some error tolerance. Using the un-embedding vector $\frac{1}{S} ⟨ e_{t}^{l} |$ , we define the estimator for the small circuit activations for each layer as

\begin{matrix} {^a}_{t}^{l} := \frac{1}{S} ⟨ e_{t}^{l} | A ⟩^{l} := ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} \begin{matrix} \frac{1}{S} ⟨ e_{t}^{l} | {(| A ⟩^{l})}_{0 : \frac{D}{d}} \frac{1}{S} ⟨ e_{t}^{l} | {(| A ⟩^{l})}_{\frac{D}{d} : 2 \frac{D}{d}} \frac{1}{S} ⟨ e_{t}^{l} | {(| A ⟩^{l})}_{2 \frac{D}{d} : 3 \frac{D}{d}} \frac{1}{S} ⟨ e_{t}^{l} | {(| A ⟩^{l})}_{3 \frac{D}{d} : D} \end{matrix} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (5.30) \end{matrix}

Following equations (5.5) and (5.6), we define the estimators for the On-indicator

\begin{matrix} {^α}_{t}^{l} := {({^a}_{t}^{l})}_{0} - {({^a}_{t}^{l})}_{1} \\ (5.31) \end{matrix}

and the rotating vector

\begin{matrix} {^x}_{t}^{l} := ⎡ ⎢ ⎢ ⎣ \begin{matrix} {({^a}_{t}^{l})}_{2} - {^α}_{t}^{l} {({^a}_{t}^{l})}_{3} - {^α}_{t}^{l} \end{matrix} ⎤ ⎥ ⎥ ⎦ \\ (5.32) \end{matrix}

The readout errors are defined as

\begin{matrix} δ a_{t}^{l} = {^a}_{t}^{l} - a_{t}^{l} \\ (5.33) \end{matrix}

\begin{matrix} δ α_{t}^{l} := {^α}_{t}^{l} - α_{t}^{l} \\ (5.34) \end{matrix}

\begin{matrix} δ x_{t}^{l} := {^x}_{t}^{l} - x_{t}^{l} \\ (5.35) \end{matrix}

6 Appendix 2: Error calculations

In this section we'll attempt to adapt the error calculations from the previous post to the new framework, and use the resulting expressions to try and predict the errors on our conditional rotation circuits. The main change in the construction we need to account for is the presence of the embedding matrix $W^{l}$ , which didn't exist in the construction the previous post used. We will also include some particular terms in the error calculations that the previous post ignored. These terms are sub-leading in the limit of large network width $D$ and number of circuits $T$ , but turned out to be non-negligible for some of the empirical data we tested the theory on.

The results here will only hold as long as the size of the errors stays much smaller than the size of the signal, which won't always be the case in reality. Past that point, errors may compound much more quickly over layers than the results here would predict.

We find that the terms for the error on inactive circuits carry over unchanged, but the error on active circuits is a little different.

Note that these calculations are not as careful as the ones in the previous post. In some spots, we just guess at approximations for certain terms and then check those guesses empirically.

6.1 Error on inactive circuits

Provided the overall error stays small enough to not change the signs of any $ReLU$ neuron preactivations, the only source of error on inactive circuits is the embedding overlap error, which was estimated in the previous post as:

\begin{matrix} δ a_{t}^{l} = (δ a_{t}^{l})_{embedding-overlap} \approx \frac{1}{S} \sum \begin{matrix} u is active u \neq t \end{matrix} ⟨ e_{t}^{l} | e_{u}^{l} ⟩ a_{u}^{l} if t is inactive \\ (6.1) \end{matrix}

This is the error that results from signal in active circuits bleeding over into the inactive circuits they share neurons with. Since this error depends purely on how the activations of the circuits are embedded in the neurons of the network and that embedding hasn't changed from the previous construction, the formula for it remains the same.

In the previous post, we used the approximation $E u \neq t [{(\frac{1}{S} ⟨ e_{t}^{l} | e_{u}^{l} ⟩)}^{2}] \approx \frac{d}{D}$ to evaluate this error as^[19]

\begin{matrix} E [| δ a_{t}^{l} |^{2}] \approx z \frac{d}{D} E v is active [| a_{v}^{l} |^{2}] if t is inactive \\ (6.2) \end{matrix}

Here, we will instead use the slightly more precise approximation

\begin{matrix} E u \neq t [{(\frac{1}{S} ⟨ e_{t}^{l} | e_{u}^{l} ⟩)}^{2}] \approx \frac{d}{D} - \frac{1}{S T} . \\ (6.3) \end{matrix}

This results in a minor but noticeable improvement in the fitting of the data. Inserting the new approximation into the previous derivation yields

\begin{matrix} E [| δ a_{t}^{l} |^{2}] \approx z (\frac{d}{D} - \frac{1}{S T}) E v is active [| a_{v}^{l} |^{2}] if t is inactive \\ (6.4) \end{matrix}

6.2 Error on active circuits

We'll assume for now that ${¯ w}^{l} = 0$ , so that Equation (5.20) becomes

\begin{matrix} W^{l} = \frac{1}{S} \sum t w_{t}^{l} | e_{t}^{l} ⟩ ⟨ e_{t}^{l - 1} | for l \geq 2 \\ (6.5) \end{matrix}

From definitions (5.28) and (5.29), it follows that

\frac{1}{S} ⟨ e_{t}^{l} | B ⟩^{l} = b^{l}

Using the definition of the readout for circuit $t$ in layer $l$ ${^a}_{t}^{l}$ , from Equation (5.30), the error $δ a_{t}^{l}$ can be written as

\begin{matrix} δ a_{t}^{l} & := {^a}_{t}^{l} - a_{t}^{l} = \frac{1}{S} ⟨ e_{t}^{l} | R e L U (W^{l} | A ⟩^{l - 1} + | B ⟩^{l}) - a_{t}^{l} = \frac{1}{S} ⟨ e_{t}^{l} | R e L U (\sum u | e_{u}^{l} ⟩ w_{u}^{l} \frac{1}{S} ⟨ e^{l - 1 ⟩_{u} | A^{l - 1}} + | B ⟩^{l}) - a_{t}^{l} = \frac{1}{S} ⟨ e_{t}^{l} | R e L U (\sum u | e_{u}^{l} ⟩ w_{u}^{l} (a_{u}^{l - 1} + δ a_{u}^{l - 1}) + | B ⟩^{l}) - a_{t}^{l} \\ (6.7) (6.8) (6.9) (6.10) \end{matrix}

If we split the sum over circuits $\sum_{u}$ into active and passive circuits, we can simplify this expression:

The term $\sum_{u is inactive} | e_{u}^{l} ⟩ w_{u}^{l} a_{u}^{l - 1}$ drops out because by definition $a_{u}^{l - 1} = 0$ for inactive circuits.
From the previous section, we know that

\begin{matrix} \sum u is inactive | e_{u}^{l} ⟩ w_{u}^{l} δ a_{u}^{l - 1} = \sum u is inactive | e_{u}^{l} ⟩ w_{u}^{l} δ (a_{u}^{l - 1})_{embedding overlap} . \\ (6.11) \end{matrix}

Provided the noise stays small compared to the signal, the term $\sum_{u is active, u \neq t} | e_{u}^{l} ⟩ w_{u}^{l} δ a_{u}^{l - 1}$ will be sub-leading compared to $\sum_{u is active} | e_{u}^{l} ⟩ w_{u}^{l} a_{u}^{l - 1}$ . Instead of neglecting it entirely however, we will keep its embedding overlap error contribution

\begin{matrix} \sum u is active, u \neq t | e_{u}^{l} ⟩ w_{u}^{l} (δ a_{u}^{l - 1})_{embedding-overlap}, \\ (6.12) \end{matrix}

since it is simple to evaluate, and only neglect the rest.^[20]

This leaves us with

\begin{matrix} δ a_{t}^{l} \approx \frac{1}{S} ⟨ e_{t}^{l} | R e L U ⎛ ⎝ \sum u is active | e_{u}^{l} ⟩ w_{u}^{l} a_{u}^{l - 1} + | e_{t}^{l} ⟩ w_{t}^{l} δ a_{t}^{l - 1} + \sum u \neq t | e_{u}^{l} ⟩ w_{u}^{l} (δ a_{u}^{l - 1})_{embedding overlap} + | B ⟩^{l} ⎞ ⎠ - a_{t}^{l} \\ (6.13) \end{matrix}

As in the last post, we will assume that all ReLU neurons used by the active circuits are turned on, since doing so will simplify the calculations and not overestimate the errors in expectation.^[21] This gives

\begin{matrix} \begin{matrix} δ a_{t}^{l} \approx & \frac{1}{S} \sum u is active ⟨ e_{t}^{l} | e_{u}^{l} ⟩ w_{u}^{l} a_{u}^{l - 1} - a_{t}^{l} + \frac{1}{S} ⟨ e_{t}^{l} | e_{t}^{l} ⟩ w_{t}^{l} δ a_{t}^{l - 1} + \frac{1}{S} \sum u is inactive ⟨ e_{t}^{l} | e_{u}^{l} ⟩ w_{u}^{l} (δ a_{u}^{l - 1})_{embedding overlap} + ⟨ e_{t}^{l} | B ⟩^{l} \end{matrix} if t is active \\ (6.14) \end{matrix}

Inserting approximation (6.1) for $(δ a_{u}^{l - 1})_{embedding overlap}$ on inactive circuits into this expression yields

\begin{matrix} \begin{matrix} δ a_{t}^{l} = & \frac{1}{S} \sum u is active ⟨ e_{t}^{l} | e_{u}^{l} ⟩ w_{u}^{l} a_{u}^{l - 1} - a_{t}^{l} + \frac{1}{S} ⟨ e_{t}^{l} | e_{t}^{l} ⟩ w_{t}^{l} δ a_{t}^{l - 1} + \frac{1}{S^{2}} \sum \begin{matrix} v is active v \neq u \end{matrix} \sum u \neq t ⟨ e_{t}^{l} | e_{u}^{l} ⟩ w_{u}^{l} ⟨ e^{l - 1 ⟩_{u} | e_{v}^{l - 1}} a_{v}^{l - 1} + ⟨ e_{t}^{l} | B ⟩^{l} \end{matrix} if t is active \\ (6.15) \end{matrix}

Which evaluates to

\begin{matrix} \begin{matrix} δ a_{t}^{l} \approx & \frac{1}{S} ⟨ e_{t}^{l} | W^{l} | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - w_{t}^{l} a_{t}^{l - 1} + \frac{1}{S} \sum \begin{matrix} v is active v \neq t \end{matrix} (⟨ e_{t}^{l} | W^{l} | e_{v}^{l - 1 ⟩} a_{v}^{l - 1} - ⟨ e^{l - 1 ⟩_{t} | e_{v}^{l - 1}} w_{t}^{l} a_{v}^{l - 1}) + w_{t}^{l} δ a_{t}^{l} \end{matrix} if t is active \\ (6.16) \end{matrix}

This derivation assumed that ${¯ w}^{l} = 0$ , i.e., the circuits are uncorrelated with each other. However, we're just going to go ahead and guess that (6.16) is still a reasonable approximation in the case ${¯ w}^{l} \neq 0$ as well.^[22] We'll see how true that is in practice when we compare these predictions to the error we measure in practice.

We can break this expression up into three components.

\begin{matrix} δ a_{t}^{l} \approx (δ a_{t}^{l})_{self} + (δ a_{t}^{l})_{others} + (δ a_{t}^{l})_{previous} (δ a_{t}^{l})_{self} := \frac{1}{S} ⟨ e_{t}^{l} | W^{l} | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - w_{t}^{l} a_{t}^{l - 1} (δ a_{t}^{l})_{others} := \frac{1}{S} \sum \begin{matrix} v is active v \neq t \end{matrix} (⟨ e_{t}^{l} | W^{l} | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - ⟨ e^{l - 1 ⟩_{t} | e_{v}^{l - 1}} w_{t}^{l} a_{v}^{l - 1}) (δ a_{t}^{l})_{previous} := w_{t}^{l} δ a_{t}^{l - 1} \\ (6.17) (6.18) (6.19) (6.20) \end{matrix}

These three components have mean zero and are uncorrelated with each other, so we can calculate their mean squared errors separately.

6.2.1 Error from Self Signal

Here we'll calculate the mean square of the error contribution $(δ a_{t}^{l})_{self}$ , from Equation (6.18). Inserting the definition of $W^{l}$ into the definition of $(δ a_{t}^{l})_{self}$ gives:

\begin{matrix} (δ a_{t}^{l})_{self} & := \frac{1}{S} ⟨ e_{t}^{l} | W^{l} | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - w_{t}^{l} a_{t}^{l - 1} = \frac{1}{S} ⟨ e_{t}^{l} | ({¯ w}^{l} W^{l} + \frac{1}{S} \sum u Δ w_{u}^{l} | e_{u}^{l} ⟩ ⟨ e_{u}^{l - 1 |}) | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - w_{t}^{l} a_{t}^{l - 1} = \frac{1}{S^{2}} \sum u \neq t ⟨ e_{t}^{l} | e_{u}^{l} ⟩ ⟨ e_{u}^{l - 1 ⟩ | e_{t}^{l - 1}} {¯ w}_{t}^{l} a_{t}^{l - 1} + Δ w_{t}^{l} a_{t}^{l - 1} + {¯ w}_{t}^{l} a_{t}^{l - 1} - w_{t}^{l} a_{t}^{l - 1} = \frac{1}{S^{2}} \sum u \neq t ⟨ e_{t}^{l} | e_{u}^{l} ⟩ ⟨ e_{u}^{l - 1 ⟩ | e_{t}^{l - 1}} {¯ w}_{t}^{l} a_{t}^{l - 1}, \\ (6.21) (6.22) (6.23) (6.24) \end{matrix}

in the third line, we used the fact that

\begin{matrix} \frac{1}{S} ⟨ e_{t}^{l} | W^{l} | e_{t}^{l - 1 ⟩} = 1, \\ (6.25) \end{matrix}

which follows from the definition of $W^{l}$ in Equation (5.23).

To estimate the expected norm of the error, we insert approximation (6.3) again:

\begin{matrix} E [{∣ ∣ (δ a_{t}^{l})_{self} ∣ ∣}_{self}^{2}] & = (T - 1) E [{(\frac{1}{S} ⟨ e_{t}^{l} | e_{u}^{l} ⟩)}^{2}] E [{(\frac{1}{S} ⟨ e_{u}^{l - 1 ⟩ | e_{t}^{l - 1}})}^{2}] E [{∣ ∣ {¯ w}_{t}^{l} a_{t}^{l - 1} ∣ ∣}^{2}] \approx T {(\frac{d}{D} - \frac{1}{S T})}^{2} E [{∣ ∣ {¯ w}_{t}^{l} a_{t}^{l - 1} ∣ ∣}^{2}] \\ (6.26) (6.27) \end{matrix}

6.2.2 Error from others' signals

Here we'll calculate the mean square of the error contribution $(δ a_{t}^{l})_{others}$ from Equation (6.19). First, we further subdivide $(δ a_{t}^{l})_{others}$ into two parts, one for terms involving $Δ w$ and one for terms involving $¯ w$ .

\begin{matrix} (δ a_{t}^{l})_{others} = (δ a_{t}^{l})_{others, ¯ w} + (δ a_{t}^{l})_{others, Δ w} (δ a_{t}^{l})_{others, Δ w} := \frac{1}{S} \sum \begin{matrix} v is active v \neq t \end{matrix} (⟨ e_{t}^{l} | (\frac{1}{S} \sum u Δ w_{u}^{l} | e_{u}^{l} ⟩ ⟨ e_{u}^{l - 1 |}) | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - ⟨ e^{l - 1 ⟩_{t} | e_{v}^{l - 1}} Δ w_{t}^{l} a_{v}^{l - 1}) (δ a_{t}^{l})_{others, ¯ w} := \frac{1}{S} \sum \begin{matrix} v is active v \neq t \end{matrix} (⟨ e_{t}^{l} | ¯ w W^{l} | e_{t}^{l - 1 ⟩} a_{t}^{l - 1} - ⟨ e^{l - 1 ⟩_{t} | e_{v}^{l - 1}} {¯ w}^{l} a_{v}^{l - 1}) \\ (6.28) (6.29) (6.30) \end{matrix}

We can straightforwardly approximate the expectation of the square of $(δ a_{t}^{l})_{others, Δ w}$ . The steps for this are similar to those we used for $(δ a_{t}^{l})_{self}$ in the previous subsection, see Equations (6.26) and (6.27).

\begin{matrix} E [{∣ ∣ (δ a_{t}^{l})_{others, Δ w} ∣ ∣}_{others, Δ w}^{2}] \approx (z - 1) T {(\frac{d}{D} - \frac{1}{S T})}^{2} E u \neq v [{∣ ∣ Δ w_{u}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] + (z - 1) (\frac{d}{D} - \frac{1}{S T}) E v [{∣ ∣ Δ w_{v}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] \\ (6.31) \end{matrix}

The expectation of $(δ a_{t}^{l})_{others, ¯ w}$ squared is a bit tougher. We didn't see an easy way to formally derive an approximation for it. So instead, we're going to be lazy: We'll just guess that it can be approximately described by the same formula as the one for $(δ a_{t}^{l})_{others, Δ w}$ , just with $Δ w_{u}^{l}$ and $Δ w_{v}^{l}$ swapped for ${¯ w}^{l}$ . Then we'll just check that guess empirically on some example data.

\begin{matrix} E [{∣ ∣ (δ a_{t}^{l})_{others, ¯ w} ∣ ∣}_{others, ¯ w}^{2}] \approx (z - 1) T {(\frac{d}{D} - \frac{1}{S T})}^{2} E v [{∣ ∣ {¯ w}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] + (z - 1) (\frac{d}{D} - \frac{1}{S T}) E v [{∣ ∣ {¯ w}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] \\ (6.32) \end{matrix}

To verify this equation, we can use the fact that ${¯ w}^{l}$ can be factored out from Equation (6.30). So, if the result holds for one set of circuits, it should hold for any set of circuits. We just need to make up some values for $a_{v}^{l - 1}$ , ${¯ w}^{l}$ , and $Δ w$ to check the equation on.^[23] To keep it simple, we choose the following:

\begin{matrix} a_{v}^{l - 1} = 1 for all active circuits v {¯ w}^{l} = 1 Δ w_{u} = {[1 or -1]}_{u} := {\begin{matrix} 1 & with probability 0.5 - 1 & with probability 0.5 \end{matrix} sampled independently for each circuit u \\ (6.33) (6.34) (6.35) \end{matrix}

We find a good fit for small values of $(δ a_{t}^{l})_{others}$ and small values of $S$ . For larger $(δ a_{t}^{l})_{others}$ and $S$ we see deviations of up to 50% from the hypothesis. I.e, the approximation is not perfect, but it seems ok. At larger errors, we expect all our error calculations to break down anyway because the errors start getting large enough to switch on ReLU neurons that should be off, leading to cascading failures as the network has a "seizure".

The two terms also add up as we would expect.

\begin{matrix} E [{∣ ∣ (δ a_{t}^{l})_{others} ∣ ∣}_{others}^{2}] & = E [{∣ ∣ (δ a_{t}^{l})_{others, Δ w} ∣ ∣}_{others, Δ w}^{2}] + E [{∣ ∣ (δ a_{t}^{l})_{others, ¯ w} ∣ ∣}_{others, ¯ w}^{2}] \approx (z - 1) T {(\frac{d}{D} - \frac{1}{S T})}^{2} E u \neq v [{∣ ∣ w_{u}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] + (z - 1) (\frac{d}{D} - \frac{1}{S T}) E v [{∣ ∣ w_{v}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] \\ (6.36) (6.37) \end{matrix}

6.2.3 Previous error

Assumption 5 about our circuits was that active circuits should not amplify the size of errors relative to the size of the signal from layer to layer:

\begin{matrix} \frac{{∣ ∣ (w_{t}^{l} δ a_{t}^{l - 1}) ∣ ∣}^{2}}{| a_{t}^{l} |} \leq \frac{{∣ ∣ δ a_{t}^{l - 1} ∣ ∣}^{2}}{| a_{t}^{l - 1} |} if t is active . \\ (6.38) \end{matrix}

Since $| a_{t}^{l} | = | a_{t}^{l - 1} |$ for our circuits, this implies that

\begin{matrix} E [{∣ ∣ (δ a_{t}^{l})_{previous} ∣ ∣}_{previous}^{2}] = E [{∣ ∣ (w_{t}^{l} δ a_{t}^{l - 1}) ∣ ∣}^{2}] \leq E [{∣ ∣ δ a_{t}^{l - 1} ∣ ∣}^{2}] . \\ (6.39) \end{matrix}

For simplicity, we use the most pessimistic estimate:

\begin{matrix} E [{∣ ∣ (δ a_{t}^{l})_{previous} ∣ ∣}_{previous}^{2}] = E [{∣ ∣ δ a_{t}^{l - 1} ∣ ∣}^{2}] \\ (6.40) \end{matrix}

6.2.4 Adding It All Up

Adding the three terms together and applying the resulting formula recursively to evaluate the propagated error from previous layers, we get

\begin{matrix} \begin{matrix} E [{∣ ∣ δ a_{t}^{l} ∣ ∣}^{2}] \approx & (l - 1) T {(\frac{d}{D} - \frac{1}{S T})}^{2} (z E u \neq v [{∣ ∣ Δ w_{u}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] + (z - 1) E v [{∣ ∣ {¯ w}^{l} a_{v}^{l - 1} ∣ ∣}^{2}]) + l (\frac{d}{D} - \frac{1}{S T}) (z - 1) E v [{∣ ∣ w_{v}^{l} a_{v}^{l - 1} ∣ ∣}^{2}] \end{matrix} \\ (6.41) \end{matrix}

The first term has a prefactor of $(l - 1)$ while the second has a prefactor of $l$ because the first term is the propagation error (from the previous post) which only appears from the second layer onward, while the second term is the embedding overlap error, which appears from the first layer onward.

6.3 Errors on our conditional rotation circuits}

6.3.1 Errors on the On-indicators

The On-indicators $α_{t}^{l}$ are read out from the circuits' hidden activations as $α_{t}^{l} := (a_{t}^{l})_{0} - (a_{t}^{l})_{1}$ . Provided the overall error stays small enough to not overwhelm any ReLU neuron enough to flip it from off to on, the errors on $(a_{t}^{l})_{0}$ and $(a_{t}^{l})_{1}$ will be identical for active circuits, and thus cancel out exactly

\begin{matrix} δ α_{t}^{l} = 0 if t is active . \\ (6.42) \end{matrix}

For inactive circuits, equations (5.4) and (6.4) yield:

\begin{matrix} E [{(δ α_{t}^{l})}^{2}] = (1.5 - 0.5) (\frac{d}{D} - \frac{1}{S T}) z = (\frac{d}{D} - \frac{1}{S T}) z if t is inactive . \\ (6.43) \end{matrix}

6.3.2 Errors on the Rotating Vectors

For active circuits
We can evaluate the errors on active rotation circuits using Equation (6.41). We want the error on $E [{(δ x_{t}^{l})}^{2}]$ though, not on $E [{(δ a_{t}^{l})}^{2}]$ . However, provided that $δ α_{t}^{l} = 0$ , we have

\begin{matrix} δ x_{t}^{l} = [\begin{matrix} {(δ a_{t}^{l})}_{2} {(δ a_{t}^{l})}_{3} \end{matrix}] \\ (6.44) \end{matrix}

So, since $δ α_{t}^{l} = 0$ for active circuits, we can just restrict (6.41) to the last two vector indices of $δ a$ to calculate the error:

\begin{matrix} E [{∣ ∣ δ x_{t}^{l} ∣ ∣}^{2}] = & E [{∣ ∣ δ a_{t}^{l} ∣ ∣}_{2 : 3}^{2}] \approx & (l - 1) T {(\frac{d}{D} - \frac{1}{S T})}^{2} (z E u \neq v [{∣ ∣ Δ w_{u}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{2}] + (z - 1) E v [{∣ ∣ {¯ w}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{l}]) + l (\frac{d}{D} - \frac{1}{S T}) (z - 1) E v [{∣ ∣ w_{v}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{2}] \\ (6.45) (6.46) \end{matrix}

where we define

\begin{matrix} | δ a_{t}^{l} |_{2 : 3}^{2} := (δ a_{t}^{l})_{2}^{2} + (δ a_{t}^{l})_{3}^{2} \\ (6.47) \end{matrix}

Inserting the definitions of ${¯ w}^{l}, Δ w_{u}^{l}, a_{v}^{l}$ for our circuits into this formula yields

\begin{matrix} {¯ w}^{l} a_{v}^{l - 1} = ⎡ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 2 & - 2 & 0 & 0 2 & - 2 & 0 & 0 2 & - 2 & 0 & 0 2 & - 2 & 0 & 0 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 1.5 0.5 (x_{t}^{l - 1})_{0} + 1 (x_{t}^{l - 1})_{1} + 1 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 2222 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎦ \\ (6.48) \end{matrix}

\begin{matrix} \Rightarrow {∣ ∣ {¯ w}^{l} a_{u}^{l - 1} ∣ ∣}_{2 : 3}^{l} = 2^{2} + 2^{2} = 8 \\ (6.49) \end{matrix}

and

\begin{matrix} Δ w_{u}^{l} a_{v}^{l - 1} = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 0 & 0 & 0 & 0 0 & 0 & 0 & 0 - (cos θ_{u} - sin θ_{u}) & (cos θ_{u} - sin θ_{u}) & cos θ_{u} & - sin θ_{u} - (sin θ_{u} + cos θ_{u}) & (sin θ_{u} + cos θ_{u}) & sin θ_{u} & cos θ_{u} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 1.5 0.5 (x_{v}^{l - 1})_{0} + 1 (x_{v}^{l - 1})_{1} + 1 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 00 (r_{u} x_{v}^{l - 1})_{0} (r_{u} x_{v}^{l - 1})_{1} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ \\ (6.50) \end{matrix}

where

\begin{matrix} r_{u} := [\begin{matrix} cos θ_{u} & - sin θ_{u} sin θ_{u} & cos θ_{u} \end{matrix}] \\ (6.51) \end{matrix}

and hence

\begin{matrix} {∣ ∣ r_{u} x_{v}^{l - 1} ∣ ∣}_{u}^{2} = {∣ ∣ x_{v}^{l - 1} ∣ ∣}^{2} & = 1 \Rightarrow {∣ ∣ Δ w_{u}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{2} & = 1 . \\ (6.52) (6.53) \end{matrix}

Since we didn't use $u \neq v$ , the above holds equally well for $u = v$ . Additionally, since directions of ${({¯ w}^{l} a_{v}^{l - 1})}_{2 : 3}^{l}$ and ${(Δ w_{v}^{l} a_{v}^{l - 1})}_{2 : 3}$ are uncorrelated, we can just add them up:

\begin{matrix} E [{∣ ∣ w_{v}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{2}] = {∣ ∣ {¯ w}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{l} + {∣ ∣ Δ w_{v}^{l} a_{v}^{l - 1} ∣ ∣}_{2 : 3}^{2} = 9 \\ (6.54) \end{matrix}

Inserting everything finally yields

\begin{matrix} E [{∣ ∣ δ x_{t}^{l} ∣ ∣}^{2}] \approx & (l - 1) T {(\frac{d}{D} - \frac{1}{S T})}^{2} (z + 8 (z - 1)) + 9 l (\frac{d}{D} - \frac{1}{S T}) (z - 1) if t is active \\ (6.55) \end{matrix}

For inactive circuits
Inserting equation (6.4) into

\begin{matrix} E [{∣ ∣ δ x_{t}^{l} ∣ ∣}^{2}] = E [{∣ ∣ δ a_{t}^{l} ∣ ∣}_{2 : 3}^{2}] \\ (6.56) \end{matrix}

yields

\begin{matrix} E [{∣ ∣ δ x_{t}^{l} ∣ ∣}^{2}] = (\frac{d}{D} - \frac{1}{S T}) z if t is inactive . \\ (6.57) \end{matrix}

^{^}
Meaning they are only applied if a Boolean variable is true.
^{^}
That post was a correction and continuation of this earlier sketch for such a framework, which was in turn based on this framework for compressing Boolean logic gates into neural networks.
^{^}
It was assumption 4, that the weights of different circuits are uncorrelated with each other.
^{^}
We did try that as well, and might present those results in a future post. They look overall pretty similar. The average error on the circuits just scales with a better prefactor.
^{^}
$T = ˜ O (\frac{D^{2}}{d^{2}})$ basically means " $T = O (\frac{D^{2}}{d^{2}})$ up to log factors". The full requirement is that $T log (T) \frac{d^{2}}{D^{2}}$ must be small. This result lines up well with expressions for the number of logic gates that neural networks can implement in superposition that were derived here [1, 2 and 3]. It also makes intuitive sense from an information theory perspective. The bits of information specifying the parameters of the $T$ circuits have to fit into the parameters of the neural network.}
^{^}
By "active error correction" we mean using additional layers for error clean-up. I.e. some of the large network layers are dedicated exclusively to noise reduction and do not carry out any circuit computations. One reason we don't do this is the cost of the extra layers, but another is that this only works if the circuit activations belong to a discrete set of allowed values. In our circuits this is the case for the on-indicator but not the rotated vector. So we would need to discretize the allowed angles to apply error correction to them. See also Appendix section D.4 here, for how to do error correction on Boolean variables in superposition.
^{^}
Data for $z = 1$ is generated by running the network $T$ times, once for every possible active circuit. Data for $z = 2$ and $z = 3$ is generated by running the network $100 \times T$ times, with randomly generated pairs and triples of active circuits. The sample sizes for $z > 1$ were chosen to be large enough that the worst-case absolute error did not seem to change very much any more when the sample size was increased further. It was still fluctuating a little though, so, maybe treat those points with some caution.
^{^}
This number was chosen because it is the largest $T$ possible for $\frac{D}{d} = 250$ and $S = 6$ in our prime-factorization-based embedding algorithm.
^{^}
Thanks to Gurkenglas for coming up with this evocative term.
^{^}
If the noise level gets large enough to significantly overwhelm the inactive circuits' error suppression, then we are outside the applicability of the theory.
^{^}
$b^{l}$ has no subscript $t$ because the embedding scheme (described in Section 5.4) requires all small circuits to have the same biases, in any given layer. If this is not the case the circuits need to be re-scaled so that the biases match.
^{^}
In the general formalism, the small circuit weights $w_{t}^{l}$ and biases $b^{l}$ can be different for each layer $l$ . However, in the specific example we are implementing here, $w_{t}^{l}$ and $b^{l}$ do not depend on $l$ .
^{^}
The "future post" mentioned in this quote from the last post is this post. The trick to modify a circuit such that assumption two holds, is to add an on-indicator. We haven't actualy described this in full generality yet, but you can hopfully see how a similar trick can be used to modify circutis other than rotations.
^{^}
You might notice that for both assumption 2 and 3 to hold simultaniusly, each small network needs a negative bias.
^{^}
We haven't actually tried that out in practice though.
^{^}
As long as all the biases have the same sign, we can rescale the weights and biases of different circuits prior to embedding so that assumption 4 is satisfied. If some biases don't have the same sign, we can increase $d$ , and partition the circuits into different parts of $R^{d}$ such that the biases have the same sign in each part. This potentially comes at the cost of being able to fit in fewer circuits, because it makes $d$ larger. $d$ will never have to be larger than $2 \times$ the initial $d$ though.
^{^}
Assumption 5 doesn't need to strictly hold for circuits of finite depth. It also doesn't strictly need to hold at any layer individually. The circuit could also have active noise 'correction' every few layers such that assumption 5 holds on average.
^{^}
or 'dot product'
^{^}
This formula assumes that circuit activation vectors $a_{t}^{l}$ are all of similar magnitude.
^{^}
We can't neglect this term for $u = t$ , because $| e_{t}^{l} ⟩ w_{t}^{l} a_{t}^{l - 1}$ will get canceled by the $- a_{t}^{l}$ , so it isn't sub-leading.
^{^}
This assumption holds for the conditional rotation circuits.
^{^}
Intuitively it doesn't seem like ${¯ w}^{l} \neq 0$ should ultimately change much in the derivation, but the calculations to show this were just getting kind of cumbersome. So we figured we might as well just guess and check.
^{^}
What values we pick shouldn't matter so long as $Δ w$ satisfies $E u [Δ w_{u}^{l}] = 0$ .

[-]Alex Gibson2moΩ230

If I understand correctly, inactive circuits mistakenly activating is the main failure mode - once this happens, things go downhill quickly. So the bottleneck is robustly knowing which circuits should be active.

Could we use O(d) redundant On-indicators per small circuit, instead of just 1, and apply the 2-ReLU trick to their average to increase resistance to noise?

In the Section 5 scenario, would it help to use additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the $L^{1}$ norm of their activations).

The reason I think this is a reasonable idea is that LLMs do seem to compute binary indicators of when to use circuits, separate from the circuits themselves. Ferrando et al. found models have features for whether they recognize an entity, which gates fact lookup. In GPT2-Small, the entity recognition circuit is just a single first-layer neuron, while the fact-lookup circuit it controls is presumably very complex (GDM couldn't reverse-engineer it). This suggests networks naturally learn simple early gating for complex downstream computations.

[-]Linda Linsefors2moΩ240

Thanks for your questions

If I understand your correctly, this is already what we are doing. Each on-ndicator is distributed over pairs of neurons in the Large Network, where we used $S = 6$ for the results in this post.

I can't increase $S$ more than that, for the given $D$ and $T$ , without braeaking the costraint that no two circuits should overlap in more than one neuron, and this constarint is an important assumption in the error calculations. However this it is a possible that a more clever way to alocate neurons could help this a bit.

See here: https://www.lesswrong.com/posts/FWkZYQceEzL84tNej/circuits-in-superposition-2-now-with-less-wrong-math#Construction

In the Section 5 scenario, would it help to use $T$ additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the $L^{1}$ norm of their activations).

We don't have $T$ extara neuons laying around. This post is about supperpossition, which means we have fewer neurons than features.

If we assume that which circuits are active never changes across layers (which is true for the example in this post) there is another thing we can do. We can encode, the on-indiators, at the start, in superpossition, and then just coppy this information from layer to layer. This prevents the parts of the network that is repsponsibel for on indicators from seizureing. The reasons we didn't do this is we wanted to test a method that could be used more generally.

encode the network's best guess for the active circuits early on

We don't need to encode the best guess, we can just encode the true value (up to the uncertanty that comes from compressing it into superpossition), given in the input. If we assume that these values stay constant, that is. But storing the value form early on also assumes that the value of the on-indicators stay fixed.

Ooo, interesting! I will definatly have a look into those papers. Thanks!

[-]Alex Gibson2mo*Ω110

To clarify where i'm coming from, I read the previous post as showing that assuming inactive circuits never activate due to superposition noise, we could get . This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.

If I understand your correctly, this is already what we are doing. Each on-indicator is distributed over $S$ pairs of neurons in the Large Network, where we used $S = 6$ for the results in this post.

Agnostic of the method for embedding the small circuits in the larger network, currently only $1$ out of $d$ neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to $c d$ for some small fixed $c$ , increasing the size of the small networks to $(1 + c) d$ neurons. In the rotation example, $d$ is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.

The idea is that while each of the $c d$ indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across $S$ neurons in the large network) receives is hopefully independent.

If we assume that which circuits are active never changes across layers (which is true for the example in this post) there is another thing we can do. We can encode, the on-indicators, at the start, in superposition, and then just copy this information from layer to layer. This prevents the parts of the network that is responsible for on indicators from seizing. The reasons we didn't do this is we wanted to test a method that could be used more generally.

This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most $z$ . Even if you erroneously believe one of the $z$ circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other $T - z$ inactive circuits. If the on-indicators don't seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.

I agree shared state/cross-circuit computation is an important thing to model, though. I guess that's what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you're just saying, if we want this to work more generally, there are issues?

---------------------------------------

This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.

[-]Linda Linsefors2moΩ120

This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.

The eronoious activation ony happens if the errors get large enough. With large enough T, D and S, this should be avoidable.

Agnostic of the method for embedding the small circuits in the larger network, currently only out of $d$ neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to $c d$ for some small fixed $c$ , increasing the size of the small networks to $(1 + c) d$ neurons. In the rotation example, $d$ is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.

Just to clarify, the current circuit uses 2 small ciruit neurons to embed the rotating vector (since it's a 2d vector) and 2 small ciruit neruons for the on-indicator (in order to compute a stepfunction wich requires 2 ReLUs).

We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.

The total noice is already a sum of several independent component. Inceraing S would make the noice be more smaller number, which is better. Your method will not make noice contibutions any more indepent.

The limit on S is that we don't want any two small network neruons to share more than one large network neurons. We have an alocation algorithm that is better than just random distribution, using primnumber steps, but if I make S to large the algorithm runs out of prim number, which is why S isn't larger. This is less of a problem for larger D, so I think that in the large limit this would not be a problem.

I'm not sure I understand what you're trying to say, but I'll try to respond anyway.

In our setup it is the case that it is known from the start which circuits are going to be active for the rest of the forward pass. This is not one of the assumptions we listed, but it is implicit in the entire framwork (there will always be cases like that even if you try to list all assumptions). However this "buil in assumption" is just there for convinience, and not becasue we think this is realistic. I expect that in a real network, which cirucits are used in the later layers, will depend on computations in the earlier layers.

Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the possitional embedding. After the first attention, the netowrk has a bit more information, but not that much, the softmax will make sure the netowork just focus on a few previous words (right?). And every step of conputatin (including attention) will come with some noise, if superpossition is involved.

There is a regime where the (updated) framwork works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much super possition.

Do you want to have a call some time in January?
There are probably lots of thing that isn't explained as well as it could have been in the post.

[-]Alex Gibson1mo*80

Ok I have a neat construction for z=1, https://www.lesswrong.com/posts/g9uMJkcWj8jQDjybb/ping-pong-computation-in-superposition that works pretty well ( with $D (1 + \frac{2}{d})$ width and $L + 3$ layers), and zero error. Note that $\frac{D^{2}}{d^{2}}$ is exact here, not asymptotic.

I'm trying to find a similar construction that scales like $z$ in the number of parameters required, without just scaling the number of layers up. I'd also be curious if it's possible to avoid scaling parameters linearly with $z$ , but it seems quite difficult.

[-]Alex Gibson2mo*10

We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.

This is all I mean. Having the small circuits do error correction on on-indicator neurons is just a way of increasing that total percentage storage allocated to on-indicators in the larger network in an embedding agnostic manner. You can change your method for constructing the larger network later and this strategy would be orthogonal to such a change.

I think allocating a higher percentage to on-indicators should still apply even when adding cross-circuit computation.

Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the positional embedding. After the first attention, the network has a bit more information, but not that much, the softmax will make sure the network just focus on a few previous words (right?). And every step of computation (including attention) will come with some noise, if superposition is involved.

The assumption "softmax will make sure the network just focus on a few previous words (right?)" is true for many attention heads, but not all of them. Some attention heads attend broadly across the whole sequence, aggregating many different tokens together to get the 'gist' of a text.

By the end of the first layer of GPT-2 Small, it has constructed a continuous linear summary of the previous tokens, and so has a Word2Vec style vector in its residual stream. So it knows that the text is about World War I, or about AI, that it is written in British English, that is is formal/informal, etc, all by the end of the first attention layer (before the first MLP). This is lots of information to go on in terms of turning off circuits etcetera (think Mixture-Of-Experts).

There is a regime where the (updated) framework works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much superposition.

Nice, ok. So asymptotically it works fine then? So the next step theory wise is having a framework that allows for cross-circuit computation, I guess.

Do you want to have a call some time in January?
There are probably lots of thing that isn't explained as well as it could have been in the post

I would be grateful if you get the time. I have recently had some more ideas that are more useful potentially about this stuff that it might be worth discussing.

LESSWRONG
LW

LESSWRONG
LW

54

Rotations in Superposition

54

Ω 18

1 Introduction

2 Takeaways

3 The task

4 Results

4.1 Comparing with our theoretical predictions

4.1.1 Rotating vector

4.1.2 On-indicator

5 Appendix 1: Construction

5.1 The circuits

5.2 Assumptions about the circuits

5.2.1 Revisiting assumptions from the previous post

Assumptions

5.2.2 Updated set of assumptions

5.3 Notation

5.3.1 Bracket notation

5.3.2 Mixed vector notation

5.4 Embedding the Circuits

5.4.1 Embedding the weights

5.4.2 Embedding the biases

5.5 Reading out the small circuit activations

6 Appendix 2: Error calculations

6.1 Error on inactive circuits

6.2 Error on active circuits

6.2.1 Error from Self Signal

6.2.2 Error from others' signals

6.2.3 Previous error

6.2.4 Adding It All Up

6.3 Errors on our conditional rotation circuits}

6.3.1 Errors on the On-indicators

6.3.2 Errors on the Rotating Vectors

54

Ω 18

54

Ω 18