If I understand correctly, inactive circuits mistakenly activating is the main failure mode - once this happens, things go downhill quickly. So the bottleneck is robustly knowing which circuits should be active.
Could we use O(d) redundant On-indicators per small circuit, instead of just 1, and apply the 2-ReLU trick to their average to increase resistance to noise?
In the Section 5 scenario, would it help to use additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the norm of their activations).
The reason I think this is a reasonable idea is that LLMs do seem to compute binary indicators of when to use circuits, separate from the circuits themselves. Ferrando et al. found models have features for whether they recognize an entity, which gates fact lookup. In GPT2-Small, the entity recognition circuit is just a single first-layer neuron, while the fact-lookup circuit it controls is presumably very complex (GDM couldn't reverse-engineer it). This suggests networks naturally learn simple early gating for complex downstream computations.
Thanks for your questions
Could we use O(d) redundant On-indicators per small circuit, instead of just 1, and apply the 2-ReLU trick to their average to increase resistance to noise?
If I understand your correctly, this is already what we are doing. Each on-ndicator is distributed over pairs of neurons in the Large Network, where we used for the results in this post.
I can't increase more than that, for the given and , without braeaking the costraint that no two circuits should overlap in more than one neuron, and this constarint is an important assumption in the error calculations. However this it is a possible that a more clever way to alocate neurons could help this a bit.
See here: https://www.lesswrong.com/posts/FWkZYQceEzL84tNej/circuits-in-superposition-2-now-with-less-wrong-math#Construction
In the Section 5 scenario, would it help to use additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the norm of their activations).
We don't have extara neuons laying around. This post is about supperpossition, which means we have fewer neurons than features.
If we assume that which circuits are active never changes across layers (which is true for the example in this post) there is another thing we can do. We can encode, the on-indiators, at the start, in superpossition, and then just coppy this information from layer to layer. This prevents the parts of the network that is repsponsibel for on indicators from seizureing. The reasons we didn't do this is we wanted to test a method that could be used more generally.
encode the network's best guess for the active circuits early on
We don't need to encode the best guess, we can just encode the true value (up to the uncertanty that comes from compressing it into superpossition), given in the input. If we assume that these values stay constant, that is. But storing the value form early on also assumes that the value of the on-indicators stay fixed.
The reason I think this is a reasonable idea is that LLMs do seem to compute binary indicators of when to use circuits, separate from the circuits themselves. Ferrando et al. found models have features for whether they recognize an entity, which gates fact lookup. In GPT2-Small, the entity recognition circuit is just a single first-layer neuron, while the fact-lookup circuit it controls is presumably very complex (GDM couldn't reverse-engineer it). This suggests networks naturally learn simple early gating for complex downstream computations.
Ooo, interesting! I will definatly have a look into those papers. Thanks!
To clarify where i'm coming from, I read the previous post as showing that assuming inactive circuits never activate due to superposition noise, we could get . This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.
If I understand your correctly, this is already what we are doing. Each on-indicator is distributed over pairs of neurons in the Large Network, where we used for the results in this post.
Agnostic of the method for embedding the small circuits in the larger network, currently only out of neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to for some small fixed , increasing the size of the small networks to neurons. In the rotation example, is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.
The idea is that while each of the indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across neurons in the large network) receives is hopefully independent.
If we assume that which circuits are active never changes across layers (which is true for the example in this post) there is another thing we can do. We can encode, the on-indicators, at the start, in superposition, and then just copy this information from layer to layer. This prevents the parts of the network that is responsible for on indicators from seizing. The reasons we didn't do this is we wanted to test a method that could be used more generally.
This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most . Even if you erroneously believe one of the circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other inactive circuits. If the on-indicators don't seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.
I agree shared state/cross-circuit computation is an important thing to model, though. I guess that's what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you're just saying, if we want this to work more generally, there are issues?
---------------------------------------
This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.
This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.
The eronoious activation ony happens if the errors get large enough. With large enough T, D and S, this should be avoidable.
Agnostic of the method for embedding the small circuits in the larger network, currently only out of neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to for some small fixed , increasing the size of the small networks to neurons. In the rotation example, is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.
Just to clarify, the current circuit uses 2 small ciruit neurons to embed the rotating vector (since it's a 2d vector) and 2 small ciruit neruons for the on-indicator (in order to compute a stepfunction wich requires 2 ReLUs).
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
The idea is that while each of the indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across neurons in the large network) receives is hopefully independent.
The total noice is already a sum of several independent component. Inceraing S would make the noice be more smaller number, which is better. Your method will not make noice contibutions any more indepent.
The limit on S is that we don't want any two small network neruons to share more than one large network neurons. We have an alocation algorithm that is better than just random distribution, using primnumber steps, but if I make S to large the algorithm runs out of prim number, which is why S isn't larger. This is less of a problem for larger D, so I think that in the large limit this would not be a problem.
This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most . Even if you erroneously believe one of the circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other inactive circuits. If the on-indicators don't seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.
I'm not sure I understand what you're trying to say, but I'll try to respond anyway.
In our setup it is the case that it is known from the start which circuits are going to be active for the rest of the forward pass. This is not one of the assumptions we listed, but it is implicit in the entire framwork (there will always be cases like that even if you try to list all assumptions). However this "buil in assumption" is just there for convinience, and not becasue we think this is realistic. I expect that in a real network, which cirucits are used in the later layers, will depend on computations in the earlier layers.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the possitional embedding. After the first attention, the netowrk has a bit more information, but not that much, the softmax will make sure the netowork just focus on a few previous words (right?). And every step of conputatin (including attention) will come with some noise, if superpossition is involved.
I agree shared state/cross-circuit computation is an important thing to model, though. I guess that's what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you're just saying, if we want this to work more generally, there are issues?
There is a regime where the (updated) framwork works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much super possition.
This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.
Do you want to have a call some time in January?
There are probably lots of thing that isn't explained as well as it could have been in the post.
Ok I have a neat construction for z=1, https://www.lesswrong.com/posts/g9uMJkcWj8jQDjybb/ping-pong-computation-in-superposition that works pretty well ( with width and layers), and zero error. Note that is exact here, not asymptotic.
I'm trying to find a similar construction that scales like in the number of parameters required, without just scaling the number of layers up. I'd also be curious if it's possible to avoid scaling parameters linearly with , but it seems quite difficult.
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
This is all I mean. Having the small circuits do error correction on on-indicator neurons is just a way of increasing that total percentage storage allocated to on-indicators in the larger network in an embedding agnostic manner. You can change your method for constructing the larger network later and this strategy would be orthogonal to such a change.
I think allocating a higher percentage to on-indicators should still apply even when adding cross-circuit computation.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the positional embedding. After the first attention, the network has a bit more information, but not that much, the softmax will make sure the network just focus on a few previous words (right?). And every step of computation (including attention) will come with some noise, if superposition is involved.
The assumption "softmax will make sure the network just focus on a few previous words (right?)" is true for many attention heads, but not all of them. Some attention heads attend broadly across the whole sequence, aggregating many different tokens together to get the 'gist' of a text.
By the end of the first layer of GPT-2 Small, it has constructed a continuous linear summary of the previous tokens, and so has a Word2Vec style vector in its residual stream. So it knows that the text is about World War I, or about AI, that it is written in British English, that is is formal/informal, etc, all by the end of the first attention layer (before the first MLP). This is lots of information to go on in terms of turning off circuits etcetera (think Mixture-Of-Experts).
There is a regime where the (updated) framework works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much superposition.
Nice, ok. So asymptotically it works fine then? So the next step theory wise is having a framework that allows for cross-circuit computation, I guess.
Do you want to have a call some time in January?
There are probably lots of thing that isn't explained as well as it could have been in the post
I would be grateful if you get the time. I have recently had some more ideas that are more useful potentially about this stuff that it might be worth discussing.
TL;DR We experimentally test the mathematical framework for circuits in superposition by hand-coding the weights of an MLP to implement many conditional[1] rotations in superposition on two-dimensional input features. The code can be found here.
This work was supported by Coefficient Giving and Goodfire AI
1 Introduction
In a previous post, we sketched out a construction for compressing many different circuits into a neural network such that they can be computed in superposition.[2]
By computation in superposition, we mean that a network represents features in superposition and can perform more possible computations with them than it has neurons (although not all at once), across multiple layers. Having better models of this is important for understanding whether and how networks use superposition, which in turn is important for mechanistic interpretability research more generally.
Performing computation in superposition over multiple layers introduces additional noise compared to just storing features in superposition. This restricts the amount and type of computation that can be implemented in a network of a given size, because the noise needs to be reduced or suppressed to stay smaller than the signal.
Here, we test the construction by using it to compress circuits that perform random rotations on different two-dimensional input vectors, conditional on a Boolean variable being active.
We ended up needing to adjust the construction a bit to do this, because one of the assumptions the original version made about the circuits to be compressed turned out to be less reasonable in practice than we thought it'd be, and did not actually apply to these rotation circuits.[3]
This sort of thing is why we do testing. Reality is complicated.
Note that this is certainly not the ideal way to implement rotations in superposition. We are testing a general method that should work for a wide class of circuits. If we just wanted to implement these specific circuits as compactly and noiselessly as possible, there'd be better ways to do that.[4]
2 Takeaways
It seems to work. But we can only fit so many independent circuits into the network before the errors compound and explode.
Independent circuits in superposition are costly
Many people we've talked to seem to think that neural networks can easily learn to implement an exponentially large number of independent computations by using superposition. They often seem to base this intuition on toy models that showcase how many independent features neural networks can store in superposition in their activations, but that's very different from implementing independent computations in superposition on these features. The latter is much more difficult and costly.
Our previous post concluded that in theory, we could fit close to T=˜O(D2d2) circuits that each require d ReLU neurons per layer into a neural network with D ReLU neurons per layer, provided the circuits activate sparsely enough.[5] But that was an asymptotic expression for the case of network width D going to infinity, based only on the leading error terms and mostly ignoring constant prefactors.
In our experiments, we had d=4, D=1000 and it was already pretty hard to fit in T≈2Dd circuits and have the noise stay small enough to not blow up the computation in the first few layers. We think we could perhaps stabilize the computations for a much greater number of layers by implementing more active error correction,[6] but squeezing in many more independent circuits than this seems very difficult to us.
So how can language models memorize so many things?
If fitting many independent circuits into a neural network is so difficult, how can language models memorize so many different facts and even whole text passages? We don't know the answer to that yet. But we have started to train and analyze small toy models of memorization to find out, and we have some suspicions. Our work on this is still at a very early stage, so take the following speculation with an even bigger helping of salt than usual. But:
Many memorization circuits might be much shallower than the ones we investigate here. A shallower depth means fewer stages of computation for errors to accumulate.
Much of the information that models memorize is not actually independent. If a model wants to store that "The Eiffel Tower is in Paris" and that "The capital of France is Paris", it doesn't actually need two independent, non-overlapping lookup circuits to do this. Since both look-ups return the same answer, "Paris", the computation need not distinguish their outputs. Instead, all look-ups that return "Paris" could effectively share the same set of neurons, meaning they'd effectively not be independent look-ups at all, but rather parts of a single general "Sequences that end in Paris" lookup.
This picture of computation in neural networks suggests that there might not be that many more independent circuits than neurons in a large language model like GPT-2. Instead, the models might be exploiting structure in the data to implement what might superficially look like very many different capabilities using a comparatively much smaller set of fairly general circuits.
Lucius: This roughly tracks with some very early results I've been seeing when decomposing the parameters of models like GPT-2 Small into independent components. It currently looks to me like it's actually rare for these models to even have more rank 1 circuit pieces than neurons in a layer, never mind neuron count squared. I'd guess that much wider models have more circuits relative to their neuron count, but at this point I kind of doubt that any real model today gets even remotely close to the theoretical limit of ≈1 circuit per weight. Which might be great, because it'd mean there aren't that many circuits we need to interpret to understand the models.
3 The task
The general setup for our circuits-in-superposition framework goes like this:
In this case, the T small circuits we want to compress into one network each perform a particular two-dimensional rotation on different two-dimensional vectors xlt, conditional on a Boolean "On-indicator" variable αlt taking the value 1. The circuits have L=6 layers, with each layer performing another conditional rotation on the previous layer's output. We randomly selected the rotation angle for each circuit. The hidden dimension of each individual circuit is d=4.
Conditional rotation
In each layer l, the circuits are supposed to rotate a two-dimensional vector xl−1t with norm |xl−1t|≤1 by some angle θt, conditional on the Boolean On-indicator αl−1t taking value 1 rather than 0.
We can implement this with an expression using ReLUs, like this:
[(xlt)0(xlt)1]=⎡⎢⎣ReLU(cosθt(xl−1t)0−sinθt(xl−1t)1+2αl−1t−1)−αl−1tReLU(sinθt(xl−1t)0+cosθt(xl−1t)1+2αl−1t−1)−αl−1t⎤⎥⎦.(3.1)If αl−1t=1, this expression simplifies to
[(xlt)0(xlt)1]=[cosθt(xl−1t)0−sinθt(xl−1t)1sinθt(xl−1t)0+cosθt(xl−1t)1],(3.2)which is a two-dimensional rotation by angle θt, as desired. If αl−1t=0, we get xlt=0.
On-indicator
αlt=ReLU(2αl−1t−1.5)−ReLU(2αl−1t−0.5).(3.3)At each layer l, the circuits are also supposed to apply a step function to the On-indicator αlt, and pass it on to the next layer:
The step function is there to make the On-indicators robust to small noise. Noise robustness is a requirement for circuits to be embedded in superposition. For more details on this and other properties that circuits need to have for computation in superposition to work, see Section 5.2
If the circuit is active, the On-indicator will be initialized as 1, and remain 1. If the circuit is inactive, the On-indicator will be initialized as 0, and stay 0.
αlt={1if circuit t is active0if circuit t is inactivefor all l(3.4)For more detail on the small circuits, see Appendix section 5.1
Why conditional rotations in particular?
No particular reason. We just wanted some simple circuits. Originally, we tried to just do plain two-dimensional rotations. Then we remembered that plain rotations aren't noise-robust in the way our computation-in-superposition framework requires, because they're analytic at every point. We need circuits that return an output of 0 when passed an input vector of sufficiently small size ϵ. Rotations don't do that. Making the rotations conditional on some Boolean logic fixes this, and for simplicity we just went with the easiest Boolean logic possible.
Why the step function?
Although we only need one float to encode α, we need two ReLUs to make the circuit robust to noise over multiple layers. If we only used a single ReLU, e.g., αlt=ReLU(2αl−1t−1), then a small error ϵ on αl−1t would be doubled in layer l, then doubled again in layer l+1 and so on, growing exponentially with circuit depth.
If dedicating two neurons just to copy over the embedding of a single Boolean seems a little silly to you, you're not wrong. If our main goal was just to implement lots of conditional rotation circuits in superposition using as few neurons as possible, there'd be better ways to do it. For example, we could use some fraction of the neurons in the large network to embed the On-indicators, and then just copy this information from layer to layer without ever recalculating these values. Such an implementation would do better at this particular task.
But our goal is not to solve this task in particular, but to test a general method for embedding many different kinds of circuits.
If you like, you can also see this part of the circuits as a stand-in for some more complicated logic, that calculates which circuits are on from layer to layer, i.e., some function that is more complicated than "same as last layer", and involves cross-circuit computations
4 Results
Here, we'll present the results of embedding T conditional rotation circuits into a ReLU MLP of width D for various choices of D and T, using the embedding method described in Section 5.4
As we would expect, the read-offs diverge more with every subsequent layer as error accumulates, since we didn't implement any active error correction.[6] Most rotation circuits seem to work fine for the first 2-3 rotations, with the read-offs predicting the ideal values much better than chance. Estimates are much worse in the later layers. The errors also get larger when the number of active circuits z is increased.
We quantify the error on the rotated vectors as the L2 norm of the difference between the ideal result xlt and the estimate ^xlt:
|δxlt|:=|xlt−^xlt|.(4.1)On a population level, we can look at both the average error of circuits 1T∑Tt=1|δxlt|2, and the worst-case error max(|δxl1|,…,|δxlT|). The former tells us how noisy the computations are most of the time. The latter tells us whether any one circuit in the ensemble of T=557 circuits was implemented so noisily that for at least one input it basically doesn't work at all.
Figure 5: Errors on the rotating vectors x, for an MLP of width D=1000 with T=557 circuits each of width d=4. Every circuit neuron is spread across S=6 MLP neurons. Upper row: mean squared errors, i.e., the average of |δx|2 over inactive or active circuits. Lower row: worst-case absolute errors, i.e., the maximum of |δx| over inactive or active circuits.[7]
The left plots show errors on inactive circuits, i.e., circuits that received an input of α0t=0,x0t=0. Most circuits will be inactive on any given forward pass. The right plots show errors on active circuits, i.e. circuits that received an input of α0t=1 and some random vector x0t with |x0t|=1. As a baseline to put these numbers in perspective, once |δxlt|≥1, the circuits are doing no better than guessing the mean. So, the average circuit beats this trivial baseline for the first three of the five rotations, or the first four if we restrict ourselves to inputs with only z=1 active circuits. As one might expect, the worst-case errors are worse. With z=1 inputs, all T=557[8] circuits beat the baseline for the first two rotations. With z=2 inputs, only the first rotation is implemented with small enough noise by all circuits to beat the baseline, and at least one circuit doesn't work right for an input with z=3 for even a single rotation.
We can also see that the mean squared errors grow superlinearly the deeper we go into the network. This does not match our error calculations, which would predict that those errors should grow linearly. We think this is because as the errors grow they start to overwhelm the error suppression mechanism by activating additional ReLUs, and thus letting through additional noise. The calculations assume that the noise always stays small enough that this doesn't happen. More on that in the next subsection, where we compare our error calculations to the empirical results.
We can also look at the error on the On-indicator variables αt:
|δαlt|:=|αlt−^αlt|(4.2)Here, an error |δαlt|>0.25 means that the error on the On-indicator is large enough to pass through our step function and propagate to the next layer. At that point, errors will start to accumulate over layers at a faster-than-linear rate.
Figure 6: Errors on the Boolean On-indicators α, for an MLP of width D=1000 with T=557 circuits each of width d=4. Every circuit neuron is spread across S=6 MLP neurons. Upper row: mean squared errors, i.e., the average of (δα)2 over inactive or active circuits. Lower row: worst-case absolute error, i.e., the maximum of |δα| over inactive or active circuits.[7]
The absolute value of the error for the On-indicator caps out at 1. This is because ^alt is the averaged output of a set of step functions, each of which is constrained to output a value in the range [0,1]. Their average is thus also constrained to the range [0,1]. Therefore, regardless of whether the true value for αlt is 0 or 1, |δαlt| has to be in the range [0,1].
4.1 Comparing with our theoretical predictions
Key takeaway: Most predictions are good when the errors are small in magnitude, but once the errors get too big, they start growing superlinearly, and our predictions begin to systematically undershoot more and more. We think this is because our error propagation calculations assume that the errors stay small enough to never switch on neurons that are not in use by active circuits. But for many settings, the errors get large enough that some of these neurons do switch on, opening more channels for the noise to propagate and compound even further.
Figure 7: Fraction of active neurons, i.e., what fraction of the large network neurons are non-zero, for a network of width D=1000 with T=557 circuits of width d=4. Every circuit neuron is spread across S=6 MLP neurons. The dotted lines are what we'd theoretically expect if the errors stayed small enough not to overwhelm the noise suppression threshold for the inactive circuits. If this assumption is broken, the network starts to develop a "seizure"[9]: the noise starts to grow much faster as errors switch on more neurons causing more errors which switch on more neurons.[7]
The key variables that influence the error here are
4.1.1 Rotating vector
Predictions are good for layer 1, z=1. For larger z, the errors get larger and the predictions get worse, with empirical errors often systematically larger than our theoretical predictions. For layer 2 and onward, our theoretical predictions likewise undershoot more and more as the overall error gets larger. At layer 3 and onward, the superlinear growth from compounding noise becomes very visible for large Td2D2.
As our error calculations would predict, rotating vectors for inactive circuits have smaller errors than rotating vectors for active circuits, so long as the overall noise level stays small enough for the calculations to be applicable.[10]
4.1.2 On-indicator
In the case of On-indicators for active circuits, our error calculations predict that they just stay zero. This is indeed empirically the case for the early layers. But as we move deeper into the network and Td2D2 or z get larger, a critical point seems to be reached. Past this point, noise on other parts of the network seems to become large enough to cause cascading failures, likely by switching the regime of ReLUs. Then, the active On-indicators become noisy as well.
The noise on the inactive On-indicator circuits follows the same pattern: Predictions are mostly good until the noise seems to grow large enough to break the assumptions of the calculation as the network starts to develop a "seizure". Here, this point seems to be reached at particularly early layers though. We think this is because seizures start with the On-indicators for inactive circuits and spread from there to other quantities.
5 Appendix 1: Construction
5.1 The circuits
Each individual circuit can be written as
alt=ReLU(wltal−1t+bl)(5.1)where the superscript l indicates the layer, ranging from 0 to 5, and the subscript t indexes the different circuits, ranging from 0 to T−1.
The circuits are supposed to compute/propagate the On-indicator, αlt∈{0,1} which controls whether the circuit should be active, and perform rotations on the rotating vector, xlt.
The weight matrices wlt for each circuit are parametrized as
wlt=⎡⎢ ⎢ ⎢ ⎢⎣2−2002−2002−(cosθt−sinθt)−2+(cosθt−sinθt)cosθt−sinθt2−(sinθt+cosθt)−2+(sinθt+cosθt)sinθtcosθt⎤⎥ ⎥ ⎥ ⎥⎦(5.2)and the biases as[11]
bl=⎡⎢ ⎢ ⎢⎣−0.5−1.5−1−1⎤⎥ ⎥ ⎥⎦.(5.3)So, the circuit parameters here don't actually vary with the layer index l at all.[12]
The initial input vector for each circuit takes the form
a0t=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1.5 α0t0.5 α0t(x0t)0+α0t(x0t)1+α0t⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.(5.4)The On-indicator αlt for circuit t at layer l can be read out from the circuit's hidden activations as
αlt:=(alt)1−(alt)2.(5.5)The rotating vector can be read out from the circuit's hidden activations as
xlt:=[(alt)2−αlt(alt)3−αlt].(5.6)If the circuit is active, the rotating vector will be initialized as some vector of length 1, and the On-indicator will be set to 1. If the circuit is inactive, the rotating vector and On-indicator will be initialized to 0, and stay 0.
|xlt|={1if circuit t is active0if circuit t is inactivefor all l(5.7)
αlt={1if circuit t is active0if circuit t is inactivefor all l(5.8)A circuit's rotating vector is robust to some noise if the circuit is inactive. If a circuit is active, errors on its rotating vector will neither be removed nor amplified.
In any given forward pass, most circuits will be inactive.
5.2 Assumptions about the circuits
5.2.1 Revisiting assumptions from the previous post
From the previous post:
Assumption 1 holds for our conditional rotation circuits. Assumption 2 also holds, because the rotation is only applied if the Boolean On-indicator αlt is large enough, and the step function for αlt also doesn't propagate it to the next layer if αlt is too small. Assumption 3 also holds. However, assumption 4 does not hold. So we're going to have to adjust the way we embed the circuits a little.
Further, while it wasn't explicitly listed as an assumption, the previous post assumed that the circuits don't have any biases. But our circuits here do (see Equation 5.1). So we'll need to make some more adjustments to account for that difference as well.
5.2.2 Updated set of assumptions
Assumptions 1-3 have stayed the same. We think the construction could likely be modified to make assumption 3 unnecessary,[15] but it simplifies the implementation a lot. The old assumption 4 is not required for the new construction. The new embedding algorithm can work without it (see next section). We think the embedding algorithm could also be modified to make the new assumption 4 unnecessary, but this may come at some cost to the total number of circuits we can embed.[16] The new assumption 5 was actually necessary all along, even in the previous construction. Without it, even a tiny amount of noise can eventually grow to overwhelm the signal if the circuit is deep enough.[17]
Note that assumption 2 is about noise reduction on inactive circuits whereas assumption 5 is about lack of noise amplification on active circuits. Assumption 2 says we should have
|δalt(x)|≤ϵ⇒|ReLU(wl+1tδalt(x)+bl+1)|=0(5.9)for some sufficiently small ϵ>0. Assumption 5 says we should have
|al+1t(x)−ReLU(wl+1t(alt+δalt(x))+bl+1)||al+1t(x)|≤|δalt(x)||alt(x)|.(5.10)5.3 Notation
5.3.1 Bracket notation
Bracket notation is a type of vector notation that's excellent for thinking about outer products. It's also the vector notation usually used in quantum mechanics, and the term "superposition" is a quantum loan-word, so deploying it here seemed kabbalistically accurate.
It works like this:
- |v⟩ denotes a column vector, also called a ket.
- ⟨u| denotes a row vector, also called a bra.
⟨u|:=|u⟩⊤(5.11)- ⟨u|v⟩ denotes the inner product[18] of two vectors, yielding a scalar number:
⟨u|v⟩:=∑i(u)i(v)i(5.12)- |v⟩⟨u| denotes the outer product of two vectors, yielding a matrix:
(|v⟩|u⟩)i,j:=(v)i(u)j(5.13)We will use bracket notation to represent D/d-dimensional vectors, like the embedding vectors in section 5.4.1
5.3.2 Mixed vector notation
So far, we've used bold to represent d-dimensional vectors, i.e., alt and b, and two-dimensional vectors, i.e., xlt. And we just defined |ket⟩ for D/d-dimensional vectors in the previous section 5.3.1
We will also want to mix bold and |ket⟩ notation to make D-dimensional vectors by stacking d=4 number of D/d-dimensional vectors.
|V⟩:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(|V⟩)0:Dd(|V⟩)Dd:2Dd(|V⟩)2Dd:3Dd(|V⟩)3Dd:D⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦(5.14)where |V⟩ is a D-dimensional vector and (|V⟩)i:j is the section of |V⟩ from index i to index j−1. Each of (|V⟩)0:Dd, (|V⟩)Dd:2Dd, (|V⟩)2Dd:3Dd, (|V⟩)3Dd:D is D/d-dimensional.
Any d×d-matrix will act on |V⟩ as if it were a d-dimensional vector, ignoring the |ket⟩ dimension. For example:
⎡⎢ ⎢ ⎢⎣0010000110000100⎤⎥ ⎥ ⎥⎦|V⟩:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(|V⟩)2Dd:3Dd(|V⟩)3Dd:D(|V⟩)0:Dd(|V⟩)Dd:2Dd⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦(5.15)Any ⟨bra| will act on |V⟩ as if it were a |ket⟩, ignoring the d-dimension. For example:
⟨u|V⟩:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣⟨u|(|V⟩)0:Dd⟨u|(|V⟩)Dd:2Dd⟨u|(|V⟩)2Dd:3Dd⟨u|(|V⟩)3Dd:D⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦(5.16)5.4 Embedding the Circuits
In this section, we describe a general method for embedding any set of small circuits that satisfy the assumptions 1 - 5 listed in Section 5.2.2
The activations |A⟩l of the large network at layer l are computed from the activations at layer |A⟩l−1 as
|A⟩l:=ReLU(Wl|A⟩l−1+|B⟩l)(5.17)5.4.1 Embedding the weights
To embed the small circuit weights in the large network, we first split their weight matrices wlt into two parts: the averaged weights across all circuits ¯wl, and the differing weights Δwlt:
(¯wl)i,j:=Et[(wlt)i,j](5.18)Δwlt:=wlt−¯wl(5.19)The differing weights Δwlt are embedded into the weights of the large network using embedding vectors |elt⟩. The averaged weights ¯wl are embedded using an embedding matrix Wl∈RDd×Dd:
Wl:=¯wlWl+1S∑tΔwlt|elt⟩⟨el−1|tforl≥2(5.20)Embedding vectors
The embedding vectors |elt⟩ are constructed using exactly the same algorithm described in the previous post. The information about them that matters most for the rest of this post:
- |elt⟩ denotes the embedding vector for circuit t in layer l. |elt⟩ has dimension D/d.
- S is the embedding redundancy, i.e., the number of large network neurons used to represent each small circuit neuron.
- |elt⟩ is a S-hot vector, i.e., it has S ones and (Dd−S) zeros. As a consequence of this, we have
1S⟨elt|elt⟩=1(5.21)- 1S⟨elt| is the un-embedding vector for circuit t in layer l.
- Any pair of embedding vectors for different circuits t≠u in the same layer l will share at most one large network neuron. Therefore:
1S⟨elt|elu⟩={1S0fort≠u(5.22)Embedding matrix
Wl is constructed as
(Wl)i,j:=∑t⎧⎪ ⎪⎨⎪ ⎪⎩1Sif(|elt⟩⟨el−1|t)i,j>0−χif(|elt⟩⟨el−1|t)i,j=0(5.23)Where the scalar χ is chosen to balance the weights of W, such that
Et≠u[⟨elt|Wl|el−1⟩u]=0(5.24)This ensures that interference terms between different circuits do not systematically aggregate along any particular direction.
Layer 0
|A⟩0=∑tw1ta0t|e1t⟩.(5.25)As in the previous post, the weights at layer 0 are treated differently. Instead of using the weight embedding described in the previous subsection, we just embed the circuit input vectors into the input vector of the network as
This allows us to set the first weight matrix to an identity,
W1=I,(5.26)yielding
|A⟩1=ReLU(∑tw1ta0t|e1t⟩+|B⟩1).(5.27)5.4.2 Embedding the biases
We embed the circuit biases bl∈Rd into the network biases |B⟩l∈RD by simply spreading them out over the full D dimensions:
|B⟩l:=bl|1⟩:=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣(bl)0|1⟩(bl)1|1⟩(bl)2|1⟩(bl)3|1⟩⎤⎥ ⎥ ⎥ ⎥ ⎥⎦(5.28)where |1⟩ is just the D/d-vector with value 1 everywhere, i.e.:
(|1⟩)i:=1for all i(5.29)5.5 Reading out the small circuit activations
If the embedding works, we can read out the approximate small circuit activations from the activations of the large network, up to some error tolerance. Using the un-embedding vector 1S⟨elt|, we define the estimator for the small circuit activations for each layer as
^alt:=1S⟨elt|A⟩l:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1S⟨elt|(|A⟩l)0:Dd1S⟨elt|(|A⟩l)Dd:2Dd1S⟨elt|(|A⟩l)2Dd:3Dd1S⟨elt|(|A⟩l)3Dd:D⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦(5.30)Following equations (5.5) and (5.6), we define the estimators for the On-indicator
^αlt:=(^alt)0−(^alt)1(5.31)and the rotating vector
^xlt:=⎡⎢ ⎢⎣(^alt)2−^αlt(^alt)3−^αlt⎤⎥ ⎥⎦(5.32)The readout errors are defined as
δalt=^alt−alt(5.33)δαlt:=^αlt−αlt(5.34)δxlt:=^xlt−xlt(5.35)6 Appendix 2: Error calculations
In this section we'll attempt to adapt the error calculations from the previous post to the new framework, and use the resulting expressions to try and predict the errors on our conditional rotation circuits. The main change in the construction we need to account for is the presence of the embedding matrix Wl, which didn't exist in the construction the previous post used. We will also include some particular terms in the error calculations that the previous post ignored. These terms are sub-leading in the limit of large network width D and number of circuits T, but turned out to be non-negligible for some of the empirical data we tested the theory on.
The results here will only hold as long as the size of the errors stays much smaller than the size of the signal, which won't always be the case in reality. Past that point, errors may compound much more quickly over layers than the results here would predict.
We find that the terms for the error on inactive circuits carry over unchanged, but the error on active circuits is a little different.
Note that these calculations are not as careful as the ones in the previous post. In some spots, we just guess at approximations for certain terms and then check those guesses empirically.
6.1 Error on inactive circuits
Provided the overall error stays small enough to not change the signs of any ReLU neuron preactivations, the only source of error on inactive circuits is the embedding overlap error, which was estimated in the previous post as:
δalt=(δalt)embedding-overlap≈1S∑u is activeu≠t⟨elt|elu⟩aluif t is inactive(6.1)This is the error that results from signal in active circuits bleeding over into the inactive circuits they share neurons with. Since this error depends purely on how the activations of the circuits are embedded in the neurons of the network and that embedding hasn't changed from the previous construction, the formula for it remains the same.
In the previous post, we used the approximation Eu≠t[(1S⟨elt|elu⟩)2]≈dD to evaluate this error as[19]
E[|δalt|2]≈zdDEv is active[|alv|2]if t is inactive(6.2)Here, we will instead use the slightly more precise approximation
Eu≠t[(1S⟨elt|elu⟩)2]≈dD−1ST.(6.3)This results in a minor but noticeable improvement in the fitting of the data. Inserting the new approximation into the previous derivation yields
E[|δalt|2]≈z(dD−1ST)Ev is active[|alv|2]if t is inactive(6.4)6.2 Error on active circuits
We'll assume for now that ¯wl=0, so that Equation (5.20) becomes
Wl=1S∑twlt|elt⟩⟨el−1t|forl≥2(6.5)From definitions (5.28) and (5.29), it follows that
1S⟨elt|B⟩l=blUsing the definition of the readout for circuit t in layer l ^alt, from Equation (5.30), the error δalt can be written as
δalt:=^alt−alt=1S⟨elt|ReLU(Wl|A⟩l−1+|B⟩l)−alt=1S⟨elt|ReLU(∑u|elu⟩wlu1S⟨el−1⟩u|Al−1+|B⟩l)−alt=1S⟨elt|ReLU(∑u|elu⟩wlu(al−1u+δal−1u)+|B⟩l)−alt(6.7)(6.8)(6.9)(6.10)If we split the sum over circuits ∑u into active and passive circuits, we can simplify this expression:
- The term ∑u is inactive|elu⟩wlual−1u drops out because by definition al−1u=0 for inactive circuits.
- From the previous section, we know that
∑u is inactive|elu⟩wluδal−1u=∑u is inactive|elu⟩wluδ(al−1u)embedding overlap.(6.11)- Provided the noise stays small compared to the signal, the term ∑u is active,u≠t|elu⟩wluδal−1u will be sub-leading compared to ∑u is active|elu⟩wlual−1u. Instead of neglecting it entirely however, we will keep its embedding overlap error contribution
∑u is active,u≠t|elu⟩wlu(δal−1u)embedding-overlap,(6.12)since it is simple to evaluate, and only neglect the rest.[20]
This leaves us with
δalt≈1S⟨elt|ReLU⎛⎝∑u is active|elu⟩wlual−1u+|elt⟩wltδal−1t+∑u≠t|elu⟩wlu(δal−1u)embedding overlap+|B⟩l⎞⎠−alt(6.13)As in the last post, we will assume that all ReLU neurons used by the active circuits are turned on, since doing so will simplify the calculations and not overestimate the errors in expectation.[21] This gives
δalt≈1S∑u is active⟨elt|elu⟩wlual−1u−alt+1S⟨elt|elt⟩wltδal−1t+1S∑u is inactive⟨elt|elu⟩wlu(δal−1u)embedding overlap+⟨elt|B⟩lif t is active(6.14)Inserting approximation (6.1) for (δal−1u)embedding overlap on inactive circuits into this expression yields
δalt=1S∑u is active⟨elt|elu⟩wlual−1u−alt+1S⟨elt|elt⟩wltδal−1t+1S2∑v is activev≠u∑u≠t⟨elt|elu⟩wlu⟨el−1⟩u|el−1val−1v+⟨elt|B⟩lif t is active(6.15)Which evaluates to
δalt≈1S⟨elt|Wl|el−1⟩tal−1t−wltal−1t+1S∑v is activev≠t(⟨elt|Wl|el−1⟩val−1v−⟨el−1⟩t|el−1vwltal−1v)+wltδaltif t is active(6.16)This derivation assumed that ¯wl=0, i.e., the circuits are uncorrelated with each other. However, we're just going to go ahead and guess that (6.16) is still a reasonable approximation in the case ¯wl≠0 as well.[22] We'll see how true that is in practice when we compare these predictions to the error we measure in practice.
We can break this expression up into three components.
δalt≈(δalt)self+(δalt)others+(δalt)previous(δalt)self:=1S⟨elt|Wl|el−1⟩tal−1t−wltal−1t(δalt)others:=1S∑v is activev≠t(⟨elt|Wl|el−1⟩tal−1t−⟨el−1⟩t|el−1vwltal−1v)(δalt)previous:=wltδal−1t(6.17)(6.18)(6.19)(6.20)These three components have mean zero and are uncorrelated with each other, so we can calculate their mean squared errors separately.
6.2.1 Error from Self Signal
Here we'll calculate the mean square of the error contribution (δalt)self, from Equation (6.18). Inserting the definition of Wl into the definition of (δalt)self gives:
(δalt)self:=1S⟨elt|Wl|el−1⟩tal−1t−wltal−1t=1S⟨elt|(¯wlWl+1S∑uΔwlu|elu⟩⟨el−1|u)|el−1⟩tal−1t−wltal−1t=1S2∑u≠t⟨elt|elu⟩⟨el−1⟩|el−1tu¯wltal−1t+Δwltal−1t+¯wltal−1t−wltal−1t=1S2∑u≠t⟨elt|elu⟩⟨el−1⟩|el−1tu¯wltal−1t,(6.21)(6.22)(6.23)(6.24)in the third line, we used the fact that
1S⟨elt|Wl|el−1⟩t=1,(6.25)which follows from the definition of Wl in Equation (5.23).
To estimate the expected norm of the error, we insert approximation (6.3) again:
E[∣∣(δalt)self∣∣2]=(T−1)E[(1S⟨elt|elu⟩)2]E[(1S⟨el−1⟩|el−1tu)2]E[∣∣¯wltal−1t∣∣2]≈T(dD−1ST)2E[∣∣¯wltal−1t∣∣2](6.26)(6.27)6.2.2 Error from others' signals
Here we'll calculate the mean square of the error contribution (δalt)others from Equation (6.19). First, we further subdivide (δalt)others into two parts, one for terms involving Δw and one for terms involving ¯w.
(δalt)others=(δalt)others, ¯w+(δalt)others, Δw(δalt)others, Δw:=1S∑v is activev≠t(⟨elt|(1S∑uΔwlu|elu⟩⟨el−1|u)|el−1⟩tal−1t−⟨el−1⟩t|el−1vΔwltal−1v)(δalt)others, ¯w:=1S∑v is activev≠t(⟨elt|¯wWl|el−1⟩tal−1t−⟨el−1⟩t|el−1v¯wlal−1v)(6.28)(6.29)(6.30)We can straightforwardly approximate the expectation of the square of (δalt)others, Δw. The steps for this are similar to those we used for (δalt)self in the previous subsection, see Equations (6.26) and (6.27).
E[∣∣(δalt)others, Δw∣∣2]≈(z−1)T(dD−1ST)2Eu≠v[∣∣Δwlual−1v∣∣2]+(z−1)(dD−1ST)Ev[∣∣Δwlval−1v∣∣2](6.31)The expectation of (δalt)others, ¯w squared is a bit tougher. We didn't see an easy way to formally derive an approximation for it. So instead, we're going to be lazy: We'll just guess that it can be approximately described by the same formula as the one for (δalt)others, Δw, just with Δwlu and Δwlv swapped for ¯wl. Then we'll just check that guess empirically on some example data.
E[∣∣(δalt)others, ¯w∣∣2]≈(z−1)T(dD−1ST)2Ev[∣∣¯wlal−1v∣∣2]+(z−1)(dD−1ST)Ev[∣∣¯wlal−1v∣∣2](6.32)To verify this equation, we can use the fact that ¯wl can be factored out from Equation (6.30). So, if the result holds for one set of circuits, it should hold for any set of circuits. We just need to make up some values for al−1v, ¯wl, and Δw to check the equation on.[23] To keep it simple, we choose the following:
al−1v=1for all active circuits v¯wl=1Δwu=[1 or -1]u:={1with probability 0.5−1with probability 0.5sampled independently for each circuit u(6.33)(6.34)(6.35)We find a good fit for small values of (δalt)others and small values of S. For larger (δalt)others and S we see deviations of up to 50% from the hypothesis. I.e, the approximation is not perfect, but it seems ok. At larger errors, we expect all our error calculations to break down anyway because the errors start getting large enough to switch on ReLU neurons that should be off, leading to cascading failures as the network has a "seizure".
The two terms also add up as we would expect.
E[∣∣(δalt)others∣∣2]=E[∣∣(δalt)others, Δw∣∣2]+E[∣∣(δalt)others, ¯w∣∣2]≈(z−1)T(dD−1ST)2Eu≠v[∣∣wlual−1v∣∣2]+(z−1)(dD−1ST)Ev[∣∣wlval−1v∣∣2](6.36)(6.37)6.2.3 Previous error
Assumption 5 about our circuits was that active circuits should not amplify the size of errors relative to the size of the signal from layer to layer:
∣∣(wltδal−1t)∣∣2|alt|≤∣∣δal−1t∣∣2|al−1t|if t is active.(6.38)Since |alt|=|al−1t| for our circuits, this implies that
E[∣∣(δalt)previous∣∣2]=E[∣∣(wltδal−1t)∣∣2]≤E[∣∣δal−1t∣∣2].(6.39)For simplicity, we use the most pessimistic estimate:
E[∣∣(δalt)previous∣∣2]=E[∣∣δal−1t∣∣2](6.40)6.2.4 Adding It All Up
Adding the three terms together and applying the resulting formula recursively to evaluate the propagated error from previous layers, we get
E[∣∣δalt∣∣2]≈ (l−1)T(dD−1ST)2(zEu≠v[∣∣Δwlual−1v∣∣2]+(z−1)Ev[∣∣¯wlal−1v∣∣2]) +l(dD−1ST)(z−1)Ev[∣∣wlval−1v∣∣2](6.41)The first term has a prefactor of (l−1) while the second has a prefactor of l because the first term is the propagation error (from the previous post) which only appears from the second layer onward, while the second term is the embedding overlap error, which appears from the first layer onward.
6.3 Errors on our conditional rotation circuits}
6.3.1 Errors on the On-indicators
The On-indicators αlt are read out from the circuits' hidden activations as αlt:=(alt)0−(alt)1. Provided the overall error stays small enough to not overwhelm any ReLU neuron enough to flip it from off to on, the errors on (alt)0 and (alt)1 will be identical for active circuits, and thus cancel out exactly
δαlt=0if t is active.(6.42)For inactive circuits, equations (5.4) and (6.4) yield:
E[(δαlt)2]=(1.5−0.5)(dD−1ST)z=(dD−1ST)zif t is inactive.(6.43)6.3.2 Errors on the Rotating Vectors
For active circuits
δxlt=[(δalt)2(δalt)3](6.44)We can evaluate the errors on active rotation circuits using Equation (6.41). We want the error on E[(δxlt)2] though, not on E[(δalt)2]. However, provided that δαlt=0, we have
So, since δαlt=0 for active circuits, we can just restrict (6.41) to the last two vector indices of δa to calculate the error:
E[∣∣δxlt∣∣2]= E[∣∣δalt∣∣22:3]≈ (l−1)T(dD−1ST)2(zEu≠v[∣∣Δwlual−1v∣∣22:3]+(z−1)Ev[∣∣¯wlal−1v∣∣22:3]) +l(dD−1ST)(z−1)Ev[∣∣wlval−1v∣∣22:3](6.45)(6.46)where we define
|δalt|22:3:=(δalt)22+(δalt)23(6.47)Inserting the definitions of ¯wl,Δwlu,alv for our circuits into this formula yields
¯wlal−1v=⎡⎢ ⎢ ⎢⎣2−2002−2002−2002−200⎤⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢⎣1.50.5(xl−1t)0+1(xl−1t)1+1⎤⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢⎣2222⎤⎥ ⎥ ⎥⎦(6.48)⇒∣∣¯wlal−1u∣∣22:3=22+22=8(6.49)and
Δwlual−1v=⎡⎢ ⎢ ⎢ ⎢⎣00000000−(cosθu−sinθu)(cosθu−sinθu)cosθu−sinθu−(sinθu+cosθu)(sinθu+cosθu)sinθucosθu⎤⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢⎣1.50.5(xl−1v)0+1(xl−1v)1+1⎤⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢⎣00(ruxl−1v)0(ruxl−1v)1⎤⎥ ⎥ ⎥ ⎥⎦(6.50)where
ru:=[cosθu−sinθusinθucosθu](6.51)and hence
∣∣ruxl−1v∣∣2=∣∣xl−1v∣∣2=1⇒∣∣Δwlual−1v∣∣22:3=1.(6.52)(6.53)Since we didn't use u≠v, the above holds equally well for u=v. Additionally, since directions of (¯wlal−1v)2:3 and (Δwlval−1v)2:3 are uncorrelated, we can just add them up:
E[∣∣wlval−1v∣∣22:3]=∣∣¯wlal−1v∣∣22:3+∣∣Δwlval−1v∣∣22:3=9(6.54)Inserting everything finally yields
E[∣∣δxlt∣∣2]≈ (l−1)T(dD−1ST)2(z+8(z−1))+9l(dD−1ST)(z−1)if t is active(6.55)For inactive circuits
E[∣∣δxlt∣∣2]=E[∣∣δalt∣∣22:3](6.56)Inserting equation (6.4) into
yields
E[∣∣δxlt∣∣2]=(dD−1ST)zif t is inactive.(6.57)Meaning they are only applied if a Boolean variable is true.
That post was a correction and continuation of this earlier sketch for such a framework, which was in turn based on this framework for compressing Boolean logic gates into neural networks.
It was assumption 4, that the weights of different circuits are uncorrelated with each other.
We did try that as well, and might present those results in a future post. They look overall pretty similar. The average error on the circuits just scales with a better prefactor.
T=˜O(D2d2) basically means "T=O(D2d2) up to log factors". The full requirement is that Tlog(T)d2D2 must be small. This result lines up well with expressions for the number of logic gates that neural networks can implement in superposition that were derived here [1, 2 and 3]. It also makes intuitive sense from an information theory perspective. The bits of information specifying the parameters of the T circuits have to fit into the parameters of the neural network.}
By "active error correction" we mean using additional layers for error clean-up. I.e. some of the large network layers are dedicated exclusively to noise reduction and do not carry out any circuit computations. One reason we don't do this is the cost of the extra layers, but another is that this only works if the circuit activations belong to a discrete set of allowed values. In our circuits this is the case for the on-indicator but not the rotated vector. So we would need to discretize the allowed angles to apply error correction to them. See also Appendix section D.4 here, for how to do error correction on Boolean variables in superposition.
Data for z=1 is generated by running the network T times, once for every possible active circuit. Data for z=2 and z=3 is generated by running the network 100×T times, with randomly generated pairs and triples of active circuits. The sample sizes for z>1 were chosen to be large enough that the worst-case absolute error did not seem to change very much any more when the sample size was increased further. It was still fluctuating a little though, so, maybe treat those points with some caution.
This number was chosen because it is the largest T possible for Dd=250 and S=6 in our prime-factorization-based embedding algorithm.
Thanks to Gurkenglas for coming up with this evocative term.
If the noise level gets large enough to significantly overwhelm the inactive circuits' error suppression, then we are outside the applicability of the theory.
bl has no subscript t because the embedding scheme (described in Section 5.4) requires all small circuits to have the same biases, in any given layer. If this is not the case the circuits need to be re-scaled so that the biases match.
In the general formalism, the small circuit weights wlt and biases bl can be different for each layer l. However, in the specific example we are implementing here, wlt and bl do not depend on l.
The "future post" mentioned in this quote from the last post is this post. The trick to modify a circuit such that assumption two holds, is to add an on-indicator. We haven't actualy described this in full generality yet, but you can hopfully see how a similar trick can be used to modify circutis other than rotations.
You might notice that for both assumption 2 and 3 to hold simultaniusly, each small network needs a negative bias.
We haven't actually tried that out in practice though.
As long as all the biases have the same sign, we can rescale the weights and biases of different circuits prior to embedding so that assumption 4 is satisfied. If some biases don't have the same sign, we can increase d, and partition the circuits into different parts of Rd such that the biases have the same sign in each part. This potentially comes at the cost of being able to fit in fewer circuits, because it makes d larger. d will never have to be larger than 2× the initial d though.
Assumption 5 doesn't need to strictly hold for circuits of finite depth. It also doesn't strictly need to hold at any layer individually. The circuit could also have active noise 'correction' every few layers such that assumption 5 holds on average.
or 'dot product'
This formula assumes that circuit activation vectors alt are all of similar magnitude.
We can't neglect this term for u=t, because |elt⟩wltal−1t will get canceled by the −alt, so it isn't sub-leading.
This assumption holds for the conditional rotation circuits.
Intuitively it doesn't seem like ¯wl≠0 should ultimately change much in the derivation, but the calculations to show this were just getting kind of cumbersome. So we figured we might as well just guess and check.
What values we pick shouldn't matter so long as Δw satisfies Eu[Δwlu]=0.