This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.
The eronoious activation ony happens if the errors get large enough. With large enough T, D and S, this should be avoidable.
Agnostic of the method for embedding the small circuits in the larger network, currently only out of neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to for some small fixed , increasing the size of the small networks to neurons. In the rotation example, is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.
Just to clarify, the current circuit uses 2 small ciruit neurons to embed the rotating vector (since it's a 2d vector) and 2 small ciruit neruons for the on-indicator (in order to compute a stepfunction wich requires 2 ReLUs).
We could allocate more of the total storage to on-indicators information and less to the rotated vector. Such a shift may be more optimal.
The idea is that while each of the indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across neurons in the large network) receives is hopefully independent.
The total noice is already a sum of several independent component. Inceraing S would make the noice be more smaller number, which is better. Your method will not make noice contibutions any more indepent.
The limit on S is that we don't want any two small network neruons to share more than one large network neurons. We have an alocation algorithm that is better than just random distribution, using primnumber steps, but if I make S to large the algorithm runs out of prim number, which is why S isn't larger. This is less of a problem for larger D, so I think that in the large limit this would not be a problem.
This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most . Even if you erroneously believe one of the circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other inactive circuits. If the on-indicators don't seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.
I'm not sure I understand what you're trying to say, but I'll try to respond anyway.
In our setup it is the case that it is known from the start which circuits are going to be active for the rest of the forward pass. This is not one of the assumptions we listed, but it is implicit in the entire framwork (there will always be cases like that even if you try to list all assumptions). However this "buil in assumption" is just there for convinience, and not becasue we think this is realistic. I expect that in a real network, which cirucits are used in the later layers, will depend on computations in the earlier layers.
Possibly there are some things you can eliminate right away? But I think often not. In the transformer architecture, at the start, the network just has the embedding vector for the first token and the possitional embedding. After the first attention, the netowrk has a bit more information, but not that much, the softmax will make sure the netowork just focus on a few previous words (right?). And every step of conputatin (including attention) will come with some noise, if superpossition is involved.
I agree shared state/cross-circuit computation is an important thing to model, though. I guess that's what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you're just saying, if we want this to work more generally, there are issues?
There is a regime where the (updated) framwork works. See figure 8-11 for values T(d/D)^2 < 0.004. However for sizes of networks I can run on my laptop, that does not leave room for very much super possition.
This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.
Do you want to have a call some time in January?
There are probably lots of thing that isn't explained as well as it could have been in the post.
Thanks for your questions
Could we use O(d) redundant On-indicators per small circuit, instead of just 1, and apply the 2-ReLU trick to their average to increase resistance to noise?
If I understand your correctly, this is already what we are doing. Each on-ndicator is distributed over pairs of neurons in the Large Network, where we used for the results in this post.
I can't increase more than that, for the given and , without braeaking the costraint that no two circuits should overlap in more than one neuron, and this constarint is an important assumption in the error calculations. However this it is a possible that a more clever way to alocate neurons could help this a bit.
See here: https://www.lesswrong.com/posts/FWkZYQceEzL84tNej/circuits-in-superposition-2-now-with-less-wrong-math#Construction
In the Section 5 scenario, would it help to use additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the norm of their activations).
We don't have extara neuons laying around. This post is about supperpossition, which means we have fewer neurons than features.
If we assume that which circuits are active never changes across layers (which is true for the example in this post) there is another thing we can do. We can encode, the on-indiators, at the start, in superpossition, and then just coppy this information from layer to layer. This prevents the parts of the network that is repsponsibel for on indicators from seizureing. The reasons we didn't do this is we wanted to test a method that could be used more generally.
encode the network's best guess for the active circuits early on
We don't need to encode the best guess, we can just encode the true value (up to the uncertanty that comes from compressing it into superpossition), given in the input. If we assume that these values stay constant, that is. But storing the value form early on also assumes that the value of the on-indicators stay fixed.
The reason I think this is a reasonable idea is that LLMs do seem to compute binary indicators of when to use circuits, separate from the circuits themselves. Ferrando et al. found models have features for whether they recognize an entity, which gates fact lookup. In GPT2-Small, the entity recognition circuit is just a single first-layer neuron, while the fact-lookup circuit it controls is presumably very complex (GDM couldn't reverse-engineer it). This suggests networks naturally learn simple early gating for complex downstream computations.
Ooo, interesting! I will definatly have a look into those papers. Thanks!
Separately, I think it's good to invite people like Sam Altman to events like the Progress Conference, and would of course want Sam to be at important diplomatic meetings. If you think that's always bad, then I do think Lighthaven might be bad! I am definitely hoping for it to facilitate conversations between many people I think are causing harm for the world.
I think it's aproximatly always bad to invite Sam Altman. We he lies and manipuate people. We know that he succeeded at stealing OpenAI from the non profit. Inviting him to any high-trust space, where most peopel will by defualt assume good faith, (which I would be very surpprised is not the case at the Progress Conference), is in my judgment very bad. Inviting him to a negotiation where most people are already supspisios of eachother might be worth it in some situations, maybe? I have no expertice here.
In general I would like the insentive landscape to be that if you steal OpenAI from the non profit, and work towards hazen the end of the world, you are socialy shunned.
(I don't think stealing OpenAI was the most impactfull thing from a perspective of X-risk. But it's just so obviously evil from any world view. I don't see any possiblity of good faith comunication after that.)
My previous understanding of the situaton is that the Progress Connferene naiviely invited Sam Altman, and Lightcone did not veto this, and for some reason did not prioritise advising them against it. Knowing that you endorse this makes me update in a negative direction.
I wish someone would link the comment in question by habryka. I remember reading it, but I can't find it.
I think you said you "would not be supprised" or "expect it will happen" or something like that, that you would rent lighthaven to the labs. Which did not give me the impression that the tax would be very high from the lab's perspective.
I do think anyone (including habryka) have the right to say "oops, that was badly written, here's what I acctually men."
But what was said in that original comment still matters for wether or not this was a reasobable thing to be concerned about, before the interactions in the comments here.
My impression after reading that old comment from you was much more in line with what Mikhail said. So I'm happy this got borugh up and clarified.
Yes, I just rememebered that I forgott to do this. Oops.
I chose my clothing based on:
The list is roughly in order of priority, and I don't wheare anything that does not at least satisfise some baselevel of them.
Point 2 depend on the setting. E.g. I wouldn't go to a costume party without at an atempt at a costume. Also at a costume party, a great costume scores better on 2 than an average on, this is an example of fitting in not being the same as blending in.
In general 2 is not very constraining, there are a lot of diffrent looks tha qualify as fiting in, in most places I hang out, but I would still proabbly experiment with more unusual looks if I was less conformist. And I would be naked a lot more, if that was normal.
I'm emotionaly conformist. But I expect a lot of people I meet don't notice this, becasue I'm also bad at conforming. There is just so much else pulling in other directions.
I would recomend thay anyone with dependents, or any other need for economic stability (e.g. lack of safety net from your family or country) should focus on erning money.
You can save up and fund yourself. Or if that takes too long, you can give what you can give 10% (or what ever works for you) to support someone else.
Definetly yes to more honestly!
However, I think it's unfair to describe all the various AI safery programs as "MATS clones". E.g. AISC is both order and quite diffrent.
But no amount of "creative ways to bridge the gap" will solve the fundamental problem, because there isn't a gap realy. There isn't lots of senior jobs, if we could only level up people faster. The simple fact is that there isn't enough money.
So the section headings are not about the transmission type investigated, but which transmission type the studies pointed to as the leading one?
LLMs (and probably most NNs) have lots of meaningfull, interpretable linear feature directions. These can be found though various unsupervised methods (e.g. SAEs) and supervised methods (e.g. linear probs).
However, most human interpretable features, are not what I would call the models true features.
If you find the true features, the network should look sparse and modular, up to noise factors. If you find the true network decomposition, than removing the what's left over should imporve performance, not make it wors.
Because the network has a limited number of orthogonal directions, there will be interference terms that the network would like to remove, but can't. A real network decomposition will be everything except this noise.
This is what I think mech-interp should be looking for
It's possible that I'm wrong and there is no such thing as "the networks true features". But we've (humans colectivly) only just started this reaserch agenda. The fact that we haven't found it yet, is not much evidence either way.