This is a cool method. Are you thinking of looking more into how gradient routed model performance (on tasks and not just loss) scales with size of the problem/model? You mention that it requires a big L1 regularization in the Vision dataset, and it would be nice to try something larger than CIFAR. Looks like the LLM and RL models are also < 1B parameters, but I'm sure you're planning to try something like a Llama model next.
I'm imagining you would do this during regular training/pre-training for your model to be modular so you can remove shards based on your needs, but if the alignment tax is really high (or lowering it is complicated—hyperparameter tuning sucks :/) it's gonna be hard to convince people to use it and that's just unfortunate. Maybe you are also thinking of using it as a modification to finetuning, which seems more promising since with gradient routing you are also basically doing some form of PeFT.
What do you think can be improved for finetuning & unlearning use-cases (i.e. for LLMs)?
This seems like a prety cool perspective, especially since it might make analysis a little simpler vs. a paradigm where you kind of need to know what to look out for specifically. Are there any toy mathematical models or basically simulated words/stories, etc... to make this more concrete? I briefly looked at some of the slides you shared but it doesn't seem to be there (though maybe I missed something, since I didn't watch he entire video(s)).
I'm not honestly sure exactly what this would look like since I don't fully understand much here beyond the notion that concentration of intelligence/cognition can lead to larger magnitude outcomes (which we probably already knew) and the idea that maybe we could measure this or use it to reason in some way (which maybe we aren't doing so much). Maybe we could have some sort of simulated game where different agents get to control civilizations (i.e. like Civ 5) and among the things they can invest (their resources) in, there is some measure of "cognition" (i.e. maybe it lets them plan further ahead or gives them the ability to take more variables into consideration when making decisions or to see more of the map...). With that said, it's not clear to me what would come out of this simulation other than maybe getting a notion of the relative value (in different contexts) of cognitive vs. physical investments (i.e. having smarter strategists vs. building a better castle). There's not clear question or hypothesis that comes to mind right now.
It looks like from some other comments that literature on agents foundations might be relevant, but I'm not familiar. If I get time I might look into it in the future. Are these sorts of frameworks are useable for actual decision making right now (and if so how can we tell/not) or are they still exploratory?
Generally just curious if there's a way to make this more concrete i.e. to understand it better.
That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM's research could get more real world grounding? (Or do you think that it's all well and good as it stands?)
I'm curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two "belief systems". We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements "if X then Y could happen" and "Z is good to have"). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by using numbers or lattices for values and probabilities and some well-defined FSM to model parts of the world.) We could say that the religions have converged if they share a lot of features, by which I mean that for some definition of a feature the feature is present in both belief systems. We can define a feature in many ways, but for our simple thought experiment it can be effectively a state or relation between states in the two world views. For example, we could imagine that a feature is a function from states and their values/causal relations such that under the mapping it remains unchanged (i.e. there is some notion of this mapping being like an isomorphism on the projection of the set via the function). For example, in one belief system you might have some sort of "god" character that is somehow the cause of many things. The function here could be "(int(god is cause of x1) + int(god is cause of x2) + ...) / num_objects". If we map common objects (this spoon) to themselves in the other system (still the spoon) and god to god, we will see that in both systems the function representing how causal god is, remains close to 1, and so we may say that both systems have a notion of a "god" and therefore there has been some degree of convergence in the "having a god" stuff for the two systems.
So now with all this formal BS out of the way, which I had to do, because it highlights what is missing, the question is clear: under some reasonable such definition of what convergence means, how do you decide whether two religions have converged? The vibe I get from the perennial philosophy believers that I have thus spoken to is that "you have to go through the journey to understand" and generally it appears to be a sort of dispositional convergence, at least on face value—though I do not observe people of very different religions, who claim to have convergence, conviving for a long time (meaning that it is not verifiable whether indeed, the dispositions are truly something that we could call converged). Of course, it may be possible to find mappings that claim that two belief systems have converged or not when the opposite is a more honest appreciation.
Obviously no one is going to come out here and create a mathematical definition and just "be right" (idt that's even a fair thing to consider to be possible), but I do not particularly like making such assertions totally "on vibes". Often people will say that they are "spiritual" and that "spirituality" helped them overcome some psychological challenge or who knows what, but what does "spiritual" mean here? Often it's associated with some belief system that we would, as laymen, call religious or spiritual (i.e. in the enumerable list of christianity and its sub-branches, buddhism and it's, etc...), but who is to say that it is not only some part of the phenomenon that person experienced, which happened to be caused by the spiritual system which was present at the time and place, that was the truer cause of the change of psyche? It seems compelling to me to want to decouple these "core truths" from the religions that hold them so as to have them presentable in a more neutral way, since in the alternative world where you must "go through the journey" of spirituality via some specific religion, you cannot know beforehand that you won't be effectively brainwashed—and you cannot even know afterwards either... you can only get faint hints at it during the process.
So this is not to say that that anyone is getting brainwashed or that anything is good or bad or that anything should be embraced or not. I'm just saying that from an outside perspective, it is not verifiable whether religions actually converge, without diving into this stuff. However, it is is also not verifiable whether diving in is actually good, and it's not verifiable whether afterwords it even will be verifiable. Maybe I'm stumbling into some core metaphysical whirlwind of "you cannot know anything" but I do truly believe that a more systematic exposition of how we should interpret spirituality, trapped priors, convergence, and the like is possible and would enable more productive discussion.
PS I think you've touched on something tangential in the statement that you should do this with trusted people. That's trying to bootstrap, however, a resistance to manipulative misappropriation of spirituality, whereas what I'm saying I would also like more of a logical bootstrapping to the whole notion of spirituality and ideas like "convergence" so that one can leave the conversation with solid conclusions, knowing their limitations, and having a higher level of actionability.
PPS: I feel like treating a belief system, like "rationality" as a machine/tool: something which has a certain reach, certain limitations, and that usually behaves as expected in most situations but might have some bugs, is a good way to go. This will make it easier to decouple rationality with, say, spiritual traditions. At each point of time and space you can basically decide on common sense which of these two machines/tools is best to apply. Each different tool can be shown, hopefully, to be good for some cases and thus most decision making happens on the routing level: which tool to use. If you understand the tool from a third person point of view there is less of a tendency to rely on it in the wrong cases purely on dogma.
For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).
Is DM exploring this sort of stuff? It doesn't seem to be under the mantle of "AGI Safety" given the post above. Maybe it's another team? It's true that it's more "AI" than "AGI" safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:
The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I'm missing some) of research that can be done in AI Safety:
In this categorization, it seems like DM's AGI Safety team is very much more focused on 1,2, and 3. There's nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren't companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I'm conflating DM with other parts of Alphabet?
Anyways, I'm curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI "safe" while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.
How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it's not clear to me, for example, what the "difference in the values" is mapping onto here (since I think these "numbers" are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I'm not sure exactly what it means. I'm guessing you can see that some sections are rather empty corresponding to the "model learns to specialize on certain parts of the vocabulary for xyz head" being potentially associated with some sort of one-hot or generally standard-basis-privilege situation. Let me know if I'm misunderstanding something dumb. I haven't seen this being done much elsewhere, but it would be nice to have a github because it's really easy to resolve these questions by reading pytorch code. Is it available somewhere?
One other thing I'm curious about is results for more control experiments. For example, for the noise, if you fully noised the output (i.e. output a random permutation) we should expect the model to fail to learn anything at all and to fail to get a high LLC right? It's also possible to noise by inserting new elements in the output (or input... I guess it's equivalent) to replace others, but keeping the order the same. In this case, maybe the network can learn to understand what the ordering is even if it doesn't know exactly which outputs will be there in the end, so even with very high amounts of noise a "structured" solution makes sense (though I reckon the way you propagate loss will matter in this case).
Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.
On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...
I think it's fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.
The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.
But anyways, there are sort of two questions that naturally arise here:
This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere
Thanks!
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.
Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.
I agree that there are inductive biases towards sharing features and/or components. I'm not sure if there's a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate[1], and I'm not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it's pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of division of labor to, like software features vs. biology knowledge features vs. creative writing features, etc... (in the setting of natural language). Obviously, these all might share some basic grammatical or structural core features, but one layer of abstraction up it feels natural that they should be seperable. All this goes to say is that it seems like a good idea to give gradient routing the best possible shot at success might be to try some such partitioning of features/tasks,[2] because unlike 3 and 8 we have some prior reason to believe that they should indeed be rather seperate. Maybe there's other sources that can motivate what features or tasks to try to route seperately with minimal loss of utility (i.e. like what MoE papers report works well or not) but I haven't thought about it too much.
One downside here is that all these examples that come to mind are in language settings, and so to get reasonable utiliy to start with you would probably need to be in the 1B-7B model size range.
About the edges. Have you tried all 3 combinations (route both, route one, route the other)? I think the fact that you limit to these edges in mentioned in the appendix Memory section. Surely, routing on activation edges is not actually prohibitive. Worst-case you you can just assign blocks to each category and it'll basically just be an MoE. This really is just mathematically equivalent to MoE with a specific choice of architecture[3] right? One idea I had vaguely a while ago but it seems rather complicated is to do alternating dense training with MoE-ffication. In the dense training phases you train densly like usual. Then, you use some clever algorithm (think: like interpretability methods on steroids) to decide which parts of the network will get which experts. Then, in the MoE-ffication phase you use the clever algorithm to basically define routes/prune edges (for the chosen partitioning). You go back and repeat. Each expert is somewhat analogous its own network so each new iteration you split further and further. The goal is to get as much splitting for minimal utility cost possible. The resulting model might be smaller/cheaper at inference time and more interpretable. I'm not honestly sure how useful this might be, but I thought it was kind of cool :P
Really all I care about here is that these features don't lose too much from being seperate. With that said, I guess some features may benefit from being seperated at training time if the train set has some spurious correlations, which it probably does.
Unlearning virology, if you do want cellular biology, seems like the hardest possible task ngl.
You might be weight-sharing or smth.