4gate — LessWrong

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

I agree that there are inductive biases towards sharing features and/or components. I'm not sure if there's a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate^[1], and I'm not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it's pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of division of labor to, like software features vs. biology knowledge features vs. creative writing features, etc... (in the setting of natural language). Obviously, these all might share some basic grammatical or structural core features, but one layer of abstraction up it feels natural that they should be seperable. All this goes to say is that it seems like a good idea to give gradient routing the best possible shot at success might be to try some such partitioning of features/tasks,^[2] because unlike 3 and 8 we have some prior reason to believe that they should indeed be rather seperate. Maybe there's other sources that can motivate what features or tasks to try to route seperately with minimal loss of utility (i.e. like what MoE papers report works well or not) but I haven't thought about it too much.

One downside here is that all these examples that come to mind are in language settings, and so to get reasonable utiliy to start with you would probably need to be in the 1B-7B model size range.

About the edges. Have you tried all 3 combinations (route both, route one, route the other)? I think the fact that you limit to these edges in mentioned in the appendix Memory section. Surely, routing on activation edges is not actually prohibitive. Worst-case you you can just assign blocks to each category and it'll basically just be an MoE. This really is just mathematically equivalent to MoE with a specific choice of architecture^[3] right? One idea I had vaguely a while ago but it seems rather complicated is to do alternating dense training with MoE-ffication. In the dense training phases you train densly like usual. Then, you use some clever algorithm (think: like interpretability methods on steroids) to decide which parts of the network will get which experts. Then, in the MoE-ffication phase you use the clever algorithm to basically define routes/prune edges (for the chosen partitioning). You go back and repeat. Each expert is somewhat analogous its own network so each new iteration you split further and further. The goal is to get as much splitting for minimal utility cost possible. The resulting model might be smaller/cheaper at inference time and more interpretable. I'm not honestly sure how useful this might be, but I thought it was kind of cool :P

^{^}
Really all I care about here is that these features don't lose too much from being seperate. With that said, I guess some features may benefit from being seperated at training time if the train set has some spurious correlations, which it probably does.
^{^}
Unlearning virology, if you do want cellular biology, seems like the hardest possible task ngl.
^{^}
You might be weight-sharing or smth.

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

4gate11mo22

This is a cool method. Are you thinking of looking more into how gradient routed model performance (on tasks and not just loss) scales with size of the problem/model? You mention that it requires a big L1 regularization in the Vision dataset, and it would be nice to try something larger than CIFAR. Looks like the LLM and RL models are also < 1B parameters, but I'm sure you're planning to try something like a Llama model next.

I'm imagining you would do this during regular training/pre-training for your model to be modular so you can remove shards based on your needs, but if the alignment tax is really high (or lowering it is complicated—hyperparameter tuning sucks :/) it's gonna be hard to convince people to use it and that's just unfortunate. Maybe you are also thinking of using it as a modification to finetuning, which seems more promising since with gradient routing you are also basically doing some form of PeFT.

What do you think can be improved for finetuning & unlearning use-cases (i.e. for LLMs)?

Cognitive Work and AI Safety: A Thermodynamic Perspective

4gate11mo10

This seems like a prety cool perspective, especially since it might make analysis a little simpler vs. a paradigm where you kind of need to know what to look out for specifically. Are there any toy mathematical models or basically simulated words/stories, etc... to make this more concrete? I briefly looked at some of the slides you shared but it doesn't seem to be there (though maybe I missed something, since I didn't watch he entire video(s)).

I'm not honestly sure exactly what this would look like since I don't fully understand much here beyond the notion that concentration of intelligence/cognition can lead to larger magnitude outcomes (which we probably already knew) and the idea that maybe we could measure this or use it to reason in some way (which maybe we aren't doing so much). Maybe we could have some sort of simulated game where different agents get to control civilizations (i.e. like Civ 5) and among the things they can invest (their resources) in, there is some measure of "cognition" (i.e. maybe it lets them plan further ahead or gives them the ability to take more variables into consideration when making decisions or to see more of the map...). With that said, it's not clear to me what would come out of this simulation other than maybe getting a notion of the relative value (in different contexts) of cognitive vs. physical investments (i.e. having smarter strategists vs. building a better castle). There's not clear question or hypothesis that comes to mind right now.

It looks like from some other comments that literature on agents foundations might be relevant, but I'm not familiar. If I get time I might look into it in the future. Are these sorts of frameworks are useable for actual decision making right now (and if so how can we tell/not) or are they still exploratory?

Generally just curious if there's a way to make this more concrete i.e. to understand it better.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

4gate1y10

That's great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I've read only a little bit of the literature, so maybe I'll just find out later :P

The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it's important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM's research could get more real world grounding? (Or do you think that it's all well and good as it stands?)

How I started believing religion might actually matter for rationality and moral philosophy

4gate1y10

I'm curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two "belief systems". We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements "if X then Y could happen" and "Z is good to have"). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by using numbers or lattices for values and probabilities and some well-defined FSM to model parts of the world.) We could say that the religions have converged if they share a lot of features, by which I mean that for some definition of a feature the feature is present in both belief systems. We can define a feature in many ways, but for our simple thought experiment it can be effectively a state or relation between states in the two world views. For example, we could imagine that a feature is a function from states and their values/causal relations such that under the mapping it remains unchanged (i.e. there is some notion of this mapping being like an isomorphism on the projection of the set via the function). For example, in one belief system you might have some sort of "god" character that is somehow the cause of many things. The function here could be "(int(god is cause of x1) + int(god is cause of x2) + ...) / num_objects". If we map common objects (this spoon) to themselves in the other system (still the spoon) and god to god, we will see that in both systems the function representing how causal god is, remains close to 1, and so we may say that both systems have a notion of a "god" and therefore there has been some degree of convergence in the "having a god" stuff for the two systems.

So now with all this formal BS out of the way, which I had to do, because it highlights what is missing, the question is clear: under some reasonable such definition of what convergence means, how do you decide whether two religions have converged? The vibe I get from the perennial philosophy believers that I have thus spoken to is that "you have to go through the journey to understand" and generally it appears to be a sort of dispositional convergence, at least on face value—though I do not observe people of very different religions, who claim to have convergence, conviving for a long time (meaning that it is not verifiable whether indeed, the dispositions are truly something that we could call converged). Of course, it may be possible to find mappings that claim that two belief systems have converged or not when the opposite is a more honest appreciation.

Obviously no one is going to come out here and create a mathematical definition and just "be right" (idt that's even a fair thing to consider to be possible), but I do not particularly like making such assertions totally "on vibes". Often people will say that they are "spiritual" and that "spirituality" helped them overcome some psychological challenge or who knows what, but what does "spiritual" mean here? Often it's associated with some belief system that we would, as laymen, call religious or spiritual (i.e. in the enumerable list of christianity and its sub-branches, buddhism and it's, etc...), but who is to say that it is not only some part of the phenomenon that person experienced, which happened to be caused by the spiritual system which was present at the time and place, that was the truer cause of the change of psyche? It seems compelling to me to want to decouple these "core truths" from the religions that hold them so as to have them presentable in a more neutral way, since in the alternative world where you must "go through the journey" of spirituality via some specific religion, you cannot know beforehand that you won't be effectively brainwashed—and you cannot even know afterwards either... you can only get faint hints at it during the process.

So this is not to say that that anyone is getting brainwashed or that anything is good or bad or that anything should be embraced or not. I'm just saying that from an outside perspective, it is not verifiable whether religions actually converge, without diving into this stuff. However, it is is also not verifiable whether diving in is actually good, and it's not verifiable whether afterwords it even will be verifiable. Maybe I'm stumbling into some core metaphysical whirlwind of "you cannot know anything" but I do truly believe that a more systematic exposition of how we should interpret spirituality, trapped priors, convergence, and the like is possible and would enable more productive discussion.

PS I think you've touched on something tangential in the statement that you should do this with trusted people. That's trying to bootstrap, however, a resistance to manipulative misappropriation of spirituality, whereas what I'm saying I would also like more of a logical bootstrapping to the whole notion of spirituality and ideas like "convergence" so that one can leave the conversation with solid conclusions, knowing their limitations, and having a higher level of actionability.

PPS: I feel like treating a belief system, like "rationality" as a machine/tool: something which has a certain reach, certain limitations, and that usually behaves as expected in most situations but might have some bugs, is a good way to go. This will make it easier to decouple rationality with, say, spiritual traditions. At each point of time and space you can basically decide on common sense which of these two machines/tools is best to apply. Each different tool can be shown, hopefully, to be good for some cases and thus most decision making happens on the routing level: which tool to use. If you understand the tool from a third person point of view there is less of a tendency to rely on it in the wrong cases purely on dogma.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

4gate1yΩ120

For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There's also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I'm not sure if what I'm illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).

Is DM exploring this sort of stuff? It doesn't seem to be under the mantle of "AGI Safety" given the post above. Maybe it's another team? It's true that it's more "AI" than "AGI" safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:

You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
You actually fix problems today. You can also fail fast. I don't know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn't work as expected... today... 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That's not what I'm trying to do here, anyways. Mainly I'm just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
A lot of safety training is about controlling/constraining what the AI can say/do so that it won't say/do the bad things. The tools of this sort of control are pretty generic, so it's not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that "will be useful eventually" we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.

The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I'm missing some) of research that can be done in AI Safety:

Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
Safety Theory: into this I lump ideas like debate and amplified oversight, which don't really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these "theory."
Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in. These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.

In this categorization, it seems like DM's AGI Safety team is very much more focused on 1,2, and 3. There's nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren't companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I'm conflating DM with other parts of Alphabet?

Anyways, I'm curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI "safe" while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.

Measuring Structure Development in Algorithmic Transformers

4gate1y*10

How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it's not clear to me, for example, what the "difference in the values" is mapping onto here (since I think these "numbers" are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I'm not sure exactly what it means. I'm guessing you can see that some sections are rather empty corresponding to the "model learns to specialize on certain parts of the vocabulary for xyz head" being potentially associated with some sort of one-hot or generally standard-basis-privilege situation. Let me know if I'm misunderstanding something dumb. I haven't seen this being done much elsewhere, but it would be nice to have a github because it's really easy to resolve these questions by reading pytorch code. Is it available somewhere?

One other thing I'm curious about is results for more control experiments. For example, for the noise, if you fully noised the output (i.e. output a random permutation) we should expect the model to fail to learn anything at all and to fail to get a high LLC right? It's also possible to noise by inserting new elements in the output (or input... I guess it's equivalent) to replace others, but keeping the order the same. In this case, maybe the network can learn to understand what the ordering is even if it doesn't know exactly which outputs will be there in the end, so even with very high amounts of noise a "structured" solution makes sense (though I reckon the way you propagate loss will matter in this case).

EIS V: Blind Spots In AI Safety Interpretability Research

4gate1y10

Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.

On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a "program synthesis problem". For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...

I think it's fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.

The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.

But anyways, there are sort of two questions that naturally arise here:

Why is MI more closely related to program synthesis than any other field that wishes to explain a process that can be thought of as a program (i.e. it has causal components that happen over time)?
I was under the impression that MI is in the business of trying to establish the right language and concepts to use to describe the information processing done by deep learning models. The field has not really cracked the "art" here yet AFAIK. With that said, I'm guessing that the program synthesis literature and tooling has a slightly different goal and therefore carries certain baggage of how one goes about thinking about these problems (i.e. maybe more of a lean towards symbolic methods). But the program synthesis literature probably doesn't actually create the right language to have a 10x conceptual framework for the science of deep learning information processing because otherwise we would have a lot more solved problems than we do. So in this sense, a new start is not necessarily bad. You can think about this in some sense fuzzily analogous to how Galois invented (if you can say that) a new branch of math to solve the so-far-unsolved 5th degree polynomial root-finding problem. It was not by using the already-existent tools that this problem was solved. You can also think of this as a society-level de-biasing strategy. If DL is ever to have an explanatory framework on par with, say, Classical Mechanics, it appears that we need a conceptual 10x-ing. Do you agree with this framing? If so, what do you think is a healthy amount of rediscovery?

Mechanistically Eliciting Latent Behaviors in Language Models

4gate1y30

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constraint other than that?
How do people usually constrain things like norm or do orthogonality constraints as a hard constraint? I assume not regular loss-based regularization since that's not hard. I assume iterative "optimize and project" is not always optimal but maybe it's usually optimal (it seems to be what is going on here but not sure?). Do lagrange multipliers work? It seems like they should but I've never used them for ML. I'm guessing that in the bigger picture this doesn't matter.
Have you experimented with adaptor rank and/or is there knowledge on what ranks tend to work were? I'm curious of the degree of sparsity. You also mention doing LoRA for attention instead and I'm curious if you've tried it yet.
W.r.t. the "spiky" parametrization options, have you tried just optimizing over certain subspaces? I guess the motivation of the spikiness must be that we would like to maintain as much as possible of the "general processing" going on but I wonder if having a large power can axe the gradient for R < 1.
Is there a way to propagate this backwards to prompts that you are exploring? Some people do bring up the question in the comments about how natural these directions might be.
Not sure to what extent we understand how RLHF, supervised finetuning and other finetuning methods currently work. What are your intuitions? If we are able to simply add some sort of vector in an early layer it would seem to support the mental model that finetuning mainly switches which behavior gets preferentially used instead of radically altering what is present in the model.

Thanks!

Mechanistically Eliciting Latent Behaviors in Language Models

4gate1y10

Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.

Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments