I think just minimising the norm of the weights is worth a try. There's a picture of neural network computation under which this mostly matches their native ontology. It doesn't match their native ontology under my current picture, which is why I personally didn't try doing this. But the empirical results here seem maybe[1] better than I predicted they were going to be last February.
I'd also add that we just have way more compute and way better standard tools for high-dimensional nonlinear optimisation than we used to. It's somewhat plausible to me that some AI techniques people never got to work at all in the old days could now be made to kind of work a little bit with sufficient effort and sheer brute force, maybe enough to get something on the level of an AlphaGo or GPT-2. Which is all we'd really need to unlock the most crucial advances in interp at the moment.
I haven't finished digesting the paper yet, so I'm not sure.
this feels like a subtweet of our recent paper on circuit sparsity. I would have preferred a direct response to our paper (or any other specific paper/post/person), rather than a dialogue against a hypothetical interlocutor.
I think this post is unfairly dismissive of the idea that we can guess aspects of the true ontology and iterate empirically towards it. it makes it sound like if you have to guess a lot of things right about the true ontology before you can make any empirical progress at all. this is a reasonable view of the world, but I think evidence so far rules out the strongest possible version of this claim.
SAEs are basically making the guess that the true ontology should activate kinda sparsely. this is clearly not enough to pin down the true ontology, and obviously at some point activation sparsity stops being beneficial and starts hurting. but SAE features seem closer to the true ontology than the neurons are, even if they are imperfect. this should be surprising if you think that you need to be really correct about the true ontology before you can make any progress! making the activations sparse is this kind of crude intervention, and you can imagine a world where SAEs don't find anything interesting at all because it's much easier to just find random sparse garbage, and so you need more constraints before you pin down something even vaguely reasonable. but we clearly don't live in that world.
our circuit sparsity work adds another additional constraint: we also enforce that the interactions between features are sparse. (I think of the part where we accomplish this by training new models from scratch as an unfortunate side effect; it just happens to be the best way to enforce this constraint.) this is another kind of crude intervention, but our main finding is that this gets us again slightly closer to the true concepts; circuits that used to require a giant pile of SAE features connected in an ungodly way can now be expressed simply. this again seems to suggest that we have gotten closer to the true features.
if you believe in natural abstractions, then it should at least be worth trying to dig down this path and slowly add more constraints, seeing whether it makes the model nicer or less nice, and iterating.
this feels like a subtweet of our recent paper on circuit sparsity
It isn't. This post has been in my drafts for ages and I just got around to slapping a passable coat of paint on it and shipping it.
Insofar as you've missed reality's ontology, things will just look like a mess
Or your thing just won't work. There's a kind of trade-off there, I think?
DL works because it gives a lot of flexibility for defining internal ontologies, and for compute-efficiently traversing their space. However, it does so by giving up all guarantees that the result would be simple/neat/easy-to-understand in any given fixed external ontology (e. g., the human one).
To combat that, you can pick a feature that would provide some interpretability assistance, such as "sparsity" or "search over symbolic programs", and push in that direction. But how hard do you push? (How big is the penalty relative to other terms? Do you give your program-search process some freedom to learn neural-net modules for plugging into your symbolic programs?)
If you proceed with a light touch, you barely have any effect, and the result is essentially as messy as before.
If you turn the dial up very high, you strangle DL's flexibility, and so end up with crippled systems. (Useful levels of sparsity make training 100x-1000x less compute-efficient; forget symbolic program search.)
In theory, I do actually think you may be able to "play it by ear" well enough to hit upon some method where the system becomes usefully more interpretable without becoming utterly crippled. You can then study it, and perhaps learn something that would assist you in interpreting increasingly less-crippled systems. (This is why I'm still pretty interested in papers like these.)
But is there a proper way out? The catch is that your interventions only hurt performance if they hinder DL's attempts to find the true ontology. On the other hand, if you yourself discover and incentivize/hard-code (some feature of) the true ontology, that may actually serve as an algorithmic improvement.[1] It would constrain the search space in a helpful way, or steer the training in the right direction, or serve as a good initialization prior... Thus making the system both more interpretable and more capable.
Which is a boon in one way (will near-certainly be widely adopted; the "alignment tax" is negative), and a curse in another (beware the midpoint of that process, where you're boosting capabilities without getting quite enough insight into models to ensure safety).
(Alternatively, you can try to come up with some Clever Plan where you're setting up a search process that's as flexible as DL but which somehow has a guaranteed of converging to be simple in terms of your fixed external ontology. I personally think such ideas are brilliant and people should throw tons of funding at them.)
May. There are some caveats there.
fwiw, I think the 100-1000x number is quite pessimistic, in that we didn't try very hard to make our implementation efficient, we were entirely focused on making it work at all. while I think it's unlikely our method will ever reach parity with frontier training methods, it doesn't seem crazy that we could reduce the gap a lot.
and I think having something 100x behind the frontier (i.e one GPT worth) is still super valuable for developing a theory of intelligence! like I claim it would be super valuable if aliens landed and gave us an interpretable GPT-4 or even GPT-3 without telling us how to make our own or scale it up.
Agreed. I think the most optimistic case is that peering at GPT-3/4's interpreted form would make it extremely obvious how to train much more powerful models much more compute-efficiently by way of explicitly hard-coding high-level parts of their structure, thus simultaneously making them much more controllable/interpretable. (E. g., clean factorization into a world-model, a planner, and a goal slot, with obvious ways to scale up just the world-model while placing whatever we want into the goal slot. Pretty sure literally-this is too much to hope for, especially at GPT≤4's level, but maybe something in that rough direction.)
fwiw, I'm pessimistic that you will actually be able to make big compute efficiency improvements even by fully understanding gpt-n. or at least, for an equivalent amount of effort, you could have improved compute efficiency vastly more by just doing normal capabilities research. my general belief is that the kind of understanding you want for improving compute efficiency is at a different level of abstraction than the kind of understanding you want for getting a deep understanding of generalization properties.
I have some empirical observations to lend here. I recently spent a few months optimizing a DNA language model for intrinsic interpretability.
There were, as I had hoped, many neurons corresponding neatly to interpretable concepts. This was enough for my purposes: I was trying to build a tool, not solve interpretability or alignment. Random sequences are riddled with functional promoters and other motifs, and us synthetic biologists didn’t have anything like a universal debugger, nor a universal annotator for poorly studied species -- even a flawed tool would be a major step forward.
The best activation (by my arbitrary judgment, sifting endlessly through neurons) was a combination of continuous approximations to the activation functions in Deep L0 Encoders, further constrained to be nonnegative and unit norm. I created the activation through several months of trial and error and realized the connection after the fact. Note that no penalties were added to the loss, and it trained just fine.
While it was often easy to to interpret many neurons post-hoc, I could never have guessed beforehand what the (superficially apparent) ontology would be. For instance, CRP and FNR are two 22-base-pair palindromic motifs; I had hoped to find a “CRP neuron” and an “FNR neuron,” but found a group of neurons each active at one position in these palindromes. AI-for-bio people love to use linear probes to establish the “presence of a concept” in their models, I feel now that this is bogus. The model modeled CRP fine, it just didn’t have use for a single direction over the whole motif.
However, the most helpful tool was visualizing the pairwise similarities between the activations (i.e., their Gram matrix). The activations’ degree of similarity often primarily reflected their offset, unless the “feature” being represented was periodic in nature, like a beta-barrel. I don’t think that my more-interpretable activations, nor SAEs, nor any obvious-to-me kind of weight or activation sparsity technique, could have made this pattern much clearer with ~any degree of effort. (At least, I have no clue how I would have counterfactually spotted it).
I'd call this an empirical win for the thesis that unless you have a way to get some level of insight into how the activations are structured without presuming that structure, your method ain't gonna have feedback loops.
(Interestingly, the images produced on a given protein by the Gram lens for my small convolutional bacterial DNA model were obviously visually similar to those from a much more heavily trained all-of-life protein Transformer, including the offset-dependent similarity.)
There is certainly still structure I can't see. The final iteration of the model is reverse-complementation-equivariant by design. RC-equivariant models trained far more quickly than unconstrained ones, but whereas unconstrained models learned many invariant features, equivariant ones never appeared to. The presence of a partial RC-equivariance, learned in an unconstrained model, would not be made clearer by sparse activations or by the Gram matrices (the paired directions are orthogonal). I'm unsure what kind of tool would reveal this kind of equivariance, if you weren’t already looking for it.
Is that the right framing? In principle the training data represents quite a lot of contact with reality if that's where you sampled it from. Almost sounds like you're saying current ML functionally makes you specify an ontology (and/or imply one through your choices of architecture and loss) and we don't know how to not do that. But something conceptually in the direction of sparsity or parsimony (~simplest suitable ontology without extraneous parts) is still presumably what we're reaching for, it's just that's much easier said than done?
Alternately, is there something broader you're pointing at where we shouldn't be trying to directly learn/train the right ontology, we should rather be trying to supply that after learning it ourselves?
The use of something like L1 regularization to achieve sparsity for inherent interpretability may just make things worse; a fixation on L1 regularization may lead people in the wrong direction. To avoid fixation, we should take a step back and look at the big picture. Occam's razor suggests that we should look for simple (and creative) solutions instead of over-engineering solutions when the entire foundation is inadequate.
In order to obtain inherent interpretability, the machine learning model needs to behave in a way that is interesting to mathematicians. By piling on tweaks such as a lot of L1 or L0 regularization for sparsity, one is making the machine learning model more complicated. That makes it more difficult to study mathematically. And neural networks are inherently difficult to study mathematically and to interpret, so they should be replaced with something else. The problem is that neural networks already have so much momentum that people are unwilling to try anything else, and people are way too indoctrinated into neural networkology that they cannot learn new things.
So how does one get momentum with a non-neural machine learning algorithm? One starts with shallow but mathematical machine learning algorithms first and one can also work with algorithms with few layers too. These shallow/few layer mathematical algorithms can still be effective for some problems since they have plenty of width. One may also construct a hybrid model where the first few layers are the mathematical construction but where the rest of the network is a deep neural network. I do not see how to make a very deep network this way, so the next steps are obscure to me.
What if interpretability is a spectrum?
Instead of one massive model, imagine at inference time, you generate a small, query specific model by carving out a small subset of the big model.
Interp on this smaller model for a given query has to be better than on the oom bigger model.
Inference gets substantially faster if you can cluster/batch similar queries together, so you can actually get financial support if this works at the frontier.
The core insight? Gradients smear knowledge everywhere. Tag training data with an ontology learned from the query stream of a frontier model. We can automate the work of millions of librarians to classify our training data. Only let the base model + tagged components be active for a given training example. Sgd can't smear the gradient across inactive components. Knowledge will be roughly partitioned. When you do a forward pass you generate some python code, you don't pay for dot products that know that Jackson is an entirely ignorable town in New Jersey.
Simplicio: Hey I’ve got an alignment research idea to run by you.
Me: … guess we’re doing this again.
Simplicio: Interpretability work on trained nets is hard, right? So instead of that, what if we pick an architecture and/or training objective to produce interpretable nets right from the get-go?
Me: If we had the textbook of the future on hand, then maybe. But in practice, you’re planning to use some particular architecture and/or objective which will not work.
Simplicio: That sounds like an empirical question! We can’t know whether it works until we try it. And I haven’t thought of any reason it would fail.
Me: Ok, let’s get concrete here. What architecture and/or objective did you have in mind?
Simplicio: Decision trees! They’re highly interpretable, and my decision theory textbook says they’re fully general in principle. So let’s just make a net tree-shaped, and train that! Or, if that’s not quite general enough, we train a bunch of tree-shaped nets as “experts” and then mix them somehow.
Me: Turns out we’ve tried that one! It’s called a random forest, it was all the rage back in the 2000’s.
Simplicio: So we just go back to that?
Me: Alas, they were both a mess and performed very unimpressively by today’s standards. I mean, if you dug around in there you could often find some interpretable structure, but the interpretability problem for random forests was qualitatively similar to the interpretability problem for today’s nets; you couldn’t just glance at them and read off what was going on or what the internal pieces represented.
Simplicio: Ok, so what if we tweak the idea to -
Me: Still not going to work.
Simplicio: You don’t even know what I was going to say!
Me: Suppose someone attempts to draw a map of New York City, and gets it completely wrong. Then they say “What if we tweak it?” and tweak it without actually looking at New York City or any actual map of New York City. What actually happens, if someone does that, is that the map will still be completely wrong.
Simplicio: I take it this is supposed to be an analogy for something I’m doing?
Me: Yes. The fundamental problem is that the real world has a preferred ontology, and you are trying to guess that ontology (or worse, imagining you can choose it), without doing the work to discover it. The real world (or a net trained on it) does not cleanly factor as a decision tree, it does not cleanly factor as a mixture of decision trees, and it does not cleanly factor via whatever thing you were about to guess next. The real world (or a net trained on it) does cleanly factor to a remarkably large extent - though that’s a separate discussion. But it takes work to discover that factorization, it requires lots of looking at the world and/or looking at things trained against the world. Trying to guess the ontology is doomed in much the same way that drawing a map of New York City by guessing is doomed.
Simplicio: <looks disappointed>
Me: Was this your one big idea, and you’re now seeing that it won’t work and feeling crushed and also desperately groping around for some way to save it? But also you’re noticing that that’s what you’re doing and feeling bad about that too?
Simplicio: … kinda, yeah.
Me: You’ll have more ideas. This is gonna suck right now, but in the long run, you can have more than one idea and build confidence that you can find more, and then it will be easier to discard ideas which don’t work.
Sophisticus: I have a less-straw version of Simplicio’s idea to run by you.
Sparsity! <jazz hands>
Me: Please be more specific.
Sophisticus: Well, the obvious thing to do is to add a term to the loss function with an norm of activations...
Me: ...which presumably multiple people have already tried in some way, and then they wrote papers arguing that it worked kinda ok if you squinted a bit, but it was not very compelling...
Sophisticus: ... right, so instead we run a clustering algorithm on activations and then use an norm on cluster membership probabilities, and then combine that a second term on the weights to cancel out the -
Me: Ok, slow down. Remember when I told Simplicio that the fundamental problem was trying to guess the ontology rather than discover it? That he was metaphorically trying to draw a map of New York City without going out and looking at the city or any actual map of it?
So what is the evidence you've seen about the real world, what have you seen in the territory or existing maps of the territory, which would tell us that the specifics of your proposal are indeed accurate parts of the map? For instance, what fact have you observed about the world, which tells us that a second term on the weights to cancel out the whatever is an inherent part of the natural ontology?
Sophisticus: Hmm. Well, looking around, there sure is a lot of informational sparsity in the world.
Me: I agree there. But your proposal involves all these additional details. Sparsity does not imply norm, and while an norm is a convenient way to train something sparse-ish, that doesn't mean it will give the right sparse thing. Whatever specific clustering algorithm you want to lay on top is also not obviously the right thing. Nor is it clear that internal activations in this particular architecture are the right thing to make sparse in the first place. And why specifically the second term with the norm on weights?
When drawing a map of a city, every single little detail, every road and label, has to be nailed down by looking at the world. We can't just start with a vague big picture and then fill in the details any which way. Likewise when learning an ontology: every single little detail of the operationalization needs to be nailed down by looking at the territory. Not just in a vague "seems kind of intuitive" way, but in a "we have evidence nailing down this specific form, i.e. ruling out every alternative" kind of way.
Sophisticus: I think you are too pessimistic here! Even if my guesses aren't exactly right, they may be close enough. You yourself argue that natural ontologies are convergent, so they should be at least somewhat robust to the details of the learning architecture, right? And as long as the guesses are close enough, we can iterate to improve it.
Me: There's some nuggets of good ideas in there, but as a practical matter you're going to shoot yourself in the foot. If you had feedback loops good enough to get that close in the first place, and to successfully iterate at all, then you would probably already be starting with a much clearer picture than you in fact have.
Picture this concretely: you build your thing, it turns out to not match reality's ontology very well. What feedback do you then get, and what do you do with it? Insofar as you've missed reality's ontology, things will just look like a mess. You're not going to have a firehose of bits on what the right ontology looks like, just the single bit that you didn't get it. You can try to slap patches on top to make things look marginally better, but you're working in way too high-dimensional a space here and you'll just be left with more mess.
If you want to make inherently interpretable architectures which actually work, guess and check isn't going to get you there. Having a little bit of evidene about the big picture idea but then guess-and-checking the many, many details also isn't going to get you there, or even get you meaningfully closer. You need to find a way to get a lot more bits of evidence about all the details of natural ontology.