Summary by OpenAI: We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.

Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models

Please share your thoughts in the the comments!

New to LessWrong?

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 11:31 PM

It is extremely reassuring that this is the sort of project that OpenAI chooses to engage in. I don't particularly expect much of this approach at the object-level (though it's obviously the sort of thing that's worth trying), but it's a nontrivial large-scale undertaking aimed at alignment, and that OpenAI has bothered with it gives a lot of credence to their claims that they care about alignment at all.

An obvious way to extrapolate this is to:

  1. Pick some way of factorizing the problem of interpreting a neural network. E. g., go neuron-by-neuron as was done here, or use the path expansion trick.
  2. Analyze a few factors (neurons/paths/etc.) by hand. Meticulously record what you're doing.
  3. Give GPT-4 a bunch of data-analysis plug-ins, an actual interface to GPT-2's weights, and a description of the procedure you used in (2).
  4. Use GPT-4 to analyse every factor the same way you analysed the handful of factors in (2).
    1. Which may involve empirical testing of the interpretations, see e. g. causal scrubbing, ARC's work on explanations, or the simulated vs. actual activations comparison OpenAI did here.
  5. Pore over the resulting data.
    1. If there's too much, figure out a way to factorize the process of interpreting it, then GOTO 2.
    2. If everything worked as intended, you should have on your hands the interpreted version of the computational graph the network is implementing.

The core idea is that this allows us to dramatically upscale interpretability efforts. I'm pretty sure you'd still need a smart human in the loop to direct the whole process, if the process is to result in anything useful, but now they won't need to waste their time doing low-level analysis (and the time of top-tier researchers seems like the main bottleneck on a lot of these projects).

You can in theory achieve the same result by hiring a horde of undergrads or something (pretty sure I'd suggested that before, even), but GPT-4 might plausibly be better at this sort of instruction-following, and also faster.

I'm still skeptical of this whole approach — e. g., it may be that it fails at step 2, because...

  • ... it turns out that analysing any individual factor is unlike analysing any other factor, such that you can't actually define a general procedure for analysis and hand it off to GPT-4.
  • ... it turns out that if you've figured out how to analyse any factor (and there's e. g. 1-10 analysis procedures that suffice for the entire NN), you gain no additional useful information from proceeding to analyse every factor.
  • ... or something like that.

In those cases, GPT-4-carried analysis would just result in a cute mountain of data that's of no particular value.

Nevertheless, it's IMO definitely something that's worth trying, just in case it does turn out to be that easy.

This might be a good time for me to ask a basic question on mechanistic interpretability:

Why does targeting single neurons work? Does it work? One would think that if there is a single dimensional quantity to measure, why would it align with the standard basis? Why wouldn't it be aligned to a random one dimensional linear subspace? Then, examining single neurons is likely to give you some weighted combination of concepts instead, rather than a single interpretation...

Those are good questions! There's some existing research which address some of your questions.

Single neurons often do represent multiple concepts: https://transformer-circuits.pub/2022/toy_model/index.html

It seems to still be unclear why the dimensions are aligned with the standard basis: https://transformer-circuits.pub/2023/privileged-basis/index.html

It's not a full answer, but: To the degree that it is true that the quantities align with the standard basis, it must be somehow a result of asymmetry of the activation. For example ReLU trivially depend on the choice of basis.

If you focus on the ReLU example, it sort of make sense: if multiple non-related concepts express in the same neuron, and one of them push the neuron in the negative direction, it may make the ReLU destroy information of the other concepts.

Some takes I have come across from AI Safety researchers in Academia (Note that both are generally in favor of this work):

Stephen Casper

Erik Jenner

I only want to point out that right now, the approach basically doesn't work.

I would really love to see this combined with a neuroscope so you can play around with the neurons easily and test your hypotheses on what it means!

I also find it pretty fun to try to figure out what a neuron is activating for, and it seems plausibly that this is something that could be gamified+crowd sourced (a la FoldIt) to great effect, even without the use of GPT-4 to generate explanations (still used to validate submitted answers). This probably wouldn't scale to a GPT-3+ sized network, but it might still be helpful at e.g. surfacing interesting neurons, or training an AI to interpret neurons more effectively.

This seems like a pretty promising approach to interpretability, and I think GPT-6 will probably be able to analyze all the neurons in itself with >0.5 scores. Which seems to be recursive self-improvement territory. It would be nice if by the time we got there, we already mostly knew how GPT-2, 3, 4, and 5 worked. Knowing how previous generation LLMs work is likely to be integral to aligning a next generation LLM and it's pretty clear that we're not going to be stopping development, so having some idea of what we're doing is better than none. Even if an AI moratorium is put in place, it would make sense for us to use GPT-4 to automate some of the neuron research going on right now. What we can hope for is that we do the most amount of work possible with GPT-4 before we jump to GPT-5 and beyond.

GPT-6 will probably be able to analyze all the neurons in itself with >0.5 scores

This seems to assume the task (writing explanations for all neurons with >0.5 scores) is possible at all, which is doubtful. Superposition and polysemanticity are certainly things that actually happen.

You need a larger model to interpret your model, and you want to make a model understand a model. Does not look safe!

You need a larger model to interpret your model

Inasmuch as this shtick works at all, that doesn't seem necessarily true to me? You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don't see any obvious reason why that threshold would move up with the size of the model under interpretation. The number of neurons/circuits to be interpreted will increase, but the complexity of any single interpretation? At the very least, that's a non-trivial claim in need of support.

you want to make a model understand a model

I don't think that's particularly risky at all. A model that wasn't dangerous before you fed it data about some other model (or, indeed, about itself) isn't going to become dangerous after it understands. In turn, a model that is dangerous after you let it do science, has been dangerous from the get-go.

We probably shouldn't have trained GPT-4 to begin with; but given that we have, and didn't die, the least we can do is fully utilize the resultant tool.

This feel reminiscent of:

If the human brain were so simple that we could understand it, we would be so simple that we couldn’t.

And while it's a well-constructed pithy quote, I don't think it's true. Can a system understand itself? Can a quining computer program exist? Where is the line between being able to recite itself and understand itself?

You need a model above some threshold of capability at which it can provide useful interpretations, yes, but I don't see any obvious reason why that threshold would move up with the size of the model under interpretation.

Agreed. A quine needs some minimum complexity and/or language / environment support, but once you have one it's usually easy to expand it. Things could go either way, and the question is an interesting one needing investigation, not bare assertion.

And the answer might depend fairly strongly on whether you take steps to make the model interpretable or a spaghetti-code turing-tar-pit mess.

At the very least, that's a non-trivial claim in need of support.

From my point of view, I could say the opposite is rather a "non-trivial claim in need of support". My (not particularly motivated) intuition is that a larger, smarter mind employs more sophisticate cognitive algorithms, and so analyzing its workings requires proportionally more intelligence.

Example: I have the experience that, if I argue with someone less intelligent and used to debate than me, it is likely that they'll perceive what I say in pieces instead of looking at the whole reasoning tree, and it is very difficult to have them understand the "big picture". For example, if I say "A then B", they might understand "A and B", or "A or B", or "A", or "B". In the domain of argument, they're not able to understand how I put all the pieces together, by looking at the pieces in isolation, and it is difficult to them to even contemplate the rules I use.

What are the intuitions that you use to feel the default case is the other way around?

I don't think that's particularly risky at all. A model that wasn't dangerous before you fed it data about some other model (or, indeed, about itself) isn't going to become dangerous after it understands. In turn, a model that is dangerous after you let it do science, has been dangerous from the get-go.

We probably shouldn't have trained GPT-4 to begin with; but given that we have, and didn't die, the least we can do is fully utilize the resultant tool.

Ok, I think I lacked clarity. I did not mean that doing this particular research bit was not safe. I meant that the kind of paradigm that I see here, as I extrapolate it, is not safe.

What are the intuitions that you use to feel the default case is the other way around?

You can always factorize the problem into smaller pieces. If the interlocutor doesn't understand "A then B" but can understand "A", "B", "or", and "not" individually, you can introduce them to "not(A)", let them get used to it until they can think of not(A) as a simple assertion C, then introduce them to or(C;B) (which implements the implication "if A then B"). It can be exhausting, but it works.

And in the case of larger AI models, seems like this sort of factorization would be automatic. Their sophistication grows with the number of parameters — which means the complexity of interactions within individual fixed-size groups of parameters can be constant, or even decrease with the model's size.

Sure, the functions that e. g. parameters at late layers implement may be more complex in an absolute sense; but not more complex relative to lower-layer functions.

Toy example: if every neuron at the nth layer implements an elementary operation over two lower-layer neurons, the function at the 32th layer would be "more complex" than any function at the 6th layer, when considered from scratch — but not more complex if by the time you get to an nth layer, you already understand everything at every preceding layer.

“ Please share your thoughts in the the comments!”

This seems pretty rad. Also, it’s fun to randomly inspect the neurons. This seems like a giant bucket of win.