An Ambitious Vision for Interpretability

leogao

The goal of ambitious mechanistic interpretability (AMI) is to fully understand how neural networks work. While some have pivoted towards more pragmatic approaches, I think the reports of AMI’s death have been greatly exaggerated. The field of AMI has made plenty of progress towards finding increasingly simple and rigorously-faithful circuits, including our latest work on circuit sparsity. There are also many exciting inroads on the core problem waiting to be explored.

The value of understanding

Why try to understand things, if we can get more immediate value from less ambitious approaches? In my opinion, there are two main reasons.

First, mechanistic understanding can make it much easier to figure out what’s actually going on, especially when it’s hard to distinguish hypotheses using external behavior (e.g if the model is scheming).

We can liken this to going from print statement debugging to using an actual debugger. Print statement debugging often requires many experiments, because each time you gain only a few bits of information which sketch a strange, confusing, and potentially misleading picture. When you start using the debugger, you suddenly notice all at once that you’re making a lot of incorrect assumptions you didn’t even realize you were making.

Second, since AGI will likely look very different from current models, we’d prefer to gain knowledge that applies beyond current models. This is one of the core difficulties of alignment that every alignment research agenda has to contend with.

The more you understand why your alignment approach works, the more likely it is to keep working in the future, or at least warn you before it fails. If you’re just whacking your model on the head, and it seems to work but you don’t really know why, then you really have no idea when it might suddenly stop working. If you’ve ever tried to fix broken software by toggling vaguely relevant sounding config options until it works again, you know how brittle this kind of fix can be. On the other hand, if you have a deep understanding of exactly what’s going on inside the model, you have a better sense of whether things are working for the right reasons, and which kinds of changes might break your alignment approach. And we already know that many pragmatic approaches will definitely break at some point, and are just hoping that they’ll last long enough.

AMI has good feedback loops

While open ended exploration is immensely valuable and disproportionately effective at producing breakthroughs, having good empirical feedback loops is also essential for the health of a field. If it’s difficult to measure how good your work is, it’s also difficult to make progress.

Thankfully, I think AMI has surprisingly good empirical feedback loops. We don’t have a watertight definition of “full understanding”, but we can measure progress in ways that we aren’t close to saturating. This isn’t so different from the state of metrics in capabilities, which has undeniably made very rapid progress, despite AGI being hard to define.

For example, I’m excited about progress on metrics based broadly on these core ideas:

Feature quality: can we show that features are human-understandable by finding explanations for when they activate, and showing that these explanations are correct by substituting model activations with explanation-simulated activations without degrading the performance of our model?
Circuit faithfulness: can we show that our circuits are truly necessary and sufficient for explaining the behavior of the model, by applying causal scrubbing (or a successor technique) without degrading the performance of our model?

None of these criteria are watertight, but this isn’t the bottleneck for AMI right now. We haven’t broken all the interp metrics we can think up; instead, the strongest metrics we can come up with give a score of precisely zero for all extant interp techniques, so we have to use weaker metrics to make progress. Therefore, progress looks like pushing the frontier of using stronger and stronger interpretability metrics. This creates an incentive problem - nobody wants to write papers that get results that seem equally as impressive as previous circuits. The solution here is social - as a field, we should value rigor in AMI papers to a much greater extent - rather than giving up entirely on rigorous ambitious interpretability.

Of course, as we make progress on these metrics, we will discover flaws in them. We will need to create new metrics, as well as stronger variants of existing metrics. But this is par for the course for a healthy research field.

The past and future of AMI

We’ve made a lot of progress on ambitious interpretability over the past few years, and we’re poised to make a lot more progress in the next few years. Just a few years ago, in 2022, the IOI paper found a very complex circuit for a simple model behavior. This circuit contains over two dozen entire attention heads, each consisting of 64 attention channels, and doesn’t even attempt to explain MLPs.

Today, our circuit sparsity approach finds circuits that are orders of magnitude simpler than the IOI circuit: we can explain behaviors of roughly similar complexity with only half a dozen attention channels and neurons.^[1] We also use a slightly stronger notion of circuit faithfulness: we show that we can ablate all nodes outside our circuit using mean ablations from the entire pretraining distribution rather than the task distribution. The various activations are often extremely cleanly understandable, and the resulting circuits are often simple enough to fully understand with a day’s work.

There are a lot of exciting future directions to take AMI. First, there are numerous tractable directions to build on our recent circuit sparsity work. Mean ablation on the pretraining distribution is weaker than full causal scrubbing, and randomly selected neurons or connections from the entire model are generally not nearly as interpretable as the ones in the specific circuits we isolate. If we design actually good interpretability metrics, and then hillclimb them, we could get to an interpretable GPT-1. It’s also plausible that circuit sparsity can be applied to understand only a small slice of an existing model; for example, by using bridges to tie the representations of the sparse model to the existing model on a very specific subdistribution, or by sparsifying only part of a network and using gradient routing to localize behavior.

Outside of circuit sparsity, there are a ton of other exciting directions as well. The circuit tracing agenda from Anthropic is another approach to AMI that trades off some amount of rigor for scalability to frontier models. Additionally, approaches similar to Jacobian SAEs seem like they could enforce the circuit sparsity constraint without needing to train new models from scratch, and approaches like SPD/APD provide a promising alternative approach for sparsifying interactions between weights and activations. Going further away from the circuit paradigm: SLT could offer a learning theoretic explanation of model generalization, and computational mechanics could explain the geometry of belief representations.

As we get better at rigorously understanding small models, we might find recurring motifs inside models, and refine our algorithms to be more efficient by taking advantage of this structure. If we had the interpretable GPT-1, studying it would teach us a lot about how to create the interpretable GPT-2. Therefore, even approaches like circuit sparsity which cannot themselves scale to frontier models can still be critical on the road to AMI, because they have extreme levels of expressivity that allow us to initially impose few preconceptions on what we find.

“Feel the AMI”

Fully understanding neural networks is not going to be easy. It might not even be possible. But the point of doing research is to make big bets with big payoffs, and it's hard to beat the payoff of ambitious mechinterp.

Thanks to Adrià Garriga-Alonso, Jesse Hoogland, Sasha Hydrie, Jack Lindsey, Jake Mendel, Jacob Merizian, Neel Nanda, Asher Parker-Sartori, Lucia Quirke, and Aidan Smith for comments on drafts of this post.

^{^}
One subtlety is that our work involves creating new models from scratch that are more interpretable, rather than interpreting existing models, which makes it somewhat harder to compare. However, I don’t think this completely invalidates the comparison, because I’d guess it’s tractable to extend our techniques to explain existing models.

What exactly do you mean by ambitious mech interp, and what does it enable? You focus on debugging here, but you didn't title the post "an ambitious vision for debugging", and indeed I think a vision for debugging would look quite different.

For example, you might say that the goal is to have "full human understanding" of the AI system, such that some specific human can answer arbitrary questions about the AI system (without just delegating to some other system). To this I'd reply that this seems like an unattainable goal; reality is very detailed, AIs inherit a lot of that detail, a human can't contain all of it.

Maybe you'd say "actually, the human just has to be able to answer any specific question given a lot of time to do so", so that the human doesn't have to contain all the detail of the AI, and can just load in the relevant detail for a given question. To do this perfectly, you still need to contain the detail of the AI, because you need to argue that there's no hidden structure anywhere in the AI that invalidates your answer. So I still think this is an unattainable goal.

Maybe you'd then say "okay fine, but come on, surely via decent heuristic arguments, the human's answer can get way more robust than via any of the pragmatic approaches, even if you don't get something like a proof". I used to be more optimistic about this but things like self-repair and negative heads make it hard in practice, not just in theory. Perhaps more fundamentally, if you've retreated this far back, it's unclear to me why we're calling this "ambitious mech interp" rather than "pragmatic interp".

To be clear, I like most of the agendas in AMI and definitely want them to be a part of the overall portfolio, since they seem especially likely to provide new affordances. I also think many of the directions are more future-proof (ie more likely to generalize to future very different AI systems). So it's quite plausible that we don't disagree much on what actions to take. I mostly just dislike gesturing at "it would be so good if we had <probably impossible thing> let's try to make it happen".

I think it’s very good that OpenAI and Google DeepMind are pursuing complementary approaches in this sense.

As long as both teams keep publishing their advances, this sounds like a win-win. This way the whole field makes better progress, compared to the situation where everyone is following the same general approach.

to be clear, this post is just my personal opinion, and is not necessarily representative of the beliefs of the openai interpretability team as a whole

Thanks, sure.

And I am simplifying quite a bit (the whole field is much larger anyway).

Mostly, I mean to say I do hope people diversify and not converge in their approaches to interpretability.

Upvoted. The way I think about the case for ambitious interp is:

There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
Ambitious interp should have a much higher ceiling of usefulness if we can get it to work in time, because (a) it give us better options for getting a lot of juice out of our signal by understanding what's going wrong (ie better debugging) and correcting the problem at the root and (b) it may be more robust to obfuscation (even holding constant the way which we optimise against the signal).
It still seems plausible that we can get it to work in time, because (a) there are reasonably promising agendas existing in interp, (b) timelines are plausibly more than a few years, and (c) we may be able to automate lots of the work (and work now might still be useful to prep for that).

I do think that pragmatic interp is great, but I don't want the field to move away from ambitious interp entirely. My guess is that people in favour of moving away from ambitious interp mostly contest (3.a) (and secondarily maybe I think (2) is a bigger deal than they do, although maybe not). I don't think I would disagree much about how object level promising the existing agendas are: I just think that 'promising' should mean 'could lead somewhere given 10 years of effort, plus maybe lots of AI labour' which might correspond to research that feels like it is very limited right now.

I know you know this but I thought it is important to emphasize that

your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.

It's not even an oversight problem. There is simply nothing to ' oversee'. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it's too late.

Strongly agree about the importance of ambitious mech interp.

My personal belief is that we should go back to tiny "toy" models, starting with 1-layer models and fully understand them before scaling up to 2-layer models, then 4-layer models, etc.

I put together a proposal to start a research lab focused on ambitious mech interp for small models - please reply or ping if you're interested in discussing:

https://docs.google.com/document/d/14WJK81ZM6IcF8igVxwmFLuTNunYyuVTPKGG5iRdb2Nk/edit?usp=sharing

I think it’s very good that OpenAI and Google DeepMind are pursuing complementary approaches in this sense.

to be clear, this post is just my personal opinion, and is not necessarily representative of the beliefs of the openai interpretability team as a whole

Thanks, sure.

And I am simplifying quite a bit (the whole field is much larger anyway).

Mostly, I mean to say I do hope people diversify and not converge in their approaches to interpretability.

Upvoted. The way I think about the case for ambitious interp is:

There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
Ambitious interp should have a much higher ceiling of usefulness if we can get it to work in time, because (a) it give us better options for getting a lot of juice out of our signal by understanding what's going wrong (ie better debugging) and correcting the problem at the root and (b) it may be more robust to obfuscation (even holding constant the way which we optimise against the signal).
It still seems plausible that we can get it to work in time, because (a) there are reasonably promising agendas existing in interp, (b) timelines are plausibly more than a few years, and (c) we may be able to automate lots of the work (and work now might still be useful to prep for that).

I know you know this but I thought it is important to emphasize that

your first point is plausibly understating the problem of pragmatic/blackbox methods. In the worse-case an AI may simply encrypt its thoughts.

It's not even an oversight problem. There is simply nothing to ' oversee'. It will think its evil thoughts in private. The AI will comply with all evals you can cook up until it's too late.

Strongly agree about the importance of ambitious mech interp.

My personal belief is that we should go back to tiny "toy" models, starting with 1-layer models and fully understand them before scaling up to 2-layer models, then 4-layer models, etc.

I put together a proposal to start a research lab focused on ambitious mech interp for small models - please reply or ping if you're interested in discussing:

https://docs.google.com/document/d/14WJK81ZM6IcF8igVxwmFLuTNunYyuVTPKGG5iRdb2Nk/edit?usp=sharing

LESSWRONG
LW

LESSWRONG
LW

168

An Ambitious Vision for Interpretability

168

Ω 65

The value of understanding

AMI has good feedback loops

The past and future of AMI

168

Ω 65

168

Ω 65