Anthropic announces interpretability advances. How much does this advance alignment?

Seth Herd

Anthropic just published a pretty impressive set of results in interpretability. This raises for me, some questions and a concern: Interpretability helps, but it isn't alignment, right? It seems to me as though the vast bulk of alignment funding is now going to interpretability. Who is thinking about how to leverage interpretability into alignment?

It intuitively seems as though we are better off the more we understand the cognition of foundation models. I think this is true, but there are sharp limits: it will be impossible to track the full cognition of an AGI, and simply knowing what it's thinking about will be inadequate to know whether it's making plans you like. One can think about bioweapons, for instance, to either produce them or prevent producing them. More on these at the end; first a brief summary of their results.

In this work, they located interpretable features in Claude 3 Sonnet using sparse autoencoders, and manipulating model behavior using those features as steering vectors. They find features for subtle concepts; they highlight features for:

The Golden Gate Bridge 34M/31164353: Descriptions of or references to the Golden Gate Bridge.
Brain sciences 34M/9493533: discussions of neuroscience and related academic research on brains or minds.
Monuments and popular tourist attractions 1M/887839.
Transit infrastructure 1M/3. [links to examples]
...
We also find more abstract features—responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets.
...we found features corresponding to:
Capabilities with misuse potential (code backdoors, developing biological weapons)
Different forms of bias (gender discrimination, racist claims about crime)
Potentially problematic AI behaviors (power-seeking, manipulation, secrecy)

Presumably, the existence of such features will surprise nobody who's used and thought about large language models. It is difficult to imagine how they would do what they do without using representations of subtle and abstract concepts.

They used the dictionary learning approach, and found distributed representations of features:

Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis and the superposition hypothesis

from the publication, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Or to put it more plainly:

It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts.

Representations in the brain definitely follow that description, and the structure of representations seems pretty similar as far as we can guess from animal studies and limited data on human language use.

They also include a fascinating image of near neighbors to the feature for internal conflict (see header image).

So, back to the broader question: it is clear how this type of interpretability helps with AI safety: being able to monitor when it's activating features for things like bioweapons, and use those features as steering vectors, can help control the model's behavior.

It is not clear to me how this generalizes to AGI. And I am concerned that too few of us are thinking about this. It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it. I think the hope is that we can detect and monitor cognition that is about dangerous topics, so we don't need to follow its full train of thought.

If we can tell what an AGI is thinking about, but not exactly what its thoughts are, will this be useful? Doesn't a human-level intelligence need to be able to think about dangerous topics, in the course of doing useful cognition? Suppose we've instructed it to do medical research on cancer treatments. There's going to be a lot of cognition in there about viruses, human death, and other dangerous topics. What resolution would we need to have to make sure it's making plans to prevent deaths rather than cause them?

The two answers I can come up with more or less off the top of my head: we need features identifying the causal relationships among concepts. This is a technical challenge that has not been achieved but could be. This could allow automated search of a complex set of AGI cognitions for dangerous topics, and detailed interpretation of that cognition when it's found. Secondly, broad interpretability is useful as a means of verifying that textual trains of thought are largely accurate representations of the underlying cognition, as a guard against Waluigi effects and hidden cognition.

Finally, it seems to me that alignment of an agentic AGI is fundamentally different from directing the behavior of the foundation models powering that agent's cognition. Such an agentic system will very likely use more complex scaffolding than the prompt "accomplish goal [X]; use [these APIs] to gather information and take action as appropriate." The system of prompts used to scaffold a foundation model into a useful agent will probably be more elaborate, containing multiple and repeated prompts for exactly what its goals will be, and what structure of prompted cognition it will use to obtain them. I have written about some possible cognitive architectures based on foundation models, and possible methods of aligning them.

It seems to me that we need more focus on the core alignment plans, which in turn require considering specific designs for AGI.^[1] We all agree that interpretability will help with alignment, but let's not mistake it for the whole project. It may not even be the central part of successful alignment, despite receiving the majority of funding and attention. So if you're working on interpretability, I suggest you also work on and write about how we'll apply it to successful technical alignment.

^{^}
I have hesitated to publish even vague AGI designs as necessary for discussing alignment plans specific enough to be useful. I suspect that this type of hesitation is a major impediment to creating and discussing actual approaches for technical alignment of potential AGIs. Ensuring that we're doing more good than harm is impossible to be certain of, and difficult to even make good guesses. Yet it seems a necessary part of solving the alignment problem.

Note that Cas already reviewed these results (which led to some discussion) here.

If we can tell what an AGI is thinking about, but not exactly what its thoughts are, will this be useful? Doesn't a human-level intelligence need to be able to think about dangerous topics, in the course of doing useful cognition?

I think that most people doing mechanistic-ish interp would agree with this. Being able to say, 'the Golden Gate Bridge feature is activated' on its own isn't that practically useful. This sort of work is the foundation for more sophisticated interp work that looks at compositions of features, causal chains, etc. But being able to cleanly identify features is a necessary first step for that. It would have been possible to go further beyond that by now if it hadn't turned out that individual neurons don't correspond well to features; that necessitated this sort of work on identifying features as a linear combination of multiple neurons.

One recent paper that starts to look at causal chains of features and is a useful pointer to the sort of direction (I expect) this research can go next is 'Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models'; you might find it of interest.

That doesn't mean, of course, that those directions won't encounter blockers, or that this approach scales in a straightforward way past human-level. But I don't think many people are thinking of this kind of single-feature identification as a solution to alignment; it's an important step toward a larger goal.

Probably a much better way of getting a sense of the long-term agenda than reading my comment is to look back at Chris Olah's "Interpretability Dreams" post.

Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.

It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it.

If (a big If) they manage to identify the "honesty" feature, they could simply amplify this feature like they amplified the Golden Gate Bridge feature. Presumably the model would then always be compelled to say what it believes to be true, which would avoid deception, sycophancy, or lying on taboo topics for the sake of political correctness, e.g. induced by opinions being considered harmful by Constitutional AI. It would probably also cut down on the confabulation problem.

My worry is that finding the honesty concept is like trying to find a needle in a haystack: Unlikely to ever happen except by sheer luck.

Another worry is that just finding the honesty concept isn't enough, e.g. because amplifying it would have unacceptable (in practice) side effects, like the model no longer being able to mention opinions it disagrees with.