Interpretability is the best path to alignment

[-]Fabien Roger2mo62

Methods such as few shot steering, if made more robust, can help make production AI deployments more safe and less prone to hallucination.

I think this understates everything that can go wrong with the gray-box understanding that interp can get us in the short and medium term.

Activation steering is somewhat close in its effects to prompting the model (after all, you are changing the activations to be more like what they are if you used different prompts). Activation steering may have the same problem as prompting (e.g. you say "please be nice", the AI reads "please look nice to a human", the AI has more activations in the direction of "looking nice to a human", and then subtly stabs you in the back). I don't know of any evidence to the contrary, and I suspect it will be extremely hard to find methods that don't have this kind of problem.

In general, while mechanistic interpretability can probably become useful (like chain-of-thought monitoring), I think mechanistic interpretability is not a track to providing a robust enough understanding of models that reliably catches problems and allows you to fix them at their root (like chain of thought monitoring). This level of pessimism is shared by people who spent a lot of time leading work on interp like Neel Nanda and Buck Shlegeris.

[-]Arch2237d10

Extremely late, but I actually agree.

I wonder the extent to which alignment faking is present in current preparedness frameworks. One of my beliefs is that a better degree of interpretability can help us understand why models engage in such behavior, but yes, it probably does not get us to a solution (so far).

[-]Alex Gibson2mo3-1

Interesting post, thanks.

I'm worried about placing Mechanistic Interpretability on a pedestal compared to simpler techniques like behavioural evals. This is because ironically, to an outsider, Mechanistic Interpretability is not very interpretable.

A lay outsider can easily understand the full context of published behavioural evaluations, and come to their own judgements on what the behavioural evaluation shows, and if they agree or disagree with the interpretation provided by the author. But if Anthropic publishes a MI blog post claiming to understand the circuits inside a model, the layperson has no way of evaluating those claims for themselves without diving deep into technical details.

This is an issue even if we trust that everyone working on safety at Anthropic has good intentions. Because people are subject to cognitive biases and can trick themselves into thinking they understand a system better than they actually do, especially when the techniques involved are sophisticated.

In an ideal world, yes, we would use Mechanistic Interpretability to formally prove that ASI isn't going to kill us or whatever. But with bounded time (because of alternative existential risks conditioning on no ASI), this ambitious goal is unlikely to be achieved. Instead, MI research will likely only produce partial insights that risk creating a false sense of security resistant to critique by non-experts.

[-]ryan_greenblatt2mo30

[-]StanislavKrym2mo10

If your core claim is that some HUMAN geniuses at Anthropic can solve mechinterp, align Claude-N to the geniuses themselves and ensure that nobody else understands it, then this is likely false. While the Race Ending of the AI-2027 forecast has Agent-4 do so, the Agent-4 collective can achieve this by having 1-2 OOMs more AI researchers who also think 1-2 OOMs faster. But the work of a team of human geniuses can at least be understood by their not-so-genius coworkers.^[1] Once it happens, a classic power struggle begins with a potential power grab, threats to whistleblow to the USG and effects of the Intelligence Curse.

If you claim that mechinterp could produce plausible and fake insights,^[2] then behavioral evaluations are arguably even less useful, especially when dealing with adversarially misaligned AIs thinking in neuralese. We just don't have anything but mechinterp^[3] to ensure that neuralese-using AIs are actually aligned.

^{^}
Or, if the leading AI companies are merged, by AI researchers from former rival companies.
^{^}
Which I don't believe. How can a fake insight be produced and avoid being checked on weaker models? GPT-3 was trained on 3e23 FLOP, allowing researchers to create hundreds of such models with various tweaks in the architecture and training environment by using less than 1e27 FLOP. Which fits into the research experiments as detailed in the AI-2027 compute forecast.
^{^}
And working with a similar training environment for CoT-using AIs and checking that the environment instills the right thoughts in the CoT. But what if the CoT-using AI instinctively knows that it is, say, inferior in true creativity in comparison with the humans and doesn't attempt takeover only because of that?

[-]Stephen McAleese2mo30

Thanks for the post. It covers an important debate: whether mechanistic interpretability is worth pursuing as a path towards safer AI. The post is logical and makes several good points but I find it's style too formal for LessWrong and it could be rewritten to be more readable.

LESSWRONG
LW

LESSWRONG
LW

2

Interpretability is the best path to alignment

2

2

A Primer on Mechanistic Interpretability and Its Ascendant Methods

The Problem of Misalignment and the Inadequacy of Current Methods

Mechanistic Interpretability as a More Robust Path to Alignment

From Technical Insight to Global AI Policy

Conclusion: Understanding as a Prerequisite for Control