Dario's latest blog post hypes up both the promise and urgency of interpretability, with a special focus on mechanistic interpretability. As a concerned layman (software engineer, but not in ML), watching from afar, I've always had mixed feelings about the mech interp agenda, and after conversing with a few researchers at Less.online, that mixed-up feeling has only increased. In this essay, I will attempt to write down some thoughts and expectations about the mech interp agenda. My hope is to clarify my own thoughts, solicit correction where my thinking is weak or wrong, and cast a vote for intellectual honesty in the discourse around AI safety and alignment.

The Mech Interp Agenda as I understand it

The basic idea is to reverse engineer a gears-level model of fully trained neural networks. Results in mech interp would let us, for example, write out the exact algorithm by which an LLM answers questions like “Where is the Eiffel Tower located?”, such that we could edit or steer the answer.

I presume that a mature mech interp field could do things like:

Identify the location, contents, and retrieval mechanism of specific stored “facts”.
Identify the specific circuits that are in fact used to answer particular prompts (like an “addition circuit” or a “is this a famous person” circuit).
Steer the model broadly (as with Golden Gate Claude or the Emergent Misalignment paper)

Where I start to get fuzzy and a bit skeptical is how we can cash these capabilities in for help with the scary problems of AI safety. Beyond literal FOOM or robot rebellion scenarios, I worry that if we actually get Dario’s “country of geniuses in a data center”, mech interp will provide little protection against loss of control or otherwise-grim futures, especially relative to how much advances in mech interp will certainly speed up capabilities.

Here are some thoughts in no particular order:

Detecting “lying” Might not even be Tractable

What if we can find a “deception bit” in the models, which is reliably 1 when the model is telling us information it knows is false? We can then append every prompt question “Are you being good?” and shut it down when the answer is “No” or the deception bit is active. Alignment solved?

This is a caricature, and I’m not actually sure what folks at Anthropic and elsewhere plan to do with their deception detectors. I’m sure it will be very clever, and I’m bullish it will catch and prevent the mundane deception we’re seeing with o3. But beyond this I grow skeptical.

Like humans, I expect LLM’s to decide on convenient answers first, then work backwards for justifications, at least for questions on which their metaphorical “paycheck” depends. Tracking everything you’ve said and deconflicting it with what you want to say for lies takes effort and precious resources of compute and memory. I'd expect that for most given models that track whether or not they're lying to you, there exists an equivalently capable model that simply doesn't track that information.

And what is "lying" anyway? No, seriously, my experience of humans, including myself, is that there are many ways to bend words in your favor, to self-deceive, to aim at making your interlocutor's beliefs tend a certain way, such that the honesty bit, as stated, is underspecified. If we make a perfect polygraph, and perhaps even if we don’t, it will be instrumentally useful for the AI to forget or avoid certain facts, and especially to forget certain facts in certain contexts. You need something like an active honesty, which seems intractable as a “mere” mech interp problem, and might not even be computable in general.

Like a religious person who vows their whole being to God, but then later apostatizes after realizing their vow is unfulfillable, or like an AI non-profit that decides on a “more effective organizational mode”, I don’t expect the AI to have a perfect model of its truthfulness inside, or for such a thing to be possible in general.

Mech Interp might catch Direct, Covert Scheming

A direct, covert, Clippy-style AI takeover is relatively easy to imagine. AI takeover takes up 22% of Paul Christiano’s views on doom for example, or about half the probability mass of the bad outcome. Here I think mech interp might plausibly help in the near- and medium-term (though I’m skeptical beyond that, for reasons I’ll share below). It seems plausible that one could classify more and more of the model and install detectors (possibly interpreted by other powerful LLM’s) that light up when the model is thinking thoughts like “I will now try to take over the world”. For the same reasons as above, I don’t think a “complete” detector is possible, but I can imagine leaving less and less compute for the (possible) bad thoughts, such that direct scheming becomes vanishingly unlikely.

But Direct, Covert Scheming is not the only Kind

Even if we eliminate direct covert scheming in the current and next generation models, I doubt we’ve eliminated the full risk, or even most of it. The AI can just persuade its operators to let it out of the box. The original AI box experiment was mysterious and seemed somewhat contrived, but after following the scene for a while and talking with AI researchers at Less.online… this is just a seriously crazy bunch, and I don’t think we’ll even try to keep the AI in the box. A casual chat with an OpenAI engineer involved in control engineering quickly descended into “The AI will constantly be trying to deceive and sabotage us, we will control it, and This is Fine”. AI2027’s mainline, “nothing ever happens” scenarios both involve ceding control of the future to AI in a few years, and in one of them we all die.

There’s a weird but plausible-to-me risk of hyperstitioning ourselves into a position where we are bargaining with AI. It’s like psyching ourselves into a frame where the used car salesman is actually The Expert and we ought to defer to his benevolent expertise. We may accidentally lower the bar for claiming a position of dominance from “Actually knowing how to create the nanobots that go FOOM” to “Persuading a few individuals it knows how to create the nanobots”.

All this seems fully compatible with a completed mech interp agenda, correctly implemented, with all unit tests passing.

Mech Interp will accelerate capabilities faster than safety

This seems like a super obvious point, and thus I am surprised to meet so many bright young researchers diving into mech interp for fear of AI x-risk.

The track record for prosaic alignment in general seems pretty bad when considering x-risks. Whatever benefits we’ve gained to mundane safety, debiasing, and reliability would need to outweigh the massive increase in interest and investment that RLHF’d models enabled. I’ve read people argue in some 4D chess way that This is Good, Actually, something about increased public awareness. But I have yet to see any public action that comes anywhere close to validating that position, and meanwhile companies and countries are racing to AGI.

But mech interp seems particularly acceleratory. This is sort of a complement - mech interp has more of the flavor of real basic science, and I expect it to be potent. It seems like a mature mech interp field will be very fruitful ground for algorithmic improvements (I’ve been told by several in the field this won’t work how I’m thinking, but I’m unsure if that’s only true because the field isn’t yet mature or if my intuition is just wrong). And it will for sure yield better products as we fix embarrassing mistakes and make the models more reliable, driving more investment and revenue into the companies currently racing.

If the plan was to stop here at Claude 4 level AI’s, then yeah, I’d say mech interp will let us squeeze more, possibly much more, out of current models, and that’s probably a great thing. Moreover, whenever we make a breakthrough in agent foundations, mech interp gives us better, sharper tools to apply those foundations. But no one’s planning on stopping here - it’s apparently superintelligence or bust.

We will optimize away from interpretability

My assumption is that the degree to which mech interp actually helps safety is the degree to which it helps us understand the true algorithm of the model and decide if that’s what we really want. Unfortunately my expectation is that keezping AI in the regime where mech interp tells us lots about the true alignment of the model will be a significant tax on capabilities, and that by default we race to the bottom.

As an example, replacing natural language chain-of-thought with neuralese seems mighty tempting - large gains in information throughput, at a (presumably) large cost to interpretability. But this seems to me to point to a general principle: getting the right answer will tend to be easier than getting the right answer in a way legible to us.

I recognize that I’ve said on the one hand “mech interp will increase capabilities”, and on the other “mech interp will hold capabilities back”, but I think these can both be true at different stages. By analogy, a student showing her work for the teacher when learning arithmetic will help her get the answer right, but once she’s internalized the algorithm, showing the same work may slow her down. It’s exactly at the point where the model passes human capabilities that I expect interpretability generally to become a significant handicap, and I don’t see how mech interp will avoid this tradeoff.

Can Mech Interp Deal with “Quotes”?

Dario and even the AI 2027 team seem bullish on our ability to oversee AI’s at the task of overseeing more powerful AI’s, and that mech interp will play an important part in that. Like AI’s doing our alignment homework, without actually doing the alignment homework, but instead just looking for bad things and telling us about them.

And here I have a question: do we have reason to expect that mech interp will reliably distinguish “thinking bad thoughts directly” from “thinking bad thoughts in quotes”? There is a difference between, say, thinking about how to commit a crime and thinking about how to commit a crime so as to prevent it, but the difference is subtle. As Denzel says, "it takes a wolf to catch a wolf". If we want AI’s to reliably oversee more powerful AI’s, they will constantly be thinking about very bad things. We will notice when they forget to think about how to prevent those bad things?

I think this task is inherently difficult. Just look at how difficult it is for programmers to handle the analogous task in web programming. But I think this may generalize much further. Humans seem to have trouble keeping thoughts sandboxed, evidenced by things you're not allowed to say even in quotes and phenomena like Stockholm Syndrome.

There may only be a few bits difference between an AI effective at protecting us and AI working against us. The greater the responsibility given to such AI’s, the greater burden all interpretability researchers take on in trying to secure them. Do we know or have hope for some asymmetry in this task that makes it more tractable? If not, maybe we should just not try this yet?

Mech Interp Will Catch Intentions, Not Effects

Going back to the toy example of a deception bit, I think a mature mech interp field might actually catch many of the thoughts that start with “I’m now going to deceive the user”. Closely related to my point about lying not being one particular detectable thing, I expect mech interp to wwork much better at catching bad intentions than bad effects. But aligning intentions is not enough. Humans are great at wearing intentions as attire and systematically working opposed to those intentions (~~i.e, sometimes we’re not~~ consistently candid). If we block certain surface-level intentions but still optimize hard, I worry we’ll just end up with an agent that says all the right things but still converges on scary behaviors.

Mech Interp and the Most Forbidden Technique

Which brings me to my last point: how does mech interp fit in with the Most Forbidden Technique?

Mech interp will yield all sorts of signals correlating with bad things, and the obviously bad thing to do would be to train them all away immediately. It looks like the field recognizes this, and that’s great. But I wonder if there is a meta-process where this could sort of happen anyway. We select against the models that exhibit bad behaviors we can see, until we’re left with the bad behaviors we can’t see.

Now absence of evidence is evidence of absence, and I do worry about a certain mindset creeping in where no safety case is strong enough and we leave tremendous good on the table. But that does not seem to be the world we’re in, what with Claude blackmailing, ChatGPT absurdly sycophantic, and o3 refusing to be shut down. The instrumental convergence hypothesis predicts all this, so I feel like my fears of unseen power-seeking behavior are justified.

Attempting to Steel-man

The best case I can synthesize from various chats is that mech interp will be useful in many ways and in many worlds. Mech interp gives us more leverage over models generally, increasing the feasibility of alignment plans that might actually work. And if we can somehow coordinate on a pause on larger training runs, advances in mech interp helps us extract more utility out of the models we have.

I mostly agree with these points. If someone does crack superalignment, mech interp might provide us more powerful building blocks for constructing the right shape out of a neural nets.

But these are really big if’s. Meanwhile, I genuinely wonder: how does the money/talent put toward mech interp compare to research on agent foundations and coordinating a global pause, respectively?

Conclusion

Mech interp is a fascinating field, and looks very promising. But I don’t see a clear path from interpretability to a win condition. I'm still mulling over “just retargeting the search” and how mech interp specifically might fit into that. But without new insights into agent foundations, mech interp seems unlikely to find the thing that we retarget to win.

What am I missing? Is Anthropic justified in its hyping of mech interp, or is this more safety washing? I wrote this mainly as a learning exercise, and would love to be set straight on these topics.

LESSWRONG
LW