The Endgame for Mechanistic Interpretability

neelsomani

Rejected for the following reason(s):

This is an automated rejection.
you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas.
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

Mechanistic interpretability is currently pulled between competing visions. On one side, Neel Nanda argues for pragmatic interpretability: grounding work in real models and judging progress by empirical feedback, even when the underlying understanding is partial. On the other, Leo Gao defends ambitious interpretability: a long-term bet on building circuits that are necessary and sufficient for behavior, on the view that deeper mechanistic understanding is what will generalize across model changes. This disagreement is treated as a question of methods, but the deeper divide lies elsewhere.

The disagreement persists because mechanistic interpretability lacks an agreed-upon end goal. What, exactly, would count as success?

The Telos of Mechanistic Interpretability

Today, feature labeling, circuit discovery, probing, activation patching, and causal interventions coexist as productive methods, yet they are only loosely coordinated. The field lacks a shared ideal for what interpretability methods are ultimately meant to deliver.

One possible telos is legibility: producing explanations that are intelligible to humans. On this view, interpretability succeeds when it tells a coherent story about why a model behaves as it does. But explanations that fail under counterfactual intervention may sound plausible, yet provide no reliable handle for control. Even advocates of curiosity-driven interpretability typically hope their findings will eventually support some downstream use, and legibility alone does not guarantee this.

A second telos is scientific understanding. Intervention is used to test hypotheses, identify causal structure, and build general explanations. Mechanistic interpretability often operates successfully under this ideal. But LLMs are not natural objects. They are engineered software artifacts. Focusing on understanding alone leaves a powerful affordance unused. Interventions need not merely reveal structure, they can permanently modify the system while preserving formally specified properties. A purely scientific telos does not demand patchability, certification, or correctness under modification.

A third telos is capability enhancement. From this perspective, interpretability is valuable only insofar as it accelerates optimization. The natural equilibrium of this ideal favors systems that are maximally effective and maximally opaque.

These ideals are not mutually exclusive, but no single one provides a stable foundation for the field. From this perspective, mechanistic interpretability ought to orient itself toward debuggability: the ability to localize failures to specific mechanisms, intervene on those mechanisms predictably, and certify that the intervention preserves desired behavior on bounded domains. This telos subsumes legibility and scientific understanding, while resisting the drift toward opacity implicit in a purely capability-driven view.

In what follows, I make this notion precise.

Desired Goals for LLM Debuggability

In an idealized setting, debugging an LLM would proceed from localization, to intervention, to certification. Each stage places strictly stronger demands on our mechanistic understanding, and each rules out large classes of explanations.

Localization: Identifying the Responsible Mechanism

The first requirement of debuggability is localization: the ability to identify which internal mechanisms are responsible for a given behavior, and to distinguish mechanisms that generalize from those that merely correlate with it.

In the strongest form, a debuggable localization supports counterexample search. For a bounded LLM input domain D, this means being able to determine whether the behavior can occur without the mechanism being active, or whether the mechanism can be active without producing the behavior (and, when possible, to surface concrete inputs that witness such cases).

Intervention: Surgical, Mechanism-Level Debugging

Localization is only meaningful if it admits intervention. Once a failure is traced to a mechanism, we'd like to modify that mechanism in a way that is predictable and targeted:

The responsible head, MLP, or subspace can be modified or constrained.
The intervention removes the undesired behavior on a specified domain.
The intervention does not induce collateral damage elsewhere in that domain.

Certification: Domain-Bounded Safety Guarantees

The final goal of debuggability is certification: the ability to make exhaustive, falsifiable claims about model behavior on bounded domains.

For a formally specified domain D, this can mean proving that no harmful token is produced for any input in D, a claim that is in principle achievable for sufficiently constrained settings. Certification may also take the form of subcircuit-level bounds, for example structural invariants that rule out entire classes of behavior by construction, such as proving that a circuit cannot bypass a guard layer unless a specific feature is active.

What a Debuggable Explanation Can (and Can't) Promise

First, some anti-goals:

Debuggability does not imply that a trained Transformer can be cleanly de-compiled into a single symbolic program. Transformer models are not limited to discrete algorithmic control flow. They exploit continuous geometry in high-dimensional embedding spaces, superposition, and distributed representations. Any realistic abstraction must preserve this expressive freedom rather than erase it.
Debuggability does not entail identifying a single, privileged "cause" of an output or phenomenon. Model behavior is mediated by deep, branching causal pathways with redundancy, overlap, and compensatory mechanisms. A debuggable explanation need not be unique. What matters is that the identified mechanisms are enough to explain and control the behavior within scope, and that alternative bypasses would be surfaced by counterexample search rather than hidden by storytelling.
Debuggability does not aim at global safety proofs of the form "this model will never produce harmful output." Even if "harmful" were formally defined, the unconstrained input space of frontier LLMs is beyond the reach of any existing or foreseeable verification technique. Any agenda that predicates success on global guarantees is doomed to either vacuity or false confidence.

Instead, debuggability is about constructing a family of verified, compositional abstractions that faithfully reproduce model behavior on bounded domains and support predictable intervention. These abstractions are partial, local, and plural, but they are exact where they apply.

The relevant analogy is debugging a large, safety-critical software system. One cannot prove "Chrome will never crash." But one can prove that specific routines are memory-safe, that sandboxing prevents certain classes of process escape, that critical invariants are preserved across refactors, and that a given patch eliminates a vulnerability without introducing regressions.

The same logic applies to LLMs. Meaningful debuggability consists in guarantees such as:

This subcircuit cannot activate a forbidden feature on domain D.
This intervention removes a failure mode while preserving all other behaviors in scope.
This pathway is structurally incapable of bypassing a guard unless a specific internal condition is met.

The Necessity of Formal Methods

Note that the above are universal claims over bounded domains. They are not distributional or probabilistic, and they do not rest on sampling. This is why a debuggability-oriented interpretability agenda is necessarily coupled to formal methods. SMT solvers, abstract interpretation, and neural verification frameworks are not optional add-ons. They are the only frameworks in which claims of impossibility, preservation, or closure under intervention can be made precise.

This vision does not require that today's frontier LLMs be fully verifiable end-to-end. What matters is that the debuggability of Transformer models has precedent:

Sparse circuit extraction shows that models contain relatively isolated, algorithmic subcircuits that remain stable under targeted intervention.
Symbolic Circuit Distillation is an early example of automated extraction, where neural mechanisms can be proven formally equivalent to symbolic programs.
Neural verification work (e.g. Reluplex and Marabou) establishes that exhaustive reasoning is possible once models are reduced to verification-friendly components on bounded domains.
Alternative attention mechanisms suggest that standard attention (the dominant barrier to SMT verification in Transformers) is an architectural choice rather than a theoretical necessity, opening the door to verification-aware model design with comparable performance.

Taken together, these results shift the problem from conceptual impossibility to engineering integration and scale. Debuggability is about being able to say, with confidence: "This mechanism, on this domain, behaves this way, and if it didn't, we would know."

What De-compiling Actually Looks Like

Here's what a possible "de-compilation" pipeline might look like:

1. Identify stable linear regions (local programs)

The smallest unit of analysis is a particular mechanism that has a stable branch structure on a bounded domain. Many verification-friendly components (affine maps, threshold gates, max/Top-K selection) behave like ordinary programs. Once you know which branch you’re in (e.g. which segment of a piecewise function is active, which items win a Top-K) the remaining computation is just affine arithmetic.

A "local program" is a region of inputs defined by explicit guard conditions (linear inequalities) together with the affine map executed under those guards. The stability part matters, because you want margins on the guards (e.g. thresholds not near zero, Top-K winners separated from runners-up) so that small perturbations or permissible interventions don't flip the branch decisions and invalidate the explanation.

Here's what a concrete example might look like:

Head 31.2: Helps break text into paragraphs
Empirically verified: attention weight exceeds $\eps$ when a token from set {‘\n’, ‘\n\n’} or a discourse marker token occurs at position t−1.

2. Factor into meaningful subspaces

Factoring into meaningful subspaces is the step where you decompose a mechanism's activations into low-dimensional directions that have stable semantics across inputs and contexts, such as syntactic markers, sentiment, or safety-relevant features. A single local program may operate over several subspaces, and the same subspace may participate in many different local programs.

Without subspaces, interventions are blunt (ablating whole heads or MLPs). With them, interventions can be surgical (editing or bounding specific directions while leaving others untouched).

Ideally, these subspaces exhibit "functional coherence," where moving along the subspace produces predictable, monotonic changes in model behavior on a bounded domain.

3. Extract formally verifiable causal circuits

In this step, we compose local programs and subspaces into a single object that supports global, counterfactual claims about behavior on a bounded domain. Formally, this means specifying an interface, a domain, and a set of admissible interventions, and then proving that the neural subcircuit is equivalent to (or soundly approximated by) a symbolic specification on that domain.

My project Symbolic Circuit Distillation builds in this direction by providing formally verified functional abstractions on bounded domains. Achieving this level of debuggability places strong pressure to redesign core Transformer components that are poorly suited to formal verification.

Multiple circuits may implement overlapping or reconstructed features elsewhere in the model. What matters is that, within scope, the abstraction is correct and closed under counterfactuals. If a bypass exists, formal search will find a concrete counterexample. This is the point where mechanistic interpretability becomes robust to patching and refactoring, and where safety-relevant guarantees become possible. Circuits stop being explanatory stories and become objects you can edit, reason about, and certify.

How these verified control abstractions are surfaced to human operators (whether through explicit query languages, automated tooling, or learned interfaces) is an important but orthogonal problem, and not required for the core claim of debuggability.

Interpretability as Control

The central question is whether mechanistic insight can support reliable, bounded, and verifiable control over systems that matter.

The result is a patchwork of verified abstractions: local programs, meaningful subspaces, and formally specified circuits. Many such decompositions may exist. Verification removes arbitrariness not by enforcing uniqueness, but by enforcing sufficiency.

I am curious to hear thoughts from other researchers. You can reach me on X: @neelsomani

Thanks to Maaz for giving feedback prior to posting.