Extracting Performant Algorithms Using Mechanistic Interpretability

Ihor Kendiukhov

A Prequel: The Tree of Life Inside a DNA Language Model

Last year, researchers at Goodfire AI took Evo 2, a genomic foundation model, and found, quite literally, the evolutionary tree of life inside. The phylogenetic relationships between thousands of species were encoded as a curved manifold in the model's internal activations, with geodesic distances along that manifold tracking actual evolutionary branch lengths. Bacteria that diverged hundreds of millions of years ago were far apart on the manifold, and closely related species were nearby.

The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to encode evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth.

I saw this and decided to apply the same approach to another type of biological foundation models - those trained on single cell data.

If Evo 2 learned the tree of life from raw DNA, what did scGPT learn about how human cells develop?

Finding the Manifold

For those unfamiliar with the biology side: scGPT is a transformer model trained on millions of single-cell gene expression profiles. Each cell in your body expresses thousands of genes at varying levels, and a single-cell RNA sequencing experiment measures those expression levels for potentially hundreds of thousands of individual cells simultaneously. scGPT was pre-trained on this kind of data in a generative fashion, learning to predict masked gene expression values from context.

The question I wanted to answer was: does scGPT encode, somewhere in its attention tensor, a compact geometric representation of some biological processes? And if so, can I find it without knowing in advance exactly where to look?

To attack this systematically, I used a two-phase research loop driven by an AI executor-reviewer pair operating under pre-registered quality gates. Phase 1 was a broad hypothesis search: the loop explored a large combinatorial space of candidate manifold hypotheses by varying the biological target (developmental ordering, regulatory structure, communication geometry), the featurization strategy (attention drift, raw embeddings, mixed operators), and the geometric fitting method (Isomap, geodesic MDS, a technique called Locally Euclidean Transformations), all applied across the full 12-layer × 8-head scGPT attention tensor, which means 96 individual attention units to screen.

What came out of Phase 1 was a robust positive hit: hypothesis H65, which identified a compact, roughly 8-to-10-dimensional manifold in specific attention heads where positions along the manifold corresponded to how far cells had progressed through hematopoietic differentiation. Stem cells clustered at one end; terminally differentiated blood cell types (T cells, B cells, monocytes, macrophages) spread out along distinct branches at the other end; and the branching topology matched the known developmental hierarchy with statistically significant branch structure that held up under stringent controls.

Then I switched to Phase 2 which was rather manual investigation: methodological closure tests, confidence intervals, structured holdouts, and external validation. I validated the manifold on a non-overlapping panel from Tabula Sapiens and then confirmed it via frozen-head zero-shot transfer to an entirely independent multi-donor immune panel. You can explore this manifold yourself and compare different extraction variants, in an interactive 3D viewer.

But Does the Extracted Algorithm Actually Work?

I think finding a biologically meaningful manifold inside a foundation model is, on its own, cool. But the question I actually cared about was: can you take this geometric object out of the model and use it as a standalone method that does useful work?

To do it, I developed a three-stage extraction pipeline:

I directly exported the frozen attention weight matrices from the relevant heads, with no retraining, just literally reading out the learned linear operator.
I attached a lightweight learned adaptor that projects the raw attention output into the manifold's coordinate system.
And then I added a task-specific readout head (for classification or pseudotime prediction).

The key property of this pipeline is that the heavy lifting, the actual biological knowledge, comes entirely from the frozen attention weights that scGPT learned during pre-training. The adaptor and readout are small and cheap to train, and they never touch the original dataset the model was pre-trained on. What you end up with is a standalone algorithm you can ship as a file and run independently of scGPT.

So: how does it perform?

I benchmarked the extracted algorithm against a lineup of established methods that biologists actually use in practice: scVI (a deep generative model for single-cell data), Palantir (a pseudotime method based on diffusion maps and Markov chains), Diffusion Pseudotime (the Scanpy implementation), CellTypist (a logistic-regression-based cell type classifier trained on a large reference atlas), PCA, and raw-expression baselines. These are the standard tools in the single-cell bioinformatics toolkit, developed and refined by domain experts over years.

On pseudotime-depth ordering, which measures how well a method recovers the true developmental progression from stem cells to mature blood cells, the extracted algorithm appeared to be the best, significantly outperforming every tested alternative in paired split-level statistics. On classification (distinguishing cell types), the picture was less unambiguous but still strong: the extracted head led on branch balanced accuracy and on key subtype discrimination tasks like CD4/CD8 T cell separation and monocyte/macrophage distinction. On some stage-level and branch-level macro-F1 metrics, diffusion-style baselines or raw expression had the edge, so this is not a clean sweep, but the extracted algorithm is solidly in the top tier across the board, and dominant on the most biologically meaningful endpoint.

Now, you might reasonably ask: is this just the result of having a fancier probe? Maybe any sufficiently flexible function fitted on top of scGPT's embeddings would do equally well, and the "manifold discovery" part is not contributing anything real. I tested this. A 3-layer MLP with 175,000 trainable parameters, fitted on frozen scGPT average-pooled embeddings, was significantly worse than the extracted 10-dimensional head on 6 out of 8 classification endpoints. And the extracted head accomplished this while being 34.5 times faster to evaluate across a full 12-split campaign, with roughly 1,000 times fewer trainable parameters.

Let me restate this: the geometric structure that mechanistic interpretability found inside scGPT's attention heads, when extracted and used directly, outperforms the standard approach of slapping an MLP on top of the model's embeddings. The interpretability-derived method is simultaneously more accurate, faster, and smaller.

How Small Can You Go Though?

Once you have an extracted algorithm that works, the natural next question is how much of it you actually need. Compression is interesting for practical reasons, but it is even more interesting for interpretability reasons, because the further you compress an algorithm while preserving its performance, the closer you get to understanding what it is actually doing.

The initial extracted operator pooled three attention heads from scGPT and weighed 17.5 MB. Not large by modern standards, but not trivially inspectable either. The first compression step was to ask: do we really need all three heads, or does a single one carry the essential geometry? I scanned all 96 attention units in scGPT's tensor and found that a single unit, Layer 2, Head 5, carried substantial transferable developmental geometry on its own. The compact operator built from this single head weighed 5.9 MB and showed almost no loss compared to the three-head version on the benchmark suite.

The second compression step was more aggressive: truncated SVD on the single-head operator. This factors the weight matrix into low-rank components and throws away everything below a chosen rank threshold. At rank 64, the resulting surrogate shrinks to 0.73 MB, which is already quite tiny, and it still beats the frozen scGPT average-pool + MLP baseline on all eight pooled classification endpoints. It does incur statistically significant losses versus the dense single-head operator on 5 out of 8 endpoints, so this is not free compression. But the rank-64 version is still a better algorithm than the standard probing approach, at a fraction of a megabyte.

And also, now the interpretability payoff arrives. I ran a factor ablation audit on the rank-64 surrogate: systematically remove each of the 64 factors one at a time, measure how much performance drops, and rank them by necessity. And it appeared that just four factors, out of 64, accounted for 66% of the total pooled ablation impact. And then, when I examined what those four factors corresponded to biologically, they resolved into explicit hematopoietic gene programs.

So, Mechanistic Interpretability is Becoming Dual Use

Let's step back from the specific results for a moment and consider the high-level lesson here.

The very property that makes this result interesting is also the property that makes me cautious about applying the same techniques to large language models. Because the argument runs in both directions. If you can extract an algorithm that a model uses to do something well, you can potentially also improve how it does that thing: by identifying inefficient components, scaling the relevant circuits, composing extracted subroutines in new ways, by replacing the fuzzy learned version with a cleaner extracted version and freeing up capacity for the model to learn something else etc. Mechanistic interpretability, in other words, is becoming a capability amplification tool. This is a well-known theoretical concern, but it looks like now it is becoming a practical one.

Consider a few scenarios. You identify the circuit in a language model responsible for multi-step planning, extract it, find that it is operating at low rank with substantial redundancy, and publish a paper showing how to compress it. Now anyone training the next generation of models can initialize that circuit more efficiently, or allocate more capacity to the components that matter. Or: you discover that a model's chain-of-thought reasoning relies on a specific attention pattern that routes information through intermediate tokens in a predictable way, and you publish a detailed mechanistic account of how this works. Now someone building an inference-time scaling pipeline can optimize that routing directly rather than relying on the model to rediscover it from scratch.

This is one of the reasons why I have deliberately chosen to focus my interpretability work on biological foundation models. Although I agree that pushing biology can also be associated with risks, I believe that we really need to push biology asap, considering the current AI risks landscape, and pushing biology is what I am trying to do.

Mechanistic Intepretability for Novel Knowledge Discovery

On a more positive and general note: on top of being an auditing/monitoring tool, mechanistic interpretability can be a knowledge discovery tool. Consider:

The model learned something about hematopoiesis that existing bioinformatics methods had not fully captured, at least not in the same compact form.
The interpretability pipeline found a representation that, when extracted and deployed as a standalone algorithm, outperformed established tools on the most biologically meaningful benchmarks.
The knowledge extracted from the model's internals was new in the operationally relevant sense: nobody had this particular algorithm before, and it works better than what people were using.

Join In

If you like mechanistic interpetability, I encourage you to consider switching from LLMs to biological foundation models.

The work I described here is part of a broader research program on mechanistic interpretability of biological foundation models. Earlier I published a comprehensive stress-test of attention-based interpretability methods on scGPT and Geneformer. In parallel, I developed a sparse autoencoder atlas covering 107,000+ features across all layers of both Geneformer and scGPT. The hematopoietic manifold paper is the latest piece.

There is a lot more to do here, both in terms of applying these methods to other biological systems and developmental processes, and in terms of developing better unsupervised techniques for manifold discovery that could scale beyond what the current semi-supervised approach allows. I think this is one of the best places in the current research landscape to do interpretability work that is simultaneously methodologically interesting, practically useful for biomedicine (and yes, human intelligence amplification), and safe with respect to the capability externalities that worry me about LLM interpretability.

You can find more about the research program, ongoing projects, and ways to get involved at biodynai.com. The full paper with all supplementary materials is on arXiv, and the interactive 3D manifold viewer is at biodyn-ai.github.io/hema-manifold.

The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to understand evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth.

I haven't read Goodfire's post and I've thought about this for like 3 minutes ^[1] but my guess is that this is a vast exaggeration of the result, with the correct thing being much closer to "it's good to have similar DNAs assigned similar representations, and one can read off phylogenic distance from similarity". Like, my guess is that it's almost entirely not about the model understanding evolutionary relationships or using representations of evolutionary relationships.

E.g. in a toy version of evolution where the only thing that can happen are single nucleotide changes, you could read off phylogenic distance really well from distance between completely naive embedding vectors constructed by just having a coordinate of the vector for each position in the genome, and its value be for respectively A/T/G/C. It'd be a vast overstatement to say this model understands evolutionary relationships or uses them in its internal computation.

addendum: I skimmed Goodfire's post now and saw they have a section on "disentangling phylogenetics from sequence similarity". But imo the thing they do there completely fails to get rid of this simple reason for having phylogeny be readable from representation vectors. It probably would kill the very naive construction I gave, but it doesn't kill reasonable mildly more sophisticated things.

p.s.: Another thing in the same vein: "you can make a pretty high accuracy linear probe for X on a model's activations" is much weaker than e.g. "the model represents X as a basic variable", "the computation implemented in the weights uses X", "the model thinks in terms of X", "the model understands X", or even "the model computes X". (Consider for example how various properties could be probed well from RAM activations when running a computer game, that aren't actually remotely computed by the source code.)

though I've thought much more about related questions in the past ↩︎

it's good to have similar DNAs assigned similar representations, and one can read off phylogenic distance from similarity

You're bang-on here. The averaging-over method would prevent sequence similarity readoffs, but it wouldn't prevent any genome-level characters such as kmer spectra, GC content or average linkage disequilibrium, which are correlated with phylogenies but are not deterministic of phylogenetic placement. This point is made in the second half of the Goodfire article yet it is not pointed out that this contradicts the first half.

I would be very surprised if Evo etc. do faithfully represent a phylogeny because phylogenies do not contain much information about genomic characteristics except at very broad scales. A better way to determine this would be to build a phylogeny from the representations and compare those, not the distances in the representation space with the phylogenetic distance. Phylogenies are used for their topology and branch lengths, of which phylogenetic distance is a poor summary.

I adjusted the text about the Goodfire discovery a bit, so it hopefull addresses your main concern which is likely valid.

But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that's a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.

Regardless, the Goodfire result is more for storytelling in my post, not load-bearing evidence.

But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that's a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.

I agree this is evidence that the model uses these representations ^[1] , I just think it is weak evidence, because you'd expect the same thing to show up even if the model doesn't use these representations. (And given that it is weak evidence, we roughly speaking have to ignore it and stick with our priors from other considerations when judging whether the model uses these representations.) I don't just think that the manifold existing isn't that meaningful, I also think the manifold being "compact" ^[2] or detectable isn't that meaningful either. In my toy naive DNA embedding example, I think one can read off phylogenic distances from projections of points to a completely random low-dimensional subspace. ^[3] When I'm playing a sufficiently high-dimensional video game, I'd guess that there is a simple probe on RAM activations which distinguishes with accuracy whether I'm really tired vs not at all tired. I'm not that familiar with how RAM positions are allocated to variables typically, but I'd pretty strongly guess that usually you can do this with a linear probe in fact, i.e. that there is a one-dimensional subspace of my video game program's activations which captures this property. I would also bet that such a subspace is findable by gradient descent from a random seed. ^[4] And yet: the naive DNA representation doesn't use phylogeny at all, and my video game program doesn't think about whether I'm tired at all.

well, because if the method gave a result worse than this, that would be evidence that the model doesn’t use these representations ↩︎
btw I'm guessing you're just pointing at the manifold being a low-dimensional subspace? ↩︎
Roughly speaking, as long as one can read off phylogenic distances from distances in the big space of dimension , one will also be able to read off distances in a random subspace of dimension , by Johnson-Lindenstrauss. One could claim that the dimension of the subspace one finds in practice is smaller than what one would expect from mechanisms of the sort I'm proposing, but I strongly doubt anyone in this epistemic pipeline has thought this sort of thing through. ↩︎
It isn't quite linear regression, but it's close to it. And linear regression is a convex optimization problem, so (roughly speaking) gradient descent works. ↩︎

Type III Audio reading here is mistakenly voicing instead On Independence Axiom

Wow that is weird. We'll look into it.

Sorry for the hassle—I've just fixed this.

No worries and thank you. Confirming, it is fixed in the article and in the karma podcast feed

The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to understand evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth.

though I've thought much more about related questions in the past ↩︎

it's good to have similar DNAs assigned similar representations, and one can read off phylogenic distance from similarity

I adjusted the text about the Goodfire discovery a bit, so it hopefull addresses your main concern which is likely valid.

Regardless, the Goodfire result is more for storytelling in my post, not load-bearing evidence.

But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that's a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.

well, because if the method gave a result worse than this, that would be evidence that the model doesn’t use these representations ↩︎
btw I'm guessing you're just pointing at the manifold being a low-dimensional subspace? ↩︎
Roughly speaking, as long as one can read off phylogenic distances from distances in the big space of dimension , one will also be able to read off distances in a random subspace of dimension , by Johnson-Lindenstrauss. One could claim that the dimension of the subspace one finds in practice is smaller than what one would expect from mechanisms of the sort I'm proposing, but I strongly doubt anyone in this epistemic pipeline has thought this sort of thing through. ↩︎
It isn't quite linear regression, but it's close to it. And linear regression is a convex optimization problem, so (roughly speaking) gradient descent works. ↩︎

Type III Audio reading here is mistakenly voicing instead On Independence Axiom

Wow that is weird. We'll look into it.

Sorry for the hassle—I've just fixed this.

No worries and thank you. Confirming, it is fixed in the article and in the karma podcast feed

56

Extracting Performant Algorithms Using Mechanistic Interpretability

56

A Prequel: The Tree of Life Inside a DNA Language Model

Finding the Manifold

But Does the Extracted Algorithm Actually Work?

How Small Can You Go Though?

So, Mechanistic Interpretability is Becoming Dual Use

Mechanistic Intepretability for Novel Knowledge Discovery

Join In

56

56