Sparse Autoencoders for Single-Cell Models

Ihor Kendiukhov

People are rushing to build bigger and bigger single cell foundation models (trained on RNA sequencing data), but in my view we have not extracted even a small fraction of the knowledge and capabilities that already exist inside the models we have today.

To explain what I mean, I want to argue three things in this post, and then show the empirical work behind them.

Thesis 1: Biological foundation models are not like LLMs, and the field's habit of evaluating them the same way is causing us to systematically underestimate what they contain. When you interact with GPT, the surface-level outputs (the text it generates) are a fairly good proxy for the model's capabilities. You can read what it writes and form a reasonable opinion. Biological foundation models are fundamentally different in this respect. A model like Geneformer or scGPT takes a cell's gene expression profile and produces embeddings, predictions of masked genes, or cell type classifications. These surface-level outputs are only a small sliver of what the model is doing internally. The model has been trained on tens of millions of cells, and the representations it has built to solve its training objective contain compressed biological knowledge that never directly appears in any output you can look at. Evaluating these models by their benchmark performance on cell type annotation or perturbation prediction is like evaluating a human scientist by asking them to fill in blanks on a multiple-choice exam.

Thesis 2: People keep calling biological foundation models "virtual cells," but this is a label that is implied rather than tested or validated. The term gets used in grant applications, press releases, and even some papers, as though it were an established fact that these models have internalized a working simulation of cellular biology. Maybe they have. Or maybe they have learned sophisticated statistical regularities that look like biology on the surface but dissolve under closer inspection. My work shows these models are, in a meaningfull sense, the models of the cells, but that is an empirical claim that needs empirical treatment.

Thesis 3: The right tools already exist, and they come from the AI safety community's work on mechanistic interpretability. Sparse autoencoders (SAEs), causal circuit tracing, feature ablation, activation patching: these methods were developed to understand language models, largely motivated by alignment concerns. It turns out they are extraordinarily well-suited to biological foundation models, and for a good reason: in language models, when you discover a circuit, you often lack ground truth about whether the circuit is "correct" in any deep sense, because there is no objective external reality that the model's internal computations are supposed to correspond to. In biological foundation models, you have decades of molecular biology, curated pathway databases, genome-scale perturbation screens, and well-characterized regulatory networks to validate against. Biology gives you the ground truth that language lacks. This makes biological FMs arguably the best (real) testbed for mechanistic interpretability methods that currently exists.

What follows is the story of three papers I recently produced, each building on the previous one, in which I applied the SAE-based interpretability toolkit to the two (not a long time ago) leading single-cell foundation models (Geneformer V2-316M and scGPT whole-human) and progressively mapped what they know, how they compute, and where their knowledge runs out.

The SAE Atlas

(arXiv:2603.02952)

The first question was very simple: what is inside these models?

Neural networks encode information in superposition. This is well-established in the interpretability literature for language models, but nobody had systematically demonstrated it for biological foundation models or attempted to resolve it.

I trained TopK sparse autoencoders on the residual stream activations of every layer of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512). The SAEs decompose the dense, superimposed activations into sparse, interpretable features, each of which (ideally) corresponds to a single biological concept. The result was a pair of feature atlases: 82,525 features for Geneformer, 24,527 for scGPT, totaling over 107,000 features across 30 layers.

The superposition is massive. 99.8% of the features recovered by the SAEs are invisible to standard linear methods like SVD, meaning that if you tried to understand these models using PCA or similar approaches, you would be looking at 0.2% of the representational structure. This alone should give pause to anyone who thinks they understand what these models are doing based on standard dimensionality reduction.

The features are biologically rich. Systematic annotation against five major databases (Gene Ontology, KEGG, Reactome, STRING, and TRRUST) revealed that 29 to 59% of features map to known biological concepts, with an interesting U-shaped profile across layers: high annotation rates in early layers (capturing basic pathway membership), declining in middle layers (where the model appears to build more abstract, less easily labeled representations), and rising again in late layers (where it reconstructs output-relevant biological categories). The features also organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (when you ablate one feature, the downstream effects are concentrated on specific output genes rather than diffusing broadly, with a median specificity of 2.36x), and form cross-layer information highways connecting 63 to 99.8% of features into functional pipelines.

So far, so encouraging. The models have clearly internalized a great deal of organized biological knowledge: pathways, protein interactions, functional modules, hierarchical abstraction. This looks close to the "virtual cell" story that the field likes to tell.

Mapping the Wiring

(arXiv:2603.01752)

The SAE atlas told us what features exist inside these models. The next question was: how do they interact? What is the computational graph?

I introduced causal circuit tracing for biological foundation models. The method works by ablating an SAE feature at its source layer (setting its activation to zero in the residual stream) and then measuring how every downstream SAE feature across all subsequent layers responds. This gives you directed, signed, causal edges: feature A at layer L causally drives feature B at layer L+k with effect size d and direction (excitatory or inhibitory). This is not correlation, not co-activation, not mutual information, but an intervention.

Applied across four experimental conditions, the result was a causal circuit graph of 96,892 significant edges, computed over 80,191 forward passes.

Several properties of this graph were surprising.

Inhibitory dominance. Between 65 and 89% of causal edges are inhibitory: ablating a source feature reduces downstream feature activations. This means that features predominantly encode necessary information. Removing a feature causes the downstream features that depend on it to lose activation, rather than freeing up capacity for other features (which would produce excitatory edges). The model's computational structure is one of mutual dependency, not competition. The roughly 20% excitatory fraction likely reflects disinhibition: removing some features releases others from suppression.

Biological coherence. Of the edges where both source and target have biological annotations, 53% share at least one ontology term. Over half of the model's internal computational pathways connect biologically related features. Specific circuits are directly interpretable as known biological cascades. For instance, in Geneformer, an L0 DNA Repair feature causally drives an L1 DNA Damage Response feature (d = -1.87, 113 shared ontology terms), which in turn connects to an L6 Kinetochore feature (d = -3.47), recapitulating the well-established link between DNA damage detection, repair machinery activation, and mitotic checkpoint engagement. The model has, through training on gene expression data alone, discovered a circuit that molecular biologists needed decades of experimental work to characterize.

Cross-model convergence. When I compared the causal wiring of Geneformer and scGPT (the models with different architectures, training data compositions, and training objectives), I found that they independently learn strikingly similar internal circuits. 1,142 biological domain pairs are conserved across architectures at over 10x enrichment over chance. Even more telling, disease-associated domains are 3.59x overrepresented in this consensus set, meaning the biology that matters most for human health is exactly the biology both models converge on most reliably. Two quite different neural networks, trained independently, wire up the same biology internally, and this convergence is strongest for disease-relevant pathways.

Going Exhaustive and Finding the Dark Matter of Biological Features

(arXiv:2603.11940)

In the third paper, instead of 30 cherry-picked features, I traced every single one of the 4,065 active SAE features at layer 5 in Geneformer, producing 1,393,850 significant causal edges. This is a 27-fold expansion over the selective sampling in Paper 2.

The result overturned several conclusions from the selective analysis. The complete circuit graph reveals a heavy-tailed hub architecture where just 1.8% of features account for disproportionate connectivity. But here is the interesting part: 40% of the top-20 hub features have zero biological annotation. They do not map to any known pathway in GO, KEGG, or Reactome. These are the features the model relies on most heavily for its computations, and they are precisely the ones that our earlier annotation-biased sampling had systematically excluded.

This has serious methodological implications! If you only interpret features that already have biological labels, you are looking under the streetlight: you will recover known biology and conclude that the model has learned biology, while the features the model actually relies on most heavily sit in the dark, unstudied. Some of these unlabeled hubs may represent novel biological programs that do not map neatly onto existing pathway databases, others may be computational abstractions the model has invented to compress cellular state in ways we have not conceptualized yet. Either way, they are exactly where the most interesting discoveries are likely hiding, and any interpretability pipeline that pre-filters for annotation is structurally incapable of finding them!

Also, the initial SAE atlas had shown that certain features correlate with differentiation state: some features are more active in mature cells, others in progenitor cells. But that is just correlation and the question that matters for the "virtual cell" claim is whether amplifying a differentiation-associated feature actually pushes a cell's state toward maturity.

It does. Late-layer features (L17) causally push cells toward maturity, while early-layer features push them away from it. The model has learned a layer-dependent differentiation gradient, and we can steer it: amplify a late-layer differentiation feature and the cell's computed state moves toward a more mature phenotype. This is the first causal evidence that these models encode something like a functional developmental program, and it is the closest thing we have to validation of the "virtual cell" metaphor.

What Does This All Mean?

The good news is that biological foundation models contain far more knowledge than anyone has extracted. Over 107,000 interpretable features, organized into biological pathways, connected by causal circuits that recapitulate known molecular biology, converging across independent architectures. The "virtual cell" metaphor is not baseless; there is real, structured, biologically meaningful computation happening inside these models, and we can identify, map, and even steer it. Yes, significant part of this knowledge correlational, but not all of it. And we have a big problem that at least the previous generation of the models don't learn regulatory networks. See more here.

There is also a clear methodological warning: the features that matter most computationally are disproportionately the ones that lack biological labels. Any future work in this space needs to grapple with the annotation bias problem, or it will keep producing results that confirm what we already knew while missing what we do not.

I am more and more convinced that there is a big opportunity here. Mechanistic interpretability, developed for AI safety, turns out to be a powerful tool for extracting biological knowledge from foundation models.

[-]michaelwaves3mo20

Very cool work! Do you think this approach could also work for protein folding models like Alphafold, RFDiffusion, Protenix, DISCO, etc? So far the only thing I've found in the literature is FoldSAE (https://arxiv.org/pdf/2511.22519), and they find only very basic features like the neuron for alpha helices vs beta sheets

[-]Ihor Kendiukhov3mo60

Thanks! One of the main reasons I work specifically with single-cell FMs is because I believe they learn much richer biology, because you model systems-level with them. There is only so much you can learn from structure models, and, unlike with single cell models, I think most of the value of the structure models is in their outputs.

That said, one can definitely apply similar methods to them. I would be happy to do that myself, but currently it's not a priority for me for the reasons described.