Sparse Efficiency vs. Superposition: The Interpretability Tradeoff

hillz

Today’s frontier models train in an expensive style: dense forward passes, huge matrix multiplies, and broad weight updates.

The human brain (~5 MWh over 28 years) is an existence proof that learning can be vastly more energy efficient - about 10,000x - than modern AI training runs (https://coefficientgiving.org/research/how-much-computational-power-does-it-take-to-match-the-human-brain/).

The human brain does not achieve this by activating everything all at once. Normal cognition is extremely sparse, local, and conditional. Different circuits are recruited for different tasks; learning updates are distributed unevenly; and “everything firing at once” is not intelligence - it is closer to a seizure.

When I look at strategies like mixture-of-experts, I see them as one small step on a potentially very long path toward more brain-like efficiency: sparse routing, specialized sub-networks, and very segmented or distributed updates, rather than running and updating the whole system uniformly for every example. (In the future, GPUs may be used in new and clever ways, as they work great for dense updates).

But there is also a real tension here. Anthropic has done awesome research showing that a big reason neural networks are so powerful is because they are able to use superposition: a dense shared representational space can compress multiple rare, mostly non-overlapping features into the same neurons / activation space.

That is part of why dense models are so powerful. If you segment a model too aggressively into isolated experts, imo you will lose some of that compression benefit, because each expert sees a narrower slice of the world and has fewer opportunities to reuse the same internal space across many non-overlapping contexts.

That tradeoff is also interesting from a safety perspective. Superposition makes interpretability research difficult (though again, Anthropic is doing cool stuff here).

I think more segmented architectures will weaken superposition, and in doing so they may also make models easier to inspect, audit, constrain, and understand.

I’m curious whether there is a workable middle path: models that get far more efficient by moving away from today’s uniformly dense training regime, while still preserving enough shared representation to remain powerful - and perhaps becoming more interpretable and governable along the way.