A Research Bet on SAE-like Expert Architectures

Nathan Helm-Burger

Interpretable by Construction: A Research Bet on SAE-like Expert Architectures

The Bet

You can build a language model architecture whose native decomposition is already close to what sparse autoencoder researchers are trying to recover post-hoc: a large pool of small, sparsely-activated, approximately-monosemantic units whose contributions to the residual stream are individually legible. If the bet pays off, we get interpretability as a structural property of the model rather than a reconstruction problem layered on top of it. If it fails, we learn something specific about why the SAE-style decomposition is harder to build in than to extract, which is itself worth knowing. I've been working on this for a while now, building on the PEER (Parameter Efficient Expert Retrieval) and MONET (Mixture of Monosemantic Experts for Transformers) architectures. This post is a status report and a call for collaborators.

Aspiration

SAEs and sparse expert architectures are aimed at the same target from opposite directions. SAE research starts with a dense trained model and searches for a sparse, monosemantic decomposition of its activations. Expert architectures start with a sparse decomposition built into the weights and try to make the resulting model competitive. The interesting question is whether the second direction can reach the destination the first direction is aiming at — and at what training-efficiency cost. I want to be clear that my current architecture is not there yet. "Interpretable by construction" is the guiding vision, not a property I've demonstrated.

What the architecture currently gives me is:

A hierarchical routing mechanism (mixture of expert-pools which contain populations of tiny intended-to-be-monosemantic experts) that produces domain-level specialization without supervision. Expert pools cluster around code, biomedical text, academic citations, and so on. The small, independently-parameterized rank-1 experts each implement a function simple enough to characterize directly.

Still To Do

What it does not yet give me, and what "SAE-like" would actually require:

Monosemanticity at the unit level

My goal is feature-level monosemanticity. Functional legibility of individual experts. Knowing what an expert tends to fire on is not equivalent to knowing what it computes.

Strong causal faithfulness

Topic correlations are the easy version of the claim. The harder version is that the expert's learned function explains its behavioral contribution mechanistically.

Competitive performance at scale

My experiments so far have been < 1B parameter training runs, for under 24 hours on one or two GPUs. The trends on my tiny prototypes look promising, but I won't have confidence that this will scale to hundreds of billions of params until I see it work for at least the 8B scale.

So the project is best understood as a wager that architectural pressure toward sparsity and specialization can produce a model where the SAE-style decomposition is not only free, but fundamentally part of the causal mechanism. I have enough early evidence to think the bet seems promising; I don't have enough to be confident it will work in full and at scale.

I would be careful about training SAEs from scratch on CE loss, since this will just move the superposition to within correlated features.

For example, w/ top-k = 10, we could have 2 features that consistently co-occur that have more than 2 meanings:

[feature1 activation, feature2 activation]
[10, 0] = dog
[0, 10] = cat
[10, 10] = bird

One way you can work around this is to switch to a fixed target (like normal SAE training).

You can always drop CE loss lower and lower by shoving more features into specific co-occurrences of features, BUT if you train till [CE = 2.4] along with sparsity losses, could work! But at that point, you could've just trained a bunch of transcoders (maybe? could be a bit different).

Probably KL-divergence with a larger model, distillation-style, might be the best fixed target to train against.

Hopefully that made sense!

Thanks! Advice much appreciated!

I'm currently allowing up to 12 experts per layer per token-position. So yeah, definitely room for those experts to be collaborating to create combined meanings. I should probably test a lower max-experts-per-layer at some point and see how much that hurts performance.

I should also try to take pairs of commonly co-occurring experts in my trained model and check whether their joint activation patterns encode more distinct meanings than their marginal activation patterns would predict.

If experts are genuinely monosemantic, the joint distribution should be close to the product of the marginals.

If they're participating in combinatorial codes, I should see structure in the joint distribution that isn't present in the marginals.