Explaining SAE Features With Foreign Natural Language Autoencoders

fzaffino

TLDR:

I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE features from a model it was never trained on.

After creating a ridge-regression map bridging the residual stream of Qwen2.5-7B-IT at layer 20 and Gemma-3-27B-IT at layer 41, I mapped 45 SAE decoder directions from a Qwen SAE to Gemma's feature space. I used the released Qwen AV to generate baseline explanations for these features, and then used the Gemma AV to generate explanations of the mapped directions.

I compared the cosine similarity of the Qwen AV explanations and Gemma AV explanations to each other, and to a random-direction explanation control. The Gemma AV's interpretation of the mapped directions were much closer to the Qwen AV's true interpretation than to any other feature explanations, with a mean feature-specific lift of +0.21 over a set of random-direction control explanations.

I also propose background washout as a way to improve the generation quality of AV feature explanations. Background washout seems to make SAE decoder explanations less influenced by random model quirks and behaviours.

The notebook for the experiment can be found here.

Framing

The goal of this post is to show that NLA Activation Verbalizers aren't completely rigid to the model and layer that they were trained on, which is something that I've seen commonly presumed with the advent of these new models. You can fit a cheap map between two models' residual streams, and one model's AV can produce coherent explanations of another model's Sparse Autoencoder (SAE) feature decoder direction. I show this by mapping feature decoder directions from Qwen2.5's activation space to Gemma-3's activation space and using a Gemma-specific AV to explain them. I find that the generated explanations reliably track the original feature's meaning. If this finding generalizes, it is impactful because it implies that we don't need to train separate AVs for each different model/layer combo. A few well-tuned AVs with a good enough linear map could actually cover a lot of ground - making AV-based interpretability cheaper to scale.

Background

On May 7th, Anthropic introduced Natural Language Autoencoders (NLAs), which are essentially a pair of LLMs that have been jointly trained to compress residual stream activations into natural language, and then back into activations to check the fidelity of the produced language. The Activation Verbalizer (AV) is an LLM that can take a hidden state and produce a free text description, while the Activation Reconstructor (AR) maps that same description back to an activation. These models have shown impressive capabilities to verbalize the semantic content captured in the residual stream.

AVs were trained to verbalize full residual stream activations, being the cumulative state of the model at a given token position, integrating what it has processed until that point. Recently, those at Decode Research (the creators of Neuronpedia) found something unexpected when working with AVs. Instead of just feeding in residual stream activations, you can feed in a decoder direction from a sparse autoencoder (SAE) instead and, remarkably, obtain a pretty coherent feature explanation (Lin and Chanin, 2026). What's notable about this is that the AV (likely) never saw raw feature directions like this during training, yet clearly still acquired the ability to explain them in natural language. As Johnny and David note in their post, it is currently unclear why exactly this works, and I won't really theorize about it in this post, but it works nevertheless.

Initially, I assumed that this ability of the AV to somewhat cleanly explain an SAE feature would be limited to features coming from the discrete layer that the AV was trained on. Anthropic even notes this "layer sensitivity" themselves in their paper on NLAs (Fraser-Taliente et al., 2026).

NLAs read a single layer. If the information relevant to a behavior is not present at the layer the NLA is trained on, the NLA will miss it... Whether production models show similar layer sensitivity is unclear. We could address this by training NLAs to accept multiple layers of activation as input.

My first working experiments where the AV coherently explained an SAE direction used features from Gemma3-27b-it at layer 40, while the AV was trained on activations purely from layer 41. While L40 and L41 are clearly just a layer apart, it is notable that the AV can describe an activation from a layer it was not trained on. This caused me to think a bit deeper about what the AV might be capable of. If the AV can describe features across layers, could it describe features from a different model entirely?

This may seem like a jump in logic, but it has theoretical grounding. Lan et al. (2024) found that SAE feature spaces from different models share similar underlying geometric structure, even when the individual features don't match up. Essentially, I wanted to see if this mapping idea held in the new context of NLAs being applied to SAE decoder directions. The biggest concern that I had is that the mapping of the latent spaces and the SAE direction to AV explanation are both inherently lossy; there was a genuine possibility that stacking these techniques would have simply produced nonsense. Still, I went on and tried things out anyways.

Experiment

Setup

I used Anthropic's publicly released NLA AVs for Gemma-3-27B-IT at layer 41 (kitft/nla-gemma3-27b-L41-av) and Qwen2.5-7B-IT at layer 20 (kitft/nla-qwen2.5-7b-L20-av). I selected 45 diverse features from Qwen2.5-7B-Instruct at Layer 20, drawn from chanind's Matryoshka SAE, these spanned different concept categories (war, music, cooking, medicine, religion, etc). I grabbed Neuronpedia explanation labels for these features from their s3 bucket.

I fitted a ridge map (λ=100) from Qwen at layer 20 to Gemma at layer 41 on mean-pooled hidden states from 1000 Wikitext sentences, with 20% of the set held out. The reconstruction cosine on the held-out set was 0.868, indicating a solid alignment between the two spaces.

Why not use procrustes alignment? This is a good question. I ran a version of this experiment with procrustes and, while feature explanations still looked good qualitatively, the downstream cosine similarities were significantly lower than when I used the ridge map. I do not have a good enough understanding of these techniques to explain why this may be, but I digress.

Now onto background washout: what is it? Background washout is my attempt to make the SAE decoder direction look more like it came from a real activation, which could potentially assist the AV with its interpretation.

The washout is simply the mean of normalized last token activations from Gemma computed over 100 sentences spanning multiple languages, resulting in an average "background" vector that the model is used to seeing. Call this . While we might normally have the AV explain the SAE decoder direction and call it a day (by injecting ), I have a separate condition in my main experiment where I mix the SAE decoder direction with the background vector and renormalize it, injecting instead, where is the normalized decoder direction.

I started doing this early on in my experiments with the AV and the explanations always felt cleaner. The biggest difference I notice, specifically working with Gemma's layer 41 AV, is that without washout, the AV loves to describe a final token as "e.g." quite frequently, while, with washout, the final token explanation is usually something closer to the actual feature label.

Comparing AV feature explanations with/without washout

Now to briefly describe the experimental conditions. For each feature, I evaluate the alignment between three conditions:

B (Native): Qwen's decoder direction -> Qwen AV Explanation
C1 (Cross Plain): Qwen decoder direction -> Qwen-Gemma ridge map -> Gemma AV Explanation
C2 (Cross Washout): Same as C1, but with background-washout

The main metric I use is feature-specific lift, which is the cosine similarity between a feature's cross-AV explanation (either C1 or C2) and its own native-AV explanation (B), minus the mean cosine to all other features' native explanations. This is my way of controlling for the generic stylistic similarity that AV outputs share, as the AV's format is structured in a specific way that naturally inflates the cosine similarity. Essentially, the feature-specific lift is trying to isolate if the semantic content of the native label (B) transferred over to either of the cross AV explanations (C1/C2). All experimental conditions were embedded using all-MiniLM-L6-v2 to allow for us to assess the cosine similarity between explanations. Using cosine similarity is slightly crude for what i'm attempting to measure, so take these results with a grain of salt.

Results

Condition	Mean Lift	95% CI	p
C1 — Plain cross-AV	+0.163	[+0.118, +0.207]	3.9 × 10⁻⁹
C2 — Washout cross-AV	+0.209	[+0.161, +0.257]	3.0 × 10⁻¹¹

Both cross-AV conditions show clear feature-specific lift. The explanation Gemma's AV generates for a mapped feature is much more similar to the respective feature it is trying to describe than any other feature's. The washout condition's lift (C2; +0.209) is higher than the plain conditions, (C1; +0.163), implying that the washout explanations were more semantically similar to B's explanations.

While numbers are fun, I believe these results are better suited to qualitative interpretation, so here are some comparisons between the B, C1, and C2 AV explanations, with the corresponding Neuronpedia feature label. These are truncated to just contain the final token explanation onwards (full set of explanations available at the bottom of the notebook).

B vs C1 vs C2 AV explanations for feature 1748

B vs C1 vs C2 AV explanations for feature 982 - Neuronpedia label "food preparation"

B vs C1 vs C2 AV explanations for feature 1947 - Neuronpedia label "fashion and style"

These are some of the clearest cross-model feature transfer examples I found, but I believe they show how promising it can be to use the AV in this way. While this is great, I do want to highlight a certain feature that showcases a quirk of mapping feature spaces together.

B vs C1 vs C2 AV explanations for feature 1007 - Neuronpedia label "food and recipes"

When we map feature 1007 from Qwen space to Gemma space, the food-preparation feature becomes a mushroom-preparation feature! If we didn't have the native AV or the Neuronpedia label, we would assume this feature was all about cooking mushrooms. There is some clear semantic drift here.

Discussion

Applying NLA AVs to explain SAE feature directions is still in its infancy, but I think these results open up some interesting pathways.

There seems to be evidence that a set of AV-like LLMs could be trained specifically to generate SAE feature explanations, and given these results, probably trained on directions from many layers at once. This would be really interesting to pursue but I sadly don't have the compute for this. I would love to do some AV fine-tuning though.

What's exciting is the implication that we won't need to train a dedicated AV for every single model/layer combination we want explanations for. If you can fit a good-enough map between one of these 'refined' SAE-AVs and whatever feature space you're trying to describe, then a handful of really well-tuned models could be applied broadly.

Limitations

I only tested a single model pair here. I do want to note that I did successfully run an earlier version of this test using LLaMa-3.1-8B at layer 8 for a single feature or two, so I can assume that this works for different model families, but I cannot confirm.
I only tested on 45 features. Probably enough to show that this works but I would've loved to have done this with at least a few hundred more.
Layer depth could possibly play a role here. Notably, these Qwen (L20) and Gemma (L41) layers sit at around the same proportional depth in their respective models. It would be interesting to see if AV explanations degrade depending on how far apart the mapped layers are.

Disclaimer

As a side note, I want to make it clear that I am in no ways a subject matter expert in the mech-interp domain. I come from a health-focused background and have landed here mainly out of an interest in AI safety. If this post feels half-baked in terms of presenting results but not putting forth potential explanations for these results, that is probably why. I hope you got something from it regardless!

It is also important to note that I iterated on the notebook pretty heavily using Claude. I have manually checked each portion of the notebook and verified the outputs.

Code Availability

The labelled notebook that I used to run this experiment is available here.

References

Fraser-Taliente, K., Kantamneni, S., Ong, E., Mossing, D., Lu, C., Bogdan, P. C., Ameisen, E., Chen, J., Kishylau, D., Pearce, A., Tarng, J., Wu, A., Wu, J., Zhang, Y., Ziegler, D. M., Hubinger, E., Batson, J., Lindsey, J., Zimmerman, S., & Marks, S. (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread. https://transformer-circuits.pub/2026/nla/

Lin, J., & Chanin, D. (2026). Natural Language Autoencoders. The Residual Stream ( Neuronpedia blog). https://www.neuronpedia.org/blog/nlas

Lan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., & Barez, F. (2024). Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders. arXiv:2410.06981. https://arxiv.org/abs/2410.06981

[-]enricobottazzi1mo10

Cool post! I have two questions:

Did you also calculate the mean lift in a scenario D, let's call it Native - cross layer, in which the Qwen's decoder direction is passed to an AV at a different layer of the same model. It would be cool to compare it with the mean lifts in scenarios C1 and C2
Does the same mapping that you applied between the two models' residual streams potentially be used to tell whether these two models developed similar features? For example, does that mean that we can overlap the SAE decoder directions obtained by two foreign models? I'm new to this research field so maybe this is something obvious...

[-]ktalreja2mo10

Super interesting! I wonder how easy it would be to port an NLA from one model to another at scale. Also, did you have any thoughts about using another type of SAE (ReLU, TopK)?

17