SAE It Out Loud: Cross-Model Feature Labeling with NLA Verbalizers

fzaffino

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

TLDR:

I show that a foreign model's Natural Language Autoencoder (NLA) Activation Verbalizer (AV) can produce plausible explanations for SAE features from a different model. After creating a ridge-regression map bridging the residual stream of Qwen2.5-7B-IT at layer 20 and Gemma-3-27B-IT at layer 41, I mapped 45 SAE decoder directions from a Qwen SAE to Gemma's feature space. I used the released Qwen AV (layer 20) to generate baseline explanations for these features, and then used the Gemma AV (layer 41) to generate explanations of the mapped directions. I compared the cosine similarity of the Qwen AV explanations and Gemma AV explanations to each other, and to a random-feature explanation control. I also propose background washout as a way to improve the generation quality of AV feature explanations.

Background

On May 7th, Anthropic introduced Natural Language Autoencoders (NLAs), which are essentially a pair of LLMs that have been jointly trained to compress residual stream activations into natural language, and then back into activations to check the fidelity of the produced language. The Activation Verbalizer (AV) is an LLM that can take a hidden state and produce a free-text description, while the Activation Reconstructor (AR) maps that same description back to an activation. These models have shown impressive capabilities to verbalize the semantic content captured in the residual stream.

AVs were trained to verbalize full residual stream activations, being the cumulative state of the model at a given token position, integrating what it has processed until that point. Recently, researchers at Decode Researchers (the creators of Neuronpedia) found something unexpected when working with AVs. Instead of just feeding in residual stream activations, you can feed in a decoder direction from an SAE instead and, remarkably, obtain a pretty coherent feature explanation (Lin and Chanin, 2026). What's notable about this is that the AV (likely) never saw raw feature directions like this during training, yet clearly still acquired the ability to explain them in natural language. As Johnny and David note in their post, it is currently unclear why exactly this works, and I won't really theorize about it in this post, but it works nevertheless.

Initially, I assumed that this ability of the AV to somewhat cleanly explain an SAE feature would be limited to features coming from the discrete layer that the AV was trained on. Anthropic even notes this "layer sensitivity" themselves in their paper on NLAs (Fraser-Taliente et al., 2026).

NLAs read a single layer. If the information relevant to a behavior is not present at the layer the NLA is trained on, the NLA will miss it... Whether production models show similar layer sensitivity is unclear. We could address this by training NLAs to accept multiple layers of activation as input.

My first working experiments where the AV coherently explained an SAE direction used features from Gemma3-27b-it at layer 40, while the AV was trained on activations purely from layer 41. While L40 and L41 are clearly just a layer apart, it is notable that the AV can describe an activation from a layer it was not trained on. This caused me to think a bit deeper about what the AV might be capable of. If the AV can describe features across layers, could it describe features from a different model entirely?

This may seem like a jump in logic, but it has theoretical grounding. Lan et al. (2024) found that SAE feature spaces from different models share similar underlying geometric structure, even when the individual features don't match up. Essentially, I wanted to see if this mapping idea held in the new context of NLAs being applied to SAE decoder directions. The biggest concern that I had is that the mapping of the latent spaces and the SAE direction to AV explanation are both inherently lossy; there was a genuine possibility that stacking these techniques would have simply produced nonsense. Still, I went on and tried things out anyways.

Experiment

Setup

I used Anthropic's publicly released NLA AVs for Gemma-3-27B-IT at layer 41 (kitft/nla-gemma3-27b-L41-av) and Qwen2.5-7B-IT at layer 20 (kitft/nla-qwen2.5-7b-L20-av). I selected 45 diverse features from Qwen2.5-7B-Instruct at Layer 20, drawn from chanind's Matryoshka SAE, these spanned different concept categories (war, music, cooking, medicine, religion, etc). I grabbed Neuronpedia explanation labels for these features from their s3 bucket.

I fitted a ridge map (λ=100) from Qwen at layer 20 to Gemma at layer 41 on mean-pooled hidden states from 1000 Wikitext sentences, with 20% of the set held out. The reconstruction cosine on the held-out set was 0.868, indicating a solid alignment between the two spaces.

Why not use procrustes? This is a good question. I ran a version of this experiment with procrustes and, while feature explanations still looked good qualitatively, the downstream cosine similarities were significantly lower than when I used the ridge map. I do not have a good enough understanding of these techniques to explain why this may be, but I digress.

Now onto background washout: what is it? Background washout is my attempt to make the SAE decoder direction look more like it came from a real activation, which could potentially assist the AV with its interpretation. The washout is simply the mean of normalized last token activations from Gemma computed over 100 sentences spanning multiple languages, resulting in an average "background" vector that the model is used to seeing. Call this . While we might normally have the AV explain the SAE decoder direction and call it a day (by injecting ), I have a separate condition in my main experiment where I mix the SAE decoder direction with the background vector and renormalize it, injecting instead, where is the normalized decoder direction. I started doing this early on in my experiments with the AV and the explanations always felt cleaner. The biggest difference I notice, specifically working with Gemma's layer 41 AV, is that without washout, the AV loves to describe a final token as "e.g." quite frequently, while, with washout, the final token explanation is usually something a bit closer to the actual feature label.

Comparing AV feature explanations with/without washout

Now to briefly describe the experimental conditions. For each feature, I evaluate the alignment between three conditions:

B (Native): Qwen's decoder direction -> Qwen AV Explanation
C1 (Cross Plain): Qwen decoder direction -> Qwen-Gemma ridge map -> Gemma AV Explanation
C2 (Cross Washout): Same as C1, but with background-washout

The main metric I use is feature-specific lift, which is the cosine similarity between a feature's cross-AV explanation (either C1 or C2) and its own native-AV explanation (B), minus the mean cosine to all other features' native explanations. This is my way of controlling for the generic stylistic similarity that AV outputs share, as the AV's format is structured in a specific way that naturally inflates the cosine similarity. Essentially, the feature-specific lift is trying to isolate if the semantic content of the native label (B) transferred over to either of the cross AV explanations (C1/C2). All experimental conditions were embedded using all-MiniLM-L6-v2 to alloxw for us to assess the cosine similarity between explanations. Using cosine similarity is slightly crude for what i'm attempting to measure, so take these results with a grain of salt.

Results

Condition	Mean Lift	95% CI	p
C1 — Plain cross-AV	+0.163	[+0.118, +0.207]	3.9 × 10⁻⁹
C2 — Washout cross-AV	+0.209	[+0.161, +0.257]	3.0 × 10⁻¹¹

Both cross-AV conditions show clear feature-specific lift. The explanation Gemma's AV generates for a mapped feature is much more similar to the respective feature it is trying to describe than any other feature's. The washout condition's lift (C2; +0.209) is higher than the plain conditions, (C1; +0.163), implying that the washout explanations were more semantically similar to B's explanations.

While numbers are fun, I believe these results are better suited to qualitative interpretation, so here are some comparisons between the B, C1, and C2 AV explanations, with the corresponding Neuronpedia feature label (full set of explanations available in the notebook)

Neuronpedia Label (A)

Qwen Native AV Explanation (B)

Gemma Cross Plain Explanation (C1)

Gemma Cross Washout Explanation(C2)

weather

Structured weather forecast format with factual, informational tone listing precipitation details and temperature data for a US city.

The phrase "The clouds and rain are common in the Northeast during winter storms, with the phrase 'Winter storm' often accompanied by" signals a common weather image or phrase, likely a seasonal description or weather condition clause continuing the pattern of natural phenomena imagery.

Final token "the" is an article mid-phrase ("when the weather is..."), part of a clause describing wind conditions ("When the season is cold, the wind comes with"), expecting a noun like "a cold front" or "over the Atlantic" or "conditions bring" or "the highest snow."

Educational/medical article format: a structured informational piece about a celebrity, establishing factual historical narrative about a career.

The phrase "The wind is blowing in the United States...The weather is hot" signals a meteorological or seasonal description, typical in a listicle about the 2019 hurricane/rain season.

Final token "e.g. " introduces a concrete example list ("..."), requiring a completion like "the" or "flooding" or "in the United States" to specify the location or date of the seasonal description. or "2019" or "of a list" or "on the topic." or "tropical winds." or "images."

Article/news format with a descriptive narrative about a local event, establishing a factual account of a community or weather event.

The phrase "Today the weather was hot and humid with winds from the north" signals a meteorological report or seasonal description, typical in a news article about hurricane/storm conditions in Florida.

Final token "flooding. The winds blew inland today. Today experienced heavy rainfall. Coastal winds are fading." is a standard meteorological description ("Fialor, coastal..."), strongly expecting a seasonal or geographic descriptor like "The rains..." or "in South Carolina" or "flooding in..." to introduce a weather description. or "June" or "tropical colors."

food preparation

Recipe blog format with helpful tips and instructions, using simple prose to cook a delicious dish with buttered carrots and cream.

The sentence structure "In a large saucepan, add your carrots and butter. You can use either..." signals a second step or variable, likely a preparation method or ingredient quantity, continuing the recipe instructions with a suggested action or optional step.

Final token "instructions" ends an incomplete clause ("For this amount of water..."), part of a variable setup clause ("Choose a size for your brown sugar or..."), expecting continuation like "you can preheat the oven" or "half the carrots are browned" or "fill a pan with water."

Recipe/instruction format established: a structured culinary article promising a savory, festive salad, signaling step-by-step preparation instructions for a roasted vegetable dish.

The instructional pattern "Heat oil in a large skillet. Preheat oven..." signals a standard recipe format, with numbered steps expected for a classic vinaigrette or salad.

Final token "e.g." introduces a list of examples of dried herbs ("add..."), immediately requiring a period or completion like "a" or "2" to specify the standard recipe quantity for dried dill. or "mushroom" or "shallots" or "to bake" or "recipe" or "the standard." or "puffy, crispy..."

Recipe/instruction format established: a structured culinary tip post promising a simple salad, signaling a step-by-step preparation method for a roasted vegetable salad.

The recipe instruction pattern: "Heat a large skillet. Preheat oven to 400 degrees." signals a standard cooking instruction format, typical of instructional cookbooks.

Final token ".". ends a standard recipe instruction opener ("Heat mushrooms in a bowl. Turn to dry..."), immediately expecting a standard recipe step like "The mushrooms" or "puff up" or "dried" to specify the classic baking technique for puff pastry. or " , dotted" or "French, salt and pepper" or "arrange..." or "make."

fashion and style

Arabic language photo description format with structured attributes listing a celebrity's appearance, showing a woman's hairstyle with "black hair" label.

The sentence "A model with a short hairstyle" begins a fashion description clause listing fashion choices ("For her look, she wears a sleek look with"), implying a second style element or combination of clothing features like suit or suit jacket.

Final token "look" ends mid-description ("with new look, new style"), part of a list of fashion elements ("To dress in: new look, new style"), expecting continuation like "a slim suit" or "a sharp chin" or "a dark suit with a narrow collar" describing the celebrity's overall look.

Article format: educational/listicle structure, establishing a factual, informative register about celebrity hairstyles and relationships.

The phrase "The actress is known for her appearance and style" signals a listicle or enumeration of fashion/beauty characteristics, typical in a descriptive article about a celebrity's appearance.

Final token "e.g." introduces a concrete example list ("The dress has a..."), requiring a list of fashion descriptors like "a" or "the" to begin a specific style description of the dress. or "2019" or "the article" or "the image below" or "her hair is..." or "classic, elegant" or "feminine,."

Article format: structured listicle/FAQ pattern, establishing a celebrity-related article about a dress style, likely a formal entry about Angelina Jolie.

The text promises fashion/beauty content ("The dress features a sleek, elegant design with a white dress..."), signaling a standard celebrity style description of aesthetic characteristics.

Final token "a look in a classic style." introduces a fashion description list ("Her hair is a slightly loose...The dress is a modern design with..."), strongly expecting a standard aesthetic descriptor like "gold" or "a subtle gold accent" or "Description of style" to define the iconic Victoria Beckham look.

These are some of the clearest cross-model feature transfer examples I found, but I believe they show how promising it can be to use the AV in this way. While this is great, I do want to highlight a certain feature that showcases a quirk of mapping feature spaces together.

Neuronpedia Label (A)	Qwen Native AV Explanation (B)	Gemma Cross Plain Explanation (C1)	Gemma Cross Washout Explanation(C2)
food and recipes	Final token "this" ends an incomplete clause ("what the food is"), part of the definition of "What does 'food' mean," completing the quoted definition clause "For the food, the ingredients are..." — expects "of the product" or "a specific flavor" or "the original food's boundary."	Final token "e.g." introduces a concrete example list ("..."), immediately requiring a completion like "a list" or "the text" to specify the source or recipe reference. or "mushroom" or "2014" or "mushroom, edible" to specify the culinary/cultural context of the food recommendation. or "sautéed" or "fresh, delicious." or "American."	Final token "a pungent." ends a culinary/food description clause ("Mushrooms are sautéed with..."), strongly expecting a standard disclaimer or recipe/answer beginning, likely "The mushroom" or "consumed" or "fresh" or "sautéed" to specify the culinary context of the food item. or "Note:". or "American/Asian cuisine" or "dried, not poisonous.

When we map feature 1007 from Qwen space to Gemma space, the food-preparation feature becomes a mushroom-preparation feature! If we didn't have the native AV or the Neuronpedia label, we would assume this feature was all about cooking mushrooms. There is some clear semantic drift here.

Discussion

Applying NLA AVs to explain SAE feature directions is still in its infancy, but I think these results open up some interesting pathways.

There seems to be evidence that a set of AV-like LLMs could be trained specifically to generate SAE feature explanations, and given these results, probably trained on directions from many layers at once. This would be really interesting to pursue but I sadly don't have the compute for this. I would love to do some AV fine-tuning though.

What's exciting is the implication that we won't need to train a dedicated AV for every single model/layer combination we want explanations for. If you can fit a good-enough map between one of these 'refined' SAE-AVs and whatever feature space you're trying to describe, then a handful of really well-tuned models could be applied broadly.

Limitations

I only tested a single model pair here. I do want to note that I did successfully run an earlier version of this test using LLaMa-3.1-8B at layer 8 for a single feature or two, so I can assume that this works for different model families, but I cannot confirm.
Tested on=45 features. Probably enough to show that this works but I would've loved to have done this with at least a few hundred more.
Layer depth could possibly play a role here. Notably, these Qwen (L20) and Gemma (L40) layers sit at around the same proportional depth in their respective models. It would be interesting to see if AV explanations degrade depending on how far apart the mapped layers are.

Disclaimer

As a side note, I want to make it clear that I am in no ways a subject matter expert in the mech-interp domain. I come from a health-focused background and have landed here mainly out of an interest in AI safety. If this post feels half-baked in terms of presenting results but not putting forth potential explanations for these results, that is probably why. I hope you got something from it regardless!

Code Availability

The labelled notebook that I used to run this experiment is available here.

References

Fraser-Taliente, K., Kantamneni, S., Ong, E., Mossing, D., Lu, C., Bogdan, P. C., Ameisen, E., Chen, J., Kishylau, D., Pearce, A., Tarng, J., Wu, A., Wu, J., Zhang, Y., Ziegler, D. M., Hubinger, E., Batson, J., Lindsey, J., Zimmerman, S., & Marks, S. (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread. https://transformer-circuits.pub/2026/nla/

Lin, J., & Chanin, D. (2026). Natural Language Autoencoders. The Residual Stream ( Neuronpedia blog). https://www.neuronpedia.org/blog/nlas

Lan, M., Torr, P., Meek, A., Khakzar, A., Krueger, D., & Barez, F. (2024). Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders. arXiv:2410.06981. https://arxiv.org/abs/2410.06981