How a failed experiment broke (and fixed) my view on feature labels

enricobottazzi

TL;DR
In this document, I propose baez a new feature label generation method that uses NLA explanations instead of activation examples. The codebase can be found here.
In the experiment, the labels are generated both via baez and eleuther_acts_top5 are compared across various evals. The results show that baez ≈ eleuther_acts_top5 across all the evals, despite using different inputs (NLA explanations vs. activation examples). Perhaps more surprisingly, the recorded scores are very close to chance, suggesting either that the label generation methods or the evals are somewhat broken.
In the (vast) conclusion, I propose a new classification of features into four categories: input, output, cross, and obscure. Based on that classification, it is possible to build a tier-based score-then-label protocol that enables cheap and targeted feature labeling.
The whole experiment should read like a brainstorming journal, in which a failed preliminary test serves as a springboard for pitching a new, yet untested, idea.

Introduction

When exploring feature databases, such as Neuronpedia based on Sparse Autoencoder (SAE), labels are like variable names. Labels allow you to quickly make sense of the concept a feature expresses. The standard process for producing a feature label is to examine the top-activation examples for a given SAE feature and discern a common thread among them. To scale the process to hundreds of thousands of labels, LLMs are typically employed (Figure 1). Feature label generation methods employed by Neuronpedia, such as oai_token-act-pair, eleuther_acts_top20, np_max-act and np_max-act-logits work this way.

Anthropic recently released Natural language autoencoders (NLAs). NLAs, specifically their activation verbalizer (AV) component, translate an activation vector into a natural-language text explanation.

Neuronpedia introduced a “super unscientific” method (Figure 2) for producing labels by feeding the feature vector into an NLA^[1]. This method is considered speculative because NLAs are trained to provide a natural-language description of activation layers, not feature vectors. Nevertheless, the results look reasonable. Furthermore, this method is cheaper than the previous one, given the smaller input token budget.

I like the NLA AV method because, compared to the industry-standard autointerp methods, it breaks the assumption that a feature label must be derived from its activation examples. Research has shown that middle layers are often associated with abstract patterns that might diverge from the input tokens and not clearly manifest in the output tokens. This suggests that the labels of features activated at these layers cannot be derived solely from activation examples.

A natural follow-up to the recent Neuronpedia effort is to leverage NLA explanations to generate feature labels in a different and more natural way.

Assuming the existence of an NLA and an SAE both trained at the -th activation layer of a model, it is possible to feed the top activation examples for that feature, extract the corresponding activations at the -th activation layer, and pass them to the AV to obtain natural language explanations.

The new feature-label generation method I propose, baez, feeds an autointerp LLM with NLA explanations rather than activation examples (Figure 3).

More specifically, the method proceeds, for a given feature extracted at the -th activation layer, as follows:

Fetch the top activation examples
Truncate each example at the top-activating token
Feed each example to the model, extract the activation corresponding to the last token at the -th activation layer, and feed it into the NLA-AV to obtain the corresponding NLA explanation^[2]
Feed an autointerpreter LLM with the NLA explanations paired with an activation score normalized to an integer [1, 10]
Fetch the feature label from the response

Experiment methodology

To determine whether baez is an effective label generation method, it needs to be compared with other available methods. The experiment relies on the SAE features from Neuronpedia's gemma-3-27b-it/41-gemmascope-2-res-262k and the NLA explanation corresponding to the same activation layer from kitft/nla-gemma3-27b-L41-av. The baez and delphi libraries are used to generate and score labels.

The experiment proceeds as follows:

Sample 40 random features across those with at least 15 non-zero activation examples. The activation examples for each feature are split into train (the top 5 examples) and test (from the 6th to the 15th top activation examples)
For each feature and its corresponding train dataset, generate labels via the following methods, using anthropic/claude-sonnet-4.5 as explanation model:
1. baez
2. baez_last^[3]
3. eleuther_acts_top5^[4]
For each feature and each label (120 datapoints), score the label via the following evals^[5], using anthropic/claude-sonnet-4.6 as scorer model and all-MiniLM-L6-v2 as sentence embedder:
1. detection
2. fuzz
3. embedding
Aggregate scores by label generation method and quantitatively analyze the results
Feed the labels generated via baez and eleuther_acts_top5 into a sentence embedder to derive the features whose labels are the most distant in terms of cosine similarity. Pick one across the most distant ones and qualitatively analyze it

Results

All the data collected is available in this folder.

Quantitative

Figure 4 synthesizes the mean score of the 40 labels generated by each label generation method across the three evals.

The detection and fuzz evals work by feeding a scorer LLM the feature label along with 20 randomly shuffled examples, sampled from positive and negative examples for that feature. For each example, the model is tasked with determining whether it is an activation example for the feature (return 1) or not (return 0). True positives () accumulate all occurrences in which a positive snipped is presented, and the scorer correctly says 1. True negatives () accumulate all the occurrences in which a negative label is presented, and the scorer correctly says 0. The final score is computed as .

The results show that, for detection and fuzz benches, the mean score is close to chance across the three label generation methods. As if the labels were generated using a random word generator.

The embedding benchmark works by feeding a sentence classifier with the feature label, the positive examples, and the negative examples, and measuring the cosine distance between the feature label vector and each example vector. A gap indicates that the feature label embeds more closely with activating examples than with non-activating ones. The higher the gap, the better.

The gaps across the three methods are tiny on a cosine scale (~0.01–0.02). The scorer barely distinguishes between activating and non-activating contexts, regardless of the label generation method. For each label generation method, ~40% of feature labels show a negative gap, indicating that the label is often misleading.

Overall, the results suggest that baez ≈ eleuther_acts_top5 across all the evals despite using different inputs (NLA explanations vs. activation examples).

The other, perhaps more important, result is that all recorded scores are very close to chance, suggesting that either the label-generation methods or the scoring evals are somewhat broken.

Qualitative

For each feature, we feed the labels generated via baez and eleuther_acts_top5 into a sentence embedder and measure which ones are the most distant in terms of cosine similarity.

A notable example is feature 62551, whose generated labels are semantically far away from each other:

baez: “E-commerce product listing and selling guide content patterns”
eleuther_acts_top5: “Tokens that are part of the word 'Redbubble" or common function words (prepositions, articles, punctuation) in instructional text about the Redbubble print-on-demand platform”

By looking at the activation examples for that feature, it is possible to see that the top three activating examples indeed mention Redbubble in discussions about “how to earn money on Redbubble” or “how to create a successful Redbubble shop”. Nevertheless, other (lower) activation examples have nothing to do with Redbubble. For example, the 4th activation example looks more like a FAQ section of an e-commerce site selling honey. Although the baez-generated label seems more appropriate to include the 4th activation example, as soon as you move further down in the list of activation examples, it is possible to quickly identify examples that have little to nothing to do with either of these labels. For example, activation example 7 relates to a bug fix release note, while the 8th activation example is a wikipedia-like article about Veerappan, an Indian forest bandit.

Overall, baez-generated feature labels seem to privilege the syntactic aspect: "incomplete syntactic structures", "truncated phrases requiring completion", "opening delimiters", "transition points". Conversely, eleuther_acts_top5-generated labels privilege more concrete categories and concepts describing the specific tokens that make a certain feature light up: "the suffixes -ifying/-ating", "Redbubble tokens", "Cyrillic verb suffixes", "concessive conjunctions (While, Though)", "currency symbols", "symlink references".

Conclusions

The original goal of the experiment was to determine whether NLA explanations are a good input to obtain feature labels. We propose baez, a new label generation method that obtains NLA explanations from top activating examples for a feature and feeds them into an Autointerp LLM to obtain its label.

Next, we compared baez (and baez_last) to eleuther_acts_top5, a method that uses activation examples as input for feature label generation. Overall, the results showed that the three methods are no better than chance at capturing the essence of a feature.

Here, it is worth pausing to ask ourselves the following question: What exactly were we measuring?

All the evals that we employed measure the correlation between a feature label and its activation examples. A label with a high score is one that is broad just enough to capture the common thread across the positive activating examples, and narrow enough to exclude the negative, non-activating ones.

Based on this premise, it is possible to explain why labels generated by either eleuther_acts_top5 or baez are no better than chance at capturing such a correlation as it follows.

Robert_AIZI highlighted how labels generated starting from top-activating examples fail to capture the long tail of lower-activating examples. In the experiment, the eleuther_acts_top5 labels are generated starting from the top 5 activation examples and scored using the top 6 to 15 examples, exactly mirroring the scenario described by Marks. The qualitative analysis confirmed that low-activation examples often have nothing to do with the feature label. Retrospectively, we can say that the [0, 4] vs. [5, 15] train/test split was inadequate.

On the other hand, baez-generated labels, by design, capture the explanations of an activation layer that correlates with a high activation of a certain feature via NLA. The activation layer is far from the input tokens (about 2/3 of the way through a forward pass) and might progressively diverge from them, as described in the introduction with an analogy. This might explain why the method performs no better than chance when tested against evals that measure the correlation between a feature label and its activation examples.

The immediate conclusion is that baez is a terrible feature label generation method. A follow-up would be to try new label-generation approaches until I find one that performs well at these evals.

Alternatively, I can analyze a task that actually requires feature labels and build the label generation method accordingly.

Anthropic recently published a paper on attribution graphs. Attribution graphs leverage the features mapped at each activation layer to build a circuit that traces all the activated features when going from a specific input prompt to an output response.

As an example, a researcher might ask what internal process allows a model to complete the prompt: Fact: the capital of the state containing Dallas is with the correct answer Austin.

Has the model simply memorized the completion during training, or does it perform a human-like two-hop reasoning: first, inferring that the state containing Dallas is Texas, and second, that the capital of Texas is Austin?

The attribution graph (Figure 5) for a given input prompt is built by feeding the prompt into a cross-layer transcoder, which produces a map of features that activate at each input token.

At this point, the graph makes little sense. The features (corresponding to the dots) need to be labeled. Scientists manually examine the feature visualization panel (Figure 6) for each dot and try to come up with a pertinent label. This is the same task that was previously delegated to an autointerpreter.

The third and final step is to manually group features into supernodes to obtain a simplified version of the attribution graph (Figure 7). This allows testing the initial hypotheses: the model does, in fact, perform two-hop reasoning.

One thing worth noticing: there’s no trace of label scores!

Another observation concerns the classes of features encountered across various circuits, as observed by the authors of the paper:

Input, abstract, and output features. In most prompts, paths through the graph begin with “input features” representing tokens or other low-level properties of the input and end with “output features” which are best understood in terms of the output tokens that they promote or suppress. Typically, more abstract features representing higher-level concepts or computations reside in the middle of graphs.

Figure 6 clearly illustrates an input feature. Labels for input features can be easily generated by examining the tokens corresponding to, and immediately preceding, the highest feature-activation token. In this case, the label is “Dallas”.

Figure 8 illustrates a feature labeled as “say a capital”. The tokens after the highest feature-activation token always correspond to a capital; therefore, this can be classified as an output feature. Labels for output features can be easily generated by examining the tokens that immediately follow the highest-activation feature token. Additionally, the tokens whose logits are pushed up the most by the feature provide a further cue for deriving a label.

Every feature that cannot immediately be identified as an input feature or output feature belongs to the broad category of abstract features. And these are the most complicated to label. As an example, the label for the feature illustrated in Figure 9 can be generated only by observing the activation examples as a whole, including both the tokens preceding the top-activating token and those that follow. These are cross features.

Lastly, there are features, such as the one illustrated in Figure 10, that seem to resist any labeling, even after examining activation examples and token predictions. We can assume that these features encapsulate some nebulous thinking process. These are obscure features.

Based on the recurring structure identified by the authors of the papers and refined to split thinking features into cross and obscure features, I propose a tier-based score-then-label protocol that enables cheap scaling of feature labeling. Central to this process is the notion of correlation score. First, some preliminaries.

For each activation example of feature , let index denote the position of the top-activating token. Let be the integer range (with ) defining the relative token window centred at ; for instance, selects the tokens at positions relative to the top-activating token.

Let be the index of the top activating examples of feature , and let denote the token window extracted from example . Let be a sentence-embedding model mapping a token window to a vector, and let denote cosine similarity between two vectors.

The correlation score^[6] for feature over window is the average cosine similarity over all pairs of embedded windows:

The protocol, for each feature , proceeds as follows:

Scoring step
- Measure input correlation score and output correlation score
- If both scores fall below the target threshold , compute the cross-correlation score .
Labelling step

- Tier 1: Input and output features
  - If or : generate the label using a low-cost method (e.g., token frequency counter or small LLM).
- Tier 2: Cross features
  - Else if : route to a frontier LLM for traditional automated labeling using standard auto interpretability methods.
- Tier 3: Obscure Features
  - Else: Flag for manual labeling, aided by NLA verbalization when available.

The suggested tier-based score-then-label problem enables optimized labeling by routing each feature to the most cost-efficient labeling method according to its complexity. The focus of the human mind is directed to the fascinating, obscure features that might capture concepts that escape existing vocabulary.

As a bonus point, such a tier-based classification step could allow us to validate the hypothesis that early layers are dedicated to input processing, middle layers to abstraction, and last layers to output response.

An immediate next step is to apply the proposed protocol to an actual attribution graph experiment. This would allow refining the protocol and establishing better heuristics for the hyperparameters.

^{^}
For this method to work, the NLA must be trained over the same activation layer mapped by the SAE
^{^}
This step can be performed leveraging Neuronpedia NLA API
^{^}
Only the last paragraph of the NLA explanation is kept, inspired by Neuronpedia approach
^{^}
This is a variant of the original eleuther_acts_top20, dictated by the limited number of activation examples and the need to split between a train and a test dataset
^{^}
Available from delphi. The positive examples are sampled from the test dataset for that feature. The negative examples are sampled at random from activation examples of different features.
^{^}
A variant of the correlation score should also weight the vectors corresponding to the activation examples by their activation score

[-]Johnny Lin2mo40

interesting. so the baez method extracts a bunch more nla explanations using the top known activating texts, then feeds back into an LLM to "summarize" them.
if im understanding this right, you end up with scores that are similar-ish. but that seems like a shortcoming of current autointerp/feature-label scoring methods.
we def need better scoring methods - would be interested in proposals/submissions for this. and i'd be curious about the next step eg code that does the new proposed feature labeling method, plus side by side examples

[-]enricobottazzi2mo20

but that seems like a shortcoming of current autointerp/feature-label scoring methods.
we def need better scoring methods - would be interested in proposals/submissions for this.

My feeling is that the biggest shortcoming of current scoring methods is assuming all features are created equal. An alternative would be to first classify the feature, using something like the correlation score I propose, and then score the label with a category-specific method.

What's your opinion on the proposed categories ?

and i'd be curious about the next step eg code that does the new proposed feature labeling method, plus side by side examples

Sure, I'm gonna run an attribution graph experiment this week, trying to use such a proposed labeling method and share the results here

19