TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. Results show that, compared with state-of-the-art auto-interpreters from OpenAI or Neuronpedia, AIR produces more accurate feature explanations at a lower cost.
Introduction
Recent studies by Anthropic on attribution graphs (Lindsey et al., 2025; Ameisen et al., 2025) highlighted different types of features, which they split into three categories: input, abstract, and output features.
In most prompts, paths through the graph begin with “input features” representing tokens or other low-level properties of the input and end with “output features” which are best understood in terms of the output tokens that they promote or suppress. Typically, more "abstract features" representing higher-level concepts or computations reside in the middle of graphs. (Lindsey et al., 2025)
Features are the atoms of the science of LLM mechanistic interpretability. The first step is usually to isolate these atoms via techniques such as SAEs or Transcoders. The second step is to understand what they mean. The standard process for explaining a feature is via auto-interpreter LLMs (Bills et al., 23): ask an LLM to find a common thread among the top-activation examples of that feature.
The original oai_token-act-pair, from OpenAI, and the more recent np_max-act-logits, from Neuronpedia, are some of the state-of-the-art protocols for auto-interpretability.
In the former, the auto-interpreter is fed with pairs of token-activation values from the top-activation examples. In the latter, more information is given to the auto-interpreter: the top positive logits for that feature and, for each activation example, a text sequence consisting of the activation example itself, the max-activating token, and the token after that.
The instructions given to the OpenAI auto-interpreter (Figure 1) are very crude and direct, and clearly do not account for any nuance in the feature category.
Figure 1: instruction for oai_token-act-pair autointerpreter
On the other hand, the instructions given to the Neuronpedia auto-interpreter (Figure 2) are more complex and somewhat attempt to guide the model to follow a cascade protocol: first identify the feature category, then provide an explanation. Broadly speaking, method 1 accounts for input features. Methods 2 and 3 for output features. Method 4 for abstract features.
Figure 2: instruction for np_max-act-logits autointerpreter
Neuronpedia's method relies on the intelligence of the auto-interpreter LLM to both discern the feature's category and explain it. This is an expensive and, potentially, error-prone protocol. This observation prompted me to ask the following question:
Can we outsource the category identification to a sentence embedder?
To answer this question, I created the Auto-Interpretability Router (AIR) protocol and ran an experiment to measure its accuracy and cost relative to state-of-the-art auto-interpreters.
The protocol (Figure 3) proceeds in two simple steps. The feature activation examples are passed into a sentence embedder to derive the feature category. Based on the identified category, different examples are passed to the auto-interpreter LLM.
Figure 3: AIR protocol
In the experiment, I generate explanations for 500 features via four protocols (np_max-act-logits ,oai_token-act-pair, air and its variant air_filtered) and measure their accuracy via fuzzing automated evaluation. The results (Figure 4) show that the explanations generated via AIR and its filtered variant are more accurate than the ones generated by other state-of-the-art autointerpreters.
Figure 4: accuracy score by protocol
Perhaps more importantly, this is obtained at a lower token cost (Figure 5): considering the sentence embedder step to be free, the token budget allocated to the auto-interpreter is drastically reduced: air consumes 9x fewer tokens than np_max-act-logitsand 60x fewer tokens than oai_token-act-pair. Given that the feature identification step is done by the sentence embedder, the task of the auto-interpreter is easier, and therefore doesn't need to rely on long context windows, expensive CoT or few-shot prompting.
Figure 5: token cost by protocol
AIR Protocol
An initial sketch of the protocol was introduced in a previous blogpost of mine.
First a bit of notation. For given a feature , let the set of its top-activation examples be . For each example, the max activating token is assigned index , such that is the token sequence, extracted from the -th example, for tokens indexed by the window relative to index .
For instance includes the max activating token only, while includes three tokens: the token before the max activating token, the max activating token itself, and the token after that.
The correlation score of feature and token window is the normalized scalar . It is computed from the set of examples and, loosely speaking, rewards coherence among examples from the same feature ( component) while penalizing similarity to the full example pool ( component). The scorer uses a sentence embedder model under the hood and does not rely on any LLM. Details are in the Appendix.
Each feature category is associated with a token window. In the first experiment, we define eight categories (Table 1)
Table 1: feature categories
act_token and before_act_token are shades of the broad category of input features. after_act_token, positive_logits and negative_logits are shades of the output feature category. In particular, features belonging to the categories positive_logits or negative_logits are those that can be identified mainly by the logits they push, respectively, up or down. For these subcategories, the examples provided to the protocol are not activation examples but the list of top positive (or negative) logits associated with that feature. short_window, medium_window and long_window belong to the broad category of abstract features
AIR is a two-step router. First, it computes the correlation score for each category and selects the category associated with the maximum correlation score:
Second, it routes the selected examples to an auto-interpreter, which produces the feature explanation.
The instruction prompt given to the auto-interpreter is generic and fixed (Figure 6) and does not rely on CoT or few-shot prompting.
Figure 6: instruction for air autointerpreter
Notably, the burden of classifying a feature is no longer on the shoulders of the LLM but has been outsourced to a virtually free sentence embedder.
Experiment
The primary goal of the experiment is to compare the accuracy of the feature generated via four protocols: np_max-act-logits, oai_token-act-pair, air and air_filtered. The secondary goal is to measure the costs of running these protocols in terms of tokens and dollars spent.
The experiment, for which we attach the source code for reproducibility, proceeds in three steps:
Sample 500 features at random from Neuronpedia. The features[1] are fetched from gemma-3-27b-it. More specifically its gemmascope-2-transcoder-262k feature dataset which maps 62 MLP layers (indexed from 0 to 61) and, for each layer, 262,144 features (indexed from 0 to 262143).
For each feature, obtain 11 explanations as follows:
1 explanation via np_max-act-logits
1 explanation via oai_token-act-pair
9 explanations via air[2] where each instance of the protocol uses a different embedder[3] from the following list ["all-MiniLM-L6-v2", "all-mpnet-base-v2", "BAAI/bge-small-en-v1.5", "Qwen/Qwen3-Embedding-0.6B", "BAAI/bge-m3", "intfloat/multilingual-e5-large-instruct", "google/embeddinggemma-300m", "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", "sentence-transformers/LaBSE"]
For each feature and explanation, score the explanation via Eleuther fuzz automated evaluation[4].
The auto-interpreters, used to generate an explanation, and the auto-evaluator, used to obtain a score for each explanation, are instantiated using gemini-2.5-flash-lite.
The accuracy of a protocol is measured as the average explanation score across all features in the dataset. For air, we pick the protocol instance (across the nine) whose accuracy score is the highest.
The filtered variant air_filtered is obtained by excluding certain features from undergoing the LLM auto-interpreter step. More specifically, we exclude:
Features whose highest correlation score is recorded either for positive_logits or negative_logits categories.
Features whose highest correlation score is less than a threshold . We call these features obscure
Why do we exclude them? Let's first take a step back and analyze how the scoring process works. The fuzzing auto-evaluator is presented with a feature explanation and a set of unlabelled examples that mix positive examples, sampled from the activation set for that feature, and negative examples, sampled from the activation sets of other features. The goal of the scorer is to guess, based on the explanation, which are the positive examples.
The common aspect of the excluded features is that they have little to no visible impact on the activation examples. So we exclude them because we know in advance they will receive a low score, since the evaluation method relies solely on activation examples as the source of truth.
While such a filtering strategy might feel like cheating at first, I argue that this is completely legit since it relies on a unique capability of the AIR protocol.
AIR allows identifying a feature's category for free using the correlation score, thereby enabling exclusion from auto-interpretation. On the other hand, methods like np_max-act-logits bundle the category identification and feature explanation step together in the autointerpreter, making filtering extremely expensive.
Lastly, the cost. We extract the LLM autocomplete Openrouter traces from step 2, split them by protocol, and, for each protocol, measure the average dollar and token costs to generate a single feature explanation. We consider the cost of air and air_filtered equivalent, as the explanation step is exactly the same.
Results
All the results are accessible through this folder. [5]
The primary thing that we care about is accuracy. As depicted in Figure 4, air with a score slightly outperforms the best performing state-of-the-art auto-interpreter np_max-act-logits which records a score of . air_filtered performs way better with a score confirming the hypothesis that features from categories positive_logits and negative_logits and obscure features degrade the fuzz scorer performance.
all-mpnet-base-v2 is the embedder associated with the best performing air instance, while Qwen/Qwen3-Embedding-0.6B is the embedder associated with the best performing air_filtered instance. Notably, Table 2 and Table 3 show, respectively, that any air and air_filtered instance across nine embedders outperforms the two state-of-the-art auto-interpreter protocols.
Table 2: accuracy across various intances of air protocols
Table 3: accuracy across various intances of air_filtered protocols
Figure 5 and, in more detail, Table 4 show the cost breakdown of each protocol. The reason for the decline in the cost of air compared to state-of-the-art autointerpreters is due to the increased simplicity of the protocol, which makes it possible to work with shorter instructions and without any CoT or few-shot prompting.
Table 4: cost breakdown
Figure 7 shows a completion for np_max-act-logits: the model needs to go through an expensive reasoning step, traverse the cascade of feature-category identification methods, and, finally, output the explanation for the feature. On the other hand, the task of autointerpreters in air is much easier. They just need to spit out the explanation. Just a few tokens of outputs (fewer than 5, on average).
Figure 7: autocomplete trace for np_max-act-logits
Now that the overall picture is clear, let's zoom in and examine the protocol's performance with specific features.
AIR breakthrough
In this section, we analyze a scenario in which AIR outperforms the state-of-the-art autoexplainer. Feature 60/152807 activates (Figure 8) on exactly one token: /. Notice that it doesn't activate on slashes in general: it only activates on the combination coffee/tea.
air assigns the following correlation scores (Table 5)
Table 5: correlation scores for feature 60/152807 and embedder all-mpnet-base-v2
The identified feature category is the one with the highest correlation score: short_window. The score proves that the embedder correctly penalized the genericity in the examples for categories before_act_token, act_token, and after_act_token, while it didn't apply the same penalty for the more specific examples of the category short_window.
The example set corresponding to the category short_window is then passed to the auto-interpreter, which can now easily explain the feature as coffee or tea, case insensitive, recording an accuracy score of 0.9.
On the other side, np_max-act-logits explains the same feature as or, obtaining an accuracy score of 0.45, worse than chance.
The explainer model is the same. The only difference lies in the instructions and examples provided to it. The failure of np_max-act-logits can be explained by looking at its instructions. The model is ordered, as part of the first method of the cascade, to Look at MAX_ACTIVATING_TOKENS. If they share something specific in common, or are all the same token or a variation of the same token (like different cases or conjugations), respond with that token.
In this case, the model correctly identified the slash as the same token for every example and therefore responded with that by (mistakenly) labeling the slash as or.
The model is not smart enough to detect the token's genericity, discard it, and proceed to method 4, which would have allowed it to correctly explain the feature by looking at the broader context around the max-activating token.
AIR failure
In this section, we analyze a scenario in which AIR fails. Feature 23/151584 (Figure 9) fires in contexts related to "big O notation".
Figure 9: top activation examples for Feature 23/151584
np_max-act-logits explains it at O, while air as exclamations of surprise or uncertainty. Both mistakenly focus only on the top-activating token and obtain an accuracy score close to chance.
The reason for the failure of np_max-act-logits is the same one described in the previous paragraph: the auto-explainer stops at method 1 and therefore fails to look at the broader context.
When it comes to air the failure can be explained by looking at the correlation scores:
Table 6: correlation scores for feature 23/151584 and embedder all-mpnet-base-v2
The feature is identified by the category act_token and therefore only a sequence of O is provided to the auto-explainer, which justifies the given explanation.
More appropriate categories would have been short_window, medium_window or long_window which would have provided the auto-explainer with enough context to explain the feature correctly.
The correlation scorer, through the embedder, did not provide enough genericity penalty for the example set corresponding to the category act_token.
Obscure features
Lastly, we have obscure features. These are the ones that I like the most. These are truly nonsensical. There's no correlation or common thread whatsoever across the activation examples or the logits. Feature 16/215906 is a great example. Its activations range from African history to technical documentation on databases, Python code, and tourist guides to Venice.
Figure 10: top activation examples for 16/215906
The explanations given by state-of-the-art auto-interpreter protocols are very hazardous. hashtags for np_max-act-logits and concepts related to specific, focused subjects for oai_token-act-pair. Both explanations score close-to-chance score in the fuzz-based auto evaluation.
The correlation score tells the whole story. The embedder cannot detect any meaningful correlation across the examples and therefore does not clearly identify a category for that feature.
Table 7: correlation scores for feature 16/215906 and embedder Qwen/Qwen3-Embedding-0.6B
AIR allows for spotting an obscure feature virtually for free before undergoing the expensive auto-interpreter and auto-explainer steps.
Conclusions
AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. This differs from methods like np_max-act-logits that perform the category identification and feature explanation step together via an LLM auto-interpreter.
The results of the experiment show that, compared with state-of-the-art auto-interpreters from OpenAI or Neuronpedia, AIR produces more accurate feature explanations at a lower cost.
More broadly, the results confirm Anthropic's empirical finding that features are not all created equal but belong to different categories with distinct properties.
My wish is to see more category-focused auto-interpreter methods in the future. A nice first step would be to have Neuronpedia display a category label for each feature.
The protocol suggested in this article is just a first step in that direction, and there's likely ample room for improvement by refining the algorithm to compute the correlation score and to identify a feature's category. Maybe better sentence embedders can be trained by using the accuracy score of the identified category as a training reward.
A nice side effect of this protocol is that it allows us to identify obscure features for free. These features might capture fascinating concepts that escape existing vocabulary and therefore be routed to human manual interpretability or to auto-interpretability via smarter models.
We consider a set of features. Each -th feature has activation examples . Each $j$-th example for feature is associated with a maximum activation value .
The correlation score for feature and token window is defined as:
We start by defining the intra and inter correlation scores and the normalized variant will follow immediately.
Raw intra correlation
The raw intra correlation score for feature and token window measures the semantic similarity of the examples in set of size . More precisely, it is the weighted mean cosine similarity over the unordered pairs of examples.
where denotes the sentence-embedder representation of token sequence , and denotes cosine similarity between two embedding vectors.
Raw inter correlation
We first define the pool as the set containing all examples for a given token window . The raw inter correlation score for feature and token window measures the semantic similarity between the example set for the given feature and the average direction of the embeddings from the pool . More precisely, it is the weighted mean cosine similarity between the -th feature’s examples and the pool centroid for that token window.
The centroid is the weighted mean of the unit-normalized embeddings in , renormalized to unit length.
Normalized intra/inter correlation
The normalized versions of the intra and inter correlation scores for feature and token window are obtained as the z-scores of the raw values against a baseline.
Where, over many random sets of examples sampled from :
- and are, respectively, the empirical mean and standard deviation of the resulting raw intra correlation values.
- and are, respectively, the empirical mean and standard deviation of the resulting raw inter correlation values.
A high normalized intra score means that examples in are more semantically similar to each other than expected by chance. A high normalized inter score means that examples in are similar to the pool centroid, and are therefore more generic.
The final correlation score favors example sets that are both coherent with each-other and specific:
For each feature, the example set , is obtained, analogously to np_max-act-logits and oai_token-act-pair, concatenating the top-10 non-duplicate activation examples for that feature
For each feature, a dataset composed of 10 positive and 10 negative examples is provided to the auto-scorer. The dataset is built once per feature and reused to score the explanations from the different protocols. An accuracy score of 0.5 indicates that the explanation does not help the scorer distinguish activating from non-activating examples better than chance.
In the experiment trace, you'll find a postprocessed explanation variant for air in which a category-specific prefix is attached on top of the explanation obtained by air. The results for this variant have not been reported, as the average accuracy score is lower than air
TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. Results show that, compared with state-of-the-art auto-interpreters from OpenAI or Neuronpedia, AIR produces more accurate feature explanations at a lower cost.
Introduction
Recent studies by Anthropic on attribution graphs (Lindsey et al., 2025; Ameisen et al., 2025) highlighted different types of features, which they split into three categories: input, abstract, and output features.
Features are the atoms of the science of LLM mechanistic interpretability. The first step is usually to isolate these atoms via techniques such as SAEs or Transcoders. The second step is to understand what they mean. The standard process for explaining a feature is via auto-interpreter LLMs (Bills et al., 23): ask an LLM to find a common thread among the top-activation examples of that feature.
The original
oai_token-act-pair, from OpenAI, and the more recentnp_max-act-logits, from Neuronpedia, are some of the state-of-the-art protocols for auto-interpretability.In the former, the auto-interpreter is fed with pairs of token-activation values from the top-activation examples. In the latter, more information is given to the auto-interpreter: the top positive logits for that feature and, for each activation example, a text sequence consisting of the activation example itself, the max-activating token, and the token after that.
The instructions given to the OpenAI auto-interpreter (Figure 1) are very crude and direct, and clearly do not account for any nuance in the feature category.
Figure 1: instruction for
oai_token-act-pairautointerpreterOn the other hand, the instructions given to the Neuronpedia auto-interpreter (Figure 2) are more complex and somewhat attempt to guide the model to follow a cascade protocol: first identify the feature category, then provide an explanation. Broadly speaking, method 1 accounts for input features. Methods 2 and 3 for output features. Method 4 for abstract features.
Figure 2: instruction for
np_max-act-logitsautointerpreterNeuronpedia's method relies on the intelligence of the auto-interpreter LLM to both discern the feature's category and explain it. This is an expensive and, potentially, error-prone protocol. This observation prompted me to ask the following question:
Can we outsource the category identification to a sentence embedder?
To answer this question, I created the Auto-Interpretability Router (AIR) protocol and ran an experiment to measure its accuracy and cost relative to state-of-the-art auto-interpreters.
The protocol (Figure 3) proceeds in two simple steps. The feature activation examples are passed into a sentence embedder to derive the feature category. Based on the identified category, different examples are passed to the auto-interpreter LLM.
Figure 3: AIR protocol
In the experiment, I generate explanations for 500 features via four protocols (
np_max-act-logits,oai_token-act-pair,airand its variantair_filtered) and measure their accuracy via fuzzing automated evaluation. The results (Figure 4) show that the explanations generated via AIR and its filtered variant are more accurate than the ones generated by other state-of-the-art autointerpreters.Figure 4: accuracy score by protocol
Perhaps more importantly, this is obtained at a lower token cost (Figure 5): considering the sentence embedder step to be free, the token budget allocated to the auto-interpreter is drastically reduced:
airconsumes 9x fewer tokens thannp_max-act-logitsand 60x fewer tokens thanoai_token-act-pair. Given that the feature identification step is done by the sentence embedder, the task of the auto-interpreter is easier, and therefore doesn't need to rely on long context windows, expensive CoT or few-shot prompting.Figure 5: token cost by protocol
AIR Protocol
An initial sketch of the protocol was introduced in a previous blogpost of mine.
First a bit of notation. For given a feature , let the set of its top-activation examples be . For each example, the max activating token is assigned index , such that is the token sequence, extracted from the -th example, for tokens indexed by the window relative to index .
For instance includes the max activating token only, while includes three tokens: the token before the max activating token, the max activating token itself, and the token after that.
The correlation score of feature and token window is the normalized scalar . It is computed from the set of examples and, loosely speaking, rewards coherence among examples from the same feature ( component) while penalizing similarity to the full example pool ( component). The scorer uses a sentence embedder model under the hood and does not rely on any LLM. Details are in the Appendix.
Each feature category is associated with a token window. In the first experiment, we define eight categories (Table 1)
Table 1: feature categories
act_tokenandbefore_act_tokenare shades of the broad category of input features.after_act_token,positive_logitsandnegative_logitsare shades of the output feature category. In particular, features belonging to the categoriespositive_logitsornegative_logitsare those that can be identified mainly by the logits they push, respectively, up or down. For these subcategories, the examples provided to the protocol are not activation examples but the list of top positive (or negative) logits associated with that feature.short_window,medium_windowandlong_windowbelong to the broad category of abstract featuresAIR is a two-step router. First, it computes the correlation score for each category and selects the category associated with the maximum correlation score:
Second, it routes the selected examples to an auto-interpreter, which produces the feature explanation.
The instruction prompt given to the auto-interpreter is generic and fixed (Figure 6) and does not rely on CoT or few-shot prompting.
Figure 6: instruction for
airautointerpreterNotably, the burden of classifying a feature is no longer on the shoulders of the LLM but has been outsourced to a virtually free sentence embedder.
Experiment
The primary goal of the experiment is to compare the accuracy of the feature generated via four protocols:
np_max-act-logits,oai_token-act-pair,airandair_filtered. The secondary goal is to measure the costs of running these protocols in terms of tokens and dollars spent.The experiment, for which we attach the source code for reproducibility, proceeds in three steps:
gemma-3-27b-it. More specifically itsgemmascope-2-transcoder-262kfeature dataset which maps 62 MLP layers (indexed from 0 to 61) and, for each layer, 262,144 features (indexed from 0 to 262143).np_max-act-logitsoai_token-act-pairair[2] where each instance of the protocol uses a different embedder[3] from the following list["all-MiniLM-L6-v2", "all-mpnet-base-v2", "BAAI/bge-small-en-v1.5", "Qwen/Qwen3-Embedding-0.6B", "BAAI/bge-m3", "intfloat/multilingual-e5-large-instruct", "google/embeddinggemma-300m", "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", "sentence-transformers/LaBSE"]The auto-interpreters, used to generate an explanation, and the auto-evaluator, used to obtain a score for each explanation, are instantiated using
gemini-2.5-flash-lite.The accuracy of a protocol is measured as the average explanation score across all features in the dataset. For
air, we pick the protocol instance (across the nine) whose accuracy score is the highest.The filtered variant
air_filteredis obtained by excluding certain features from undergoing the LLM auto-interpreter step. More specifically, we exclude:positive_logitsornegative_logitscategories.Why do we exclude them? Let's first take a step back and analyze how the scoring process works. The fuzzing auto-evaluator is presented with a feature explanation and a set of unlabelled examples that mix positive examples, sampled from the activation set for that feature, and negative examples, sampled from the activation sets of other features. The goal of the scorer is to guess, based on the explanation, which are the positive examples.
The common aspect of the excluded features is that they have little to no visible impact on the activation examples. So we exclude them because we know in advance they will receive a low score, since the evaluation method relies solely on activation examples as the source of truth.
While such a filtering strategy might feel like cheating at first, I argue that this is completely legit since it relies on a unique capability of the AIR protocol.
AIR allows identifying a feature's category for free using the correlation score, thereby enabling exclusion from auto-interpretation. On the other hand, methods like
np_max-act-logitsbundle the category identification and feature explanation step together in the autointerpreter, making filtering extremely expensive.Lastly, the cost. We extract the LLM autocomplete Openrouter traces from step 2, split them by protocol, and, for each protocol, measure the average dollar and token costs to generate a single feature explanation. We consider the cost of
airandair_filteredequivalent, as the explanation step is exactly the same.Results
All the results are accessible through this folder. [5]
The primary thing that we care about is accuracy. As depicted in Figure 4, slightly outperforms the best performing state-of-the-art auto-interpreter . confirming the hypothesis that features from categories
airwith a scorenp_max-act-logitswhich records a score ofair_filteredperforms way better with a scorepositive_logitsandnegative_logitsand obscure features degrade the fuzz scorer performance.all-mpnet-base-v2is the embedder associated with the best performingairinstance, whileQwen/Qwen3-Embedding-0.6Bis the embedder associated with the best performingair_filteredinstance. Notably, Table 2 and Table 3 show, respectively, that anyairandair_filteredinstance across nine embedders outperforms the two state-of-the-art auto-interpreter protocols.Table 2: accuracy across various intances of
airprotocolsTable 3: accuracy across various intances of
air_filteredprotocolsFigure 5 and, in more detail, Table 4 show the cost breakdown of each protocol. The reason for the decline in the cost of
aircompared to state-of-the-art autointerpreters is due to the increased simplicity of the protocol, which makes it possible to work with shorter instructions and without any CoT or few-shot prompting.Table 4: cost breakdown
Figure 7 shows a completion for
np_max-act-logits: the model needs to go through an expensive reasoning step, traverse the cascade of feature-category identification methods, and, finally, output the explanation for the feature. On the other hand, the task of autointerpreters inairis much easier. They just need to spit out the explanation. Just a few tokens of outputs (fewer than 5, on average).Figure 7: autocomplete trace for
np_max-act-logitsNow that the overall picture is clear, let's zoom in and examine the protocol's performance with specific features.
AIR breakthrough
In this section, we analyze a scenario in which AIR outperforms the state-of-the-art autoexplainer. Feature
60/152807activates (Figure 8) on exactly one token:/. Notice that it doesn't activate on slashes in general: it only activates on the combinationcoffee/tea.airassigns the following correlation scores (Table 5)Table 5: correlation scores for feature
60/152807and embedderall-mpnet-base-v2The identified feature category is the one with the highest correlation score:
short_window. The score proves that the embedder correctly penalized the genericity in the examples for categoriesbefore_act_token,act_token, andafter_act_token, while it didn't apply the same penalty for the more specific examples of the categoryshort_window.The example set corresponding to the category
short_windowis then passed to the auto-interpreter, which can now easily explain the feature ascoffee or tea, case insensitive, recording an accuracy score of0.9.On the other side,
np_max-act-logitsexplains the same feature asor, obtaining an accuracy score of0.45, worse than chance.The explainer model is the same. The only difference lies in the instructions and examples provided to it. The failure of
np_max-act-logitscan be explained by looking at its instructions. The model is ordered, as part of the first method of the cascade, toLook at MAX_ACTIVATING_TOKENS. If they share something specific in common, or are all the same token or a variation of the same token (like different cases or conjugations), respond with that token.In this case, the model correctly identified the slash as the same token for every example and therefore responded with that by (mistakenly) labeling the slash as
or.The model is not smart enough to detect the token's genericity, discard it, and proceed to method 4, which would have allowed it to correctly explain the feature by looking at the broader context around the max-activating token.
AIR failure
In this section, we analyze a scenario in which AIR fails. Feature
23/151584(Figure 9) fires in contexts related to "big O notation".Figure 9: top activation examples for Feature
23/151584np_max-act-logitsexplains it atO, whileairasexclamations of surprise or uncertainty. Both mistakenly focus only on the top-activating token and obtain an accuracy score close to chance.The reason for the failure of
np_max-act-logitsis the same one described in the previous paragraph: the auto-explainer stops at method 1 and therefore fails to look at the broader context.When it comes to
airthe failure can be explained by looking at the correlation scores:Table 6: correlation scores for feature
23/151584and embedderall-mpnet-base-v2The feature is identified by the category
act_tokenand therefore only a sequence ofOis provided to the auto-explainer, which justifies the given explanation.More appropriate categories would have been
short_window,medium_windoworlong_windowwhich would have provided the auto-explainer with enough context to explain the feature correctly.The correlation scorer, through the embedder, did not provide enough genericity penalty for the example set corresponding to the category
act_token.Obscure features
Lastly, we have obscure features. These are the ones that I like the most. These are truly nonsensical. There's no correlation or common thread whatsoever across the activation examples or the logits. Feature
16/215906is a great example. Its activations range from African history to technical documentation on databases, Python code, and tourist guides to Venice.Figure 10: top activation examples for
16/215906The explanations given by state-of-the-art auto-interpreter protocols are very hazardous.
hashtagsfornp_max-act-logitsandconcepts related to specific, focused subjectsforoai_token-act-pair. Both explanations score close-to-chance score in the fuzz-based auto evaluation.The correlation score tells the whole story. The embedder cannot detect any meaningful correlation across the examples and therefore does not clearly identify a category for that feature.
Table 7: correlation scores for feature
16/215906and embedderQwen/Qwen3-Embedding-0.6BAIR allows for spotting an obscure feature virtually for free before undergoing the expensive auto-interpreter and auto-explainer steps.
Conclusions
AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. This differs from methods like
np_max-act-logitsthat perform the category identification and feature explanation step together via an LLM auto-interpreter.The results of the experiment show that, compared with state-of-the-art auto-interpreters from OpenAI or Neuronpedia, AIR produces more accurate feature explanations at a lower cost.
More broadly, the results confirm Anthropic's empirical finding that features are not all created equal but belong to different categories with distinct properties.
My wish is to see more category-focused auto-interpreter methods in the future. A nice first step would be to have Neuronpedia display a category label for each feature.
The protocol suggested in this article is just a first step in that direction, and there's likely ample room for improvement by refining the algorithm to compute the correlation score and to identify a feature's category. Maybe better sentence embedders can be trained by using the accuracy score of the identified category as a training reward.
A nice side effect of this protocol is that it allows us to identify obscure features for free. These features might capture fascinating concepts that escape existing vocabulary and therefore be routed to human manual interpretability or to auto-interpretability via smarter models.
References
Appendix
We consider a set of features. Each -th feature has activation examples . Each $j$-th example for feature is associated with a maximum activation value .
The correlation score for feature and token window is defined as:
We start by defining the intra and inter correlation scores and the normalized variant will follow immediately.
Raw intra correlation
The raw intra correlation score for feature and token window measures the semantic similarity of the examples in set of size . More precisely, it is the weighted mean cosine similarity over the unordered pairs of examples.
where denotes the sentence-embedder representation of token sequence , and denotes cosine similarity between two embedding vectors.
Raw inter correlation
We first define the pool as the set containing all examples for a given token window . The raw inter correlation score for feature and token window measures the semantic similarity between the example set for the given feature and the average direction of the embeddings from the pool . More precisely, it is the weighted mean cosine similarity between the -th feature’s examples and the pool centroid for that token window.
The centroid is the weighted mean of the unit-normalized embeddings in , renormalized to unit length.
Normalized intra/inter correlation
The normalized versions of the intra and inter correlation scores for feature and token window are obtained as the z-scores of the raw values against a baseline.
Where, over many random sets of examples sampled from :
- and are, respectively, the empirical mean and standard deviation of the resulting raw intra correlation values.
- and are, respectively, the empirical mean and standard deviation of the resulting raw inter correlation values.
A high normalized intra score means that examples in are more semantically similar to each other than expected by chance. A high normalized inter score means that examples in are similar to the pool centroid, and are therefore more generic.
The final correlation score favors example sets that are both coherent with each-other and specific:
The features are filtered to those with at least 20 non-zero activation examples and, within them, at least 10 non-duplicate examples.
For each feature, the example set , is obtained, analogously to
np_max-act-logitsandoai_token-act-pair, concatenating the top-10 non-duplicate activation examples for that featureGiven that the correlation score is normalized per embedder, it is possible to compare the results across various embedders
For each feature, a dataset composed of 10 positive and 10 negative examples is provided to the auto-scorer. The dataset is built once per feature and reused to score the explanations from the different protocols. An accuracy score of 0.5 indicates that the explanation does not help the scorer distinguish activating from non-activating examples better than chance.
In the experiment trace, you'll find a
postprocessedexplanation variant forairin which a category-specific prefix is attached on top of the explanation obtained byair. The results for this variant have not been reported, as the average accuracy score is lower thanair