With the growing prevalence of large language models (LLMs), explainable AI (XAI) is a concern for all. A growing field of XAI is mechanistic interpretability. One method of mechanistic interpretability is the use of sparse autoencoders (SAEs) to extract the features of LLMs. The effect of text datasets used to train SAEs on the features is a topic of ongoing study. Here we compare SAE features from two text datasets trained separately and then trained combined, in two LLMs. The text datasets come from lower and higher represented group topics. We find that the text influences the features that emerge by tending to split features, rather than aggregating features. In addition, the LLMs that are used to label features also have an influence on the feature labels, where the higher represented group is underrepresented in the features.
LLMs have become ubiquitous in modern life, creating the need for more advanced XAI.[1][2] Mechanistic interpretability aims to reverse engineer complex neural networks.[3][4] An issue with LLMs is that their neurons are not as interpretable as other neural networks such as CNNs--the neurons of LLMs are more polysemantic.[5][6] However, these polysemantic neurons can still achieve a variety of tasks, since inputs tend to be sparse, so only a limited number of "features" of language are activated at a time in the models. Training SAEs on LLM activations attempts to create monosemantic features of these LLMs.[7][8]
SAEs as XAI for LLMs tend to be trained on activations from layers of the LLMs. The SAEs reconstruct the activations during training. Normally, autoencoders reduce the input to fewer latent features; however, with SAEs, an expansion factor () is multiplied by the input dimensions () to create the number of hidden dimensions () in the SAE. The activations of the hidden dimensions of the SAEs are the attempted monosemantic feature values. These features can then be manipulated, such as by clamping at a high value or turning them off, to affect the output of the LLM. Templeton et al. have shown that an LLM can talk excessively about the Golden Gate Bridge, when they identified a feature of the bridge and clamped it to a high value.[8:1] Feature steering also has implications for responsible AI, where Harle and coauthors showed that they could reduce toxicity in an LLM by turning off toxic features.[9] However, Neel Nanda's team at Google Deep Mind (GDM) has diverted research away from SAEs, citing that they perform poorly compared to linear probes when classifying harmful OOD content in prompts.[10] However, Chris Olah of Anthropic disagrees and will continue to research with SAEs.[11]
In addition to the text used to train LLMs, the text used to train SAEs also has an effect on the features that are found. Bricken et al. have shown that oversampling bio-data has led to more features on bio-data.[12] Drori has shown that training SAEs with more safety data results in more safety features. Drori also points out that it may take more compute power than initially training an LLM to find all the features of an LLM with SAEs.[13] To the best of the current author's knowledge, there has not been a study in which feature counts from individual similar-sized datasets are compared to feature counts from their datasets combined, with semantic comparisons, using the same training methods; thus, the author has not found such a study where the datasets also consist of higher and lower represented groups. This can add to the debate on how generalizable SAEs are, as in to whether they may be overfit on their training data, which may contribute to the aforementioned GDM/Anthropic debate.
To create the two text datasets, we searched the Wikipedia API Python package for the topics 'inuit' and 'american'. These keywords were chosen since they would have specific Wikipedia pages about them, and may be considered higher and lower represented groups, with some overlap. Since there were more American articles, we only kept the longest articles in the American dataset so the number of articles matched the Inuit dataset. The number of characters in the Inuit dataset was still just over a quarter that of the American dataset. We then found the average Inuit article length, and truncated the American articles to that length. The character difference between the datasets then was less than 10. We then created the combined dataset by concatenating the two datasets.
We then put text tokens in chunks of the maximum input prompt length, with a sliding window of half the maximum input prompt length size. We then input these chunks into the LLM under investigation and stored the activations from halfway through the transformer layers. We only used GPT style models, DistilBERT/DistilGPT2 and EleutherAI/GPT-Neo-125M. The former is smaller than the latter. These activations were used to train the SAE.
We used TopK SAEs, which work to impose sparsity in the features by only keeping the top feature activations, masking the rest.[14] The input LLM activations were input into the SAE encoder, while the activations were then input into the SAE decoder, to reconstruct the LLM activations with .
where weights and biases are , , , . The expansion factor used was . We used for DistilGPT2 and for GPT-Neo-125M. The reconstruction loss was found with MSE. Adam optimizer was used with a learning rate of 0.001. For GPT-Neo-125M, LLM activations were standardized, with the mean subtracted and the result divided by the standard deviation. The number of training epochs was 200. We then got the activations of the trained SAE from the LLM activations, the same LLM activations that were used to train the SAE. To verify training was successful and determine hyperparameters, we would look at the feature activation frequencies, the frequency at which tokens would activate features. If the features tended to activate for 1/1000 tokens, with a peak close to that, the model would be considered adequately trained. This is because if features tend to activate too often, such as 1/10 tokens, this would be considered dense and hard to interpret, since too many words would activate the same features and thus they may still be polysemantic. If the features are activated too little, they may be rare features that are also hard to interpret.
To tie features to words (reconstructed from sub-word tokens), we obtained the average feature activation per word across the dataset, weighted by chunks. We then kept the top 20 words that had the highest mean activations per feature. We then used LLMs to label the features. First, we used gpt-3.5-turbo to label the features based on the 20 words and their mean activations. For instance, a feature from the Inuit GPT-Neo-125M model was (only 1 decimal point shown here) -- "feature 4 [('couples', 10.8), ('family', 9.9), ('Council', 9.1), ('families', 8.7), ('clans', 8.6), ('household', 8.0), ('Court', 7.7), ('Family', 7.5), ('households', 7.1), ('couple,', 7.0), ('herds', 6.6), ('generations', 6.2), ('wedding', 5.8), ('camp', 5.7), ('Legacy', 5.4), ('associations', 5.3), ('Council,', 5.2), ('chains', 5.2), ('faith', 5.1), ('association', 4.9)]", which was labeled as 'family' by gpt-3.5-turbo. If it couldn't label the feature, due to inconsistent words, the label was set as 'unknown'. We then used gpt-4-turbo to put these features in higher level labels, since many feature labels may be synonymous or similar enough. We also asked it to try to be consistent in labeling, reusing labels when possible. The code is linked in the Appendix .
The keywords 'inuit' and 'america' counts in the features, the total sum of activations, and the mean activations of the counts are shown in Figure 1. The results of the combined dataset show values in between the separate datasets in each case.






Figure 1: Bar plots of total word counts that are in the top 20 of features, total sum of the activations, and mean activations per counts for Inuit, American, and combined datasets for GPT-Neo-125M (top 3) and DistilGPT (bottom 3) .
Feature activation frequencies are shown in Figure 2. DistilGPT2 has a peak close to the ideal , while GPT-Neo-125M is slightly shifted to the right. DistilGPT2 has significantly more dead latents that do not activate for any token. The peaks at were included to avoid . The high-level feature labels for all datasets and each model are shown in Figure 3.
Figure 2: Log plot of feature activations for Inuit, American, and combined datasets for GPT-Neo-125M (top 3) and DistilGPT (bottom 3).
Figure 3: High-level feature counts, GPT-Neo-125M (top) and DistilGPT (bottom).
Figure 2 shows similar peaks for individual and combined datasets. This may imply that they are sharing features, rather than aggregating features, including in the model that has many dead latents. This may be an artifact of the training, where combining the datasets could be taken into account in the training, to obtain more features. However, this was not readily apparent from the training loss, as all the starting and ending reconstruction MSE values across the datasets started and ended at relatively close values, of the same order of magnitude. However, there may be other metrics to take into account to better judge how many features should be expected.
What is noticeable in Figure 3 is that the combo datasets tend to give feature counts in between those of the individual datasets, which further implies the sharing of features. This may be significant for SAEs, since, if they were impartial to the data and really uncovering LLM features (these aspects may not be actually expected though), one might expect the features to add up when combined, maybe with more dense features if there were no unused dead latent features, rather than split features. However, Leask et al. have found that SAEs are not able to find canonical features that cannot be further subdivided.[15] In addition, Bricken et al. write that determining the correct number of features may not be essential.[5:1] Thus, our results may give evidence of SAEs overfitting on training data. There could be arguments that this is an artifact of the training or labeling regime. For instance, one may argue that the features may actually be mostly the same in the combo set as in the individual sets, and that it is just the shift in the top words that causes the LLM to classify them differently. In any case, these concerns should be taken into account when training SAEs and labeling features. Note that one may not be aware of more possible features if they did not split their dataset. Thus, the converse of the results here is that one can use dataset splitting to check if they should be getting more features. With Figure 2, one might think that maybe it is because they are just sharing the same features that are activated, however, Figure 3 shows that they have different features, so some features appear to be lost in the combined dataset. In addition, one may think that the separate datasets should produce some different features.
What is also noticeable is that there is an 'inuit' feature label, but no 'american' feature label, in Figure 3. These keywords stick out as they are inline with the wikipedia search words the datasets were created from. Both words appear in the top 20 feature labels, as shown in Figure 1. Thus, the feature-labeling LLMs effectively subsumed mentions of 'america' in other labels. This may be due to influence from LLMs biases, e.g. how a stereotype for a profession may be referred to as solely the profession, while for people who do not fit the stereotype, an extra label may be added. Thus, anything Inuit related may stick out as labeled as Inuit, while anything American related may be labeled by what is being discussed, for instance, culture. In other words, an American culture feature may just be considered a culture feature, while an Inuit culture feature may be labeled as an Inuit or arctic culture feature. However, this is just a hypothetical, as the features the labels came from were not recorded in the LLM output, only the counts. For reference, we also checked with a pretrained SAE "google/gemma-scope-2b-pt-res" which showed 'inuit' and 'americans' activated a similar number of features -- 698 and 685, respectively -- when put in the brackets in the prompt "The {} have a long history and culture." However, this is not a direct comparison with our research. We can also see that 'unknown' features are only present in American and combo datasets. This may also be due to American features being afforded more nuance.
The feature labeling being only based on the top 20 feature activating words loses information. However, inspecting the top 20 words manually generally showed clear themes. In addition, labeling these words into simple themes also loses information. However, fitting all labels into LLM prompts required reducing explanatory texts.
Truncating the American articles to the average length of the Inuit articles may have caused some loss of context. However, the vast majority of the text would still have been in the correct context.
Investigating how to obtain cumulative features with a combined dataset, rather than shared features, is a direction future research could take. Corrections may be trivial to make. In general, the results presented here are mainly something to keep in mind for those who may otherwise be unaware.
With a large number of features, there is a trade-off between interpretability and scale. Higher-level interpretation is easier to understand, but loses the fine-grained lower-level information. LLMs that create higher-level interpretations may impose their own biases, in addition to biases that the LLM on which the SAE being trained may have.
Using an LLM for higher-level feature labels lost information on which feature the labels came from, since, to obtain consistent labeling, all feature labels from all SAEs were input at once. Only the counts per label were output. Further investigation would require not only time, but also money, since the API is not free. However, the total cost of this research was less than a toonie.
We are not stating that aggregation of features should happen under the same training circumstances, just that one should be aware that they do not seem to. The feature counts being split rather than aggregated here may be useful in itself in finding how many and what features an LLM should have. The method here may function as another way to determine possible feature counts, taking semantics into account. By splitting datasets and checking the feature counts and semantics of them, one can get an idea of what features can be found and can adjust training when combined accordingly (or keep them separate if unable to find the same features when combined). We make no precise recommendations for splitting techniques; however, saturation of feature counts may give a signal for when to end the feature search.
We find that the text datasets used to train SAEs have an effect on the SAE features. The combined dataset tends to show feature count values in between the individual datasets. This result shows aspects of how the SAE training text should be taken into account when discovering features of LLMs. In addition, the higher represented group is underrepresented in the feature labels generated by an LLM. This shows that the influence of the LLM feature labeling must also be taken into account.
Code can be found at (https://github.com/Gregpy/SAE_Feature_Counts)
Paper at DOI: 10.13140/RG.2.2.27080.23040