A Selection of Randomly Selected SAE Features
Epistemic status - self-evident. In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment. Motivation Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution? While a full answer to this question is important and will take deliberate investigation, we note that researchers who've spent large amounts of time interacting with feature dashboards think it’s more likely that SAE features capture highly non-trivial information about the underlying models. Evidently, SAEs are the one true answer to ontology identification and as evidence of this, we show how initially uninterpretable features are often quite interpretable with further investigation / tweaking of dashboards. In each case, we describe how we make the best possible use of feature dashboards to ensure we aren't fooling ourselves or reading tea-leaves. Note - to better understand these results, we highly recommend readers who are unfamiliar with SAE Feature Dashboards briefly refer to the relevant section of Anthropic's publication (whose dashboard structure we emulate below). TLDR - to understand what concepts are encoded by features, we look for patterns in the text which causes them to activate most strongly. Case Studies in SAE Features Scripture Feature We open with a feature that seems to activate strongly on examples of sacred text, specifically from the works of Christianity. Scripture Feature Even though interpreting SAEs seems bad, and it can really make you mad, seeing features like this reminds us to always look on the bright side of life. Perseverance Feature We regist
It's undesirable because it makes SAEs more confusing / harder to use but it's also a concrete obstacle to things like sparse feature circuits. For example, it causes automatic circuits composed of features to have issues. Instead of having a "starts with E" feature in a spelling circuit that fires on all "E" words, you have a "starts with E feature" that fires on most words (part of the circuit) + all the "E" absorbing features like "fires on the token Europe" (large number of features, adds lots of nodes to the circuit). So instead of being able to create a simple DAG, you get a complex one.