Exploring Decomposability of SAE Features
TL;DR SAE features are often less decomposable than the feature descriptions imply. By leveraging a prompting technique to test potential sub-components of individual SAE features (for example (using the analogy from the linked post) decomposing Einstein into “German”, “physics”, "relativity” and “famous”), I found very divergent behaviour in how decomposable...