Broken Latents: Studying SAEs and Feature Co-occurrence in Toy Models
Thanks to Jean Kaddour, Tomáš Dulka, and Joseph Bloom for providing feedback on earlier drafts of this post. In a previous post on Toy Models of Feature Absorption, we showed that tied SAEs seem to solve feature absorption. However, when we tried to training some tied SAEs on Gemma 2...
Even just for evaluating the utility of SAEs for supervised probing though, I think it's unfair to use the same layer for all tasks. Afaik there could easily be tasks where the model represents the target concept using a small number of linear features at some layer, but not at the chosen layer. This will harm k-sparse SAE probe performance far more than the baseline performance because the baselines can make the best of the bad situation at the chosen layer by e.g. combining many features which are weakly correlated with the target concept and using non-linearities. I think it would be a fair test if the 'quiver of arrows' were expanded to include each method applied at each of a range of layers.