SAE feature geometry is outside the superposition hypothesis
Written at Apollo Research Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models. Epistemic status: Decently confident that the ideas here are directionally correct. I’ve been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including both SAE stans and SAE skeptics) have thought very similar things before and some of them have written about it in various places too. Some of my views, especially the merit of certain research approaches to tackle the problems I highlight, have been presented here without my best attempt to argue for them. What would it mean if we could fully understand an activation space through the lens of superposition? If you fully understand something, you can explain everything about it that matters to someone else in terms of concepts you (and hopefully they) understand. So we can think about how well I understand an activation space by how well I can communicate to you what the activation space is doing, and we can test if my explanation is good by seeing if you can construct a functionally equivalent activation space (which need not be completely identical of course) solely from the information I have gi
I think it is reasonable to say that your plan is not 'have the AIs do your homework' to the extent that your research has been roadmapped by humans. This is a spectrum, here's some points on it:
- Using AIs to monitor for potential reward hacks, or using AIs as automated interpretability agents. At a stretch, maybe I'd call this 'have the AIs do our reward hacking homework' or similar. We're here already.
- Using AIs to come up with new debate algorithms that better satisfy desiderata outlined by humans, or SAE variants that perform better on SAEbench. We might call this 'have the AIs do our algorithm design homework'.
- Using the AIs to make progress
... (read 470 more words →)