Incidental polysemanticity
This is a preliminary research report; we are still building on initial work and would appreciate any feedback. Summary Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks,[1] with implications for AI safety. The classic...
Nov 15, 202343