This is a preliminary research report; we are still building on initial work and would appreciate any feedback.
Summary
Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety.
The classic origin story of polysemanticity is that the data contains more "features" than there are neurons, such that learning to solve a task forces the network to allocate multiple unrelated features to the same neuron, threatening our ability to understand the network's internal processing.
In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when... (read 3053 more words →)
Sorry for the late answer! I agree with your assessment of the TMS paper. In our case, the L1 regularization is strong enough that the encodings do completely align with the canonical basis: in the experiments that gave the "Polysemantic neurons vs hidden neurons" graph, we observe that all weights are either 0 or close to 1 or -1. And I think that all solutions which minimize the loss (with L1-regularization included) align with the canonical basis.