Expanding the Scope of Superposition
Overview One of the active research areas for interpretability involves distilling neural network activations into clean, labeled features. This is made difficult because of superposition, where a neuron may fire in response to multiple, disparate signals making that neuron polysemantic. To date, research has focused on one type of such...
Sep 13, 202310