[Paper] Dictionary Learning Identifiability for Understanding SAEs
Brief Summary Despite showing promise for studying the internals of neural networks, Sparse Autoencoders (SAEs) do some puzzling things, like feature-splitting, feature-absorption, or encoding dense features. Working out why they show these behaviours may help us extract more insight from SAEs, and provide principles for designing their successors. In this...
Jun 512