Normalizing Sparse Autoencoders
TL;DR Sparse autoencoders (SAEs) presents us a promising direction towards automating mechanistic interpretability, but it not without flaws. One known issue of the original sparse autoencoders is the feature suppression effect which is caused by the conflict between the L2 and L1 loss and the unit norm constraint on the...
Apr 8, 202422