A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"
Introduction
Here, we compare the difference of variational sparse autoencoders (vSAE) against a standard sparse autoencoder (SAE). We hypothesize that adding the KL divergence term into the loss function and sampling the activations from a Gaussian distribution should cause a pressure for features to 'move apart' leading to better representations of the language model's latent space, leading to increased performance and interpretability. We ultimately find that this addition does not benefit the interpretability of the language model and degrades performance.