Mechanistic interpretability through clustering
How are activations in a transformer clustered together and what can we learn? There has been a lot of progress using unsupervised methods (such as sparse autoencoders) to find monosemantic features in LLMs. However, is there a way that we can interpret activations without breaking them down into features? The...
Dec 4, 20231