x

LESSWRONG

LW

Alistair Fraser — LessWrong

Alistair Fraser

Alistair Fraser

Message

1

3y

Alistair Fraser

3y

Mechanistic interpretability through clustering

How are activations in a transformer clustered together and what can we learn? There has been a lot of progress using unsupervised methods (such as sparse autoencoders) to find monosemantic features in LLMs. However, is there a way that we can interpret activations without breaking them down into features? The...

Dec 4, 2023•1