Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way. We should be able to rewrite the output as a sum of individual terms, I told myself.

For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.

Reply

5Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

2y

4

10Exploring Llama-3-8B MLP Neurons

2y

0