LESSWRONG
LW

2077
ntt123
15210
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
ntt1231y30

Thank you for the upvote! My main frustration with logit lens and tuned lens is that these methods are kind of ad hoc and do not reflect component contributions in a mathematically sound way.  We should be able to rewrite the output as a sum of individual terms, I told myself.

For the record, I did not assume MLP neurons are monosemantic or polysemantic, and this is why I did not mention SAEs.

Reply
5Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
1y
4
10Exploring Llama-3-8B MLP Neurons
1y
0