LESSWRONG
LW

infinitevoid
4020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
(OLD) An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers
infinitevoid3yΩ010

Thanks for writing this - I've found it useful in my current attempts to survey some key mechanistic interpretability literature.

a decent survey paper on what’s up in the rest of interpretability.

I’m personally pretty meh about the majority of the academic field of interpretability

A bit confused by this. This paper's abstract and intro claim to be focusing on inner interpretability methods - which they define as learned features and internal structure. This seems to fit my idea of what mechanistic interpretability is pretty well, but you seem to classify it as 'the rest of interpretability'.

Do you see a clear distinction between mechanistic interpretability methods vs the methods reviewed in this paper? If so, what's the distinction?

Reply
[Linkpost] A survey on over 300 works about interpretability in deep networks
infinitevoid3y50

Nice paper, thanks! A meta question - how did you analyse and systematise the results of over 300 papers? (gesturing at software tools/general methodology here)

Reply
No wikitag contributions to display.
No posts to display.