What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called "The Few Greatest Achievements of Interpretability to Date," what would you recommend I include in the talk?

[-]scasper3y*Ω13260

My answer to this is actually tucked into one paragraph on the 10th page of the paper: "This type of approach is valuable...reverse engineering a system". We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries:

https://distill.pub/2019/activation-atlas/

https://arxiv.org/abs/2110.03605

https://arxiv.org/abs/1811.12231

https://arxiv.org/abs/2201.11114

https://arxiv.org/abs/2206.14754

https://arxiv.org/abs/2106.03805

https://arxiv.org/abs/2006.14032

https://arxiv.org/abs/2208.08831

https://arxiv.org/abs/2205.01663

Manual fine-tuning:

https://arxiv.org/abs/2202.05262

https://arxiv.org/abs/2105.04857

Reverse engineering (I'd put an asterisk on these ones though because I don't expect methods like this to scale well to non-toy problems):

https://www.lesswrong.com/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking

https://distill.pub/2020/circuits/curve-detectors/

[-]infinitevoid3y50

Nice paper, thanks! A meta question - how did you analyse and systematise the results of over 300 papers? (gesturing at software tools/general methodology here)

[-]scasper3y60

The taxonomy we introduced in the survey gave a helpful way of splitting up the problem. Other than that, it took a lot of effort, several google docs that got very messy, and https://www.connectedpapers.com/. Personally, I've also been working on interpretability for a while and have passively formed a mental model of the space.

[-]evhub3yΩ452

(Moderation note: added to the Alignment Forum from LessWrong.)

[-]Peter Hase3y40

Nice work! Two good points from the paper:

"Works should evaluate how their techniques perform on randomly or adversarially sampled tasks"
"...highlights a need for techniques that allow a user to discover failures that may not be in a typical dataset or easy to think of in advance. This represents one of the unique potential benefits of interpretability methods compared to other ways of evaluating models such as test performance"

[-]Charlie Steiner3y40

This is relevant to my interests, thanks!

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

97

[Linkpost] A survey on over 300 works about interpretability in deep networks

97

Ω 36

97

Ω 36