LESSWRONG
LW

Miles Wang

Message

Miles Wang's Shortform

Dec 15, 2023•1

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Overview We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our...

Dec 15, 2023•34

Miles Wang

Miles Wang — LessWrong

Miles Wang

Message

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang

Tony Wang, Miles Wang, kaivu+ 0 more

Tony Wang, Miles Wang, kaivu

Overview

We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our takeaways here, which we hope are useful to other researchers. We are very interested in critiques of our views.

Our major takeaways:

Some model behaviors might be computationally irreducible.
“Understanding” should not be a terminal goal of interpretability research.

More minor takeaways:

Log-odds is a great metric.
Tooling is important and should not be taken for granted.

More technical details are available in our paper, which serves as the appendix to... (read 2870 more words →)

Miles Wang2yQuick Take

Test