Miles Wang
29
1
1
Overview We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our...
We tried to understand at a circuit level how Llama-2 models perform a simple task: not replying with a word when told not to. We mostly failed at fully reverse engineering the responsible circuit, but in the process we learned a few things about mechanistic interpretability. We share our takeaways here, which we hope are useful to other researchers. We are very interested in critiques of our views.
Our major takeaways:
More minor takeaways:
More technical details are available in our paper, which serves as the appendix to... (read 2870 more words →)
Test