Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
To learn more about this work, check out the paper. We assume general familiarity with transformer circuits. Intro: There isn’t much interpretability work that explains end-to-end how a model is able to do some task (except for toy models). In this work, we make progress towards this goal by understanding some of the structure of GPT-2 small “in the wild” by studying how it computes a simple natural language task. The task we investigate is what we call indirect object identification (IOI), where sentences like “When John and Mary went to the store, John gave a drink to” should be completed with “Mary” as opposed to “John”. We discovered the structure of a circuit of 26 attention heads grouped into 7 main classes, the largest end-to-end attempt to reverse engineer how a LM computes a natural behavior (to our knowledge). There is still much missing from our explanation, however, and our explanation doesn’t go to the parameter level. The circuit we identified performing indirect identification in GPT-2 small Besides discovering the particular circuit shown above, we gained some interesting insights about low-level phenomena arising inside language models. For example, we found attention heads communicating with pointers (sharing the location of a piece of information instead of copying it). We also identified heads compensating for the loss of function of other heads, and heads contributing negatively to the correct next-token prediction. We're excited to see if these discoveries generalize beyond our case study. Since explanations of model behavior can be confused or non-rigorous, we used our knowledge to design adversarial examples. Moreover, we formulate 3 quantitative criteria to test the validity of our circuit. These criteria partially validate our circuit but indicate that there are still gaps in our understanding. This post is a companion post to our paper where we share lessons that we learned from doing this work and describe some of Redwood’s inter
Oh man, it totally was wrong, sorry about that, updated data again. I looked at the train datasets from the various models we trained and reran the data generation pipeline and the results looked as expected, so I don't think I trained models on the wrong data for the original results, but I'm not fully sure how this data mix came about. It looks like it's a combination of the followup and goals data, i think claude might have accidentally mixed them when i was having it sanitize it for release
also fwiw depending on what you're using this data for, you should probably just regenerate it, it's not that hard and you could probably easily generate more diverse data. it probably also helps if the prompts actually elicit deception on the model you're working with