This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel.
REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ).
Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field. We also think participants will learn skills valuable for many styles of interpretability research, and also for ML research more broadly.
Apply here by Sunday, November 13th [DEADLINE EXTENDED] to be a researcher in the program. Apply sooner if you’d like to start early (details below) or receive an earlier response.
Some key details:
- We expect to accept 30-50 participants.
- The research program will take place in Berkeley, CA.
- We plan to have the majority of researchers participate during the months of December and/or January (depending on availability) although we may invite some to begin earlier and are open to some starting as soon as possible.
- We expect researchers to participate for a month minimum, and (all else equal) will prefer applicants who are able to come for longer. We’ll pay for housing and travel, and also pay researchers for their time. We’ll clarify the payment structure prior to asking people to commit to the program.
- We’re interested in some participants acting as team leaders who would help on-board and provide research advice to other participants. This would involve arriving early to get experience with our tools and research directions and participating for a longer period (~2 months). You can indicate interest in this role in the application.
- We’re excited about applicants with a range of backgrounds; we’re not expecting applicants to have prior experience in interpretability research. Applicants should be comfortable working with Python, PyTorch/TensorFlow/Numpy (we’ll be using PyTorch), and linear algebra. We’re particularly excited about applicants with experience doing empirical science in any field.
- We’ll allocate the first week to practice using our interpretability tools and methodology; the rest will be researching in small groups. See Schedule.
Feel free to email email@example.com with questions.
Why you should apply:
Research results. We are optimistic about the research progress you could make during this program (more below). Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field.
Skill-building. We think this is a great way to gain experience working with language models and interpreting/analyzing their behaviors. The skills you’ll learn in this program will be valuable for many styles of interpretability research, and also for ML research more broadly.
Financial support & community. This is a paid opportunity, and a chance to meet and connect with other researchers interested in interpretability.
Why we’re doing this:
Research output. We hope this program will produce research that is useful in multiple ways:
- We’d like stronger and more grounded characterizations of how language models perform a certain class of behaviors. For example, we currently have a variety of findings about how GPT-2-small implements indirect object identification (“IOI”, see next section for more explanation), but aren’t yet sure how often they apply to other models or other tasks. We’d know a lot more if we had a larger quantity of this research.
- For each behavior investigated, we think there’s some chance of stumbling across something really interesting. Examples of this include induction heads and the “pointer manipulation” result in the IOI paper: not only does the model copy information between attention streams, but it also copies “pointers”, i.e. the position of the residual stream that contains the relevant information.
- We’re interested in learning whether different language models implement the same behaviors in similar ways.
- We’d like a better sense of how good the current library of interpretability techniques is, and we’d like to get ideas for new techniques.
- We’d like to have more examples of this kind of investigation, to help us build infrastructure to support or automate this kind of research.
Training and hiring. We might want to hire people who produce valuable research during this program.
Experience running large collaborative research projects. It seems plausible that at some point it will be useful to run a huge collaborative alignment project. We’d like to practice this kind of thing, in the hope that the lessons learned are useful to us or others.
Why do this now?
We think our recent progress in interpretability makes it a lot more plausible for us to reliably establish mechanistic explanations of model behaviors, and therefore get value from a large, parallelized research effort.
A unified framework for specifying and validating explanations. Previously, a big bottleneck on parallelizing interpretability research across many people was the lack of a clear standard of evidence for proposed explanations of model behaviors (which made us expect the research produced to be pretty unreliable). We believe we’ve recently made some progress on this front, developing an algorithm called “causal scrubbing” which allows us to automatically derive an extensive set of tests for a wide class of mechanistic explanations. This algorithm is only able to reject hypotheses rather than confirming them, but we think that this still makes it way more efficient to review the research produced by all the participants.
- Our current plan is to require researchers in this program to do causal scrubbing on all serious research they produce and submit to the internal review process. (They can do exploratory analyses with whatever techniques they like.)
Improved proofs of concept. We now have several examples where we followed our methodology and were able to learn a fair bit about how a transformer was performing some behavior.
- We recently investigated a somewhat complex behavior in GPT-2-small, which we call “indirect object identification” (“IOI”). IOI is the model behavior where, given a sequence such as “Bob and Alice went to the store and Alice gave an apple to”, the model predicts “Bob” rather than “Alice”. Our research located and described the mechanism inside GPT-2-small that performs this behavior, i.e. which heads of the network are involved and what roles they play. Though our investigation into GPT-2-small was much less comprehensive than e.g. the description of how an image classification model detects curves in the Distill Circuits thread, we think our research here suggests that transformer language models do have some crisp circuits that can be located and described.
- We did causal-scrubbing-based analysis of a two-layer attention-only language model in order to get a detailed sense of the interactions in a model that were important for the performance of the “induction heads”. We also used causal scrubbing to assess hypotheses about a transformer trained to classify strings of parentheses as balanced or unbalanced. In both these cases, we found that we could apply our methodology pretty straightforwardly, yielding useful results.
Tools that allow complicated experiments to be specified quickly. We’ve built a powerful library for manipulating neural nets (and computational graphs more generally) for doing intervention experiments and getting activations out of models. This library allows us to do experiments that would be quite error-prone and painful with other tools.
Who should apply?
We're most excited about applicants comfortable working with (basic) Python, any of PyTorch/TensorFlow/Numpy, and linear algebra. Quickly generating hypotheses about model mechanisms and testing them requires some competence in these domains.
If you don’t understand the transformer architecture, we’ll require that you go through preparatory materials, which explain the architecture and walk you through building one yourself.
We’re excited about applicants with a range of backgrounds; prior experience in interpretability research is not required. The primary activity will be designing, running, and analyzing results from experiments which you hope will shed light on how a model accomplishes some task, so we’re excited about applicants with experience doing empirical science in any field (e.g. economics, biology, physics). The core skill we’re looking for here, among people with the requisite coding/math background, is something like rigorous curiosity: a drive to thoroughly explore all the ways the model might be performing some behavior and narrow them down through careful experiments.
What is doing this sort of research like?
Mechanistic interpretability is an unusual empirical scientific setting in that controlled experimentation is relatively easy, but there’s relatively little knowledge about the kinds of structures found in neural nets.
Regarding the ease of experimentation:
- It’s easy to do complicated intervention experiments. If you’re curious about whether the network, or some internal component of it, would have produced a radically different output had some feature of the input been different, you can re-run it exactly and just change that feature. There’s little to no hassle involved in controlling for all the other features of the run.
- You can extract almost any metric from the internal state of the model at any time.
- You can quickly run lots of inputs through the model all at once if you want to characterize average behaviors.
- See Chris Olah’s note on neuroscience versus interpretability for more on this point.
Regarding the openness of the field:
- Large and important questions, like “How much similarity is there in the internals of different models?” and “To what degree are model behaviors implemented as modular, human-comprehensible algorithms?” are still mostly unexplored.
- We’re early on in the process of discovering commonalities across models like induction heads - we don’t currently know how much there tends to be simple, universal algorithms that show up across models, versus each model being idiosyncratic.
REMIX participants pursue interpretability research akin to the investigations Redwood has done recently into induction heads, indirect object identification (IOI) in small language models, and balanced parenthesis classification, all of which will be released publically soon. You can read more about behavior selection criteria here.
The main activities will be:
- doing exploratory analyses to generate hypotheses about how a language model (probably GPT-2-small) performs some behavior
- evaluating your hypotheses with our causal scrubbing methodology
- iterating to make hypotheses more fine-grained and more accurate
The mechanisms for behaviors we’ll be studying are often surprisingly complex, so careful experimentation is needed to accurately characterize them. For example, the Redwood researchers investigating the IOI behavior found that removing the influence of the circuit they identified as primarily responsible had surprisingly little effect on the model’s ability to do IOI. Instead, other heads in the model substantially changed their behavior to compensate for the excision. As the researchers write, “Both the reason and the mechanism of this compensation effect are still unclear. We think that this could be an interesting phenomenon to investigate in future work.”
Here’s how a Redwood researcher describes this type of research:
It feels a lot of time like you're cycling between: "this looks kind of weird and interesting, not sure what's up with this" and then "I have some vague idea about what maybe this part is doing, I should come up with a test to see if I understand I correctly" and then once you have a test "oh cool, I was kind of right but also kind of wrong, why was I wrong" and then the cycle repeats.
Often it's pretty easy to have a hunch about what some part of your model is doing, but finding a way to appropriately test that hunch is hard and often your hunch might be partially correct but incomplete so your test may rule it out prematurely if you're not careful/specific enough.
It feels like you're in a lab, with your model on the dissection table, and you're trying to pick apart what's going on with different pieces and using different tools to do so - this feels really cool to me, kind of like trying to figure out what's going on with this alien species and how it can do the things it does.
It's really fun to try and construct a persuasive argument for your results: "I think this is what's happening because I ran X, Y, Z experiments that show A, B, C, plus I was able to easily generate adversarial examples based on these hypotheses" - I often feel like there's some sort of imaginary adversary (sometimes not imaginary!) that I have to convince of my results and this makes it extremely important that I make claims that I can actually back up and that I appropriately caveat others.
This research also involves a reasonable amount of linear algebra and probability theory. Researchers will be able to choose how deep they want to delve into some of the trickier math we’ve used for our interpretability research (for example, it turns out that one technique we’ve used is closely related to Wick products and Feynman diagrams).
The program will start out with a week of training using our library for computational graph rewrites and investigating model behaviors using our methodology. This week will have a similar structure to MLAB (our machine learning bootcamp), with pair programming and a prepared curriculum. We’re proud to say that past iterations of MLAB have been highly-reviewed – the participants in the second iteration gave an average score of 9.2/10 to the question “How likely are you to recommend future MLAB programs to a friend/colleague?”.
An approximate schedule for week one:
- Monday: Use our software to manipulate neural net computational graphs, e.g. get the output of different attention heads or virtual attention heads, or do various forms of intervention experiments.
- Tuesday: Using this software, replicate results about induction heads in a small language model.
- Wednesday: Use causal scrubbing to investigate induction heads in the same model.
- Thursday: Replicate some of our causal scrubbing results on a small model trained to classify whether sequences of parentheses are balanced.
- Friday: Replicate some of the “indirect object identification” results.
In future weeks, you’ll split your time between investigating behaviors in these models, communicating your findings to the other researchers, and reading/learning from/critiquing other researchers’ findings.
- During this program, we’ll try to be agile and respond to bottlenecks on the production of high-quality research as they appear. For example, we might have some Redwood staff or contractors work full time to maintain and improve our tools as we go. (We might also have REMIX researchers work on these tools if they’re interested and if we think it won’t be too chaotic.)
- We’ll probably try to build organizational infrastructure and processes according to demand, e.g. some kind of wiki for cross-referencing observations about various parts of models a la the OpenAI Microscope.
- We can imagine having training days later in the program (perhaps as optional events on the weekends).
- We might investigate algorithmic behaviors in small transformers not trained on language modeling.
- We will probably mostly focus on a single model, but we may have a few people looking at somewhat larger models.
What if I can’t make these dates?
We encourage you to submit an application even if you can’t make the dates; we have some flexibility, and might make exceptions for exceptional applicants. We’re planning to have some participants start as soon as possible to test drive our materials, practice in our research methodology, and generally help us structure this research program so it goes well.
Am I eligible if I’m not sure I want to do interpretability research long-term?
What’s the application process?
You fill out the form, complete some TripleByte tests that assess your programming abilities, then do an interview with us.
Can you sponsor visas?
Given this program is a research sprint rather than a purely educational program, and given the fact that we plan to offer stipends for participants, we can’t guarantee sponsorship of the right-to-work visas required for international participants to be in person. If you are international but studying at a US university, we are optimistic about getting a CPT for you to be able to participate.
However, we still encourage international candidates to apply. We’ll try to evaluate on a case-by-case basis and for exceptional candidates depending on your circumstances, there may be alternatives, like trying to sponsor a visa to have your join later or participating remotely for some period.
Is this research promising enough to justify running this program?
I would love to say that this project is paid for by the expected direct value of its research output. My inside view is that the expected direct value does in fact pay for the dollar cost of this project, and probably even the time cost of the organizers. However, there are strong reasons for skepticism–this is a pretty weird thing to do, and it’s sort of weird to be able to make progress on things by having a large group of people work together. So the decision to run this program is to some extent determined by more boring, capacity-building considerations, like training people and getting experience with large projects.
How useful is this kind of interpretability research for understanding models that might pose an existential risk?
This research might end up not being very useful. Here’s Buck’s description of some reasons why this might be the case:
My main concern is that the language model interpretability research we mentioned above was done on model behaviors which were specifically selected because we thought interpreting these behaviors would be easy. (I’ve recently been calling this kind of research “streetlight interpretability”, as in the classic fallacy where you only look for things in the place that’s easiest to look in.) These model behaviors are chosen to be unrepresentatively easy to interpret.
In particular, it’s not at all obvious how to use any existing interpretability techniques (or even how to articulate the goal of interpretability) in situations where the algorithm the model is using is poorly described by simple, human-understandable heuristics. I suspect that tasks like IOI or acronym generation, where the model implements a simple algorithm, are the exception rather than the rule, and models achieve good performance on their training trask mostly by using huge piles of correlations and heuristics. Our preliminary attempts to characterize how the model distinguishes between outputting “is” and “was” indicate that it relies on a huge number of small effects; my guess is that this is more representative of what language models are mostly doing than e.g. the IOI work.
So my guess is that this research direction (where we try to explain model behaviors in terms of human-understandable concepts) is limited, and is more like diving into a part of the problem that I strongly suspect to be solvable, rather than tackling the biggest areas of uncertainty or developing the techniques that might greatly expand the space of model behavior that we can understand. We are also pursuing various research directions that might make a big difference here; I think that research on these improved strategies is quite valuable (plausibly the best research direction), but I think that streetlight interpretability still looks pretty good.
Another concern you might have is that it’s useful to have a few examples of this kind of streetlight interpretability, but there are steeply diminishing marginal returns from doing more work of this type. For what it’s worth, I have so far continued to find it useful/interesting to see more examples of research in this style, but it’s pretty believable that this will slow down after ten more projects of this form or something.
Overall, we think that mechanistic interpretability is one of the most promising research directions for helping prevent AI takeover. Our hope is that mature interpretability techniques will let us distinguish between two ML systems that each behave equally helpfully during training – even having exactly the same input/output behavior on the entire training dataset – but where one does so because it is deceiving us and the other does so “for the right reasons.”
Our experience has been that explaining model behaviors supports both empirical interpretability work – guiding how we engineer interpretability tools and providing practical knowhow – and theoretical interpretability work – for example, leading to the development of the causal scrubbing algorithm. We expect many of the practical lessons that we might learn would generalize to more advanced systems, and we expect that addressing the theoretical questions that we encounter along the way would lead to important conceptual progress.
Currently, almost no interesting behaviors of ML models have been explained – even for models that are tiny compared with frontier systems. We have been working to change this, and we’d like you to help.
Apply here by November 8th to be a researcher in the program, and apply sooner if you want to start ASAP. Sooner applications are also more likely to receive sooner responses. Email firstname.lastname@example.org with questions.
The problem of distinguishing between models which behave identically on the training distribution is core to the ELK problem.