I have written down a long list of alignment ideas that I’d be interested in working on. The ideas roughly boil down to “To make progress on alignment, we need to understand Deep Learning models and the process by which they arrive at their final parameters in much more detail than we currently do”.
- I’m not the first person to think of most of these ideas and it builds on a lot of other people’s work. You should think of it more as a collection of existing resources and ideas than a new agenda. I also haven’t come up with the term “Science of Deep Learning”. I have already heard it being used by multiple people within the alignment community and many researchers are already working on parts of this agenda.
- I obviously don’t own this agenda. The questions in this list are sufficient to keep hundreds of researchers busy for a while, so feel free to hop on. If you’re interested in collaborating just reach out.
- Some of this research has the potential to increase capabilities more than alignment and the results should, in some cases, be kept private and only discussed with a small group of trusted peers. However, I think that most of the projects have a “defender’s advantage”, i.e. they increase alignment more than capabilities.
- Whenever possible, Science of DL projects should have a direct benefit for alignment but I think our current understanding of DL is so bad that just increasing our general understanding seems like a good start.
Here is the link to the full version (comments are on, please don’t abuse it): https://docs.google.com/document/d/1AyuTphQ31rLHDtpZoEwEPb4fWbZna1H3hGx_YUACxk4/edit?usp=sharing
The rest of this post is an overview copied from the doc. Feedback is welcome.
Overview - Science of Deep Learning
By Science of DL, I roughly mean “understanding DL systems and how they learn concepts” better. The main goal is to propose a precise and testable hypothesis related to a phenomenon in DL and then test and refine it until we are highly confident in its truth or falsehood. This hypothesis could be about how NNs behave on the neuron level, the circuit level, during training, during fine-tuning, etc. This research will almost surely at some point include mechanistic interpretability but it is not limited to it.
The refined statement after investigation can but doesn’t have to be of mathematical form as long as it is unambiguous and can be tested, i.e. two people could agree on an experiment that would provide evidence for or against the statement and then run it.
How this could look in practice
The details would obviously differ from project to project but on a high level I imagine it to look roughly like this
- Pick an interesting concept found in deep learning, e.g. grokking, the lottery ticket hypothesis, adversarial examples or the emergence of 2-digit addition in LLMs. Optimally, the concept is safety-related but especially in the beginning, just increasing general understanding seems more important than the exact choice of topic.
- Try to understand high-level features of the phenomenon, e.g. under which conditions this concept arises, which NNs show it and which ones don’t, in which parts of the networks it arises, when during training it arises, etc. This likely includes retraining the network under different conditions with different hyperparameters, number of parameters, etc. and monitoring meaningful high-level statistics related to the concept, e.g. monitor the validation loss to see when the model starts to grok.
- Zoom in: try to understand what happens on a low level, e.g. use mechanistic interpretability tools to investigate the neurons/activations or use other techniques to form a hypothesis of how this specific part of the network works. In the optimal case, we would be able to describe the behavior very precisely, e.g. “this circuit models addition” or “this is when the model starts to grok”.
- Form a testable hypothesis: Once we feel like we understand what’s going on for this particular part of the network, we form a testable hypothesis. This could be a hypothesis about how networks learn something, e.g. “when X happens during training, we will see more of this phenomenon”, or about concepts that relate to the part of the network, e.g. “this is a circuit related to animals, let’s see if it lights up when you talk about a ‘cuckoo clock’ (which is not an animal; just a specific kind of clock)”.
- Test and refine the hypothesis: Test the hypothesis and attack it from multiple angles. Try to find corner cases and actively play an adversary role, e.g. by suggesting alternative explanations for the phenomenon. Use the process to refine our understanding and propose a new testable hypothesis. Repeat until we’re sufficiently confident.
- Generalize: Make a speculative claim that might or might not be implied by the more narrow hypothesis we are relatively confident in. Theorycraft why this speculative claim relates to the narrow hypothesis. Once we have a plausible theory for why the speculative claim could relate to the previous hypothesis, we translate it into a new testable hypothesis (the theorycrafting is necessary so that we are forced to build mechanistic mental models of how DL works).
- Iterate: Repeat the above steps as long as it makes sense, e.g. for new concepts and settings.
- Get fast and automate: I think the goal should be to “understand” important components of a neural network very fast, e.g. it takes one human (with automated tools) less than 24 hours. For this to be successful, we need to train the skill of understanding a neural network and we need automatic (narrow/harmless) tools to assist us.
The goal of this research is to understand DL systems as well as possible. This means there is not one clear goal by that we could judge our performance. However, I think there are some ways to test whether we actually increased our understanding of different parts of the system. These include
- Can we predict with high accuracy whether a network will learn or not learn a specific property before we train it, e.g. from size, hyperparameters, data and compute alone?
- Can we predict with high accuracy whether a phenomenon has already been learned before we test it on a benchmark, e.g. can we tell whether it learned 2-digit addition after looking at a set of circuits?
- Can we predict with high accuracy which part of the network is “responsible” for a task with a very limited budget of forward passes, e.g. if we’re only allowed ten different prompts? Bonus: can we explain the respective behavior?
- Can we attribute a specific input-output behavior and explain it on a mechanistic level within a certain budget of time, e.g. can we find the “car circuit” in an LLM within 60 minutes using whatever method we want?
Understanding more parts of the DL pipeline can always also lead to an increase in dangerous capabilities. Essentially, whenever we understand technology better, we can use that knowledge to make it more efficient or powerful.
- I think that understanding the system better tends to favor alignment vs. capabilities, i.e. for alignment understanding seems more necessary than for capabilities (see e.g. my post on the defender’s advantage of interpretability),
- since people deploy ML systems in the real world at large scale, I don’t really see a way around “understanding the system better” and
- one can always choose not to publish the results or only share them among a trusted group of researchers. I expect at least some of this work to be private by default.
I’m currently excited about this agenda and will likely explore some of the project ideas in the long doc in the near future. However, I’m still uncertain how promising I find the agenda compared to other approaches to alignment. Feedback and considerations are welcome.