Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have written down a long list of alignment ideas that I’d be interested in working on. The ideas roughly boil down to “To make progress on alignment, we need to understand Deep Learning models and the process by which they arrive at their final parameters in much more detail than we currently do”. 

  • I’m not the first person to think of most of these ideas and it builds on a lot of other people’s work. You should think of it more as a collection of existing resources and ideas than a new agenda. I also haven’t come up with the term “Science of Deep Learning”. I have already heard it being used by multiple people within the alignment community and many researchers are already working on parts of this agenda. 
  • I obviously don’t own this agenda. The questions in this list are sufficient to keep hundreds of researchers busy for a while, so feel free to hop on. If you’re interested in collaborating just reach out.
  • Some of this research has the potential to increase capabilities more than alignment and the results should, in some cases, be kept private and only discussed with a small group of trusted peers. However, I think that most of the projects have a “defender’s advantage”, i.e. they increase alignment more than capabilities. 
  • Whenever possible, Science of DL projects should have a direct benefit for alignment but I think our current understanding of DL is so bad that just increasing our general understanding seems like a good start. 

Here is the link to the full version (comments are on, please don’t abuse it):

The rest of this post is an overview copied from the doc. Feedback is welcome. 

Overview - Science of Deep Learning

By Science of DL, I roughly mean “understanding DL systems and how they learn concepts” better. The main goal is to propose a precise and testable hypothesis related to a phenomenon in DL and then test and refine it until we are highly confident in its truth or falsehood. This hypothesis could be about how NNs behave on the neuron level, the circuit level, during training, during fine-tuning, etc. This research will almost surely at some point include mechanistic interpretability but it is not limited to it. 
The refined statement after investigation can but doesn’t have to be of mathematical form as long as it is unambiguous and can be tested, i.e. two people could agree on an experiment that would provide evidence for or against the statement and then run it.

How this could look in practice 

The details would obviously differ from project to project but on a high level I imagine it to look roughly like this

  1. Pick an interesting concept found in deep learning, e.g. grokking, the lottery ticket hypothesis, adversarial examples or the emergence of 2-digit addition in LLMs. Optimally, the concept is safety-related but especially in the beginning, just increasing general understanding seems more important than the exact choice of topic.
  2. Try to understand high-level features of the phenomenon, e.g. under which conditions this concept arises, which NNs show it and which ones don’t, in which parts of the networks it arises, when during training it arises, etc. This likely includes retraining the network under different conditions with different hyperparameters, number of parameters, etc. and monitoring meaningful high-level statistics related to the concept, e.g. monitor the validation loss to see when the model starts to grok. 
  3. Zoom in: try to understand what happens on a low level, e.g. use mechanistic interpretability tools to investigate the neurons/activations or use other techniques to form a hypothesis of how this specific part of the network works. In the optimal case, we would be able to describe the behavior very precisely, e.g. “this circuit models addition” or “this is when the model starts to grok”. 
  4. Form a testable hypothesis: Once we feel like we understand what’s going on for this particular part of the network, we form a testable hypothesis. This could be a hypothesis about how networks learn something, e.g. “when X happens during training, we will see more of this phenomenon”, or about concepts that relate to the part of the network, e.g. “this is a circuit related to animals, let’s see if it lights up when you talk about a ‘cuckoo clock’ (which is not an animal; just a specific kind of clock)”. 
  5. Test and refine the hypothesis: Test the hypothesis and attack it from multiple angles. Try to find corner cases and actively play an adversary role, e.g. by suggesting alternative explanations for the phenomenon. Use the process to refine our understanding and propose a new testable hypothesis. Repeat until we’re sufficiently confident.
  6. Generalize: Make a speculative claim that might or might not be implied by the more narrow hypothesis we are relatively confident in. Theorycraft why this speculative claim relates to the narrow hypothesis. Once we have a plausible theory for why the speculative claim could relate to the previous hypothesis, we translate it into a new testable hypothesis (the theorycrafting is necessary so that we are forced to build mechanistic mental models of how DL works). 
  7. Iterate: Repeat the above steps as long as it makes sense, e.g. for new concepts and settings.
  8. Get fast and automate: I think the goal should be to “understand” important components of a neural network very fast, e.g. it takes one human (with automated tools) less than 24 hours. For this to be successful, we need to train the skill of understanding a neural network and we need automatic (narrow/harmless) tools to assist us. 


The goal of this research is to understand DL systems as well as possible. This means there is not one clear goal by that we could judge our performance. However, I think there are some ways to test whether we actually increased our understanding of different parts of the system. These include

  1. Can we predict with high accuracy whether a network will learn or not learn a specific property before we train it, e.g. from size, hyperparameters, data and compute alone? 
  2. Can we predict with high accuracy whether a phenomenon has already been learned before we test it on a benchmark, e.g. can we tell whether it learned 2-digit addition after looking at a set of circuits? 
  3. Can we predict with high accuracy which part of the network is “responsible” for a task with a very limited budget of forward passes, e.g. if we’re only allowed ten different prompts? Bonus: can we explain the respective behavior? 
  4. Can we attribute a specific input-output behavior and explain it on a mechanistic level within a certain budget of time, e.g. can we find the “car circuit” in an LLM within 60 minutes using whatever method we want?


Understanding more parts of the DL pipeline can always also lead to an increase in dangerous capabilities. Essentially, whenever we understand technology better, we can use that knowledge to make it more efficient or powerful. 


  • I think that understanding the system better tends to favor alignment vs. capabilities, i.e. for alignment understanding seems more necessary than for capabilities (see e.g. my post on the defender’s advantage of interpretability), 
  • since people deploy ML systems in the real world at large scale, I don’t really see a way around “understanding the system better” and 
  • one can always choose not to publish the results or only share them among a trusted group of researchers. I expect at least some of this work to be private by default.

Final words

I’m currently excited about this agenda and will likely explore some of the project ideas in the long doc in the near future. However, I’m still uncertain how promising I find the agenda compared to other approaches to alignment. Feedback and considerations are welcome. 


New Comment
7 comments, sorted by Click to highlight new comments since: Today at 1:43 AM

Very hot take [I would like to have my mind changed]. I think that studying the Science of Deep Learning is one of the least impactful areas that people interested in alignment could work on. To be concrete, I think it is less impactful than: foundational problems (MIRI/Wentworth), prosaic theoretical work (ELK), studying DL (e.g deep RL) systems for alignment failures (Langosco et at) or mechanistic interpretability (Olah stuff) off the top of my head. Some of these could involve the (very general) feedback loop mentioned here, but it wouldn't be the greatest description of any of these directions.

Figuring out why machine learning “works” is an important problem for several subfields of academic ML (Nakkiran et al, any paper that mentions “bias-variance tradeoff”, statistical learning theory literature, neural tangent kernel literature, lottery ticket hypothesis, …). Science of Deep Learning is an umbrella term for all this work, and more (loss landscape stuff also is under the umbrella, but has a less ambitious goal than figuring out how ML works). Why should it be a fruitful research direction when all the mentioned research directions are not settled research areas, but open and unresolved? Taking an outside view on the question it asks, Science of Deep Learning work is not a tractable research direction.

Additionally, everyone would like to understand how ML works, including those alignment-motivated and those capabilities motivated. This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution. This doesn’t mean the work is definitely net-negative from an alignment perspective, but a case has to be made here to explain why the alignment gains are greater than the capabilities gains. This case is harder to make than the same case for interpretability.

This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution.

I strongly disagree! AFAICT SGD works so well for capabilities that interpretability/actually understanding models/etc. is highly neglected and there's low-hanging fruit all over the place.

To me, the label "Science of DL" is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).

Got it, I was mostly responding to the third paragraph (insight into why SGD works, which I think is mostly an interpretability question) and should have made that clearer.

I think the situation I'm considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of "why SGD works" (the model uses the Y components to achieve low loss).

I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from "why SGD works".

Got it, I see. I think of the two as really intertwined (e.g. a big part of my agenda at the moment is studying how biases/path-dependence in SGD affect interpretability/polysemanticity).

I'm not certain about it either but I'm less skeptical. However, I agree with you that some of this could be capabilities work and has to be treated with caution. 

However, I think to answer some of the important questions around Deep Learning, e.g. which concepts they learn and under which conditions, we just need to get a better understanding of the entire pipeline. I think it's plausible that this is very hard and progress is much slower than one would hope.