I’d like to thank Jérémy Scheurer and Ethan Perez for discussions about the post.
I recently published a post on the Science of Deep Learning. There have been many people before me who had similar ideas and I don’t claim to have invented this agenda, I’m merely excited about it. In this post, I want to explain why I think Science of DL is an important research direction in the current alignment landscape.
By Science of DL, I roughly mean “understanding DL systems and how they learn concepts”. A large component of such an agenda is interpretability (mechanistic and other) but it also tries to get a better understanding of how and under which conditions NNs learn specific concepts. For example, it would include questions like “how, when and why does a network show grokking?”, “Can we predict some capabilities of models from high-level knowledge about the training process before we look at the model?” or “Can we build a robust theory of what fine-tuning does to a NN on a mechanistic level?”. In general, the idea is to build a detailed understanding of how core aspects of DL work. Given that this is a safety agenda, the specific research questions would obviously be prioritized by how relevant they are to alignment.
Note that there are many definitions of Science of DL that include research questions that I don’t think are important for alignment. For example, trying to understand what differentiates Adam from SGD might be considered part of Science of DL but I think this question only very vaguely relates to alignment and is not neglected.
(Mechanistic) interpretability is a core part of Science of DL. There already exist resources on theories of impact for interpretability, most prominently “a longlist of theories of impact for interpretability” and “another list of theories of impact for interpretability”. Therefore, I will only briefly present a small selection of what Neel Nanda and Beth Barnes have already written:
There are many more reasons why interpretability is helpful for alignment and I recommend checking out the linked posts. I personally am very excited about interpretability and many of the recent developments.
Interpretability allows you to inspect a model after it has been trained and thus adds a layer of protection before it is deployed. However, there are many things we can say about a model even before it is trained. Architecture, task, dataset, hyperparameters, training compute, and so on all have inductive biases that make it more or less likely that a model develops a specific capability. We do have some understanding of what these inductive biases look like, e.g. convolutions model local correlations, RNNs model a relationship over time and RL induces an actor/agent.
However, there are still a lot of behaviors of which we have a very vague understanding at best, e.g. we can draw some lines in log-space and call them scaling laws but we have no good understanding of why these lines exist, whether they stop at some point or how a log-loss translates to a specific capability. Some of these questions are only distantly related to alignment, e.g. which inductive bias different initialization methods have, but some seem pretty directly relevant to alignment. For example,
Optimally, we could make specific predictions about DL systems with high accuracy. However, I think our current understanding of DL is so bad that we should aim for getting decent educated guesses first.
There are many different ways in which misaligned AIs could be harmful. Two specific ways are through emergent phenomena and deception. We have seen that if you increase the scale of AI systems new capabilities emerge. These capabilities are not just improved versions of previous capabilities but are qualitatively new. For example, small LLMs can’t do two-digit addition but larger LLMs can. One emergent phenomenon that seems especially relevant for safety and alignment is deception, i.e. that an AI system seems like it is aligned on the surface but pursues other goals in the background when given the chance.
The more relevant emergent phenomena and deception are for alignment, the more important it is to have a robust understanding of the AI system without having to rely on extensive testing. For example, if we expect a system with X parameters to be not deceptive but had strong reasons to believe that an AI system with 100X parameters is deceptive, then testing the larger system on input-output behavior is not very useful. The big model will pass all tests but will still act misaligned when deployed.
Interpretability (mechanistic and other) is one way to address this failure mode, e.g. by investigating the internals of the model in detail but a much more detailed understanding of DL might enable us to make more precise probabilistic predictions about capabilities in addition to looking at internal beliefs. To some extent, we already have some intuitive models for capability prediction. For example, we think it is extremely unlikely that a ResNet trained on imagenet will ever be dangerous. We also expect current LLMs not to be agentic because we think that the task of next-word prediction doesn’t straightforwardly induce agents (but we aren’t sure). On the other hand, we expect RL training to lead to agents. With hybrid models like decision transformers, it is already unclear which inductive biases apply. Having a more detailed understanding of the inductive biases of different architectures, datasets, training regimes, hyperparameters, etc. might enable us to make much more precise predictions about when and how a relevant capability arises. This is analogous to how most other sciences have quantitative theoretical models that enable predictions, e.g. models of the trajectory of planets in astrophysics, models of prey and predators in biology and models of atoms in chemistry. Similarly, Science of DL could also yield different predictive models from the micro to the macro scale.
A nuclear physicist doesn’t have to test what happens when they explode a nuclear bomb in the desert, they can get a relatively good prediction from their theoretical model. I think it would be helpful if the AI community had a better theoretical understanding of DL before we get to very dangerous capabilities. And the more disruptive you think emergent capabilities are, the more important it is to build a good understanding before deployment (even in controlled environments) because testing on input-output behavior might be insufficient.
Comment: by “theoretical understanding”, I don’t necessarily mean a mathematical equation; verbal models are fine as long as they have high predictive power. I think most of the Science of DL research is empirical and very pre-paradigmatic such that most research will be pattern matching and clarifying phenomena, e.g. “grokking happens when X, Y or Z are given; and during grokking this and this happens on a mechanistic level”.
I think Science of DL can be seen as a bet on NNs & SGD (or some of its many variants). More specifically, it is a bet on
In general, different approaches to alignment make assumptions of different specificity and I roughly think of them in the following hierarchy (I’m omitting many approaches/ideas, just want to pump the intuition)
Obviously, I don’t claim that everyone who works on mechanistic interpretability for transformers thinks that this is the only necessary component of alignment. I’m just trying to put Science of DL into the context of the assumptions other approaches make.
I personally believe that the Deep Learning paradigm is here to stay for at least another decade and is powerful enough to produce highly capable and potentially dangerous AI systems. Therefore, I currently think it is reasonable to focus on DL instead of more abstract approaches. On the other hand, I don’t have any strong belief that e.g. RLHF, adversarial training or similar training techniques are sufficient for alignment. Therefore, I prefer to not add more restrictive assumptions which puts me at the abstraction level of Science of DL. Of course, in practice, one has to start somewhere, and thus, people working on Science of DL will work on problems with more restrictive assumptions most of the time, e.g. mechanistic interpretability. I think of this more as a perspective/framing.
I think the current norms in ML strongly favor research that focuses on capabilities over research that focuses on understanding. For example, a lot of papers are along the lines of “we used a bigger model or slightly modified a previous model and improved the state-of-the-art performance”. This claim is then backed up by large tables of benchmark performances in which the author’s model outperforms their competitors. Sometimes the modification is motivated by a theoretical idea but the paper rarely provides evidence for or against this hypothesis being true or not. I think there are some papers that try to test their hypothesis but the majority of papers don’t. I can also totally understand why this is the case. Reviewers will ask for large tables with comparisons on benchmarks, so researchers focus on providing them. From the perspective of the researchers, it’s totally rational to follow these incentives. Furthermore, from a short-sighted perspective, these norms could make a lot of sense, e.g. if you’re looking to deploy a model as soon as possible, you care more about the fact that it works well than why it works.
However, all of this combined leads to an ever larger capability-to-understanding gap, i.e. the AI community gets better at building models that perform better than we are at understanding them. This trend would be fine if we had reasons to believe that models never become dangerous. However, given that there are a ton of reasons to assume something could go wrong with AI, I think this growing gap is dangerous.
I expect that there will be large-scale failures of AI systems in deployment that lead to large humanitarian and economic harm. I think when we look into how these failures arose, we will find the reasons eventually. However, it would be so much better if we could catch many of these failures before they lead to harm and were able to prevent this unfortunate awakening.
Besides the object-level reasons to work on Science of DL, I think there are multiple instrumental reasons why it could be impactful. Most of these are based on the fact that there already is a large research field outside of the alignment community that works on related questions. However, I want to emphasize that an agenda should primarily be chosen on object-level reasons and instrumental reasons should be secondary--they are nice to have but not necessary for a good alignment agenda.
First, I think one could fill multiple full-time alignment roles whose description is broadly “closely follow the Deep Learning research landscape and then think about and explain their implications for safety”. Many phenomena in DL such as double descend, the literature on SGD, some explainability and interpretability work, adversarial training, the lottery ticket hypothesis, symmetries in parameter space and many more possibly have implications for alignment. However, these implications are rarely discussed in the papers themselves. Thus, reading and summarizing the paper, detailing some implications for safety and proposing follow-up research for alignment researchers seems like an impactful job for the alignment community. In some sense, you only have to add “the last 10%” because many technical questions have already been answered in the original paper. Therefore, such a role might be very efficient in terms of time invested per information gain. I think Quintin Pope’s paper roundups are a great example of how something like this could look in practice.
Secondly, I think researching Science of DL has a lot of synergies with academia and industry research. More academics have started to care about AI safety and alignment and most academics care about better understanding the system they are working with. Thus, the benefits of Science of DL include:
Once again, I want to emphasize that these instrumental reasons are nice but they should not be the primary reason for working on an agenda. The most important reason to work on an agenda is that it has a promising path to alignment, everything else is secondary.
There are multiple good reasons to be skeptical of Science of DL as an alignment agenda and I definitely don’t think everyone should work on it. These include:
I’m not super sure about the agenda but it currently feels promising to me. I feel like a lot of things line up that made me skeptical about some other agendas. For example, I think Science of DL has a defender’s advantage, e.g. working on it seems to favor alignment more than capabilities. Also, the bet on NNs seems plausible to me, i.e. TAI could just be a scaled-up version of today’s systems with no major novel insights or paradigm shifts. Lastly, I think if we want to have a plausible shot at getting alignment right, having an agenda that is accessible to the broader ML community and can absorb a lot of people that are not (yet) interested in alignment definitely helps. Science of DL seems technical and applied enough that many people outside of the AI safety bubble could be interested and contribute.