Theories of impact for Science of Deep Learning

Marius Hobbhahn

I’d like to thank Jérémy Scheurer and Ethan Perez for discussions about the post.

I recently published a post on the Science of Deep Learning. There have been many people before me who had similar ideas and I don’t claim to have invented this agenda, I’m merely excited about it. In this post, I want to explain why I think Science of DL is an important research direction in the current alignment landscape.

By Science of DL, I roughly mean “understanding DL systems and how they learn concepts”. A large component of such an agenda is interpretability (mechanistic and other) but it also tries to get a better understanding of how and under which conditions NNs learn specific concepts. For example, it would include questions like “how, when and why does a network show grokking?”, “Can we predict some capabilities of models from high-level knowledge about the training process before we look at the model?” or “Can we build a robust theory of what fine-tuning does to a NN on a mechanistic level?”. In general, the idea is to build a detailed understanding of how core aspects of DL work. Given that this is a safety agenda, the specific research questions would obviously be prioritized by how relevant they are to alignment.

Note that there are many definitions of Science of DL that include research questions that I don’t think are important for alignment. For example, trying to understand what differentiates Adam from SGD might be considered part of Science of DL but I think this question only very vaguely relates to alignment and is not neglected.

Theories of impact

Interpretability

(Mechanistic) interpretability is a core part of Science of DL. There already exist resources on theories of impact for interpretability, most prominently “a longlist of theories of impact for interpretability” and “another list of theories of impact for interpretability”. Therefore, I will only briefly present a small selection of what Neel Nanda and Beth Barnes have already written:

A force-multiplier for alignment research: Understanding the model and why it might be misaligned is a core component of finding and constructing good alignment solutions.
Auditing: we can check if the model is safe before deployment. Auditing can be done in general or for specific properties like deception or at different points during training.
Intervening on training, either by modifying the training for the model to be more interpretable or to steer it away from undesired concepts.

There are many more reasons why interpretability is helpful for alignment and I recommend checking out the linked posts. I personally am very excited about interpretability and many of the recent developments.

Understanding of DL beyond interpretability

Interpretability allows you to inspect a model after it has been trained and thus adds a layer of protection before it is deployed. However, there are many things we can say about a model even before it is trained. Architecture, task, dataset, hyperparameters, training compute, and so on all have inductive biases that make it more or less likely that a model develops a specific capability. We do have some understanding of what these inductive biases look like, e.g. convolutions model local correlations, RNNs model a relationship over time and RL induces an actor/agent.

However, there are still a lot of behaviors of which we have a very vague understanding at best, e.g. we can draw some lines in log-space and call them scaling laws but we have no good understanding of why these lines exist, whether they stop at some point or how a log-loss translates to a specific capability. Some of these questions are only distantly related to alignment, e.g. which inductive bias different initialization methods have, but some seem pretty directly relevant to alignment. For example,

We don’t really know what happens during RLHF or adversarial training. Are the internal circuits of the LLM adapted to reflect our desired goal or do they just learn weird non-general proxies?
Can we reliably predict a specific capability before training a model, ranging from simpler capabilities like 2-digit addition to very hard capabilities like deception?
Can we understand how general circuits form, e.g. how they might change from random behavior to simple heuristics to sophisticated heuristics during training?

Optimally, we could make specific predictions about DL systems with high accuracy. However, I think our current understanding of DL is so bad that we should aim for getting decent educated guesses first.

Framing - Emergent phenomena and Deception

There are many different ways in which misaligned AIs could be harmful. Two specific ways are through emergent phenomena and deception. We have seen that if you increase the scale of AI systems new capabilities emerge. These capabilities are not just improved versions of previous capabilities but are qualitatively new. For example, small LLMs can’t do two-digit addition but larger LLMs can. One emergent phenomenon that seems especially relevant for safety and alignment is deception, i.e. that an AI system seems like it is aligned on the surface but pursues other goals in the background when given the chance.

The more relevant emergent phenomena and deception are for alignment, the more important it is to have a robust understanding of the AI system without having to rely on extensive testing. For example, if we expect a system with X parameters to be not deceptive but had strong reasons to believe that an AI system with 100X parameters is deceptive, then testing the larger system on input-output behavior is not very useful. The big model will pass all tests but will still act misaligned when deployed.

Interpretability (mechanistic and other) is one way to address this failure mode, e.g. by investigating the internals of the model in detail but a much more detailed understanding of DL might enable us to make more precise probabilistic predictions about capabilities in addition to looking at internal beliefs.
To some extent, we already have some intuitive models for capability prediction. For example, we think it is extremely unlikely that a ResNet trained on imagenet will ever be dangerous. We also expect current LLMs not to be agentic because we think that the task of next-word prediction doesn’t straightforwardly induce agents (but we aren’t sure). On the other hand, we expect RL training to lead to agents. With hybrid models like decision transformers, it is already unclear which inductive biases apply. Having a more detailed understanding of the inductive biases of different architectures, datasets, training regimes, hyperparameters, etc. might enable us to make much more precise predictions about when and how a relevant capability arises. This is analogous to how most other sciences have quantitative theoretical models that enable predictions, e.g. models of the trajectory of planets in astrophysics, models of prey and predators in biology and models of atoms in chemistry. Similarly, Science of DL could also yield different predictive models from the micro to the macro scale.

A nuclear physicist doesn’t have to test what happens when they explode a nuclear bomb in the desert, they can get a relatively good prediction from their theoretical model. I think it would be helpful if the AI community had a better theoretical understanding of DL before we get to very dangerous capabilities. And the more disruptive you think emergent capabilities are, the more important it is to build a good understanding before deployment (even in controlled environments) because testing on input-output behavior might be insufficient.

Comment: by “theoretical understanding”, I don’t necessarily mean a mathematical equation; verbal models are fine as long as they have high predictive power. I think most of the Science of DL research is empirical and very pre-paradigmatic such that most research will be pattern matching and clarifying phenomena, e.g. “grokking happens when X, Y or Z are given; and during grokking this and this happens on a mechanistic level”.

Framing - A bet on NNs & SGD

I think Science of DL can be seen as a bet on NNs & SGD (or some of its many variants). More specifically, it is a bet on

the relevant AI risks coming from DL systems trained with SGD and
the relevant insights for alignment coming from understanding DL's functionality and training process better.

In general, different approaches to alignment make assumptions of different specificity and I roughly think of them in the following hierarchy (I’m omitting many approaches/ideas, just want to pump the intuition)

Agent foundations and other MIRI-type work: assumes that the AI system has goals and acts rationally.
Selection theorems and other John Wentworth-type work: the AI system is selected according to some fundamental selection principles.
ELK and heuristic arguments: assumes that the AI system has some internal model that we can interpret if the model is trained correctly and/or we have the right tools.
Science of DL: assumes that the AI system is a NN or a combination of NNs and has been trained with SGD or one of its many variants.
RLHF: assumes that the system is DL-based and can be fine-tuned to show good behavior by training it with human feedback.
Transformer mechanistic interpretability: Assumes that the relevant models are transformers and that a mechanistic understanding of the model is the key to aligning them.

Obviously, I don’t claim that everyone who works on mechanistic interpretability for transformers thinks that this is the only necessary component of alignment. I’m just trying to put Science of DL into the context of the assumptions other approaches make.

I personally believe that the Deep Learning paradigm is here to stay for at least another decade and is powerful enough to produce highly capable and potentially dangerous AI systems. Therefore, I currently think it is reasonable to focus on DL instead of more abstract approaches. On the other hand, I don’t have any strong belief that e.g. RLHF, adversarial training or similar training techniques are sufficient for alignment. Therefore, I prefer to not add more restrictive assumptions which puts me at the abstraction level of Science of DL. Of course, in practice, one has to start somewhere, and thus, people working on Science of DL will work on problems with more restrictive assumptions most of the time, e.g. mechanistic interpretability. I think of this more as a perspective/framing.

Framing - the capability-to-understanding gap is getting bigger

I think the current norms in ML strongly favor research that focuses on capabilities over research that focuses on understanding. For example, a lot of papers are along the lines of “we used a bigger model or slightly modified a previous model and improved the state-of-the-art performance”. This claim is then backed up by large tables of benchmark performances in which the author’s model outperforms their competitors. Sometimes the modification is motivated by a theoretical idea but the paper rarely provides evidence for or against this hypothesis being true or not. I think there are some papers that try to test their hypothesis but the majority of papers don’t. I can also totally understand why this is the case. Reviewers will ask for large tables with comparisons on benchmarks, so researchers focus on providing them. From the perspective of the researchers, it’s totally rational to follow these incentives. Furthermore, from a short-sighted perspective, these norms could make a lot of sense, e.g. if you’re looking to deploy a model as soon as possible, you care more about the fact that it works well than why it works.

However, all of this combined leads to an ever larger capability-to-understanding gap, i.e. the AI community gets better at building models that perform better than we are at understanding them. This trend would be fine if we had reasons to believe that models never become dangerous. However, given that there are a ton of reasons to assume something could go wrong with AI, I think this growing gap is dangerous.

I expect that there will be large-scale failures of AI systems in deployment that lead to large humanitarian and economic harm. I think when we look into how these failures arose, we will find the reasons eventually. However, it would be so much better if we could catch many of these failures before they lead to harm and were able to prevent this unfortunate awakening.

Instrumental reasons

Besides the object-level reasons to work on Science of DL, I think there are multiple instrumental reasons why it could be impactful. Most of these are based on the fact that there already is a large research field outside of the alignment community that works on related questions. However, I want to emphasize that an agenda should primarily be chosen on object-level reasons and instrumental reasons should be secondary--they are nice to have but not necessary for a good alignment agenda.

First, I think one could fill multiple full-time alignment roles whose description is broadly “closely follow the Deep Learning research landscape and then think about and explain their implications for safety”. Many phenomena in DL such as double descend, the literature on SGD, some explainability and interpretability work, adversarial training, the lottery ticket hypothesis, symmetries in parameter space and many more possibly have implications for alignment. However, these implications are rarely discussed in the papers themselves. Thus, reading and summarizing the paper, detailing some implications for safety and proposing follow-up research for alignment researchers seems like an impactful job for the alignment community. In some sense, you only have to add “the last 10%” because many technical questions have already been answered in the original paper. Therefore, such a role might be very efficient in terms of time invested per information gain. I think Quintin Pope’s paper roundups are a great example of how something like this could look in practice.

Secondly, I think researching Science of DL has a lot of synergies with academia and industry research. More academics have started to care about AI safety and alignment and most academics care about better understanding the system they are working with. Thus, the benefits of Science of DL include:

Collaborations: It allows for more collaborations within established academic disciplines and labs. This effectively increases the pool of people working on the topic and might give you an additional lever, e.g. if you motivate your research with safety, more academics might care about it for this reason.
Publications: While it might be getting easier, I think that publishing research on AI safety in academic venues is still harder than capabilities research, especially if the safety work is non-technical. Science of DL, however, is already a part of multiple academic disciplines and the justification of “we want to understand the system better to prevent future harm” seems uncontroversial. Thus, publishing it could be easier than other alignment work. Publications come with many instrumental benefits such as increased publicity, more follow-up work, increased reputation and so on.
Academic paths: I’m not sure about this but I think it’s plausible that the AI safety community is currently giving too little weight to academic paths as a means to impact. Professors educate most of the next generation of researchers, academics review most papers at large conferences, academics are consulted for important public decisions, and much more. Much of the more abstract work on AI safety is not seen as credible science in the current academic environment and thus important and prestigious professorships might not be achievable for researchers of some agendas. Science of DL, on the other hand, is already established enough that good work will be recognized as such by the academic community even if you have a focus on safety.
A gateway drug for AI safety: In my experience, academics care less about abstract arguments around AI safety or motivations through X-risks and much more about technical questions and understanding things in detail. Therefore, the more the Science of DL agenda is intertwined with AI safety, the more the safety aspect will become normalized and widely known. Once this is a common assumption in traditional academia, it is much easier to explain the rest of the arguments and arrive at questions about deceptive alignment, uncontrollability, specification gaming, etc. I think the fact that it is completely normalized and accepted that people train large black-box models without understanding at all what’s going on inside them is very weird and actually unexpected in a scientific community. Thus, pushing against this norm seems already valuable even if it doesn’t lead to more interest in alignment.

Once again, I want to emphasize that these instrumental reasons are nice but they should not be the primary reason for working on an agenda. The most important reason to work on an agenda is that it has a promising path to alignment, everything else is secondary.

Reasons against Science of DL

There are multiple good reasons to be skeptical of Science of DL as an alignment agenda and I definitely don’t think everyone should work on it. These include:

Other agendas might be more promising: There are a lot of alignment agendas out there. Most of them seem like they could solve some important subcomponent of alignment. I really wouldn’t want everyone to work on Science of DL and it’s good to take multiple bets and diversify as a community.
The main problem doesn’t come from NNs: Especially if you think that the core of the alignment problem comes from something that is not really related to NNs or how they are trained, e.g. if you believe the core of the alignment problem is agency or reward specification, Science of DL just seems like the wrong approach.
It’s not neglected enough: There are already people in academia and industry working on many of the questions related to Science of DL. It’s not clear that the AI safety community or EAs should use their time to work on it.
Progress is too slow: Questions around interpretability and generalization properties of NNs have been around for a long time. Most people would have hoped for much faster progress but we still can’t answer many basic questions about why NNs generalize and what concepts they learn.

Final words

I’m not super sure about the agenda but it currently feels promising to me. I feel like a lot of things line up that made me skeptical about some other agendas. For example, I think Science of DL has a defender’s advantage, e.g. working on it seems to favor alignment more than capabilities. Also, the bet on NNs seems plausible to me, i.e. TAI could just be a scaled-up version of today’s systems with no major novel insights or paradigm shifts. Lastly, I think if we want to have a plausible shot at getting alignment right, having an agenda that is accessible to the broader ML community and can absorb a lot of people that are not (yet) interested in alignment definitely helps. Science of DL seems technical and applied enough that many people outside of the AI safety bubble could be interested and contribute.

25

Theories of impact for Science of Deep Learning

25

Ω 12

Theories of impact

Interpretability

Understanding of DL beyond interpretability

Framing - Emergent phenomena and Deception

Framing - A bet on NNs & SGD

Framing - the capability-to-understanding gap is getting bigger

Instrumental reasons

Reasons against Science of DL

Final words

25

Ω 12

25

Ω 12