Lots of agent foundations research is motivated by the idea that alignment techniques found by empirical trial-and-error will fail to generalize to future systems. While such threat models are plausible, agent foundations researchers have largely failed to make progress on addressing them because they tend to make very few assumptions about the gears-level properties of future AI systems. In contrast, the emerging field of "Science of Deep Learning" (SDL) assumes that the systems under question are neural networks, and aims to empirically uncover general principles about how they work. Assuming such principles exist and that neural nets will lead to AGI, insights from SDL will generalize to future systems even if specific alignment techniques found by trial-and-error do not. As a result, SDL can address the threat models that motivate agent foundations research while being much more tractable due to making more assumptions about future AI systems, though it cannot help in the construction of an ideal utility function for future AI systems to maximize.

Thanks to Alexandra Bates for discussion and feedback


Generally speaking, the "agent foundations" line of research in alignment is motivated by a particular class of threat models: that alignment techniques generated through empirical trial and error like RLHF or Constitutional AI will suddenly break down as AI capabilities advance, leading to catastrophe. For example, Nate Soares has argued that AI systems will undergo a "sharp left turn" at which point the "shallow" motivational heuristics which caused them to be aligned at lower capabilities levels will cease to function. Eliezer Yudkowsky seems to endorse a similar threat model in AGI Ruin, arguing that rapid capabilities increases will cause models to go out of distribution and become misaligned.

I'm much more optimistic about finding robust alignment methods through trial and error than these researchers, but I'm not here to debate that. While I think sharp left turn style failure modes are relatively unlikely, they do not seem totally implausible. Moreover, because doom is much more likely in worlds where alignment techniques generated through trial and error fail to generalize, it makes sense to focus on them when doing alignment research. Instead, I want to argue that the emerging field of the "science of deep learning" or SDL is more promising for dealing with these threat models than agent foundations.

A Central Tradeoff in Alignment Research

AI alignment research agendas face a central tradeoff: robustness vs tractability. The goal of alignment research is to develop techniques which will allow us to control the AIs of the future. However, we can only observe presently existing AIs, leaving us uncertain about the properties of these future systems. Yet in order to make progress on alignment, we need to make certain assumptions about these properties. Thus, the question arises: what assumptions should we make? 

The more assumptions you make about a system, the more you can prove or empirically show about it, making research about it more tractable. However, the assumptions might turn out to be incorrect, invalidating your model and any engineering conclusions you drew from it. Thus, making more assumptions about future AI systems makes the resulting alignment research simultaneously more tractable and less robust. Framed in this way, I take the worry of agent foundations researchers to be that many current empirical research efforts rely on too many assumptions which will fail to generalize to future systems.

Agent Foundations Makes Too Few Assumptions

Maybe trial and error empirical work does make too many assumptions, but agent foundations still makes too few.. Lots of historical agent foundations work assumed nothing about future AI systems other than that they will be expected utility maximizers (EUM) for some objective (e. g. Corrigibility). More recent work in agent foundations like Embedded Agency takes into account that true EUM is impossible in the real world. But my impression is that current lines of agent foundations research still make few if any assumptions about the gears-level details of future systems.

This makes sense given the pre-deep learning context in which agent foundations emerged. Unfortunately, such research seems likely to be intractable. Prima facie, it seems very challenging to come up with alignment methods which don't depend heavily on the details of the systems you are trying to align. Empirically, while lots of agent foundations style work has helped to frame the alignment problem, it has largely failed to make progress on solving it, and the main organization associated with agent foundations work (MIRI) has partially abandoned it due to this fact. Yet it anecdotally seems to me that many in the AI safety community (especially those in the rationality community) who are skeptical of trial and error empirical work are still interested in agent foundations.

Science of Deep Learning as the Happy Medium

If "future systems will be like the LLMs of 2023" assumes too much, and "future systems will be agents of some kind" assumes too little, what set of assumptions lies at the happy medium in between? While future AIs may be very different from current AIs at a high level of abstraction, they seem increasingly likely to share the low level property of being neural networks, especially in the short-timelines worlds where current research is the most valuable. Indeed, given that there is lots of disagreement about whether and when future AI systems will be well-described by formalisms like EUM, the assumption that they will be neural nets is arguably more robust than the assumption that they will conform with any higher level abstractions. 

More importantly, adding the assumption that future systems will be neural nets seems likely to make the resulting alignment research much more tractable. At a high level of abstraction, the specific fact that future AI’s will be the product of gradient descent demands that alignment solutions have the shape of training setups and creates the potential for inner misalignment, which is a key sub-problem. While this is a necessary formal property of neural nets, it also seems likely that there are at least some important empirical regularities about the way that neural nets learn and how their capabilities scale with size that will generalize to future systems. Given the fact that we could observe these regularities in present system if they exist, investigating them is plausibly tractable.

Fortunately, there is a line of research that studies neural nets as such: the science of deep learning (SDL). SDL is a relatively new "field", and doesn't have canonical boundaries. Moreover, SDL research is largely conducted by academics who are often only adjacent to or sometimes completely outside of the alignment space. However, it has already led to some interesting findings. In "The Quantization Model of Neural Scaling," researchers at Max Tegmark's lab found evidence that over the course of training, neural nets tend to learn a sequence of discrete computations known as "quanta" which grant them certain capabilities. Moreover, quanta seem to be distributed in their effects on loss according to a power law, and networks tend to learn the most useful quanta first; these facts together explain power-law loss curves. Similarly, in "The Lottery Ticket Hypothesis," the authors find that if you train CNNs and then prune them, you can then train a neural net that performs about as well as the original network while being only 20% of the size by taking the remaining parameters, resetting them to their initial values, and training them in isolation. Other examples of SDL work that seems to have significant implications for alignment include Mingard et al.’s “Is SGD a Bayesian Sampler?” which helps to explain the “simplicity prior” of neural nets, and more recently research into singular learning theory, which Nina Rimsky and Dmitry Vaintrob have in part used to uncover subcircuits within networks which implement different possible generalizations of the training data.

Without more evidence, these findings are just plausible hypotheses, and other researchers have contested the conclusions they draw. However, they seem like the sorts of hypotheses that could probably be confirmed or disconfirmed with more research. 

Genuine confirmation of such hypotheses would have substantial implications for alignment. For example, if the quantization model of neural scaling is true, this would suggest that discontinuous capabilities increases late in training due to models learning some kind of “simple core of intelligence” are unlikely, as if small (and therefore quickly learnable) sets of quanta existed for such capabilities, they would be learned earlier in training. Similarly, if singular learning theory could help us to understand the different ways in which networks could generalize from the training data and to control which generalizations they ultimately employ, this would obviously constitute major progress.

The fact that SDL has already made some progress suggests that it is more tractable than agent foundations. However, SDL is also particularly well suited to tackling inner alignment problems which are otherwise notoriously tricky to address. While agent foundations researchers have known about inner alignment problems for a while, they have generally focused more on trying to construct a utility function that we would want an AI system to optimize if we already had a robust way of getting said utility function into the AI. It seems obvious that agent foundations would not be very helpful for tackling inner alignment, because inner alignment is a problem that only comes up when you are building an AI in a particular way rather than being a problem with agents in general. In contrast, arguably the central sub-topic in SDL is neural network inductive biases, as seen in several of the papers I linked. If the pessimists are right and the vastness of mindspace makes it such that the prior odds that gradient descent lands on an agent that is optimizing a perfectly aligned goal are slim, then if we want to create such an agent, we must have a very good understanding of inductive biases in order to craft the architecture, dataset, and reward function that will actually result in the model generalizing in the way that we want. 

Altogether, SDL is more tractable than agent foundations while addressing the same central threat model in addition to inner alignment problems. As a result, I would encourage people who are worried that techniques discovered through trial and error will fail to generalize to future systems and/or inner alignment failures to work on SDL instead of agent foundations.


New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 7:55 PM

My guess at the agent foundations answer would be that, sure, human value needs to survive the transition from "The smartest thing alive is a human" to "the smartest thing alive is a transformer trained by gradient descent", but it also needs to survive the transition from "the smartest thing alive is a gradient descent transformer" to "the smartest thing alive is coded by an aligned transformer, but internally is meta-shrudlu / a hand coded tree searcher /  a transformer trained by the update rule used in human brains / a simulation of 60,000 copies of obama wired together / etc" and stopping progress in order prevent that second transition is just as hard as stopping now to prevent the first transition.

Upvoted with the caveat that I think no one has clearly articulated the "sharp left turn" failure mode, including Nate Soares, and so anything attempting to address it might not succeed.