There’s more and more interest in using AI to do a lot of useful things. And it makes sense: AI companies didn’t come this far just to come this far. Full automation might be underway, depending on a series of constraints. But what I want to talk about here is how to think about using automation for AI alignment.
Automated alignment research is all we seem to have the time to do, so everyone is lining up to do the second most foolish possible thing and ask the AI to do their alignment homework, with the only more foolish thing being not to do your homework at all. Dignity levels continue to hit all-time lows.
The way I read this, automated alignment is essentially equivalent to handing off the most crucial bits of science humanity will ever have to do to highly unreliable intelligent systems and hoping for the best.
I won’t try to assess whether automating alignment is a good idea per se in this post. To the extent that this kind of work is an explicit goal of AI companies and appears in AI safety agendas, I seek to clarify what automating alignment research means, treating this as a metascientific endeavor: theorizing about alignment as a science.
1. Is alignment research special?
Alignment can be understood as a capability that makes AI systems predictable and controllable. In that sense, it’s a prerequisite for deploying any system, not a special add-on or a feature to consider once the system has been deployed and diffused. Importantly, publicly releasing systems in the absence of robust alignment techniques carries a series of risks that scale concerningly with capabilities.
Plausibly, many of the tasks involved in alignment research are typical in software and machine learning engineering in that they require writing and debugging complex code bases, using compute, and securing high-quality training data. The question then is how to make sure that alignment-relevant work progresses proportionately to the rest of AI research that is typically focused on making systems generally capable (also known as differential technological development). It has been argued, for example, by OpenAI, Anthropic, and more recently by Carlsmith, that without the help of AI systems, human developers won’t be able to make the necessary progress in time to release systems that are beneficial for everyone.
What seems different in current deep learning systems, and therefore in current alignment work, concerns their scale: we now have highly complex code bases, large and costly amounts of compute, and vast training data. Automating parts of a team’s workflow would be both instrumental to the acceleration of AI development and deployment, but at the same time, would present higher-stakes challenges.
2. How is AI integrated into alignment research?
It’s not clear at the moment how automation could accelerate or improve alignment work. In particular, there’s a series of questions to think about and answer before being able to evaluate what automating alignment looks like. I group these questions into two clusters.
The first is about capabilities, i.e., what cognitive work AI systems can do. More specifically:
How useful are models for identifying algorithmic improvements?
How useful are these models in helping improve code bases that are complicated, especially within AI companies?
The second cluster is about testing and measurement, i.e., if models do assist with research, how do we effectively assess the ways in which they do so? Some questions to ask here are:
What are the most reliable methodologies to track the productivity of engineers and scientists?
What does tracking threat vectors from improving AI R&D look like? What are the metrics to do that?
What benchmarks can be designed for automating alignment research?
With these questions in mind, it’s natural to wonder about alignment as a science (Anthropic notably has an Alignment Science blog). The blockers that appear in human scientific thinking are likely to come up when it’s time for AI agents to take on the role of the scientist.
3. Human science obstacles for non-human minds
Empirical work helps determine what the exemplary problems are, what counts as a solution, and what methods are legitimate. This is Kuhn’s sense of a paradigm. Recently, the discussion about whether parts of alignment research are more or less paradigmatic has received a lot more attention. For example, Lee Sharkey talks about how mechanistic interpretability is no longer pre-paradigmatic, and I have previously written about Artificial Intelligence Safety as an Emerging Paradigm.
There is one straightforward way to test whether the field is now paradigmatic: use currently available research data to train AI alignment researchers. If the result is bad, then it’s either that the models are just not capable enough yet (but perhaps compare them to how they do at other scientific tasks), or that the data used are of low quality. I suspect that at the moment, it’s a combination of the two.
Before Kuhn, logical empiricists pointed to that same problem of epistemic bootstrapping, without, of course, having automated AI agents to do any science for them. Neurath's boat is the classic metaphor here: we are building (or rebuilding) the ship while sailing on it, and so we’re never able to put it in dry dock. This captures the AI alignment endeavor well — most of the time, it’s also pretty stormy.
There’s another bottleneck in accelerating research progress in the idealized sense; there might not be a “logic of scientific discovery” — a formalized recipe for how to do science. This is a long debate in the history and philosophy of science. What matters for the purposes of thinking about science agents is that such formalizations might not exist, at least not in a way that can be given as input from one intelligent system (human) to another (artificial). Scientific processes can be messy and difficult to articulate cleanly. I especially expect that scientific intuitions and research taste are hard to compress, though I also expect that AI systems could develop fast and frugal heuristics and learn from their training data the way human scientists do. This goes more into a comparative cognitive science for human and artificial minds, but there’s more to consider at the theory of science level.
4. What counts as safe AI?
There’s a useful distinction in the philosophy of science and engineering: verification vs validation. Verification asks: “did we build the system right?”, whereas validation asks “did we build the right system?”
For safe AI, we most likely need both: the internal consistency of satisfying verification criteria and the external confirmation that those criteria capture what we actually need an aligned system to be. Verification can look like checking whether RLHF satisfies a given reward specification within a test setting, whether the system behaves in an aligned way. Validation may require a more robust definition of what it means for the model to be aligned, outside of a suite of narrow experiments.
AI systems (perhaps even not completely aligned ones or more tool-like) could assist with verification work. But it seems unlikely that a partially aligned model could be useful for doing validation work. Validation typically requires zooming out, observing how a system operates in different environments, or how it generalizes, and being able to tell whether it has the right objective. In AI safety, the problem of validation is often described in terms of outer alignment, a question about whether themodel does what the developer thinks they want the model to do.
Takeaways
Before evaluating automated alignment, we need to address and answer two sets of questions: 1) can models do the necessary cognitive work and 2) how do we measure how good their work is?
It might be that key aspects of scientific processes cannot be cleanly formalized. AI systems could still pick up the mechanisms that make human scientists successful (because of deep learning).
AIs can probably help with verification work that requires internal consistency, but we should be very cautious about validation work that requires zooming out and assessing the system and its operations as a whole.
Crossposted from my Substack.
Epistemic status: in philosophy of science mode.
There’s more and more interest in using AI to do a lot of useful things. And it makes sense: AI companies didn’t come this far just to come this far. Full automation might be underway, depending on a series of constraints. But what I want to talk about here is how to think about using automation for AI alignment.
A while ago, the following Zvi quote resonated:
The way I read this, automated alignment is essentially equivalent to handing off the most crucial bits of science humanity will ever have to do to highly unreliable intelligent systems and hoping for the best.
I won’t try to assess whether automating alignment is a good idea per se in this post. To the extent that this kind of work is an explicit goal of AI companies and appears in AI safety agendas, I seek to clarify what automating alignment research means, treating this as a metascientific endeavor: theorizing about alignment as a science.
1. Is alignment research special?
Alignment can be understood as a capability that makes AI systems predictable and controllable. In that sense, it’s a prerequisite for deploying any system, not a special add-on or a feature to consider once the system has been deployed and diffused. Importantly, publicly releasing systems in the absence of robust alignment techniques carries a series of risks that scale concerningly with capabilities.
Plausibly, many of the tasks involved in alignment research are typical in software and machine learning engineering in that they require writing and debugging complex code bases, using compute, and securing high-quality training data. The question then is how to make sure that alignment-relevant work progresses proportionately to the rest of AI research that is typically focused on making systems generally capable (also known as differential technological development). It has been argued, for example, by OpenAI, Anthropic, and more recently by Carlsmith, that without the help of AI systems, human developers won’t be able to make the necessary progress in time to release systems that are beneficial for everyone.
What seems different in current deep learning systems, and therefore in current alignment work, concerns their scale: we now have highly complex code bases, large and costly amounts of compute, and vast training data. Automating parts of a team’s workflow would be both instrumental to the acceleration of AI development and deployment, but at the same time, would present higher-stakes challenges.
2. How is AI integrated into alignment research?
It’s not clear at the moment how automation could accelerate or improve alignment work. In particular, there’s a series of questions to think about and answer before being able to evaluate what automating alignment looks like. I group these questions into two clusters.
The first is about capabilities, i.e., what cognitive work AI systems can do. More specifically:
The second cluster is about testing and measurement, i.e., if models do assist with research, how do we effectively assess the ways in which they do so? Some questions to ask here are:
With these questions in mind, it’s natural to wonder about alignment as a science (Anthropic notably has an Alignment Science blog). The blockers that appear in human scientific thinking are likely to come up when it’s time for AI agents to take on the role of the scientist.
3. Human science obstacles for non-human minds
Empirical work helps determine what the exemplary problems are, what counts as a solution, and what methods are legitimate. This is Kuhn’s sense of a paradigm. Recently, the discussion about whether parts of alignment research are more or less paradigmatic has received a lot more attention. For example, Lee Sharkey talks about how mechanistic interpretability is no longer pre-paradigmatic, and I have previously written about Artificial Intelligence Safety as an Emerging Paradigm.
There is one straightforward way to test whether the field is now paradigmatic: use currently available research data to train AI alignment researchers. If the result is bad, then it’s either that the models are just not capable enough yet (but perhaps compare them to how they do at other scientific tasks), or that the data used are of low quality. I suspect that at the moment, it’s a combination of the two.
Before Kuhn, logical empiricists pointed to that same problem of epistemic bootstrapping, without, of course, having automated AI agents to do any science for them. Neurath's boat is the classic metaphor here: we are building (or rebuilding) the ship while sailing on it, and so we’re never able to put it in dry dock. This captures the AI alignment endeavor well — most of the time, it’s also pretty stormy.
“Ship at Sea”, Anton Otto Fischer, 1937.
There’s another bottleneck in accelerating research progress in the idealized sense; there might not be a “logic of scientific discovery” — a formalized recipe for how to do science. This is a long debate in the history and philosophy of science. What matters for the purposes of thinking about science agents is that such formalizations might not exist, at least not in a way that can be given as input from one intelligent system (human) to another (artificial). Scientific processes can be messy and difficult to articulate cleanly. I especially expect that scientific intuitions and research taste are hard to compress, though I also expect that AI systems could develop fast and frugal heuristics and learn from their training data the way human scientists do. This goes more into a comparative cognitive science for human and artificial minds, but there’s more to consider at the theory of science level.
4. What counts as safe AI?
There’s a useful distinction in the philosophy of science and engineering: verification vs validation. Verification asks: “did we build the system right?”, whereas validation asks “did we build the right system?”
For safe AI, we most likely need both: the internal consistency of satisfying verification criteria and the external confirmation that those criteria capture what we actually need an aligned system to be. Verification can look like checking whether RLHF satisfies a given reward specification within a test setting, whether the system behaves in an aligned way. Validation may require a more robust definition of what it means for the model to be aligned, outside of a suite of narrow experiments.
AI systems (perhaps even not completely aligned ones or more tool-like) could assist with verification work. But it seems unlikely that a partially aligned model could be useful for doing validation work. Validation typically requires zooming out, observing how a system operates in different environments, or how it generalizes, and being able to tell whether it has the right objective. In AI safety, the problem of validation is often described in terms of outer alignment, a question about whether the model does what the developer thinks they want the model to do.
Takeaways