We skip over [..] where we move from the human-ish range to strong superintelligence[1]. [..] the period where we can harness potentially vast quantities of AI labour to help us with the alignment of the next generation of models
- Will MacAskill in his critique of IABIED
I want to respond to Will MacAskill's claim in his IABIED review that we may be able use AI to solve alignment.[1] Will believes that recent developments in AI made it more likely that takeoff will be relatively slow - "Sudden, sharp, large leaps in intelligence now look unlikely". Because of this, he and many others believe that there will likely be a period of time at some point in the future when we can essentially direct the AIs to align more powerful AIs. But it appears to me that a “slow takeoff” is not sufficient at all and that a lot of things have to be true for this to work. Not only do we need to have a slow takeoff, we also need to have AIs that are great at alignment research during this time period. For this research to be useful, we need to have verifiable metrics and well-specified objectives ahead of time that we can give to the AIs. If that all works out, it has to be the case that the alignment problem is solvable by this sort of approach. And this only helps us if no one else builds unaligned dangerous AI by then or uses AI for capabilities research. I think it's unlikely that all of this is true and that this plan is likely to have negative consequences.
TLDR: The necessary conditions for superalignment[2] are unlikely to be met and the plan itself will possibly have more negative consequences.
Fast takeoff is still possible and LLMs don’t prove anything about it being impossible or very unlikely. Will does not provide a full-length argument why he thinks anything about LLMs rules out fast takeoff. The key load-bearing arguments for fast takeoff are simple and unchanged. Once AI gets capable enough to meaningfully do its own AI research without humans, this will lead to a great speed-up because computers are fast. We are also building a lot of very fast parallel computers. Also once the AIs start improving capabilities, these capabilities will make them faster and smarter. Empirically, we have evidence from games like Go that superhuman levels can be quickly reached (within days or hours) through RL and methods such as self-play. If fast takeoff happens, no substantial “self-alignment time period” will happen. Furthermore, Will himself describes slow takeoff as months to years, which is still very little time.
In addition to a slow takeoff, strong capabilities for AI alignment have to appear in a certain sequence and long before AI is existentially dangerous. Despite the fact that humans have failed and are struggling to understand the alignment problem, superalignment assumes that AIs can be trained to solve it. And ideally this happens before they get very good at speeding up capabilities research or being directly dangerous. I think this is unlikely to be true, because that's not how it works in humans and because capabilities research appears much easier to verify and specify. There are many humans that are good at capabilities research, that includes work such as optimizing performance, creating good datasets, setting up high-quality RL environments. These humans have been able to make rapid progress on AI capabilities while practical progress on eliminating prompt injections or interpretability or theoretical breakthroughs on agency appear to me much more limited. I’d expect AIs similarly to first get good at capabilities rather than alignment research. We already have many examples of AI being used to do capabilities research, likely because it’s easier to verify and specify compared to alignment research. Examples are optimizing matrix multiplications, chip design, generating data, coming up with RL-tasks to name a few. Therefore, AI will likely accelerate capabilities research long before it can meaningfully help with alignment.
There is still no agreed upon specification of what we would actually have these AI alignment research agents do. Would we figure this all out in the moment we get to this barely specified time period? In fairness, some proposals exist for interpretability, and it seems empirically possible to have AIs help us with interpretability work. However, interpretability is a helpful but not sufficient part of alignment. Currently proposed explanation metrics can be tricked and are not sufficient for verification. Without strong verifiability, AIs could easily give us misleading or false interpretability results. Furthermore, improvements in interpretability do not equal an alignment solution. Being able to understand that an AI is plotting to take over doesn’t mean you can build an AI that isn’t trying to take over (Chapter 11, IABIED). It’s also not clear that for something much smarter than humans, interpretability could even work or be useful. Is it even possible to understand or steer the thoughts of something much smarter and faster than you?
The alignment problem is like an excavation site where we don't yet know what lies beneath. It could be all sand - countless grains we can steadily move with shovels and buckets, each scoop representing a solved sub-problem. Or we might discover that after clearing some surface sand, we hit solid bedrock - fundamental barriers requiring genius breakthroughs far beyond human capability. I think it’s more likely that alignment is similar to sand over bedrock than pure sand, so we may get lots of work on shoveling sand (solving small aspects of interpretability) but fail to address deeper questions on agency and decision theory. Just focusing on interpretability in LLMs, it’s not clear that it is in principle possible to solve it. It may be fundamentally impossible for an LLM to fully interpret another LLM of similar capability - like asking a human to perfectly understand another human's thoughts. While we do have some progress on interpretability and evaluations, critical questions such as guaranteeing corrigibility seem totally unsolved with no known way to approach the problem. We are very far from understanding how we could tell that we solved the problem. Superalignment assumes that alignment just takes a lot of hard work, it assumes the problem is just like shoveling sand - a massive engineering project. But if it's bedrock underneath, no amount of human-level AI labor will help.
If that period really existed, it would also very likely be useful to accelerate capabilities and rush straight forward to unaligned superintelligence. While Anthropic or OpenAI might be careful here, there are many other companies that will go ahead as soon as possible. For the most part, the vast majority of AI labs are extremely irresponsible and have no stated interest in dedicating any resources to solving alignment.
The main impact of the superalignment plan may very well be that it gives the people advancing capabilities a story to tell to worried people. “Let’s have the AIs do the alignment work for us at some unspecified point in the future” also sounds like the kind of thing you’d say if you had absolutely zero plans on how to align powerful AI. My overall impression here is that the people championing superalignment are not putting out plans that are specific enough to be really critiqued. It just doesn’t seem that there is that much substance here to engage with. Instead, I think they should clearly outline why they believe this strategy will likely work out. Why do they believe these conditions will be met, in particular why do they think this “period” will exist and why do they believe these things about the alignment problem?
Eliezer and Nate also discuss the superalignment plan in detail in chapter 11 of IABIED. Basically, they think some interpretability work can likely be done with AIs, and that is a good thing. Interpretability itself is not a solution for alignment though it’s helpful. As for the version where the AI does all the alignment work, Eliezer and Nate believe this AI would already be too dangerous to be trustworthy. It would require a superhuman AI to solve the alignment problem.