Why I don't believe Superalignment will work

[-]cousin_it2mo125

My view on this is even gloomier than the standard LW view. If alignment fails, we're screwed. If alignment succeeds, most likely it'll be alignment to power and money, so most of us are also screwed. The only way to not get screwed is to build AI that's aligned to goodness instead of power and money, but nobody's working on this.

[-]Jeremy Gillen2mo123

I've been meaning to write a post like this for a while. I think your conclusion is right and all of your arguments are true, but I don't think most of it engages with a strong version of the superalignment argument, so I don't think a proponent of superalignment-like plans would find this convincing.

When discussing the quality of a plan, we should assume that a company (and/or government) is really trying to make the plan work and has some lead time. If they do, then the second and fourth arguments can be dismissed.

Stopping self-improvement by restricting actions and very heavily supervising seems maybe doable (if risky^[1]) for near-human-level AI (unless the capabilities profile is uneven in an inconvenient way), so the first argument is also unconvincing.

I think 3 is a strong argument, and I think it's crazy how bad the proposals I've read are (about what research the AIs should do).

I think the strongest argument is that it'll take a lot of researcher-labour to distinguish good from bad research, so it overall doesn't speed up research much. This is because our only way to train for good research is to train for research that is very difficult to distinguish from good research, and it takes a lot of labour to avoid being misled by good-looking-but-not-quite-good research. (You do make this argument in several places, but in less detail).

Proponents often say that verification is easier than generation, and this is sometimes true but not true enough for most kinds of research. Verification is also time consuming, and more so the better the AI is at sweeping errors under the rug. And they are already really good at sweeping errors under the rug, I find it really annoying to have an LLM help me with proofs because of how often it makes hard-to-spot mistakes. Often it feels like it slows me down overall.

So you need a lot of researcher-labour for the training signal, and you also need a lot to verify any alignment progress produced. (On top of this, the signal is gonna be super noisy and researchers will disagree about whether research-outputs have merit). Since skilled-researcher-labour will be the bottleneck, it's not clear that the plan actually accelerates research progress that much. It probably does a bit, but I don't know how to work out how much, and intuitively it seems unlikely to be more than 3x. It's plausible it slows it down on net, if mistakes sometimes get past the researchers.

In my experience, this argument about the quantity of labour involved seems to occasionally convince people who believed in superalignment.

^{^}
Especially risky given that other countries will likely try to steal the AGI.

[-]Simon Lermen2mo61

Looking forward to reading your take on superalignment. I wanted to get my thoughts out here but would really like there to be a good reference document with all the core counterarguments. When I read Will's post, I it seemed sad that I didn't know of a well argued paper or post to point to against superalignment.

When discussing the quality of a plan, we should assume that a company (and/or government) is really trying to make the plan work and has some lead time.

I agree that some of my arguments don't directly address the best version of the plan, but rather what realistically is actually happening. I do think that proponents should give us some reason why they believe they will have time to implement the plan. I think they should also explain why they don't think this plan will have negative consequences.

[-]Jeremy Gillen2mo30

I think John's anti-control post is the best I know of

[-]jacquesthibs2mo30

Thanks for the post, Simon! I think having more discussion giving specific criticisms and demands for the mainline alignment plan by the labs is needed.

I’d like to eventually put forth my strongest arguments for superalignment as a whole and what we need to happen to realistically convince/force the labs to stop.

Quick comments:

“AIs are unlikely to speed up alignment before capabilities”: I think this can also be used as an argument for accelerating automated alignment ASAP if you believe that we won’t get alignment value out of AIs soon enough (well, some people already are). Unless the crux is “doesn’t matter how hard you try to differentially accelerate alignment work, you won’t make any dent in progress”, in which case I disagree but think it’s more likely in a world where people are too concerned about dual-use to attempt it.
Dual-use actually seems overblown due to the labs being full steam ahead on automating AI R&D, and automated alignment will just lag behind.
I think as part of automated alignment, we should definitely try to make it easier for researchers to make verification and detecting hard-to-spot mistakes faster. There might be probes that help with this (like the hallucination probes help with identifying hallucinations). My guess is that making verification with humans in the loop can be made much faster.
Automated alignment can be leveraged as part of the scary demo strategy and can make a more convincing case for a pause.
I think our main crux is that I think AIs will accelerate prosaic alignment research, though agent foundations will be much rougher. Luckily, the first thing you want to do as AIs get better at control is probably make control even more useful and lengthen the time in which we can take to leverage AIs. In addition, I think there is a belief that traditional alignment theory basically won’t matter and is just another step removed from the problem.
I think that given this crux and the fact that the companies will indeed automate AI R&D (and alignment), we should force companies to stop at some point in this process. There should be an even stronger push for some If-Then commitment or RSP style thing. One thing I’m ruminating on is to propose a Responsible Automation Policy that would slow down progress between training and internal deployment.
Maybe this also means companies are forced to accelerate a guaranteed safe AI style plan beyond a level of capability.

[-]Simon Lermen1mo30

For the record, I think that this blog post was mostly intended for frontier labs pushing this plan, the situation is different for independent orgs. I think that there is useful work to be done on subproblems with AI-assisted alignment such as interpretability. So I agree that there is prosaic alignment that can be done, though I am probably still much less optimistic than you.

1 Fast takeoff is possible

Fast takeoff is still possible and LLMs don’t prove anything about it being impossible or very unlikely. Will does not provide a full-length argument why he thinks anything about LLMs rules out fast takeoff. The key load-bearing arguments for fast takeoff are simple and unchanged. Once AI gets capable enough to meaningfully do its own AI research without humans, this will lead to a great speed-up because computers are fast. We are also building a lot of very fast parallel computers. Also once the AIs start improving capabilities, these capabilities will make them faster and smarter. Empirically, we have evidence from games like Go that superhuman levels can be quickly reached (within days or hours) through RL and methods such as self-play. Crucially, we don't know if automated R&D research is necessary for takeoff, we could also find algorithmic breakthroughs that shoot us straight to superhuman AI in a single training run. If fast takeoff happens, no substantial “self-alignment time period” will happen. Furthermore, Will himself describes slow takeoff as months to years, which is still very little time.

2 AIs are unlikely to speed up alignment before capabilities

In addition to a slow takeoff, strong capabilities for AI alignment have to appear in a certain sequence and long before AI is existentially dangerous. Despite the fact that humans have failed and are struggling to understand the alignment problem, superalignment assumes that AIs can be trained to solve it. And ideally this happens before they get very good at speeding up capabilities research or being directly dangerous. I think this is unlikely to be true, because that's not how it works in humans and because capabilities research appears much easier to verify and specify. There are many humans that are good at capabilities research, that includes work such as optimizing performance, creating good datasets, setting up high-quality RL environments. These humans have been able to make rapid progress on AI capabilities while practical progress on eliminating prompt injections or interpretability or theoretical breakthroughs on agency appear to me much more limited. I’d expect AIs similarly to first get good at capabilities rather than alignment research. We already have many examples of AI being used to do capabilities research, likely because it’s easier to verify and specify compared to alignment research. Examples are optimizing matrix multiplications, chip design, generating data, coming up with RL-tasks to name a few. Therefore, AI will likely accelerate capabilities research long before it can meaningfully help with alignment.

3 What would the AI alignment researchers actually be doing?

There is still no agreed upon specification of what we would actually have these AI alignment research agents do. Would we figure this all out in the moment we get to this barely specified time period? In fairness, some proposals exist for interpretability, and it seems empirically possible to have AIs help us with interpretability work. However, interpretability is a helpful but not sufficient part of alignment. Currently proposed explanation metrics can be tricked and are not sufficient for verification. Without strong verifiability, AIs could easily give us misleading or false interpretability results. Furthermore, improvements in interpretability do not equal an alignment solution. Being able to understand that an AI is plotting to take over doesn’t mean you can build an AI that isn’t trying to take over (Chapter 11, IABIED). It’s also not clear that for something much smarter than humans, interpretability could even work or be useful. Is it even possible to understand or steer the thoughts of something much smarter and faster than you?

4 Alignment problem might require genius breakthroughs

The alignment problem is like an excavation site where we don't yet know what lies beneath. It could be all sand - countless grains we can steadily move with shovels and buckets, each scoop representing a solved sub-problem. Or we might discover that after clearing some surface sand, we hit solid bedrock - fundamental barriers requiring genius breakthroughs far beyond human capability. I think it’s more likely that alignment is similar to sand over bedrock than pure sand, so we may get lots of work on shoveling sand (solving small aspects of interpretability) but fail to address deeper questions on agency and decision theory. Just focusing on interpretability in LLMs, it’s not clear that it is in principle possible to solve it. It may be fundamentally impossible for an LLM to fully interpret another LLM of similar capability - like asking a human to perfectly understand another human's thoughts. While we do have some progress on interpretability and evaluations, critical questions such as guaranteeing corrigibility seem totally unsolved with no known way to approach the problem. We are very far from understanding how we could tell that we solved the problem. Superalignment assumes that alignment just takes a lot of hard work, it assumes the problem is just like shoveling sand - a massive engineering project. But if it's bedrock underneath, no amount of human-level AI labor will help.

5 Most labs won’t use the time

If that period really existed, it would also very likely be useful to accelerate capabilities and rush straight forward to unaligned superintelligence. While Anthropic or OpenAI might be careful here, there are many other companies that will go ahead as soon as possible. For the most part, the vast majority of AI labs are extremely irresponsible and have no stated interest in dedicating any resources to solving alignment.

6 The plan could have negative consequences

The main impact of the superalignment plan may very well be that it gives the people advancing capabilities a story to tell to worried people. “Let’s have the AIs do the alignment work for us at some unspecified point in the future” also sounds like the kind of thing you’d say if you had absolutely zero plans on how to align powerful AI. My overall impression here is that the people championing superalignment are not putting out plans that are specific enough to be really critiqued. It just doesn’t seem that there is that much substance here to engage with. Instead, I think they should clearly outline why they believe this strategy will likely work out. Why do they believe these conditions will be met, in particular why do they think this “period” will exist and why do they believe these things about the alignment problem?

Eliezer and Nate also discuss the superalignment plan in detail in chapter 11 of IABIED. Basically, they think some interpretability work can likely be done with AIs, and that is a good thing. Interpretability itself is not a solution for alignment though it’s helpful. As for the version where the AI does all the alignment work, Eliezer and Nate believe this AI would already be too dangerous to be trustworthy. It would require a superhuman AI to solve the alignment problem.

^{^}

This is a general response to superalignment proposals