Eliezer frequently claims that AI cannot do our alignment homework for us. OpenAI disagrees and is pursuing Superalignment as their main alignment strategy.

Who is correct?

New Answer
New Comment

9 Answers sorted by



There doesn't seem to be any clear technical obstacle to this plan having a reasonable chance of success given substantial effort.

(That said, reasonable chance of success might look like 90% which isn't amazing. And this will depend on your probablity of various issues like scheming.)

I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans (at the point where we have powerful AIs). I think this level of checking should be reasonably doable, though could go poorly (see various fields).

However, note that if you think we would fail to sufficiently check human AI safety work even given substantial time, we would also fail to solve various issues given a substantial pause (as Eliezer thinks is likely the case).

It seems pretty relevant to note that we haven't found an easy way of writing software[1] or making hardware that, once written, can be easily evaluated to be bug-free and works as intended (see the underhanded C contest or cve-rs). The artifacts that AI systems will produce will be of similar or higher complexity.

I think this puts a serious dent into "evaluation is easier than generation"—sure, easier, but how much easier? In practice we can also solve a lot of pretty big SAT instances.

  1. Unless we get AI systems to write formal verifications for the cod

... (read more)
I think software companies do OK handling negligent and/or adversarial employees in practice via verification? Perhaps you disagree? And if this was a serious problem (e.g. if you could only trust a known 1/30th of employees but wanted to produce the same output), I think this would be possible to handle in practice as long as we could fire employees which are caught behaving badly? If I recall correctly, underhanded C contest entries are almost all pretty easily caught by decent software engineers with a combination of testing, syntax highlighting, and basic compiler linting. I don't think this is easy, but it's not as though the situation with human coders supervising human coders of comparable abilities looks that discouraging. (It seems basically fine?)
(I have not read your recent work on AI control, so feel free to refer me to any material that answers my questions there.) At least with C, in my experience these kinds of mistakes are not easily caught by testing, syntax highlighting and basic compiler linting. (There's a reason why valgrind exists!) Looking over the winners for 2008, I have no idea what is going on, and think it'd take me quite a while to figure out what the hell is happening, and whether it's sketchy. I'd enjoy reading about experiments where people have to figure out whether a piece of C code is underhanded. Another example that feels relevant, but where I'm not sure about the exact lesson, that comes to mind is the NSA modifying S-boxes for DES in order to make them more resistant to differential cryptanalysis. * The NSA hid a defense from a specific attack in the S-boxes * People figured this out only when the specific attack was found * It is unknown whether they hid anything that makes offense easier Is it possible that they hid anything that makes offense easier? I don't know. Edit: After some searching, I came upon the pRNG Dual_EC_DRBG, which did have a bunch of NSA involvement, and where they were pretty much caught biasing the numbers in a specific direction. So attacks here are definitely possible, though in this case it took more than five years to get caught. As for the rest of your comment, I think we have different models, where you analogize negatively reinforcing AI systems to firing, which would be more applicable if we were training several systems. I'm pretty sure you've written an answer to "negatively reinforcing bad behavior can reinforce both good behavior and better hiding", so I'll leave it at that.
I'm not analogizing negative reinforcement to firing, I'm analogizing firing to no longer using some AI. See catching AIs red-handed for more discussion.
I agree that mistakes can be very hard to catch, but I still think it's hard to cause specific outcomes via carefully inserted bugs which aren't caught by careful auditing and testing. (See e.g. underhanded C where I think testing does pretty well.) This one is trivially resolved with unit testing I think. Though this breaks the analogousness of the problem. But I think my overall point stands, it just actually seems hard to cause specific outcomes IMO.

I think an important crux here is whether you think that we can build institutions which are reasonably good at checking the quality of AI safety work done by humans


Why is this an important crux? Is it necessarily the case that if we can reliably check AI safety work done by humans that we we reliably check AI safety work done by Ai's which may be optimising against us? 

It's not necessarily the case. But in practice this tends to be a key line of disagreement.

However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause

This does not seem automatic to me (at least in the hypothetical scenario where "pause" takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].

For example, my crux[1] is that current institutions do not subscribe ... (read more)

I said "fail to sufficiently check human AI safety work given substantial time". This might be considerably easier than ensuring that such institutions exist immediately and can already evaluate things. I was just noting there was a weaker version of "build institutions which are reasonably good at checking the quality of AI safety work done by humans" which is required for a pause to produce good safety work. Of course, good AI safety work (in the traditional sense of AI safety work) might be not be the best route forward. We could also (e.g.) work on routes other than AI like emulated minds.



Assumming that there is an "alignment homework" to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.

An important disclaimer is that perhaps there is no "alignment homework" that needs to get done ("alignment by default", "AGI being impossible", etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question - namely, because they think that the homework to be done isn't particularly difficult in the first place.

For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters).
However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI's "superalignment" approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens.
And my point here is not to argue that this won't happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)

To put it in a different way:

  • We can use AI to automate an existing process, or a process that we can describe in enough detail.
    (EG, suppose we want to "automate science". Then an example of a thing that we might be able to do would be to: Set up a system where many LLMs are tasked to write papers. Other LLMs then score those papers using the same system as human researchers use for conference reviewes. And perhaps the most successful papers then get added to the training corpus of future LLMs. And then we repeat the whole thing. However, we do not know how to "magically make science better".)
  • We can also have AI generate solution proposals, but this will only be helpful to the extent that we know how to evaluate the quality of those proposals.[1]
    (EG, we can use AI to factorise numbers into their prime factors, since we know how to check whether  is equal to the original number. However, suppose we use an AI to generate a plan for how to improve an urban design of a particular city. Then it's not really clear how to evaluate that plan. And the same issue arises when we ask for plans regarding the problem of "making AI go well".)

Finally, suppose you think that the problem with "making AI go well" is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]

  1. ^

    A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.

  2. ^

    The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities - for example, because we know how to evaluate capabilities research, while the field of "make AI go well" is much less mature.

Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.

I think there is an important distinction between "If given substantial investment, would the plan to use the AIs to do alignment research work?" and "Will it work in practice given realistic investment?".

The cost of the approach where the AIs do alignment research might look like 2 years of delay in median worlds and perhaps considerably more delay with some probability.

This is a substantial cost, but it's not an insanely high cost.

I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don't understand or perhaps disagree with.) Are you saying something like: "Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn't hard in the first place)."? If yes, then I agree --- but I feel that of the two questions, "would the plan work in theory" is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won't really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research --- then this won't really matter unless the US government actually does that.)
I just think that these are important concepts to distinguish because I think it's useful to notice the extent to which problems could be solved by moderate amount of coordination and which asks could suffice for safety. I wasn't particularly trying to make a broader claim, just trying to highlight something that seemed important. My overall guess is that people paying costs equivalent to 2 years of delay for existential safety reasons is about 50% likely. (Though I'm uncertain overall and this is possible to influence.) Thus, ensuring that the plan for spending that budget is as good as possible looks quite good. And not hopeless overall. By analogy, note that google bears substantial costs to improve security (e.g. running 10% slower). I think that if we could ensure the implementation of our best safety plans which just cost a few years of delay, we'd be in a much better position.



Trivially, any AI smart enough to be truly dangerous is capable of doing our "alignment homework" for us, in the sense of having enough intelligence to solve the problem. This is something EY has also pointed out many times, but which often gets ignored. Any ASI that destroys humanity will have no problem whatsoever understanding that that's not what humanity wanted, and no difficulty figuring out what things we would have wanted it to do instead.

What is very different and less clear of a claim is whether we can use any AI developed with sufficient capabilities, but built before the "homework" was done, to do so safely (for likely/plausible definitions of "we" and "use").

Trivially, any AI smart enough to be truly dangerous is capable of doing our "alignment homework" for us, in the sense of having enough intelligence to solve the problem.

Is this trivial? People at least argue that the capability profile could be sufficiently unfortunate such that AIs are extremely dangerous prior to being extremely useful. (As a particularly extreme case, people often argue that AIs will be qualitatively wildly superhumans in dangerous skills (e.g. persuation) prior to being merely qualitatively human level at doing AI safety research. ... (read more)

Fair enough, "trivial" overstates the case. I do think it is overwhelmingly likely.   That said, I'm not sure how much we actually disagree on this? I was mostly trying to highlight the gap between an AI have a capability and us having the control to use an AI to usefully benefit from that capability.
I personally agree that on the default trajectory it's very likely that at the point where AIs are quite existentially dangerous (in the absense of serious countermeasures) they also are capable of being very useful (though misalignment might make them hard to use). However, I think this is a key disagreement I have with more pessimistic people who think that at the point where models become useful, they're also qualitiatively wildly superhumanly dangerous. And this implies (assuming some rough notion of continuity) that there were earlier AIs which weren't very useful but which were still dangerous in some ways.
Yeah, there are lots of ways to be useful, and not all require any superhuman capabilities. How much is broadly-effective intelligence vs targeted capabilities development (seems like more the former lately), how much is cheap-but-good-enough compared to humans vs better-than-human along some axis, etc.

Petition to change the title of this post to "Can we get AIs to do our alignment homework for us?"




I'm no longer sure the question makes sense, and to the extent it makes sense I'm pessimistic. Things probably won't look like one AI taking over everything, but more like an AI economy that's misaligned as a whole, gradually eclipsing the human economy. We're already seeing the first steps: the internet is filling up with AI generated crap, jobs are being lost to AI, and AI companies aren't doing anything to mitigate either of these things. This looks like a plausible picture of the future: as the AI economy grows, the money-hungry part of it will continue being stronger than the human-aligned part. So it's only a matter of time before most humans are outbid / manipulated out of most resources by AIs playing the game of money with each other.



My mental model is that there is an entire space of possible AIs, each with some capability level and alignability level. Given the state of the alignment field, there is some alignability ceiling, below which we can reliably align AIs. Right now, this ceiling is very low, but we can push it higher over time.

At some capability level, the AI is powerful enough to solve alignment of a more capable AI, which can then solve alignment for even more capable AI, etc all the way up. However, even the most alignable AI capable of this is still potentially very hard to align. There will of course be more alignable and less capable AIs too, but they will not be capable enough to actually kick off this bucket chain.

Then the key question is whether there will exist an AI that is both alignable and capable enough to start the bucket chain. This is a function of both (a) the shape of the space of AIs (how quickly do models become unalignable as they become more capable?) and (b) how good we become at solving alignment. Opinions differ on this - my personal opinion is that probably this first AI is pretty hard to align, so we're pretty screwed, though it's still worth a try.

I wish you wouldn't use the term "align" if actually just mean "safely use" or you would make it clear that we don't necessarily need alignment. E.g. because we could apply something like control (perhaps combined with paying AIs for their labor like normal employees).

Sorry for the word policing.

Stephen McAleese


I wrote a blog post on whether AI alignment can be automated last year. The key takeaways:

  • There's a chicken-and-egg problem where you need the automated alignment researcher to create the alignment solution but the alignment solution is needed before you can safely create the automated alignment researcher. The solution to this dilemma is an iterative bootstrapping process where the AI's capabilities and alignment iteratively improve each other (a more aligned AI can be made more capable and a more capable AI can create a more aligned AI and so on).
  • Creating the automated alignment researcher only makes sense if it is less capable and general than a full-blown AGI. Otherwise, aligning it is just as hard as aligning AGI.

There's no clear answer to this question because it depends on your definition of "AI alignment" work. Some AI alignment work is already automated today such as generating datasets for evals, RL from AI feedback, and simple coding work. On the other hand, there are probably some AI alignment tasks that are AGI-complete such as deep, cross-domain, and highly creative alignment work.

The idea of the bootstrapping strategy is that as the automated alignment researcher is made more capable, it improves its own alignment strategies which enables further capability and alignment capabilities and so on. So hopefully there is a virtuous feedback loop over time where more and more alignment tasks are automated.

However, this strategy relies on a robust feedback loop which could break down if the AI is deceptive, incorrigible, or undergoes recursive self-improvement and I think these risks increase with higher levels of capability.

I can't find the source but I remember reading somewhere on the MIRI website that MIRI aims to do work that can't easily be automated so Eliezer's pessimism makes sense in light of that information.

Further reading:



I am not so sure it will be possible to extract useful work towards solving alignment out of systems we do not already know how to carefully steer. I think that substantial progress on alignment is necessary before we know how to build things that actually want to help us advance the science. Even if we built something tomorrow that was in principle smart enough to do good alignment research, I am concerned we don’t know how to make it actually do that rather than, say, imitate more plausible-sounding but incorrect ideas. The fact that appending silly phrases like “I’ll tip $200” improves the probability of receiving correct code from current LLMs indicates to me that we haven’t succeeded at aligning them to maximally want to produce correct code when they are capable of doing so.

Bridgett Kay


As one scales up a system, any small misalignment within that system will become more apparent- more skewed. I use shooting an arrow as an example. Say you shoot an arrow at a target from only a few feet away. If you are only a few degrees off from being lined up with the bullseye, when you shoot the close target your arrow will land very close to the bullseye. However, if you shoot a target many yards away with the same degree of error, your arrow will land much, much farther from the bullseye. 

So if you get a less powerful AI aligned with your goals to a degree where everything looks fine, and then assign it the task of aligning a much more powerful AI, then any small flaw in the alignment of the less powerful AI will go askew far worse in the more powerful AI. What's worse- since you assigned the less powerful AI the task aligning the larger AI, you won't be able to see exactly what the flaw was until it's too late, because if you'd been able to see the flaw, you would have aligned the larger AI yourself. 



There's only one way to know!

</joking> <=========

5 comments, sorted by Click to highlight new comments since:
[-]O O70

My intuition is that it is at least feasible to align a human level intelligence with the "obvious" methods that fail for superintelligence, and have them run faster to to produce superhuman output. 

Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.

And third, there is a lot of value to be captured from narrow AI that don't have deceptive capabilities but are very good at say solving math. 

Second, it is also possible to robustly verify the outputs of a superhuman intelligence without superhuman intelligence.

Why do you believe that a superhuman intelligence wouldn't be able to deceive you by producing outputs that look correct instead of outputs that are correct?

Davidad's plan involves one plausible way of doing that

[-]O O10

I don’t have the specifics but this is just a natural tendency of many problems - verification is easier than coming up with the solution. Also maybe there are systems where we can require the output to be mathematically verified or reject solutions whose outcomes are hard to understand.

I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we're on the same page. 

Roughly speaking, what I have in mind is that there are at least two possible claims. One is that 'we can't get AI to do our alignment homework' because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there's some sort of 'intrinsic' reason why an AI built by humans could never solve alignment homework.