I have a long, detailed, opinionated answer, which I have published as a separate post (since one of my draft readers persuaded me that some readers skip Question posts, since they don’t expect to find long, extensively-researched answers).
You should probably also go read Evan Hubinger’s excellent post Alignment remains a hard, unsolved problem, for his recent take on this question.
Eliezer Yudkowski and Nate Soares’ bestselling book If Anyone Builds It, Everyone Dies is also an attempt to answer this question, aimed primarily at a lay audience.
I'm not one of the impossible crowd, but I had a long discussion with @Remmelt about his views on this, and a significant part of the argument (at least the part I understood and feel I can adequately convey) seems to be more about keeping an AI aligned once you have it there. A vague outline goes something like this:
Perhaps this would be best phrased as "there is a capability level above which it is impossible to align an ASI", but I think that these dynamics obviously apply to modern day systems as well.
There are also some thresholds to the degree/accuracy of alignment:
1) the AI not killing/permanently disempowering everyone through misaligned actions (Not-Kill-Everyone Alignment) — you can't align ASI once you're all dead
2) the AI not being so corrigible/controllable/having such easily adjusted alignment that a small group of humans use the AI to massively concentrate power/resources to the point where almost everyone else is dead or permanently disempowered (theoretically humanity might be able get back from this state, if the group grows and their descendants are more moral, but it's at least a generational-duration trap).
3) the AI is sufficiently aligned to be safely able to assist us with AI assisted Alignment
4) the AI is sufficiently aligned that it can and will successfully do Value Learning and align itself better, and will converge to a stable very-aligned state.
I would hope that 4) might be able to solve the problem you describe, and 3) might help us do so, but neither of these is guaranteed or necessarily quick.
So, which of these should we use for "Alignment is 100% done"? Clearly if we don't have both 1) and some soultion to 2) (either a technical one, or a a legal/societal one), we're not done. I'm inclined to say we're not "done" until we have either 3) or 4): but if I'm right that we're currenlt maybe 10% done, mapping out the exact end state now seems over ambitious. Getting things to "this is no longer an existential risk emergency" is clearly required, but exactly what the equivalent of "an acceptable level of steam engine safety" is for AI is less clear: there probably isn't a single sharp cutoff, just a "we're mostly past the drastic risk" level.
Epistemic status: We really need to know. (I also posted an opinionated answer.)
There’s a well-known diagram from a tweet by Chris Olah of Anthropic:
It would be marvelous to know what the actual difficulty is, out of those five labeled difficulty categories (ideally, exactly where it lies on that spectrum). This is a major crux that explains a large part of the very wide variation in found across experts in the field. Clear evidence that Alignment is something like an Apollo-sized problem would strongly motivate dramatically increased funding and emphasis for AI Safety research (Apollo cost roughly $200 billion in current money, i.e. only 40% of OpenAI’s current valuation: expensive, but entirely affordable if needed to safely build ASI). Clear evidence that it’s more like or impossible would be a smoking gun proof that enforcing an AI pause before AGI or ASI is the only rational course. This is the question for the near-term survival of our species (so no pressure!)
I think this is a vital discussion. I’m also going to link-post below to my own (rather long) opinion, which is a separate post, and also to a few other existing resources which are basically other people’s attempts to answer this question.
The first three labels on Chris’s diagram are pretty self-explanatory: the only interesting question is whether “Steam Engine” means just steam engine safety work (which would make the scale more logarithmic, and also might make doing a progress-so-far comparison more natural), or also covers what one might call “steam engine capabilities” work, since those are pretty separable.
For rocketry I think it’s a lot more difficult to separate getting there and back with only a 10% fatality rate (as Apollo did) from getting there and back at all, so rather than trying to separate just safety, I think it makes the most sense to do a comparison to all of rocketry (so as to begin at the beginning) that led up to the Apollo program (probably excluding the parallel Russian program, or the more military-specific aspects of various other programs). However, the Apollo program itself was so enormous that when to start from is rather a small quibble.
P vs NP
We’ve only spent about 3,000 to 6,000 person-years on so far, so it’s still quite plausible that it will be proven (or disproven) in a lot fewer person-years of effort than the about 3.5 million person-years that were spent on the Apollo program. However, unless we have ASI to help us, it’s still unlikely it will be solved any time soon, because it’s a far more abstract and challenging problem than Apollo engineering, so people competent to work on it and interested in doing so are very few and far between. Thus it’s taking a long time, because it’s hard in a more conceptual than detail-oriented way. Sadly for AI Alignment we’re currently on a short time limit, but then the problem is attracting increasing attention.
Eliezer Yudkowsky, the most famous proponent of high ,[2] is on record that he doesn’t believe alignment is an insoluble problem (that’s item -1 in his 2022 List of Lethalities: he seems to think that it might take us of the order of a hundred years, if we actually managed not to kill ourselves in the process, and that if we somehow had access now to a textbook from a hundred years into that future then that might well be all we needed) —[3] so that presumably makes him and Nate Soares able advocates for the viewpoint. I’d be absolutely delighted if they or anyone else wanted to chime in here for that viewpoint — otherwise I’ll take If Anyone Builds It, Everyone Dies as the lay-audience case for it.
Shades of Impossible
I think it might be useful to provide some more granularity on “Impossible”:
Mathematically Impossible
The Orthogonality Thesis clearly predicts that it’s not actually impossible for an aligned ASI to exist, so unless that’s wrong, an impossibility proof would have to be something like demonstrating that identifying or constructing an aligned ASI, even at very low but finite risk level, was a worse-than-polynomially-hard problem (say in parameter count, or IQ). I’d be particularly interested to hear from anyone who genuinely thinks Alignment is impossible in this sense (not just or years’ work), if we’re willing to accept a very low but not zero risk level. (Of course, an actual impossibility proof is a higher bar than it actually being impossible but not provably so.)
Kardashev II
Another alternative is that alignment is not mathematically impossible-in-theory in the above sort of sense, but that it’s just vastly harder than any of the other four labeled categories — perhaps even to the level where it’s currently impossible-in-practice. If, for any sapient species around our current development level, even those that proceeded forward from this point with millennia of caution before finally creating ASI, they still had negligible success chance, and also negligible chance of surviving the attempt, then it could reasonably be said to be impossible in practice. I’d similarly be very interested to hear arguments for this viewpoint. I’m nominating “Kardashev II”[4] as a name for this difficulty level: possible in theory, but something that we’re very clearly not going to manage any time soon.
Too Hard for Us
If Aligning AI is anywhere past Trivial, yet we rush ahead and build ASI anyway before we’re ready, then (unless we manage to luck into alignment-by-default) we’re likely to end up extinct or permanently disempowered. Even if we proceed cautiously, but then mess up our first critical try, we’re still all dead or enslaved. Obviously you can’t solve alignment if you’re dead.
Holding this viewpoint often says rather less about your opinion about how hard aligning AI actually is, and rather more about your opinion of the frailty and foolishness of humans and their institutions.
I would like to thank everyone who made suggestions, gave input, or commented on earlier drafts of this pair of posts, including (in alphabetical order): Hannes Whittingham, JasonB, Justin Dollman, Olivia Benoit, Scott Aaronson, Seth Herd, and the gears to ascension.
I originally wanted to post this simply as a Question post plus my long, detailed Answer, as one of (hopefully) several answers; one of my draft commenters persuaded me that quite a few LessWrong / Alignment Forum readers typically don’t click on question posts, because they expect to find only quick-take answers rather than detailed extensively-researched answers.
Nevertheless, I encourage detailed heavily researched answers to the Question version of this post, as well as quick takes.
Eliezer has very carefully not publicly stated his current , but it’s generally agreed by people familiar with him and his writings that it must be over 90%, very likely over 95% — he has, after all, written a list of 43 different reasons why we’re doomed if we don’t pause AI, and co-authored a best-selling book on the subject titled If Anyone Builds It, Everyone Dies. When interviewed on the subject in 2023 he summarized this as:
He appears rather certain about this.
However, as the title of the book he co-authored suggests, it’s more accurate to describe him as having a very high : he has given a TED talk in which he says that any not-DOOM credence he has is predicated on society doing something that he is publicly advocating for, but unfortunately doesn’t expect to happen on the current trajectory: enforceably pausing AI.
For anyone curious, none of my Question or Answer were written by an LLM (though they were proofread by an LLM, and, as I mention in my answer, I did have them do some Fermi estimates and even delve into some deep web research for me). I have been overusing the m-dash since well before transformers were invented: it’s slightly old-fashioned, a little informal, and has useful differences in emphasis from the colon, the semi-colon, or the full stop. I considered giving up the m-dash when LLMs started copying me — and I rejected it. (I am also fond of parenthetical n-dashes for emphasis – and use them on occasion in my writing – though that seems to be less of a hot-button issue.)
I am here intending this to describe a civilization capable of things like using gathered solar power to lift the contents of all the gas giants out of their gravity wells and then fusing most of the hydrogen and helium into elements more useful for building Dyson swarm computronium, such as carbon/oxygen/nitrogen, and then perhaps also doing some star-lifting: a mature Kardashev Type II civilization, not just one with a lot of solar panels orbiting the sun.