How Hard a Problem is Alignment?

RogerDearnaley

Epistemic status: We really need to know. (I also posted an opinionated answer.)

There’s a well-known diagram from a tweet by Chris Olah of Anthropic:

It would be marvelous to know what the actual difficulty is, out of those five labeled difficulty categories (ideally, exactly where it lies on that spectrum). This is a major crux that explains a large part of the very wide variation in found across experts in the field. Clear evidence that Alignment is something like an Apollo-sized problem would strongly motivate dramatically increased funding and emphasis for AI Safety research (Apollo cost roughly $200 billion in current money, i.e. only 40% of OpenAI’s current valuation: expensive, but entirely affordable if needed to safely build ASI). Clear evidence that it’s more like or impossible would be a smoking gun proof that enforcing an AI pause before AGI or ASI is the only rational course. This is the question for the near-term survival of our species (so no pressure!)

I think this is a vital discussion. I’m also going to link-post below to my own (rather long) opinion, which is a separate post, and also to a few other existing resources which are basically other people’s attempts to answer this question.

The first three labels on Chris’s diagram are pretty self-explanatory: the only interesting question is whether “Steam Engine” means just steam engine safety work (which would make the scale more logarithmic, and also might make doing a progress-so-far comparison more natural), or also covers what one might call “steam engine capabilities” work, since those are pretty separable.

For rocketry I think it’s a lot more difficult to separate getting there and back with only a 10% fatality rate (as Apollo did) from getting there and back at all, so rather than trying to separate just safety, I think it makes the most sense to do a comparison to all of rocketry (so as to begin at the beginning) that led up to the Apollo program (probably excluding the parallel Russian program, or the more military-specific aspects of various other programs). However, the Apollo program itself was so enormous that when to start from is rather a small quibble.

P vs NP

We’ve only spent about 3,000 to 6,000 person-years on so far, so it’s still quite plausible that it will be proven (or disproven) in a lot fewer person-years of effort than the about 3.5 million person-years that were spent on the Apollo program. However, unless we have ASI to help us, it’s still unlikely it will be solved any time soon, because it’s a far more abstract and challenging problem than Apollo engineering, so people competent to work on it and interested in doing so are very few and far between. Thus it’s taking a long time, because it’s hard in a more conceptual than detail-oriented way. Sadly for AI Alignment we’re currently on a short time limit, but then the problem is attracting increasing attention.

Eliezer Yudkowsky, the most famous proponent of high ,^[2] is on record that he doesn’t believe alignment is an insoluble problem (that’s item -1 in his 2022 List of Lethalities: he seems to think that it might take us of the order of a hundred years, if we actually managed not to kill ourselves in the process, and that if we somehow had access now to a textbook from a hundred years into that future then that might well be all we needed) —^[3] so that presumably makes him and Nate Soares able advocates for the viewpoint. I’d be absolutely delighted if they or anyone else wanted to chime in here for that viewpoint — otherwise I’ll take If Anyone Builds It, Everyone Dies as the lay-audience case for it.

Shades of Impossible

I think it might be useful to provide some more granularity on “Impossible”:

Mathematically Impossible

The Orthogonality Thesis clearly predicts that it’s not actually impossible for an aligned ASI to exist, so unless that’s wrong, an impossibility proof would have to be something like demonstrating that identifying or constructing an aligned ASI, even at very low but finite risk level, was a worse-than-polynomially-hard problem (say in parameter count, or IQ). I’d be particularly interested to hear from anyone who genuinely thinks Alignment is impossible in this sense (not just or years’ work), if we’re willing to accept a very low but not zero risk level. (Of course, an actual impossibility proof is a higher bar than it actually being impossible but not provably so.)

Kardashev II

Another alternative is that alignment is not mathematically impossible-in-theory in the above sort of sense, but that it’s just vastly harder than any of the other four labeled categories — perhaps even to the level where it’s currently impossible-in-practice. If, for any sapient species around our current development level, even those that proceeded forward from this point with millennia of caution before finally creating ASI, they still had negligible success chance, and also negligible chance of surviving the attempt, then it could reasonably be said to be impossible in practice. I’d similarly be very interested to hear arguments for this viewpoint. I’m nominating “Kardashev II”^[4] as a name for this difficulty level: possible in theory, but something that we’re very clearly not going to manage any time soon.

Too Hard for Us

If Aligning AI is anywhere past Trivial, yet we rush ahead and build ASI anyway before we’re ready, then (unless we manage to luck into alignment-by-default) we’re likely to end up extinct or permanently disempowered. Even if we proceed cautiously, but then mess up our first critical try, we’re still all dead or enslaved. Obviously you can’t solve alignment if you’re dead.

Holding this viewpoint often says rather less about your opinion about how hard aligning AI actually is, and rather more about your opinion of the frailty and foolishness of humans and their institutions.

I would like to thank everyone who made suggestions, gave input, or commented on earlier drafts of this pair of posts, including (in alphabetical order): Hannes Whittingham, JasonB, Justin Dollman, Olivia Benoit, Scott Aaronson, Seth Herd, and the gears to ascension.

^{^}
I originally wanted to post this simply as a Question post plus my long, detailed Answer, as one of (hopefully) several answers; one of my draft commenters persuaded me that quite a few LessWrong / Alignment Forum readers typically don’t click on question posts, because they expect to find only quick-take answers rather than detailed extensively-researched answers.
Nevertheless, I encourage detailed heavily researched answers to this Question, as well as quick takes.
^{^}
Eliezer has very carefully not publicly stated his current , but it’s generally agreed by people familiar with him and his writings that it must be over 90%, very likely over 95% — he has, after all, written a list of 43 different reasons why we’re doomed if we don’t pause AI, and co-authored a best-selling book on the subject titled If Anyone Builds It, Everyone Dies. When interviewed on the subject in 2023 he summarized this as:
“I think it gets smarter than us, I think we’re not ready, I think we don’t know what we’re doing, and I think we’re all going to die.”
He appears rather certain about this.
However, as the title of the book he co-authored suggests, it’s more accurate to describe him as having a very high : he has given a TED talk in which he says that any not-DOOM credence he has is predicated on society doing something that he is publicly advocating for, but unfortunately doesn’t expect to happen on the current trajectory: enforceably pausing AI.
^{^}
For anyone curious, none of my Question or Answer were written by an LLM (though they were proofread by an LLM, and, as I mention in my answer, I did have them do some Fermi estimates and even delve into some deep web research for me). I have been overusing the m-dash since well before transformers were invented: it’s slightly old-fashioned, a little informal, and has useful differences in emphasis from the colon, the semi-colon, or the full stop. I considered giving up the m-dash when LLMs started copying me — and I rejected it. (I am also fond of parenthetical n-dashes for emphasis – and use them on occasion in my writing – though that seems to be less of a hot-button issue.)
^{^}
I am here intending this to describe a civilization capable of things like using gathered solar power to lift the contents of all the gas giants out of their gravity wells and then fusing most of the hydrogen and helium into elements more useful for building Dyson swarm computronium, such as carbon/oxygen/nitrogen, and then perhaps also doing some star-lifting: a mature Kardashev Type II civilization, not just one with a lot of solar panels orbiting the sun.

Nobody knows. Not to within an order of magnitude. Seriously. I've burned a lot of time taking all the arguments seriously, and none on either side are very complete.

This is an unfortunate position to be in. The sane thing to do is slow down if you don't know how dangerous the road ahead is. But humans are short-sighted. I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.

This is one of my biggest topics of interest. I read everything on it. The arguments on both sides are strong. I have found absolutely no discussion or writing that comes even close to getting to the bottom of squaring the different models that produce arguments on each side. Arguments on both sides are very abstract.

Nobody on either side, nor in the middle, has gotten even close to the object level. That's because it's a really hard problem. It requires either solving alignment-in-general, for any sort of mind; the agent foundations people have mostly (and rightly I think) given up on getting that done any time soon. But we don't have to align all minds, just the first one(s) significantly smarter than we are. And we don't have to align them perfectly to human values, just get them to reliably follow instructions from a just minimally smart and kind (set of) human(s).

Figuring out how hard alignment is is a solvable problem. It gets easier as we get closer to building the first AGIs because it scopes the problem. Of course, time to do so and use that information to good effect is also terrifyingly short if we wait for anything like certainty on what the first AGI will actually look like. That's why I'm spending a lot of time predicting AGI architecture while working on the alignment problem.

I won't even mention my actual guess on how hard alignment is, because like everyone else's, it's just a guess. I've spent almost as much time on this as anyone, and I don't yet have much clue. (My technical work is based on my guesses, because what else can you do?)

And I'll just pitch here that your answer is among the current-best pieces at actually working through where the abstract arguments for doom meet the reality of what's happening now, and therefore the most likely path to first takeover-capable AGI.

Now, that's for technical alignment. The additional problems of societal alignment (whose values is it aligned to, and how does that all shake out) are a different ball of wax. The two intertwine; I personally suspect that technical alignment for LLM-based AGI is pretty do-able, but the lack of societal alignment makes it effectively devilishly difficult.

The uncertainty also means, from our current perspective: NO FATE. It's time to work, not celebrate or despair.

I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.

I don't expect that to happen until at least some experts are saying that the danger is imminent, rather than a few years away, and probably not until we get a moderately impressive near miss that supports this claim. Currently, basically everyone still agrees that models are not existentially dangerous yet.

Racing all they way to the edge of the precipice and then slamming the brakes on at the very last m... (read more)

This should probably be a recurring question, ala the Open Threads LW moderators make, but to put it in a short sentence, alignment has gotten easier, but humanity has gotten more incompetent and is unwilling to pay large costs for safety.

The reason I say alignment has gotten easier is that we have slowly started to realize that the original goal needed to be revised in part by lowering the capability target.

One of the insights of AI control is that we (probably) don't actually need to consider aligning super-intelligences in the limit of technological development, or anywhere close to that, and that the first AIs that are both massively useful and pose non-negligible risk of AI takeover are able to be controlled in a way that doesn't depend on AI alignment working.

To be clear, it's still quite daunting as a challenge and AI companies/governments have started to be more reckless in AI deployment/progress, so it's still easy for misalignment to occur, especially if we get more unfavorable paradigms (neuralese actually working would be the big one here, but even more prosaic continual learning/long-term memory could be a big problem for AI alignment)

My median/modal expectation conditional on AI being able to automate all of AI R&D is that we implement half-baked control/alignment, and things are very messy and lots of balls are dropped, but we ultimately survive the ordeal based on cheap strategies like satiating AI preferences working, but that we incur a terrifying amount of risk (as in for example taking on 1-5%, or even 10-90% risk of AI takeover) while attempting to solve AI alignment.

I have a long, detailed, opinionated answer, which I have published as a separate post (since one of my draft readers persuaded me that some readers skip Question posts, since they don’t expect to find long, extensively-researched answers).

You should probably also go read Evan Hubinger’s excellent post Alignment remains a hard, unsolved problem, for his recent take on this question.

Eliezer Yudkowski and Nate Soares’ bestselling book If Anyone Builds It, Everyone Dies is also an attempt to answer this question, aimed primarily at a lay audience.

I think I care less than you do about this question. My reasoning is: if there's high uncertainty about the difficulty of alignment, then we should behave as if alignment is hard. Therefore, we should behave as if alignment is hard.

There's an asymmetric payoff to being wrong. If you assume alignment is hard when it's easy, then you unnecessarily delayed the singularity, which has some opportunity cost (and runs some risk that we go extinct from something unrelated to AI in the meantime). If you assume alignment is easy when it's hard, then everyone dies. The downside to being wrong in the latter case is worst. Therefore, we should behave as if alignment is hard (or at least skew our behavior in that direction).

I agree we don't know for sure and need to allow for a range of possibilities, and that (in some cases) that means the right thing to do is to pessimize.

However, I think there is some utility here. The case I make at the end of my answer is that we're very likely not going to be done in time if your timelines are 5 years, and probably not even if they're 10 years, but that we are close enough that if we could increase the growth rate of the field from 20% per year to 50% a year, then we have at least some chance in 5 years, and probably would be OK at 10.

My conclusions may, or may not, in fact turn out to be roughly right, and that sort of information does require you to be able to make an estimate to within something like a factor of two or three, so it's quite easy to be wrong, especially this early — but it's also really valuable information for things like funding priorities: it tells us we need to drastically increase effort on Fieldbuilding. Now if, as some people argue, this is in fact a problem, then you'd reach a very different set of conclusions about funding priorities.

The question is ill defined.

There is a lot conclusion about this subject, not least because the word "alignment" is used to mean different things. For some, it's synonymous with safety, for others, it's a particular approach to safety. for some It's once-and-for-all, for others iterative. For some, only perfection will suffice, for others, good enough is good enough.

For those reasons, treating aligned/misaligned as a simple binary isn't helpful. Nonetheless it is very common.

The good news is that the semantic confusion can be fixed here and now , without any special equipment.

True, but I'm happy to let anyone answering it include defining what they mean.

FWIW, my definition was "getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work. I'm also implicitly assuming that what we're aligning is LLMs, or at least something fairly similar to them for alignment purposes — partly because, as I discuss in my answer, if it's something else aligning LLMs may still be useful (e.g. if they get used by the other AI as a tool call to solve alignment-related problems), and also that that happening probably delays ASI enough to give us some extra time.

FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work

For many non-doomers , including those working in AI, safety is a series incremental steps, gradual improvements that hopefully keep pace with improvements in capabilities. Their view is analogous to car or plane safety , where the process has no end point, and perfection, zero casualties, isn't really expected. On this view, safety is not a point , but a line that needs to keep ahead of the ever rising capabilities line. It would be odd to ask when car safety will be solved.

Why alignment at all?

There are a number of routes to AI safety.

Control means that it doesn't matter what the AI wants, if it wants anything, because we can make it do what we want.

Corrigibility means alignment that can be changed once an AI is up and running. Control could be considered extreme corrigibility.

Non agency. Alignment and Control are both responses to agency. A third approach is non-agentic "tool AI" which responds to a specific instruction or request. Current (2025) AI's are fairly tool like.

AI Control is fine below and perhaps even up to AGI. I think that approach genuinely does suffere from a Sharp Left turn once the AI's capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you're trying to control. In very simple situations, you can use cryptographically strong techniques, but in realistic AI Control tasks, the attack surface is so large and so complex that something that can understand it better than you can has a huge tactical advantage.

I see Corrigibility as very different from Control. Building a very corrigible AI is likely a feasible technical approach to AI Alignment. My issues with it are primarily:

a) corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach. This is not necessarily an insoluble problem, if the AI can distinguish what's out-of-distribution and act suitable cautiously: Seth Herd's "Do What I Mean (and Check)" Corrigible alignment is basically this.

b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies. It is also very easy for a small powerful group to use very Corrigible AI to greatly concentrate power. Both of these are separate sources of X-Risk/Suffering-Risk that simple misalignment, but also very serious risks. Dario Amodei 's writing, and indeed Claude's Constitution make it clear that Anthropic take this risk as seriously as they do misalignment X-risk, and I completely agree with them. I think people on LessWrong and in the Alignment Community generally need to consider this problem more than they often seem to. ASI generated technology is going to be very powerful, and is thus going to need to be used very wisely, even when it has appeared rapidly. Highly Corrigible AI is much less likely to push back on the imprudent ideas of whoever is operating/controlling it than Value Learning AI.

Tool AI isn't the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.

So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.

So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.

I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control

There's a very basic difference between the people who believe in SLT'S, rapid RSI , etc and those who don't, and it affects their unspoken assumptions and semantics. The thing where it affects their senstuvs is a problem.

corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach

I don't se why.

b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies

Agreed. I didn't say so explicitly, but I was mainly concerned with Everybody Dies scenarios. I think a multipolar scenario where ASI are controllable and controlled by powerful interests is highly likely, but not completely fatal.

Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.

Ok, but that's a different complaint to "it's not even possible". Also , the market is for agents that work for you, but do your own things. That's a point against the standard Doom argument with a sovereign AI killing everyone for it's own reasons.

So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.

And a.way of forcing people to use it. If merely controllable/corrigible AI is available, powerful interests are going to prefer it.

So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.

Neither alignment nor safety is a simple binary

Sounds like we're mostly in agreement!

I'm not one of the impossible crowd, but I had a long discussion with @Remmelt about his views on this, and a significant part of the argument (at least the part I understood and feel I can adequately convey) seems to be more about keeping an AI aligned once you have it there. A vague outline goes something like this:

Imagine you have an aligned ASI
A sufficiently powerful system will likely have subsystems
There are always going to be aspects of the subsystems which cannot be controlled for
Those aspects will provide natural variance within which natural selection will occur
Gradually, we end up with a system which wants to reproduce (or at the very least with capable subsystems which do, and are capable of it in a way the system cannot control)
Purely reproduction oriented AIs are misaligned

Perhaps this would be best phrased as "there is a capability level above which it is impossible to align an ASI", but I think that these dynamics obviously apply to modern day systems as well.

This is assuming a highly intelligent system can't adequately anticipate problems with its overall function. I've read them carefully, and I don't buy this argument. The "proofs" are that misalignment happens by the end of time. I think improvements in intelligence likely outpace this problem so the practical answer is that this takes longer than the universe lasts, at least for an ASI that strongly "wants" to maintain its goals/values (as is instrumentally convergent under many reasonable assumptions).

There are also some thresholds to the degree/accuracy of alignment:

1) the AI not killing/permanently disempowering everyone through misaligned actions (Not-Kill-Everyone Alignment) — you can't align ASI once you're all dead

2) the AI not being so corrigible/controllable/having such easily adjusted alignment that a small group of humans use the AI to massively concentrate power/resources to the point where almost everyone else is dead or permanently disempowered (theoretically humanity might be able get back from this state, if the group grows and their descendants are more moral, but it's at least a generational-duration trap).

3) the AI is sufficiently aligned to be safely able to assist us with AI assisted Alignment

4) the AI is sufficiently aligned that it can and will successfully do Value Learning and align itself better, and will converge to a stable very-aligned state.

I would hope that 4) might be able to solve the problem you describe, and 3) might help us do so, but neither of these is guaranteed or necessarily quick.

So, which of these should we use for "Alignment is 100% done"? Clearly if we don't have both 1) and some solution to 2) (either a technical one, or a legal/societal one), we're not done. I'm inclined to say we're not "done" until we have either 3) or 4): but if I'm right that we're currently maybe 10% done, mapping out the exact end state now seems overambitious. Getting things to "this is no longer an existential risk emergency" is clearly required, but exactly what the equivalent of "an acceptable level of steam engine safety" is for AI is less clear: there probably isn't a single sharp cutoff, just a "we're mostly past the drastic risk" level.

I agree we don't know for sure and need to allow for a range of possibilities, and that (in some cases) that means the right thing to do is to pessimize.

The question is ill defined.

For those reasons, treating aligned/misaligned as a simple binary isn't helpful. Nonetheless it is very common.

The good news is that the semantic confusion can be fixed here and now , without any special equipment.

FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work

Why alignment at all?

There are a number of routes to AI safety.

Control means that it doesn't matter what the AI wants, if it wants anything, because we can make it do what we want.

Corrigibility means alignment that can be changed once an AI is up and running. Control could be considered extreme corrigibility.

I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control

corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach

I don't se why.

b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies

Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.

So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.

And a.way of forcing people to use it. If merely controllable/corrigible AI is available, powerful interests are going to prefer it.

So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.

Neither alignment nor safety is a simple binary

Sounds like we're mostly in agreement!

Imagine you have an aligned ASI
A sufficiently powerful system will likely have subsystems
There are always going to be aspects of the subsystems which cannot be controlled for
Those aspects will provide natural variance within which natural selection will occur
Gradually, we end up with a system which wants to reproduce (or at the very least with capable subsystems which do, and are capable of it in a way the system cannot control)
Purely reproduction oriented AIs are misaligned

Perhaps this would be best phrased as "there is a capability level above which it is impossible to align an ASI", but I think that these dynamics obviously apply to modern day systems as well.

3) the AI is sufficiently aligned to be safely able to assist us with AI assisted Alignment

4) the AI is sufficiently aligned that it can and will successfully do Value Learning and align itself better, and will converge to a stable very-aligned state.

21

[ Question ]

How Hard a Problem is Alignment?

21

P vs NP

Shades of Impossible

Mathematically Impossible

Kardashev II

Too Hard for Us

21

3 Answers sorted by
top scoring

Mar 11, 2026*

Mar 13, 2026

Mar 11, 2026

21

21

[ Question ]

How Hard a Problem is Alignment?

21

P vs NP

Shades of Impossible

Mathematically Impossible

Kardashev II

Too Hard for Us

21

3 Answers sorted by top scoring

Mar 11, 2026*

Mar 13, 2026

Mar 11, 2026

21

3 Answers sorted by
top scoring