Sympathy for both sides of the egregious misalignment debate

Steven Byrnes

Sympathy for both sides of the egregious misalignment debate — LessWrong

211 Sympathy for both sides of the egregious misalignment debate

by Steven Byrnes

12th Jun 2026

AI Alignment Forum

5 min read

211 Ω 60

On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.

On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll often say that it would come about through being in too much of a rush, or careless programmers, or bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get egregious misalignment and scheming because nobody has the foggiest idea how to avoid that”.

Here’s my brief idiosyncratic take on this debate. I think BOTH of the following are true:

(1) If you really think carefully about the properties of ASI, you really do find good reasons to strongly expect it to be egregiously misaligned, scheming, and ruthless, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
(2) If you really think carefully about the properties of current LLMs, you really do find good reasons to think that existing technical alignment techniques are adequate now, and may well continue to be adequate in the future.

So then here are three (caricatured) positions:

My position:

(1) and (2) are both totally true. And we can reconcile them by saying that LLMs won’t scale to ASI.

Yudkowsky & Soares’s position [caricatured]:

(1) is totally true. We know this with great confidence, having spent decades thinking about it.
So it follows that (2) must be wrong or irrelevant.
Why is (2) wrong or irrelevant? Hard to say! There’s no ASI yet, and nobody knows in detail how it will appear. Sometimes it’s easier to predict what happens eventually than the detailed path. An ice cube in warm water will melt eventually, but don’t ask me to predict how many seconds it will take to melt, etc.
So anyway, one possibility is that (2) is wrong because LLMs will kinda ‘wake up’, or something, when the core pieces of true intelligence finally come together. And then their behavior would change drastically for the worse. And maybe we’re already starting to see glimmers of that in existing LLMs?
Or another possibility [cf. Eliezer tweet] is that LLMs will invent non-LLM ASI. And then (2) will be simply irrelevant!
…Or something else! Again, we don’t know! But we do know that (1) is definitely right.

LLM people’s position [caricatured]:

(2) is totally true. We know this with great confidence, because we are LLM experts and we have thought about these alignment plans in great detail, including matching our theories against real-world data.
So it follows that (1) must be incorrect.
Why is (1) incorrect? I don’t really know! Man, I read Yudkowsky and Soares, and it’s all these words, words, words, and I’m reading along and trying to match those words to my knowledge of LLMs and it just doesn’t make any damn sense. I can and will try to respond to their points in detail, but honestly the core issue is that they’re guilty of head-in-the-clouds armchair theorizing gone off the rails.

Conclusion

…So I think that both sides of the debate are basically coming from a reasonable and sympathetic place, with a big kernel of truth.

Bonus section: Further commentary

…That said, I can still complain at both sides!

My “true objection” to Yudkowsky & Soares:

For the record, my “true objection” to Yudkowsky & Soares is that if we’re talking about ASI, then LLMs are basically irrelevant and we shouldn’t even be talking about LLMs at all. And meanwhile, their plans are misguided because delaying ASI is possible on the margin but mostly hopeless, although I guess I’m happy that they’re trying anyway. Meanwhile, my hunch is that they’re overstating the intractability of finding that technical alignment breakthrough, although I haven’t found it yet, so I guess time will tell.

My within-frame complaint at Yudkowsky & Soares:

…But I’ll put that aside for the sake of argument, and bring up a narrower complaint within their frame:

I think their suggestions that LLMs may become completely egregiously misaligned in the future via … umm … the ‘true core of intelligence’ coming together, and ‘waking up’? Like Skynet or something?? That was mean, sorry, but in any case, I don’t think this idea hangs together either theoretically or empirically.

For the former (theory), see my discussion of the extreme weirdness of the LLM pretraining algorithm in Foom & Doom §2.3.2. I think Yudkowsky & Soares have not internalized how weird this type of learning algorithm is, and if they had, then Yudkowsky would not be occasionally suggesting that we should think of an LLM as an actress playing characters.

For the latter (empirical), I think the most fair assessment is that current LLMs are nice and obedient in some contexts, and LLMs are mean, defiant, and just plain weird in other contexts. You can straightforwardly go from that observation to “maybe there will be egregious misalignment and scheming in the future”, but not to “there will definitely be egregious misalignment and scheming in the future, absent new breakthrough technical alignment ideas”.

I think that if Yudkowsky & Soares stopped treating current LLMs as direct evidence for technical alignment being definitely completely unsolved, and instead treated it as either mixed evidence or entirely off-topic, then their public messaging would come across to policymakers and general audiences as somewhat more convoluted and confusing. But I think it would be more accurate. Oh well.

My “true objection” to LLM people:

For the record, my “true objection” to the LLM people is that I don’t really care about anything they say, because I’m working on the ASI alignment problem, and LLMs won’t scale to ASI.

(I’m overstating a bit. I’m generally happy for people to work on making LLM-world a place of wisdom and goodness, especially because LLM-world is the world in which ASI will someday be invented.)

My within-frame complaint at LLM people:

…But I’ll put that aside for the sake of argument, and bring up a narrower complaint within their frame:

I think the LLM people are not pricing in the predictable consequences of ever more RLVR and/or the predictable consequences of ever more “real” open-ended continual learning, should the latter ever be solved (which I don’t think it will be, but never mind that).

In other words, lots of LLM-focused people say “LLMs will eventually be able to do the things that humanity did over the last 5000 years: open-endedly and autonomously build new knowledge and ideas on top of new knowledge and ideas, in an endless tower, with no need for human-provided ground truth anywhere in that process. And how exactly will the future LLMs do that? Uhh, I don’t know, people are working on it, I guess they’ll probably figure something out.”

…And bam, that blank spot in the map is where the pea gets hidden under the thimble.

Because if you want the LLMs to gain ever more knowledge, whether through a perpetual RLVR loop or some other yet-to-be-invented type of continual learning, there has to be some kind of ground truth, or else it will go off the rails into nonsense. And that ground truth, whatever it is, will basically amount to an objective function (a.k.a. cost function, reward function, whatever). And when the LLM updates enough on that ground truth, then whatever human-niceness that the LLM inherited from pretraining will get diluted away in favor of ruthless maximization of that objective function.

Thanks Zack M. Davis for a brief discussion that inspired this post.

Frontpage

211 Ω 60

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:45 AM

[-]ryan_greenblatt2moΩ13318

And if they’re concerned about egregious misalignment and scheming, they’ll probably say that it would come about through race dynamics, careless programmers, bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get egregious misalignment and scheming because nobody has the foggiest idea how to avoid that”.

Citation needed? I think most people who are pretty worried about egregious misalignment are worried about it emerging naturally and being at least moderately difficult to prevent.

[-]Steven Byrnes1moΩ240

I guess you would know better than me, so I changed it from "probably" to "often" and from "race dynamics" to "being in too much of a rush" (the latter is what I meant all along but hopefully clearer). But I'm reluctant to go further than that. I do really have an impression that most people in LLM world think that things that people have been doing (constitutions, deliberative alignment, inoculation prompting, using interp to test for eval-awareness, etc.) is what progress on alignment looks like, and if egregious misalignment and scheming happens in the future it would be either because the good guys were in too much of a hurry to iterate and develop more and better techniques in that same general genre, or because the good guys were not the ones building ASI.

[-]Vladimir_Nesov2moΩ5223

LLMs won’t scale to ASI

Not directly, but if humans automate the process of using LLMs to build the next generation of LLMs, this process of prosaic RSI is plausibly good enough to count as AGI (where the LLMs themselves still don't count as AGI; model size scaling slows down after 2030-2031, at quadrillion param models). It still likely won't scale to ASI, unable to quickly learn deep skills and thus make fast conceptual progress, but to the extent that it helps invent ASI (possibly over many years), it might also help invent the concepts of ASI alignment.

So alignment of LLMs is necessary, but whether it's sufficient remains unclear. It depends on how capable LLMs become at quadrillion param scale, after the prosaic RSI loop closes and they become self-building (most relevantly, automatically preparing the training data for RL, to teach the next model new concepts and skills). Possible cruxes are that LLMs never reach that point (with current training methods), or that they remain mostly useless at conceptual progress even with the prosaic RSI loop closed, or that they somehow proceed to invent ASI quickly, while not being wise enough to solve ASI alignment first (since this scenario doesn't involve a gradual process of getting smarter significantly beyond human level).

[-]Logan Zoellner2mo211

LLMs won’t scale to ASI

this was a fine view to have 3 years ago. But at the point where LLMs are already pushing the boundaries of mathematics claiming they won't scale is denying objective reality. What specific capability do you expect ASI to have that LLMs don't already possess?

[-]julius vidal1mo21

It does not seem clear to me at all that mathematical ability (and more generally discrete token manipulation) translates into ability in real world tasks that involve messy unpredictable continuous systems.

Of course AI that is massively superhuman in maths, coding, etc. would still be transformative in many ways. But it might not be the kind of ASI that can meaningfully pursue its own goals in the world the way most X-risk scenarios worry about.

[-]kbear1mo-33

je ne sais quoi

[-]Seth Herd2mo*143

I agree. I'm semi-obsessed with this disagreement and how to understand and resolve it.

Here's my framing, largely in agreement with yours I think, except the relevance of LLMs to aligning ASI.

Both sides are making reasonable arguments that seem pretty strong as far as they go.

But they don't go as far as making contact with each other. They don't reach the common ground between those two positions: when/if LLMs reach AGI. I agree with you that LLMs won't reach ASI. But with a little scaffolding and learning, they are looking quite likely to reach what we usually call AGI. How many of the severe concerns will apply? We haven't figured that out.

I think much of why the two sides don't make contact is sociological. I think there's a cultural and methodological divide, and some justified (but unhelpful) irritation on both sides based on some amount of condescension from both sides. There's a certain amount of Motivated reasoning, confirmation bias and social compounding of those effects making both sides confident that they have the better perspective and methodology. We're all biased, and even rationalists (let alone scientists) have emotions that factor into creating our beliefs. Correcting for biases takes work and skill and is always imperfect.

I think this is happening in part because it's really hard to make a mental model of how those two perspectives meet: when LLMs become AGI. That describes the core of My research agenda. LLM AGI may reason about its goals and discover misalignments by default was my most complete attempt to envision how the concerns of classic agent foundations enter into the progress of LLMs to AGI. I think this one way people envision them "waking up" - becoming fully reflective and self-aware as well as continuous and agentic, like humans whom we consider "awake". But this might be delayed with current alignment and control strategies, perhaps long enough to get substantial help with alignment - another source of LLM optimism.

I agree that LLMs won't reach ASI, for all of the reasons you and others have given. They won't be ASI even with scaffolding and Continual Learning (new sequence exploring the implications). But they will likely be useful for building ASI (raising risks of Slop, not Scheming creating misaligned ASI) and they may be smart, agentic, and competent enough to be takeover-capable themselves.

That's if they have enough continual learning, memory, and executive function/Human-like metacognitive skills scaffolded or trained in. But given the progress in the base LLMs, I don't think there need to be any breakthroughs in any of those. As you say, this raises many of the classic concerns because continual learning does imply a value function. The current functionally adequate alignment of LLMs isn't stable in the face of a lot of additional RL, or perhaps updated world models, including through learning during reflection (my LLM AGI may reason about its goals... post).

My hope is that continual learning may be mostly about improving world models, while the "values" and "goals" of the LLM self-stabilize in reflective stability. I think this is part of the intuition that LLMs might remain aligned if they reach AGI/ASI; this is more or less why we'd trust a really good human as they got smarter. But that's a hope, not a plan or an argument that we get that by default.

So: it seems like we should really figure out how the two arguments meet.

[-]Petropolitan2mo73

they may be smart, agentic, and competent enough to be takeover-capable themselves

Might we just put aside AGI and ASI buzzwords and use "takeover-capable AI" as a go-to term instead? I believe it's of secondary importance whether a takeover-capable AI system is "general" (whatever that means) and/or "superhuman"

[-]Noosphere892mo41

I just wrote a linkpost advocating for the use of more specific terms here, and I wholeheartedly agree with you on the point you are making here, and think the AGI and ASI buzzwords are useless at this point.

[-]Petropolitan1mo10

Thanks a lot (and big thanks to Helen Toner!), do you think you could add AI capable of a takeover, or rather a civilization-ending failed takeover attempt (as a lower and more practical bar) to the post?

[-]Noosphere891mo30

Added the paragraph asked for here.

[-]Petropolitan1mo10

Thanks again for that (and the hat tip), I will link to your post in the future when this topic comes up again! As a side note, I would have laid out that section as a postscript because that's what it basically is ;-)

[-]Zack_M_Davis2moΩ3100

I think "transformative AI could be slightly nice" arguments aren't logically dependent on LLMs-as-AGI per se, even if belief in the two are correlated: ^[1] Christiano's formulation (very roughly, that it's not obvious why the evolutionary quirks leading to humans not being maximally ruthless couldn't have ML analogues) doesn't seem to depend on levels higher than "(D) systems centrally involving deep learning" in your plateau-ism taxonomy.

Where delusional optimism would be an obvious candidate for the source of the correlation. ↩︎

[-]Steven Byrnes1moΩ470

I think the right starting point is not whether something is an LLM, or deep learning, but rather what are the inputs, outputs, loss functions, etc.? And then go from there to whether we expect slight-niceness or not.

My own opinion (stated without justification) is: you can get niceness through LLM-style “true” imitation learning (Foom & Doom §2.3.2). Alternatively, if the AI is choosing actions through RL and/or model-based search & planning, rather than through imitation learning, than I expect zero-niceness, and instead the ruthless pursuit of the objective, or of something vaguely related to the objective, with ample specification gaming and so on (e.g. “be helpful” gets ruthless-ified into “come across as helpful”).

…Except that there exist weird objective / reward / cost functions that don’t have that property, but rather support niceness. And humans wound up with such a function via evolution doing an outer-loop search over reward functions in a certain type of environment where niceness was advantageous. In principle, future AI programmers could likewise do an outer-loop search over reward functions, but they probably won’t, because any kind of outer-loop search over scaled-up learning algorithms is hella expensive. If they do it at all, it would be a situation where the programmer crafted the reward function up to a handful of adjustable parameters, and then the outer-loop search would be a kind of hyperparameter tuning. And then the alignment challenges would be (1) crafting a reward function (up to the handful of unknown adjustable parameters) that supports niceness, (2) figuring out what the outer-loop test environment and selection criterion is, such that the selected reward function hyperparameters will lead to niceness towards humans in the real post-ASI world despite the wild distribution shift from the test environment. That’s basically what I’m working on, and I claim that not only are these open problems but that all the ideas in the literature will almost definitely fail.

[-]quetzal_rainbow2mo8-6

I think you don't understand several aspects of Yudkowsky and Soares position.

Their position is not "LLMs scale directly to AGI" but "we are within 0-3 breakthroughs of transformers scale of AGI", which is much less controversial.

Next, I think that everybody (LessWrong is not an exception) failed to do update on AI meta, following from existence of LLMs. Existence of LLMs says that there are brute-force-ish methods which create previously unimaginable gains in intelligence. It would be insane if transformer pretraining was the only such method which would also not reach general intelligence. Even if you are correct and there is a secret brain juice not present in current LLMs, it seems to be more likely that such hypothetical algorithm will coalesce in some sort of advanced, within 0-3 breakthroughs of scale of transformers, brute-force-ish neural net training process, than that it will be discovered by doing normal cognitive/AI/neuroscience. Humans are proven to suck in the latter and to excel in the former. Therefore, it seems to be the right strategy to put all compute under control.

Also you put too much weight on specific LLM properties as reasons for their current safety. Inside Y&S model, the main reason why LLMs are safe now is because they are not superintelligent, period. If you have something intelligent which can't kill you, then you can hit it with wrench until it starts to output acceptable behavior. I think that if you presented modern AI safety community with black box running your hypothetical brain algorithm at speed and quality of 100IQ human and ability to receive whatever training data and reward schedule researchers choose, they would make it to output aligned behavior in six months, using methodology not dissimilar to modern LLM alignment methodology. The problem is that they would solve it using trial and error in a way that would kill everyone if they tried to do it with superintelligent black box.

[-]jelly2moΩ150

Unless I’m reading this wrong somehow, I think you’re excluding people who think something along the lines of “current alignment techniques work great in the current regime but won’t generalize to superintelligence, and the hope instead is to use the best AI that can still be aligned to automate AI alignment”.

[-]Steven Byrnes2moΩ362

Eh, I see that as a separate debate. (I.e., “Suppose Yudkowsky & Soares are right that ASI will definitely be egregiously-misaligned & scheming in the absence of yet-to-be-invented breakthrough technical alignment ideas. Is it plausible that weaker AIs could find those breakthrough technical alignment ideas? Or not?” That’s a live debate, but it’s a different debate than I’m discussing in this post. Lots of people would not grant the premise.)

[-]Hastings2mo40

If LLMs are alignable, the question isn’t whether LLMs can scale to ASI, its whether 1) LLMs are the equilibrium, compute optimal way to be intelligent or 2) we can coordinate to stay off the equilibrium in the time between discovering it and working out how to align it. 1) seems galactic-ally unlikely to me so LLM alignment is entirely reliant on 2) (so we would be well served to develop that capability)

There’s a bit of flexibility in that the compute optimal way to build intelligence could be llm-like enough for alignment to generalize, but this seems unlikely- for example, LLMs pre rlvr were way easier to make behave but it is unthinkable to coordinate on not doing rlvr. I expect more rlvr-like innovations.

[-]Alex Mallen1moΩ130

In the framing of the post, I think much (most?) of the disagreement is downstream of whether we'll even choose to pursue the kind of ASI for which the theoretical arguments dominate the prosaic LLM-style safety arguments. LLMs or other non-limits-of-intelligence technologies with better safety properties could very plausibly scale far enough to satisfy the wants of people developing AI and/or end competitive pressures to build more ASI-like things.

[-]less_raichu2moΩ130

This argument seems weak on two fronts.

There's no reason to think LLM safety today will scale if the LLM undergoes RSI and/or scales to ASI. Or maybe the LLM will discover the ASI algorithm and code that and deploy it.
Today's LLMs are misaligned, see Ryan Greenblatt's recent post.

[-]Steven Byrnes2moΩ5106

RE 1, sure, “LLM will invent non-LLM ASI” is possible in principle, and would be a special case of “LLMs do not scale to ASI”. I do mention that (in the “Yudkowsky & Soares’s position [caricatured]” section).

RE 2, he wrote that “current AIs seem pretty misaligned”, not that current AIs are egregiously misaligned, scheming, and ruthless. I obviously do not think we should extrapolate from empirical observation of today’s LLMs to future ASI, but if I DID so extrapolate, I think my attitude would be vaguely like “eh, maybe future ASI will be egregious misaligned and scheming, even if people really try hard using known techniques, but probably not? And even if it happens to some degree, the AIs would still probably be at least slightly nice, and maybe that’s good enough?” That would be the kind of thing LLM people might say. By contrast, Yudkowsky & Soares (and me) are very very much more pessimistic than that.

[-]TheVinci1mo10

I find a strong single thread that unites both your post and the comments here: when/if LLMs become ASI, what capabilities does it require, how it will work, scale, etc.

It's an important and consequential point, but I think there is another one that is not attended to:

If we completely disregard the granular definition of ASI and/or AGI, specifically because of, as you accurately caricaturized, "fogginess" - what are the downstream effects of a technology which will check enough of the boxes that we can claim are closely related with what we would expect from ASI/AGI.

For example, if we see:

Large job displacement of white collar workers
Strong take-off in scientific understanding
A system which increases its own development speed
An autonomous system which is unusually difficult to contain
Etc.

Then, perhaps, the discussion of specific architectures and training methods can, to some extent, be quieted down?

Stated plainly, is it not the case that we can claim that a system is an ASI / AGI if it checks enough of the boxes of downstream effects which we claim such a system will cause?

[-]username221mo10

I agree. But, if LLMs hit a wall and we then try to make non-LLM ASI, (1) will become a concern again for that new AI.

LLMs might be extremely competent helpers in the research process, so research could happen at a high pace even if most AI researchers prefer to quit while we're ahead.

[-]RogerDearnaley2mo*1-20

Humans have a quite complicated and variable set of goals. When we distill human agentic behavior into LLMs, those come along for the ride. Aliigning an LLM is basically fiuguring out how to turn the loving humanitarian compassionate bits of human motivations way up, and all the selfish bits way down. It's basically the same problem as turning a human into a bodhisattva.

If Steve is right that a) LLMs don't scale to AGI and b) brain-like AI does, then what we need to do is reverse engineer the loving humanitarian compassionate bits of human motivation, not all the selfish bits. As I gather he's working on.

The Orthogonality Thesis suggests a solution. As Eliezer has observed, an LLM understands human values, and the remaining problem is getting it to care (more than most humans do). So if LLMs scale to superintelligence, this reduces to aligning an LLM, or if they don't, attach an LLM to whatever we end up building, so it knows what human values are (preferably with Bayesian Learning so it can Value Learn more detail), and attach an explicit goal slot so that we can explicitly make it care. AIXI with an LLM as a subroutine.

[-]jmh1mo30

I think human intelligence and history might suggest that is incorrect, turning up the compassion and turning down selfish. I agree it is a really complicated area. I'd point to things from economics (Mandeville, Smith and others) and recognition of things like the saying about the road to hell being paved with good intentions.

Perhaps we've not tried but we don't seem to channel at a social level our compassions as well as we do out selfish interests. It's not the driver of the incentives it seems to me but the environment and institutional structures that are mediating the interactions. But even if our social institutions that mediate our interactions in beneficial ways worked well for our compassionate impulses I'm not sure they work well for AI as that seems to be more of a monolithic entity that a real composite of individuals as human society is.

[-]Andrii Vasylenko2mo10

So if LLMs scale to superintelligence, this reduces to aligning an LLM, or if they don't, attach an LLM to whatever we end up building, so it knows what human values are (preferably with Bayesian Learning so it can Value Learn more detail), and attach an explicit goal slot so that we can explicitly make it care. AIXI with an LLM as a subroutine.

In the limit of superintelligent optimization, the things that look the best to an LLM grader are not generally the things that we value.

[-]Romain Deléglise2mo*10

Maybe I don't understand yours points.

For my point of view we don't need a perfect reward/loss function to be in deep troubles with RSI, that's basically the reason why I expect LLM to likely scale to AGI and then probably ASI.

Likely all we need is a function good enough to begin the circle of RSI. Some domains need humans to verify the quality of the AI answers but it's seems the most important domains for RSI don't need human verification.

This domains are :

Code (fully automated)
Maths (fully automated)
Physic and computer simulations (mostly automated cause you can reason without experiments using models, but probably not entirely because probably still needs some experiment for news stuffs?).
AI research itself (benchmark exist today and are a good approximation of the AI capabilities, I expect that the AI can create itself it's own benchmark and then tests using trials and errors or a better algorithm). Problems like overfitting can maybe slow down a bit the progress here, but again we don't need perfect reward function, a good enough function is largely enough. The AI needs to be able to do some tests and be capable of measuring the news capabilities basically.

These domains reinforce themselves in some ways (math and code helps a lot in physics and computer simulation, physics and computer simulation help in AI research for example).

Other things can help in the RSI phase but I expect it's might be enough to boostrap itself.

We don't need a ASI good in all domains, if it's good in enough in the importants domains it's probably over.

A second very important issue is that even is LLM don't scale to AGI/ASI, they might be able to create an other system (different from LLM) which will scale. You recognize the problem as far as I understand but don't seem to accept the conclusions (LLM scaling is terrible even if they don't directly scale to AGI/ASI).

(As an apart somes people seems to have other theories for why LLM will scale but don't want to speak about it publicly because it's can be misused. That might be useful to keep in mind also even if we can't judge there ideas).

Moderation Log