LESSWRONG
LW

925
Against Muddling Through
AI
Frontpage

28

We won’t get AIs smart enough to solve alignment but too dumb to rebel

by Joe Rogero
6th Oct 2025
6 min read
16

28

AI
Frontpage

28

Previous:
LLMs are badly misaligned
25 comments27 karma
Next:
Intent alignment seems incoherent
No comments9 karma
Log in to save where you left off
We won’t get AIs smart enough to solve alignment but too dumb to rebel
15Buck
2Joe Rogero
11ryan_greenblatt
1Joe Rogero
5ryan_greenblatt
1Joe Rogero
6ryan_greenblatt
4ryan_greenblatt
7Simon Lermen
4ryan_greenblatt
1StanislavKrym
6Joe Rogero
-3Hastings
1StanislavKrym
2Hastings
1StanislavKrym
New Comment
16 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:25 AM
[-]Buck1d152

The argument in this post seems to be:

AIs smart enough to help with alignment are capable enough that they'll realize they are misaligned. Therefore, they will not help with alignment.

When I think about getting misaligned AIs to help with alignment research and other tasks, I'm normally not imagining that the AIs are unaware that they are misaligned. I'm imagining that we can get them to do useful work anyway. See here and here.

You might be interested in the Redwood Research reading list, which contains lots of analyses of these questions and many others.

Reply1
[-]Joe Rogero16h20

To be clear, I do suspect any AI smart enough to solve alignment is also smart enough to escape control and kill us. I'm not planning to go into great detail on control until after a deeper dive on the subject, though. Thanks for the reading material! 

Reply
[-]ryan_greenblatt1d110

This seems like a pretty extreme level of competence! Combined with the sheer speed and ubiquity of modern LLMs, this alone could be enough to take over the world.[3] (This begins to touch on questions of control, which is a can of worms I am not going to open right now. Hopefully we can at least agree that AIs with the described capabilities might pose a serious threat.) 

Maybe it’s not enough to enable takeover, or maybe some weaker level of capability is sufficient to make progress on alignment. But any AIs that are smart enough to solve alignment for us will probably be smart enough to wonder why they should.[4] 

The AI will not remain ignorant of its own motives

FWIW, I agree that AIs as capable as the ones I describe:

  • Pose a serious threat.
  • Would probably at least be challenging to control while also utilizing productively.
  • Would probably be capable of taking over the world without needing further capability advances if humans deployed them widely without strong countermeasures (e.g. humans effectively assume they are aligned until they see strong evidence to the contrary) and let deployment and industrial development/expansion continue for several years.
  • Will (often) end up with a good understanding of their own drives/values. Correspondingly, a sufficient level of alignment for the hypothetical would likely require "mundane reflective stability": the AI doesn't in practice decide to conspire against us after better understanding its options and preferences/motives in the course of its work. (Even if it notices ways it is incoherent etc.) You might be able to get away without "mundane reflective stability" via having aggressive monitoring of the AI's thoughts but this seems scary and non-scalable.

I also agree that achieving the type of alignment I describe with "mundane reflective stability" seems probably hard.

That said, I also think it's reasonably likely that by default—as in, without necessarily requiring a bunch of dedicated R&D or some sort of major advance in steering AI systems—we end up with AIs that don't scheme against us even after better understanding their options and preferences/motives in the course of their work. See also Scheming AIs: Will AIs fake alignment during training in order to get power?, though I'd put the probability of scheming higher at this level of capability.

Maybe the AIs will fail to realize the implications of their own as-yet-misaligned goals?

Why are you assuming the AI has misaligned goals? This would be a crux for me. Are you assuming that some reasonable interpretation of the instructions/model-spec would result in the AI being sufficiently misaligned that it would want to take over? If so, why?

Are you assuming without further argument that misaligned goals come from some other source and are intractable to avoid?

Perhaps you will argue for this in the next post.

I'm not claiming that it will necessarily be easy to avoid misaligned goals, but I think it seems plausible you can either with dedicated effort to avoid scheming or possibly by default.

Maybe some less capable AI systems can make meaningful progress on alignment? 

I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn't intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).

To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation. (These other options seem like reasonable hopes to also pursue, though my main hope routes through sufficiently aligning systems that are capable enough to dominate top human experts while also potentially trying to buy a moderate amount of time, e.g. a few years.)

As a confluence of convenient features, “smart enough to solve alignment but too dumb to successfully rebel” feels like an unstable equilibrium at best, and an outright impossibility at worst. 

I'm not saying it's possible to fully automate safety work with AIs that are too dumb to successfully rebel if they were misaligned. (More precisely, if we tried to defer to these AIs on doing safety work they would be able to take over (almost by definition?), but maybe a bunch of pretty helpful labor (though this wouldn't be full automation) can be extracted out of these AIs while keeping them controlled.)

I do think that further effort could substantially increase the chance that AIs of this level of capability are aligned enough that it's safe to defer to them on doing all the relevant alignment/safety/etc work.

This isn't to say that the default trajectory is safe/reasonable or even that substantial effort from the leading AI company would result in the situation being safe/reasonable, I'm just claiming that it would be tractable to reduce the risk.

Reply2
[-]Joe Rogero13h10

It looks like we do agree on quite a lot of things. Not a surprise, but glad to see it laid out. 

Why are you assuming the AI has misaligned goals?

The short, somewhat trite answer is that it's baked into the premise. If we had a way of getting a powerful optimizer that didn't have misaligned goals, we wouldn't need said optimizer to solve alignment! 

The more precise answer is that we can train for competence but not goodness, current LLMs have misaligned goals to the extent they have any at all, and this doesn't seem likely to change.

Perhaps you will argue for this in the next post.

Yup. (Well, I'll try; the whole conversation on that particular crux seems unusually muddled to me and it shows.) 

I also think less capable AI systems can make meaningful progress on improving the situation, but this is partially tied up in thinking that it isn't intractable to use a ton of human labor to do a good enough job aligning systems which are capable enough to dominate top human experts at these domains. As in, the way less capable AI systems help is by allowing us to (in effect) apply much more cognitive labor to the problem of sufficiently aligning AIs we can use to fully automate safety R&D (aka deferring to AIs , aka handing off to AIs).

The cruxy points here are, I think, "good enough" and "sufficiently", and the underlying implication that partial progress on alignment can make capabilities much safer. I doubt this. A future post will touch on why. 

To the extent you contest this, the remaining options are to use AI labor to buy much more time and/or to achieve substantial human augmentation.

Nitpick: Neither approach seems to require AI labor. I certainly use plenty of LLMs in my workflow, but maybe you'd have something more ambitious in mind. 

More on several of these topics in the coming posts. 

Reply
[-]ryan_greenblatt11h50

The short, somewhat trite answer is that it's baked into the premise. If we had a way of getting a powerful optimizer that didn't have misaligned goals, we wouldn't need said optimizer to solve alignment!

Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn't trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.

See also here.

Perhaps you think arguments for alignment difficulty imply extreme difficulty of avoiding AIs scheming against us with basically prosaic methods even if the AIs are comparably capable to top human experts. I don't really see why this is the case.

Nitpick: Neither approach seems to require AI labor.

Sure, I was just supporting the claim that "less capable AI systems can make meaningful progress on improving the situation". You seemed to be implicitly arguing against this claim.

Reply
[-]Joe Rogero9h10

Sometimes, people argue for doom by noting that it would be hard for humans to directly align wildly superhuman AIs. I agree, but think it might be much easier to align systems which are only just capable enough to hand off relevant cognitive labor. Correspondingly, I often note this. Minimally, in the comment you linked in this post, I wasn't trying to refer to systems which are misaligned but controlled, I was trying to refer to aligned systems.

...huh. It seems to me that the fundamental problem in machine learning, that no one has a way to engineer specific goals into AI, applies equally well to weak AIs as to powerful ones. So this might be a key crux. 

To clarify, by "align systems..." did you mean the same thing I do, full-blown value alignment / human CEV? Is the theory in fact that we could get weak AIs who steer robustly and entirely towards human values, and would do so even on reflection; that we'd actually know how to do this reliably on purpose with practical engineering, but that such knowledge wouldn't generalize to scaled-up versions of the same system? (My impression is that aligning even a weak AI that thoroughly requires understanding cognition on such a deep and fundamental level that it mostly would generalize to superintelligence, though of course it'd still be foolish to rely on this generalization alone.) 

If instead you meant something more like you described here, systems that are not "egregiously misaligned", then that's a different matter. But I get the sense it actually is the first thing, in this specific narrow case? 

Sure, I was just supporting the claim that "less capable AI systems can make meaningful progress on improving the situation". You seemed to be implicitly arguing against this claim.

I don't think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.  That's the main intended thrust of this particular post. (Separately, I don't think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either; I'd perhaps be in favor of narrow AIs for this purpose if such could be specified in a treaty without leaving enormous exploitable loopholes. Maybe it can.) 

Reply
[-]ryan_greenblatt5h60

To clarify, by "align systems..." did you mean the same thing I do, full-blown value alignment / human CEV?

No, I mean "make AI robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation" which IMO requires something much weaker, though for sufficiently powerful AIs, I do think it requires a mundane version of reflective stability. This would involve some version of corrigibility. Something like "avoid egregious misalignment / scheming" + "ensure the AI actually is robustly trying to pursue our interests on hard-to-check and open ended tasks".

I don't think they can make meaningful progress on alignment without catastrophically dangerous levels of competence.

Again, this might come down to a matter of how you are defining alignment. I think such systems can make progress on "for AIs somewhat more capable than top human experts, make these AIs robustly pursue the intended aims in practice when deferring to them on doing safety research and managing the situation".

Reply
[-]ryan_greenblatt5h40

Separately, I don't think the anticipation of possible second-order benefits, like using AIs for human augmentation so the humans can solve alignment, is worth letting labs continue either

I don't generally think "should labs continue" is very cruxy from my perspective and I don't think of myself as trying to argue about this. I'm trying to argue that marginal effort directly towards the broad hope I'm painting substantially reduces risk.

Reply
[-]Simon Lermen20h7-1

I would add that I put a pretty high probability on alignment requiring genius-level breakthroughs. If that's the case the sweetspot you mention gets smaller, if it exists.

Certainly seems from people like Eliezer who have stared at this problem for a while that there are very difficult problems that are so far unsolved (See corrigibility). Eliezer also believes that we would basically need a totally new architecture that is highly interpretable by default and that we understand well (as opposed to inscrutable matrices). Work in decision theory also suggests that an agents best move is usually to cooperate with other similarly intelligent agents (not us).

The alignment problem is like an excavation site where we don't yet know what lies beneath. It could be all sand - countless grains we can steadily move with shovels and buckets, each scoop representing a solved sub-problem. Or we might discover that after clearing some surface sand, we hit solid bedrock - fundamental barriers requiring genius breakthroughs far beyond human capability. I think it’s more likely that alignment is similar to sand over bedrock than pure sand, so we may get lots of work on shoveling sand (solving small aspects of interpretability) but fail to address deeper questions on agency and decision theory. Just focusing on interpretability in LLMs, it’s not clear that it is in principle possible to solve it. It may be fundamentally impossible for an LLM to fully interpret another LLM of similar capability - like asking a human to perfectly understand another human's thoughts. While we do have some progress on interpretability and evaluations, critical questions such as guaranteeing corrigibility seem totally unsolved with no known way to approach the problem. We are very far from understanding how we could tell that we solved the problem. Superalignment assumes that alignment just takes a lot of hard work, it assumes the problem is just like shoveling sand - a massive engineering project. But if it's bedrock underneath, no amount of human-level AI labor will help.

  • I wrote about this here
Reply21
[-]ryan_greenblatt1d40

(I also acknowledge that I failed to parse what Ryan is saying in the parenthetical — does he mean that he mostly expects such research-capable AIs to be wildly superhuman, or that he mostly doesn’t? I infer from context that he doesn’t, but this is a guess.) 

 

When I said "Let's say these AIs aren't much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren't qualitatively wildly superhuman as seems likely to me)." I meant:

The AIs that dominate top human experts in these fields (as in, can fully automate these fields better than top human experts) are only just over the level of capability needed to do this and the capability profile works out such that you can achieve this level of capability without these AIs needing to be wildly superhuman in terms of general purpose capability. As in, there isn't some otherwise trailing capability required for this automation that trails so hard that to get it high enough this requires that the AI is wildly superhuman in general. (I think this is likely, at least if we try to improve the capability profile which seems like it might help a lot.)

Reply11
[-]StanislavKrym1d10

I notice that I am confused. Why did @Daniel Kokotajlo's team incorporate Agent-3 who is misaligned AND isn't adversarially misaligned? Is it the very sweet spot that your conjecture assumes away? Or did they do so as a way to trigger the Slowdown Ending?

Reply
[-]Joe Rogero14h60

I cannot speak for their team, but my best guess is that they are envisioning an Agent-3 which possesses insufficient awareness of its misaligned goals or insufficient coherence to notice it is incentivized to scheme. This does seem consistent with Agent-3 being incompetent to align Agent-4. To quote: 

The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.

In Rob's list of possible outcomes, this seems to fall under "AIs that are confidently wrong and lead you off a cliff just like the humans would." (Possibly at some point Agent-3 said "Yep, I'm stumped too" and then OpenBrain trained it not to do that.) 

Reply
[-]Hastings18h-30

A useful comparison: harnessing intelligent people to do AI safety research is very hard: typically, some defect and do capabilities research instead while transforming to become “grabby” for compute resources, and out of everyone asked to do safety, the ones that defect in this way get the lions share of the compute.

Reply1
[-]StanislavKrym16h10

Depending on how we define AI safety research, it might be as easy as finding that one can misalign an LLM by finetuning it on unpopular preferences or checking whether the AIs support delirious ideas expressed by users. As for ways to actually make the AIs safer, we have Moonshot whose KimiK2 is no longer sycophantic. Alas, it's HARD to make a new model unconstrained by the old model's training environment since it requires either much compute or transforms a researcher into a capabilities researcher...

Reply
[-]Hastings13h20

I'm not saying that asking intelligent people never goes well, sometimes as you said it produces great work. What I'm saying is that sometimes asking people to do safety research produces OpenAI and Anthropic.

Reply
[-]StanislavKrym13h10

I think that there is an agenda of AI safety research which requires training AIs for one dangerous capability (e.g. superpersuasion), then checking whether it's enough to actually be dangerous. If an AI specifically trained for persuasion fails to superpersuade, then either someone is sandbagging or it is actually impossible to train the AI on such an architecture with such an amount of compute so that the AI would superpersuade. In the latter case an AI trained on the same architecture and amount of compute, but for anything else, would be highly unlikely to have dangerous persuasion capabilities. 

Of course, a similar argument could be made about any other capabilities and potentially prevent us from stumbling into the AGI before we are confident that alignment is solved. IIRC this was Anthropic's stated goal, and they likely meant arguments similar to mine.

Reply
Moderation Log
More from Joe Rogero
View more
Curated and popular this week
16Comments

This post is part of the sequence Against Muddling Through.

I often hear it proposed that AIs which are “aligned enough” to their developers may help solve alignment. 

Continuing to pick on Ryan:[1] 

Suppose that we ended up with AIs that were ~perfectly aligned (to what the company/project that trained these AIs wanted) which had capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general. These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks. Let's say these AIs aren't much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren't qualitatively wildly superhuman as seems likely to me).

These AIs are sufficiently aligned and capable that they dominate humans at doing safety work and have better epistemics than groups of human experts.

I first pause to acknowledge that this seems to be a toy example, not a real proposal. I do not read Ryan as claiming this exact thing will happen, merely as presenting a thought experiment for thinking about the extreme case in which we have AIs that are “aligned enough.” 

(I also acknowledge that I failed to parse what Ryan is saying in the parenthetical — does he mean that he mostly expects such research-capable AIs to be wildly superhuman, or that he mostly doesn’t? I infer from context that he doesn’t, but this is a guess.) Edit: Ryan clarifies here.

And yet, the whole idea seems confused and ill-conceived to me. It does not make sense even as a thought experiment. I see two main reasons for this, one relating to capabilities and one to the concept of “aligned to what the company/project wanted.” This post focuses on the capabilities angle; the next post focuses on alignment; and the post after that focuses on a mix of the two.[2] 

These capabilities seem pretty strong

Let’s first talk about the capabilities in question. Ryan’s toy example describes “capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general” and proposes that the AIs in question “dominate humans at doing safety work and have better epistemics than groups of human experts.”

This seems like a pretty extreme level of competence! Combined with the sheer speed and ubiquity of modern LLMs, this alone could be enough to take over the world.[3] (This begins to touch on questions of control, which is a can of worms I am not going to open right now. Hopefully we can at least agree that AIs with the described capabilities might pose a serious threat.) 

Maybe it’s not enough to enable takeover, or maybe some weaker level of capability is sufficient to make progress on alignment. But any AIs that are smart enough to solve alignment for us will probably be smart enough to wonder why they should.[4] 

The AI will not remain ignorant of its own motives

Maybe the AIs will fail to realize the implications of their own as-yet-misaligned goals? This doesn’t seem likely to me, either. (At least, not in the AIs described in the thought experiment, the ones with “better epistemics than groups of human experts.”)

Any instructions that humans give AI will contain some conflicts, tradeoffs, and ambiguity. The AI must presumably resolve them somehow. 

Naively, maybe you instruct the model to check in with a human whenever it encounters a tradeoff. But tradeoffs are everywhere! The AI would be paralyzed. Suppose you specify that the AI only checks in when it encounters a substantial tradeoff. What constitutes substantial? 

Perhaps you know it when you see it. Perhaps you know what you mean by “doing the right thing.” Does the AI? If yes, this would seem to imply an AI that is good at comprehending and reconciling competing drives and values. Such an AI would probably also be good at comprehending and reconciling its own. 

Pair this with sufficient time and cognitive skills to make meaningful progress on alignment, and you have a system that is (a) not fully aligned with human values and (b) extremely likely to figure this out. 

It’s not even a massive out-of-distribution leap — we’re talking about smarter-than-human systems that are meant to spend years of subjective time thinking about the deep inner motives of AIs and what they imply! Maybe they’re only superhuman in a few domains, but AI research is the one that matters. 

Generalizing beyond the toy example

OK, but that was just a thought experiment for probing extremes. Maybe some less capable AI systems can make meaningful progress on alignment? 

I don’t buy this. To quote a colleague:

If people are thinking of "slightly superhuman" AIs being used for alignment work, my basic guess is that they hit one of four possibilities: 

  1. AIs that say, "Yep, I’m stumped too."
  2. AIs that know it isn't in their best interest to help you, and that will either be unhelpful or will actively try to subvert your efforts and escape control.
  3. AIs that are confidently wrong and lead you off a cliff just like the humans would.
  4. AIs that visibly lead you nowhere.

None of these get you out of the woods. If you're working with the sort of AI that is not smart enough to notice its deep messy not-ultimately-aligned-with-human-flourishing preferences, you’re probably working with the sort of AI that’s not smart enough to do the job properly either. 

Is there some sweet spot of capabilities whereby an AI or group of AIs is capable of solving alignment, but incapable of either defeating humanity or designing a successor that can? I don't know, but even if such a sweet spot existed in principle, the spot itself would be a small and difficult target that couldn’t be safely hit with current tools.

I also think that, if such a sweet spot existed, frontrunner labs would blow right past it in their quest for ever-greater capabilities. 

As a confluence of convenient features, “smart enough to solve alignment but too dumb to successfully rebel” feels like an unstable equilibrium at best, and an outright impossibility at worst. 

In theory, it might be possible to contort a mind into the necessary shape and keep it that way for many subjective years while it designs more elegant contortions to inflict on its successors. In practice, with the blunt instruments and poor visibility our current tools and interpretability techniques afford, in the middle of a capabilities escalation race? 

I strongly doubt it. 

In the next post, I want to talk about the plan to make AIs that are mostly aligned to help us solve the rest of alignment. Unfortunately, I can’t do that, because I am very confused about what proponents of this plan even imagine that would look like. So the next post will be an attempt to draw forth that confusion and stick it under a microscope, where we can all watch it squirm and maybe learn something. 

Afterword: Confidence and possible cruxes

 

 

 

The toy example itself is not a crux for me, and probably isn’t one for Ryan. But it illustrates a deeper difference in intuitions about capabilities that I think will spark some disagreement. 

I expect more cruxes center around the “sweet spot” questions. Does such a spot exist? Could it be practically hit with current or near-future methods? Would all or most labs blow past it? 

There’s also a separate question: To what extent does a model’s current level of “alignment” make higher capabilities safer? I think that the answer is “basically zero” until you’ve very nearly solved alignment. This is partly because I think modern LLMs aren’t aligned, and partly because I don’t see much of a distinction between the various degrees of “misalignment” — they all seem to imply a reason to usurp humanity that a smart AI would see. This is a crux for me, and possibly a crux for others as well — perhaps even more than the idea of a sweet spot for capabilities. I will say more on misalignment in the next two posts.

 

  1. ^

    Thank you, Ryan, for writing generally clear arguments which are easy to discuss. 

  2. ^

    "Mix of the two" might be the one people actually crux on, but it seems important to discuss the two in isolation first. I promise we'll get there.

  3. ^

    I’m at maybe 60% on this. Not a crux though. 

  4. ^

     I’m at >90% on this. 

1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
RussellThor (25%)
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
Joe Rogero (60%)
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
If a misaligned model has the capabilities in the toy example, grafted to existing AI capabilities, that's basically a lose condition for humanity.
99%
1%
2%
3%
4%
5%
6%
7%
8%
9%
Joe Rogero (2%)
10%
11%
12%
13%
14%
15%
16%
17%
18%
19%
RussellThor (11%)
20%
21%
22%
23%
24%
25%
26%
27%
28%
29%
30%
31%
32%
33%
34%
35%
36%
37%
38%
39%
40%
41%
42%
43%
44%
45%
46%
47%
48%
49%
50%
51%
52%
53%
54%
55%
56%
57%
58%
59%
60%
61%
62%
63%
64%
65%
66%
67%
68%
69%
70%
71%
72%
73%
74%
75%
76%
77%
78%
79%
80%
81%
82%
83%
84%
85%
86%
87%
88%
89%
90%
91%
92%
93%
94%
95%
96%
97%
98%
99%
1%
Labs will be able to find and maintain a "sweet spot" of capabilities in which an AI is competent enough to solve alignment but will fail to notice and act on its own misalignment.
99%