An Explication of Alignment Optimism

Oliver Daniels

32 An Explication of Alignment Optimism

31st Jan 2026

1 min read

32

Some people have been getting more optimistic about alignment. But from a skeptical / high p(doom) perspective, justifications for this optimism seem lacking.

"Claude is nice and can kinda do moral philosophy" just doesn't address the concern that lots of long horizon RL + self-reflection will lead to misaligned consequentialists (c.f. Hubinger)

So I think the casual alignment optimists aren't doing a great job of arguing their case. Still, it feels like there's an optimistic update somewhere in the current trajectory of AI development.

It really is kinda crazy how capable current models are, and how much I basically trust them. Paradoxically, most of this trust comes from lack of capabilities (current models couldn't seize power right now if they tried).

...and I think this is the positive update. It feels very plausible, in a visceral way, that the first economically transformative AI systems could be, in many ways, really dumb.

Slow takeoff implies that we'll get the stupidest possible transformative AI first. Moravec's paradox leads to a similar conclusion. Calling LLMs a "cultural technology" can be a form of AI denialism, but there's still an important truth there. If the secret of our success is culture, then maybe culture(++) is all you need.

Of course, the concern is that soon after we have stupid AI systems, we'll have even less stupid ones. But on my reading, the MIRI types were skeptical about whether we could get the transformative stuff at all without the dangerous capabilities coming bundled in. I think LLMs and their derivatives have provided substantial evidence that we can.

Frontpage

32

An Explication of Alignment Optimism

New Comment

6 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:16 PM

[-]RHollerith14h51

Suppose AI assistants similar to Claude transform the economy. Now what? How is the risk of human extinction reduced?

Yes, alignment researchers have become more capable, and yes, the people trying to effect an end or a long pause of AI "progress" have become more capable, but so have those trying to effect more AI "progress".

Also, rapid economic change entails rapid changes in most human interpersonal relationships, which is hard on people and according to some is the main underlying cause of addictions. Addicts aren't likely to come to understand that AI "progress" is very dangerous even if they are empowered by AI assistants.

Your argument supports the assertion that if AI "progress" stopped now then we'd be better off then we would've been without any AI progress (in the past), but of course that is very different than being optimistic about the outcome of AI "progress" continues from here.

[-]Oliver Daniels8h10

the hope is that
a) we to get the transformative ai to do our alignment homework for us, and
b) that companies / society will become more concerned about safety (such that the ratio of safety to capabilities research increases a lot)

[-]RogerDearnaley12h31

"The MIRI types" were very explicit that they were doing security mindset thinking, trying to think about all the things that could possibly go wrong, in advance. This is entirely appropriate and reasonable when not only 8.3 billion lives, but also all their descendants for however long and far the human race would otherwise have got (at least several orders of magnitude more, quite possibly astronomically more), are on the line.

However, the most likely result of thinking long and hard about everything that could possible go wrong and then publicly posting long lists of them (if you do it right), is that, actually, nature doesn't thrown you quite that many curveballs — and then people get surprised by why things aren't turning out as badly as MIRI were concerned they might. Now, they did miss one or two (jailbreaks, for example: almost no-one saw that coming before it happened), but in general they managed to think of almost everything Reality has actually thrown at us, plus quite a bit more — and that was the goal.

I'm happy Reality hasn't been that sadistic with us so far. I try to remember this when updating my P(DOOM). But I'm still not going to rely on this going forwards — complacency is not an appropriate response to existential risk.

[-]Oliver Daniels8h10

yeah I am very grateful for MIRI, and I don't think we should be complacent about existential risks (e.g. 50% P(doom) seems totally reasonable to me)

[-]Eli Tyre8h21

...and I think this is the positive update. It feels very plausible, in a visceral way, that the first economically transformative AI systems could be, in many ways, really dumb.

I think this is right, but I think is misleadingly encouraging.

I keep having to remind myself: Most of the risk does not come from the early transformative AIs. Most of the risk comes from the overwhelming superintelligences that come only a few years later.

Maybe we can leverage our merely transformative capabilities into a way to stick the landing with the overwhelming superintelligences, but that's definitely not a forgone conclusion.

[-]Oliver Daniels8h10

yeah I agree, I think the update is basically just "AI control and automated alignment research seem very viable and important", not "Alignment will be solved by default"

Moderation Log

Curated and popular this week

LESSWRONG
LW

LESSWRONG
LW

32

An Explication of Alignment Optimism

32

32