We won’t get docile, brilliant AIs before we solve alignment

LESSWRONG
LW

We won’t get docile, brilliant AIs before we solve alignment — LessWrong

This post is part of the sequence Against Muddling Through.

Nate and Eliezer have written about why they think that even a mostly aligned system won’t safely generalize.

This seems true whether “mostly aligned” means “we’ve hard-coded in most of the terminal goals we care about” or “it’s mostly done what we wanted most of the time, so far” or even “we’ve almost got it pointed at our CEV.”

For myself, despite some lingering confusion about what various people mean by “mostly aligned” or “intent aligned”, I think it’d take a miracle to get us anything close to adequate using anything close to modern tools, even before we hit superintelligence. This is why I view a global halt as essential to our survival.

To summarize:

Good is a smaller target than smart. Human values are messy, complicated, and poorly understood even by humans. Intelligence seems easier to grow. This means we’re in trouble, because…

Goodness is harder to achieve than competence. Aiming a powerful optimizer that precisely seems extremely hard, probably much harder than a Moon landing, and much harder than getting a powerful optimizer in the first place. The modern science of machine learning has a long way to go, because…

LLMs are badly misaligned. We’re still alive because they’re not smart enough to kill us yet. But the labs keep trying to make them smarter, and…

We won’t get AIs smart enough to solve alignment but too dumb to rebel. Labs are pouring most of their resources, including from AI assistance, into escalating capabilities. Insofar as we’re depending on hitting a “sweet spot” of capabilities and staying there for (subjectively) dozens or hundreds of AI-years, we will probably fail. Nor can we rely on AI niceness, because…

Intent alignment seems incoherent.^[1] Insofar as we’re depending on current techniques making AIs “aligned enough” that they won’t try to kill us even when they plausibly could, this seems to require vastly more skill than the field can muster in time. Furthermore…

Alignment progress doesn’t compensate for higher capabilities. A smart, partially-aligned thing seems approximately as dangerous as a smart, misaligned thing, at least until the alignment gets extremely close to complete or robustly corrigible. It isn’t safe to scale capabilities at anything like current alignment levels. And once AIs start automating AI research…

Labs lack the tools to course-correct. A plan that has a reasonable chance of working would require a deep understanding of cognition in general and machine cognition in particular; it would include a precise trajectory plotted in advance and a redundant web of contingencies informed by the same gears-level understanding that enabled the plotting. Labs have neither.

It is this confluence of arguments that convinces me the race to superintelligence in general, and automated AI research in particular, needs to stop. It’s not a complete account of the arguments (that would take a whole book and more!) but instead a rough sketch of what seem to me to be the thorniest sticking points for many.

Afterword: Confidence levels

There aren’t really any new cruxes here, so I’ll focus on a higher-level view.

Some have accused Nate and Eliezer of being overconfident in predicting that superintelligent AIs kill everyone. I can’t speak for them, but my own intuitions are what they are, and they suggest this prediction is highly overdetermined.

In part, that’s because the individual pieces seem overdetermined. In this sequence, I’ve tried to include only headline claims I’m >90% confident in, with the possible exception of the “intent alignment” piece, on which my modal intuition is still “hell no” but with more confusion and higher variance. As for why they’re overdetermined, well, there’s some notes about that in the individual posts, but probably not enough to be convincing to those who don’t buy the crux of a post. That would likely take a much more in-depth argument which this margin is too small to contain.

In part, it’s because the pieces seem to overdetermine the conclusion. Alignment seems objectively much harder than a Moonshot; even if it isn’t, it seems practically hard; even if it’s relatively easy, the proposed plans don’t make much sense; even if they’re sound, labs don’t seem NASA-competent at execution; even if some are, the most reckless won’t be; even if they all were, they’re spending most of their energy on capabilities, etc.^[2]

I may be wrong about some of it, but I don’t think, given the arguments and confidence levels involved, that ~98% confidence ASI kills everyone is poorly calibrated.

^{^}
In one sense, I’m less confident in this than the others, because it’s hard to even articulate a counterexample. In another sense, I might be more confident, because it’s hard to even articulate a counterexample. I’m not going so far as to claim “intent alignment without full alignment is logically inconsistent” because it may very well not be, but it sure is occupying a large chunk of my hypothesis-space.
^{^}
Not all of these earned their own post; I was trying to focus on the most cruxy ones.

LESSWRONG
LW

LESSWRONG
LW

7

We won’t get docile, brilliant AIs before we solve alignment

7

7

Afterword: Confidence levels