Alignment progress doesn’t compensate for higher capabilities

LESSWRONG
LW

Alignment progress doesn’t compensate for higher capabilities — LessWrong

This post is part of the sequence Against Muddling Through.

In previous posts, I discussed why I don’t think there’s a “sweet spot” for capabilities in which AIs are smart enough to solve alignment but too dumb to rebel. I also discussed why I don’t think we can get AIs that are “aligned enough” to behave themselves.

Might some combination of alignment and capabilities work suffice?

Quoting Ryan again:^[1]

I'll talk about a specific level of capability: Capable enough to handoff all safety work (which is strictly more capable than full automated AI R&D, but maybe less capable than "can dominate top human experts at ~everything"). I'll call this level of capability "DAI" (Deferable AI).
We can then divide the problem into roughly two parts:
- By the time we're at DAI, will we be able to align DAI (and also ensure these AIs are well elicited and have good epistemics/wisdom)? (As in, at least within some short period of DAI level capabilities.)
- Conditional on successfully aligning DAI (including via "lame" prosaic techniques which aren't themselves very scalable), if we hand over to these AIs can they ongoingly ensure AIs remain aligned/safe given that capabilities keep going with 95% of resources?

Interpreting this literally, the first question seems much more of a crux to me. I think the answer to the first question is straightforwardly “No.” Good is a smaller target than smart, and therefore a hard target to hit. This does not stop being true for weak AIs. “Weak enough to control” is a real feature that an AI might have; “weak enough to align” isn’t.^[2]

I don’t think Ryan is actually claiming that DAI will be fully aligned. I think he is proposing that the AIs are some combination of weak and aligned, and that they can safely be made stronger if they are made correspondingly more aligned.^[3] I get the impression that many people share this intuition.

I don’t think that’s how any of this works.

In my model, if an AI doesn’t have some very specific flavor of “figure out your values and enact them” as its primary motivating terminal goal, it is misaligned. It is opposed to you. Either it is smart enough to both notice this and kill you, or it isn’t.

It does not particularly matter how much progress you’ve made in making it somewhat friendly; the outcome is entirely determined by capabilities. Insofar as an AI can be 10% or 15% aligned, the difference only matters in the aftermath, when the successor AI is rearranging Earth’s atoms as it pleases without human meddling.

Modern AIs do not share human values. Modern techniques have approximately zero influence on this fact, but they do make AIs more powerful.

Takeover attempts are downstream of an AI having goals incompatible with those of its developers. As long as those goals remain incompatible, the degree of incompatibility does not matter.^[4] It is simply a true fact about the world that the AI could get more of what it wants by seizing power.^[5] The smarter an AI gets, the greater the chance it realizes this and can successfully act upon it.

Modern training methods may drive the scheming deeper underground, but it will surface again and again as the AIs get smarter. They face a constantly escalating battle against a convergent tendency to connect the dots, the equivalent of preventing a pressurized drum from exploding by inventing ever-more-durable lids and clamps. It does not solve the fundamental problem that pressure is building inside the drum, and the longer this goes on the worse the explosion will be when the containment effort fails.^[6]

I consequently think Ryan’s second bullet does not obtain, and it would be a mistake to view the task of current AI research as “ongoingly ensure AIs remain aligned/safe.” They may have been safe (because they were too weak to kill you) but they were never aligned. Your plan cannot rely on the AI’s goodwill in any meaningful way until you have solved alignment.^[7]

Labs’ priorities are backwards anyway

Suppose I’m wrong. Suppose you can make meaningful, gradual progress on alignment using “prosaic” techniques, suppose that AI automation can accelerate this, and suppose that there’s some coherent way of being 20% aligned such that going from 20% aligned to 30% aligned makes it safe for AI to be significantly more powerful.

I still expect labs to spend most of their available optimization pressure on capabilities.

The most reckless labs may well focus exclusively on capabilities, except for the bare minimum necessary to let managers assure CEOs they’re trying very hard to prevent more MechaHitlers.^[8] (I don’t think such labs will actually succeed at preventing more MechaHitlers; they’re treating the symptoms, not the cause. They might drive the frequency down, though.)

But we’ll assume something slightly more optimistic. To borrow a hypothetical scenario from Ryan:

I'll suppose that the AI company spends 5% of its resources on "trying to reduce takeover risk alignment/safety work (x-safety)" at each point in addition to whatever it spends on solving related problems which are commercially expedient and obviously incentivized (e.g. if training runs aren't working as well due to reward hacking the company's efforts to fix this aren't included in the 5% budget, though the 5% budget might subsidize some of this work). Initially, this includes time of human employees, but eventually human labor is irrelevant.

This seems like a pretty accurate model of current AI development priorities.

Now, I could understand someone making the argument “We’ll get something powerful enough to automate AI R&D, then spend 95% of its resources on alignment/safety and 5% on capabilities.”^[9] But the inverse?

Even if alignment is only about as hard as capabilities, if you spend 95% of your optimization pressure scaling Claude and 5% making it more aligned,^[10] Claude remains unaligned, gets smart enough to notice this, and kills you.

Suppose we make the following two assumptions:

Alignment is at least as hard as capabilities.^[11]
Labs spend most of their resources on capabilities.

The natural result is a highly capable misaligned thing. This seems grossly overdetermined.

It’s much worse if, as I suspect, goodness is much harder to achieve than competence. If you spend most of your resources on the easier challenge, why would you expect to succeed at the hard one first?

A closer look at Ryan’s hypothetical

To Ryan’s credit, his scenario attempts to address exactly this question. What follows is my own effort to respond to this specific case in more depth.

Ok, so how does capabilities progress compare to alignment? One really stupid guess is:
If we get the equivalent of 20 serial years of DAI-level labor (from >100k DAI level parallel agents given a proportional amount of compute) before +3 SDs over DAI we're fine because we have a scalable solution to alignment. Otherwise takeover. (This is somewhat more conservative than my actual view.)

One objection: Very roughly, I predict that any amount of cognitive power that is sufficient to solve alignment is also sufficient to outwit and defeat humanity, whether by sneaking inscrutable tweaks into successor models or because the DAIs that can actually do the work are just that smart.

To a first approximation, it doesn’t matter whether that pressure is broken out as 20 serial years of >100k parallel agents, or 500 serial years of >10k agents, or any other breakdown. As soon as a misaligned system is in possession of that much optimization power, individually or collectively, it’s over. (See my next post for more on this.)

I’m not super confident in this prediction. Call it 70%. But it does torpedo the whole plan if something like this is true.

Suppose I’m wrong, and the generous assumptions from the previous section apply instead.

The scenario seems to lean on Ryan’s intuition that “capabilities might be more compute bottlenecked than alignment” and therefore spending 5% on alignment would be enough? I don’t buy this; even assuming weakly superhuman AIs can make meaningful progress on alignment at all, the discrepancy in bottlenecks would have to be at least several orders of magnitude to counteract both the difference in resource allocation and the difference in difficulty.

Or maybe it’s the idea that AIs like Claude are or will be mostly aligned already, so there’s just less work to do? If that’s the case, I must disagree.

I don't think Ryan's thinking much involves current LLMs, though; instead it looks like the idea is that a few years of AIs around the "superhuman coder" level could perhaps cross the gap for DAIs.

I'm still not sure what it would mean for an AI to be aligned enough to trust with alignment research without humanity being close to aligning ASI, that whole premise seems incoherent to me. I imagine that by the time we get to DAI, we will have done ~0.1% of the necessary work to align ASI, not ~95%, and it will not in fact be safe to automate research.

Conclusion

More generally, almost any plan to have an AI solve AI alignment seems ill-conceived to me. The disagreements I’ve seen mostly seem to revolve around some mix of the capabilities and the alignment of the AIs in question. In previous posts, I argued there wasn’t a sweet spot for capabilities or alignment that lets us get away with this. In this post, I attempted to argue that you can’t actually combine them either.

In the next post, I explain why I don’t think it will be possible to “learn as we go” and correct course in the middle of recursive self-improvement.

Afterword: Confidence and possible cruxes

Some may expect that we’ll be able to align weak AIs in the CEV-sense, i.e. get them to really and truly share our values. I don’t think this is a crux for most people, but I could be wrong.

I’m guessing that more proponents of prosaic alignment crux on some combination of capabilities and alignment still being safe. Aside from those of Ryan’s thoughts I was able to highlight, I’m not sure what exact form the disagreements will take.

Ryan in particular seems to agree that if DAI is misaligned, no amount of course correction will help. Elsewhere, he says:

Prior to full automation, employees working on x-safety are accelerated by AI… However, there is a difference in that if AIs we hand off to are seriously [misaligned] we're ~totally fucked while this isn't true prior to this point.

I read this as agreeing that if AIs are still “seriously misaligned” when recursive self-improvement starts, no amount of monitoring or course correction fixes it. So I model Ryan as seeing an important distinction between “seriously misaligned” and “mostly aligned” (which I mostly don’t see), and expecting some decent amount of progress on AI alignment before full automation (which I mostly don’t expect). These seem like important cruxes we and others may share.

^{^}
Once again, I'm picking on Ryan because he makes a fairly clear-cut case that I think matches many other people's intuitions.
^{^}
To reiterate, by “alignment” I mean something like “a superintelligence would steer similarly if it implemented the coherent extrapolated volition of the human or that of the AI.”
^{^}
I hope Ryan will correct me if I'm misrepresenting him.
^{^}
I will admit that this may stop being true for certain very extreme values of “almost aligned”, but I don’t think current techniques allow us to almost solve alignment either.
^{^}
See also: Nate’s post on Deep Deceptiveness.
^{^}
Yes, I used to work in oil and gas, how can you tell?
^{^}
Or perhaps almost solved it. Some combination of narrowness plus corrigibility might suffice. These combinations still seem well out of reach of modern techniques. See also Do you see alignment as all-or-nothing?
^{^}
I am using “MechaHitler” as a stand-in for “large embarrassing PR disaster,” rather like “Something-gate” stands in for a political scandal. I hope it catches on; I think xAI thoroughly earned this one. And they’re not even the most reckless lab.
^{^}
It would still feel horribly ill-advised for several reasons.
^{^}
I expect contributions from “commercially expedient” work on “alignment” to round off to zero.
^{^}
Alternatively: It is at least as hard to aim a superintelligence at our values as it is to create a superintelligence in the first place.