PeterMcCluskey — LessWrong

[Cross-posted from my blog.]

A group of people from MIRI have published a mostly good introduction to the dangers of AI: The Problem. It is a step forward at improving the discussion of catastrophic risks from AI.

I agree with much of what MIRI writes there. I strongly agree with their near-term policy advice of prioritizing the creation of an off switch.

I somewhat disagree with their advice to halt (for a long time) progress toward ASI. We ought to make preparations in case a halt turns out to be important. But most of my hopes route through strategies that don't need a halt.

A halt is both expensive and risky.

My biggest difference with MIRI is about how hard it is to adequately align an AI. Some related differences involve the idea of a pivotal act, and the expectation of a slippery slope between human-level AI and ASI.

Important Agreement

There isn't a ceiling at human-level capabilities.

This is an important truth, that many people reject because they want it not to be true.

It would be lethally dangerous to build ASIs that have the wrong goals.

The default outcome if we're careless about those goals might well be that AIs conquer humans.

If you train a tiger not to eat you, you haven't made it share your desire to survive and thrive, with a full understanding of what that means to you.

This is a good way to frame a key part of MIRI's concern. We should be worried that current AI company strategies look somewhat like this. But the way that we train dogs seems like a slightly better analogy for how AI training is likely to work in a few years. That's not at all sufficient by itself for us to be safe, but it has a much better track record for generalized loyalty than training tigers.

Can We Stop Near Human-Level?

The development of systems with human-level generality is likely to quickly result in artificial superintelligence (ASI)

This seems true for weak meanings of "likely" or "quickly". That is enough to scare me. But MIRI hints at a near-inevitability (or slippery slope) that I don't accept.

I predict that it will become easier to halt AI development as AI reaches human levels, and continue to get easier for a bit after that. (But probably not easy enough that we can afford to become complacent.)

Let's imagine that someone produces an AI that is roughly as intellectually capable as Elon Musk. Is it going to prioritize building a smarter AI? I expect it will be more capable of evaluating the risks than MIRI is today, due in part to it having better evidence than is available today about the goals of that smarter AI. If it agrees with MIRI's assessment of the risk, wouldn't it warn (or sabotage?) developers instead? Note that this doesn't require the Musk-level AI to be aligned with humans - it could be afraid that the smarter AI would be unaligned with the Musk-level AI's goals.

There are a number of implicit ifs in that paragraph, such as if progress produces a Musk-level AI before producing an ASI. But I don't think my weak optimism here requires anything far-fetched. Even if the last AI before we reach ASI is less capable than Musk, it will have significant understanding of the risks, and will likely have a good enough track record that developers will listen to its concerns.

[How hard would it be to require AI companies to regularly ask their best AIs how risky it is to build their next AI?]

I suspect that some of the sense of inevitability comes from the expectation that the arguments for a halt are as persuasive now as they will ever be.

On the contrary, I see at least half the difficulty in slowing progress toward ASI is due to the average voter and average politician believing that AI progress is mostly hype. Even superforecasters have tended to dismiss AI progress as hype.

I'm about 85% confident that before we get an AI capable of world conquest, we'll have an AI that is capable of convincing most voters that AI is powerful enough to be a bigger concern than nuclear weapons.

MIRI is focused here on dispelling the illusion that it will be technologically hard to speed past human intelligence levels. The main point of my line of argument is that we should expect some changes in willingness to accelerate, hopefully influenced by better analyses of the risks.

I'm unsure whether this makes much difference for our strategy. It's hard enough to halt AI progress that we're more likely to achieve it just in the nick of time than too early. The main benefit of thinking about doing a halt when AI is slightly better than human is that it opens up better possibilities for enforcing the halt than we'll envision if we imagine that the only time for a halt is before AI reaches human levels.

I'm reminded of the saying "You can always count on Americans to do the right thing---after they've tried everything else."

Alignment difficulty

MIRI's advice depends somewhat heavily on the belief that we're not at all close to solving alignment. Whereas I'm about 70% confident that we already have the basic ideas needed for alignment, and that a large fraction of the remaining difficulty involves distinguishing the good ideas from the bad ones, and assembling as many of the good ideas as we can afford into an organized strategy. (I don't think this is out of line with expert opinion on the subject. However, the large range of expert opinions on this subject worries me a good deal.)

[The Problem delegates most discussion of alignment difficulty to the AGI Ruin page, which is a slightly improved version of Eliezer's AGI Ruin: A List of Lethalities. This section of my post is mostly a reply to that. ]

Corrigibility is anti-natural to consequentialist reasoning

No! It only looks that way because you've tried to combine corrigibility with a conflicting utility function.

The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.

You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555.

That describes some attempts at corrigibility, in particular those which give the AI additional goals that are not sub-goals of corrigibility. Max Harms' CAST avoids this mistake.

Corrigibility creates a basin of attraction that increases the likelihood of getting a good enough result on the first try, and mitigates MIRI's concerns about generalizing out of distribution.

There are still plenty of thorny implementation details, and concerns about who should be allowed to influence a corrigible AGI. But it's hard to see how a decade of further research would produce new insights that can't be found sooner.

Another way that we might be close to understanding how to create a safe ASI is Drexler's CAIS. Which roughly means keeping AI goals very short-term and tool-like.

I'm guessing that MIRI's most plausible objection is that AIs created this way wouldn't be powerful enough to defend us against more agentic AIs that are likely to be created. MIRI is probably wrong about that defense, due to some false assumptions about some of the relevant coordination problems.

MIRI often talks about pivotal acts such as melting all GPUs. I expect defense against bad AIs to come from pivotal processes that focus on persuasion and negotiation, and to require weaker capabilities than what's needed for melting GPUs. Such pivotal processes should be feasible earlier than I'd expect an AI to be able to melt GPUs.

How does defending against bad AI with the aid of human-level CAIS compare to MIRI's plan to defend by halting AI progress earlier? Either way, I expect the solution to involve active enforcement by leading governments.

The closer the world gets to ASI, the better surveillance is needed to detect and respond to dangers. And maybe more regulatory power is needed. But I expect AI to increasingly help with those problems, such that pivotal processes which focus on global agreements to halt certain research become easier. I don't see a clear dividing line between proposals for a halt now, and the pivotal processes that would defend us at a later stage.

I'll guess that MIRI disagrees, likely due to assigning a much higher probability than I do to a large leap in AI capabilities, producing a world conquering agent before human-level CAIS has enough time to implement defenses.

The CAIS strategy is still rather tricky to implement. CAIS development won't automatically outpace the development of agentic AI. So we'll need either some regulation, or a further fire alarm that causes AI companies to become much more cautious.

It is tricky to enforce a rule that prohibits work on more agentic AIs, but I expect that CAIS systems of 2027 will be wise enough to do much of the needed evaluations of whether particular work violates such a rule.

Corrigibility and CAIS are the two clearest reasons why I'm cautiously optimistic that non-catastrophic ASI is no harder than the Manhattan and Apollo projects. Those two reasons make up maybe half of my reasoning here. I've focused on them because the other reasons involve a much wider range of weaker arguments that are harder to articulate.

Alas, there's a large gap between someone knowing the correct pieces of a safe approach to AI, and AI companies implementing them. Little in current AI company practices inspires confidence in their ability to make the right choices.

Conclusion

Parts of The Problem are unrealistically pessimistic. Yet the valid parts of their argument are robust enough to justify being half as concerned as they are. My policy advice overlaps a fair amount with MIRI's advice:

Creating an off switch should be the most urgent policy task.

Secondly, require AI companies to regularly ask their best AIs how risky it is to create their next AI. Even if it only helps a little, the cost / benefit ratio ought to be great.

Policy experts ought to be preparing for ways to significantly slow or halt a subset of AI development for several years. Ideally this should focus on restricting agentic AI, while exempting CAIS. The political climate has a decent chance of becoming ripe for this before the end of the decade. The timing is likely to depend heavily on what accidents AIs cause.

The details of such a halt should depend somewhat on advice given by AIs near the time of the halt.

None of these options are as safe as I would like.

A halt carries its own serious risks: black market development without safety constraints, and the possibility that when development resumes it will be faster and less careful than continued cautious progress would have been. [These concerns deserve their own post, but briefly: halts are unstable equilibria that may make eventual development more dangerous rather than less.]

When I started to write this post, I planned to conclude that I mostly agreed with MIRI's policy advice. But now I've decided that the structural similarities are masking a dramatic difference in expected cost. I anticipate that the tech industry will fight MIRI's version much more strongly than they will resist mine. That leaves me with conflicting feelings about whether to treat MIRI's position as allied versus opposed to mine.

I expect that as we get more experience with advanced AIs, we will get more information that is relevant to deciding whether a halt is desirable. Let's not commit ourselves so strongly on any particular policy that we can't change our minds in response to new evidence.

P.S. I asked Gemini2.5pro to guess how Eliezer would react to Max Harms' CAST. It was sufficiently confused about CAST that I gave up - it imagined that the key advantage was that the AI had a narrow goal. Claude Opus 4.1 did better - I needed to correct one misunderstanding of CAST, then it gave some non-embarrassing guesses.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Important Agreement

Can We Stop Near Human-Level?

Alignment difficulty

Conclusion