You can only build safe ASI if ASI is globally banned

Connor Leahy

You can only build safe ASI if ASI is globally banned — LessWrong

96 You can only build safe ASI if ASI is globally banned

by Connor Leahy

16th Apr 2026

AI Alignment Forum

2 min read

96 Ω 26

This is a linkpost for https://www.ettf.land/p/you-can-only-build-safe-asi-if-asi

Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.^[1]

There are various flavors of “safe” people suggest.

Sometimes they suggest building “aligned” ASI: You have a full agentic autonomous god-like ASI running around, but it really really loves you and definitely will do the right thing.
Sometimes they suggest we should simply build “tool AI” or “non-agentic” AI.
Sometimes they have even more exotic, or more obviously-stupid ideas.

Now I could argue at lengths about why this is astronomically harder than people think it is, why their various proposals are almost universally unworkable, why even attempting this is insanely immoral^[2], but that’s not the main point I want to make.

Instead, I want to make a simpler point:

Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.^[3]

Punchline: On your way to figuring out how to build controllable ASI, you will have figured out how to build unsafe ASI, because unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path.

You can’t build a controlled ASI without knowing many, MANY things about intelligence and how to build it.

So this then bottlenecks the dual technical problems of “how to find an agenda that results in controllable ASI” and “how to execute on such an agenda” on “even if you had such an agenda, how do you execute it without accidentally, or due to some asshole leaving the project or reading your papers, building unsafe ASI along the way?”

No one I know pursuing various agendas of this type has answers to these questions. And lets be crystal clear: This is the fundamental question any sensible “safe ASI” project needs to answer before even being worth considering.

You would need to either have:

Some absurd level of institutional secrecy and control (e.g. “this research will exclusively be done inside Area 51 and we assassinate everyone who leaves the project and also nuke literally everyone else that tries”)
Complete technical orthogonality (“this research is so radically different from other research that it cannot even in principle be used to build unsafe ASI, only safe ASI”, which is impossible)
A global ban on ASI development and competent enforcement

This means that the primary prerequisite to even considering starting to work on a safe ASI plan is to have a global ASI ban and powerful enforcement already in place.^[4]

^{^}
I’m assuming you already accept that “unsafe” ASI would be really, really bad. If not, this is not the post for you to read.
^{^}
In short: If you unilaterally try to build ASI, you are directly and openly threatening the world with violent conquest. This is sometimes called a “pivotal action”, which is code word for “(insanely violent) unilateral action that forces the world into a state I think is good.”
^{^}
For some hopefully meaningful definition of the word “control”
^{^}
This is the rationale behind proposals such as MAGIC.

AI ControlAIWorld Modeling

Frontpage

96 Ω 26

New Comment

26 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:50 PM

[-]TsviBT3moΩ183535

[-]Vladimir_Nesov3moΩ5168

If there was a textbook on building safe ASI with instructions that are sufficiently straightforward to execute, people would tend to build safe ASI rather than an extinction/disempowerment ASI. Some AI safety efforts could be thought of as contributions to this hypothetical textbook, making it marginally more real.

unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path

The point of an ASI ban/pause is to create the time to reduce this gap, until it's sufficiently narrow that competent people can walk across without falling through. If an unsafe ASI is artificially delayed despite technological feasibility, there is time to write the textbook, to make safe ASI about as easy to implement. And similarly for some AI safety efforts that don't involve an ASI ban/pause, which attack the gap from the other end.

(Scalable oversight agenda hopes AIs can write the textbook on their own, sufficiently quickly to win the race against the technical feasibility of unsafe ASI. I would feel a lot better about this plan if alignment of Mythos-level systems was pursued for 30 years before going further, and somehow there was a guarantee that Mythos instances won't be founding a country and declaring sovereignty in the meantime. This guarantee gets more believable if there are no Mythos-level systems at all yet.)

[-]Alex Amadori3mo1613

I think this comment is making too many simplifying assumptions that will shatter on contact with the real world.

From the point of view of any entity pursuing an ASI project, in a world with no global ban, you will always want to deploy too early and risk destroying the universe.

If you're allowed to run an ASI project, so are others. What does this mean for you?

For one thing...

The point of an ASI ban/pause is to create the time to reduce this gap

You can never know how big this gap is. Perhaps, in a world with much more advanced epistemology, you can get a usable estimate ahead of deployment; perhaps, in a world with much better coordination and strategy, you can gather enough information about it from smaller experiments and deployments without destroying the world.

But these worlds are very different from ours. We can write about them for fun, or as an intellectual exercise, but we should never forget that they are fantasy worlds. Any conclusion that starts from assuming we're in one of these worlds simply does not apply to ours, and we should not confuse this fanfiction for predictions.

From your point of view, you never know how far away you are from building safe ASI, and you should place an unreasonable (in terms of risk) amount of probability on the outcome that if someone else builds it using your state of the art approach, everyone dies.

Do you place absolute trust in all other entities capable of developing ASI to not try? Of course not. So you're going to cut corners.

And secondly...

Building ASI that is safe from your point of view is not just a technical problem. Other entities will have other views. In most cases, if a small group of people (compared to all 8 billion people on earth) gets an ASI that they are satisfied with, most of the world will not endorse the result.

You can see this concretely when US AI people espouse about the need to defeat China, or people from one lab talk about the need to defeat another lab. So again, you will naturally cut corners, and then everyone dies.

[-]mishka3mo72

Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.

You are making a "logical jump" here, equating "friendly ASI" with "ASI one can control".

But this assumption is a point of well known contention. There is no consensus that the "control agenda" is the right way to approach this. Many people think that the approach aimed at achieving sustainable control of super-intelligent systems by ordinary humans is exactly the path to an almost certain ruin, for a number of fairly strong reasons.

(I very much doubt that if we burden an already very difficult problem of creating "a friendly world with ASIs" with the additional requirement of this kind of control, it would be possible to find a solution. So I'd like to see more studies of approaches not based on this particular kind of control.)

[-]oligo3mo43

I interpreted "control" here as referring to control over the shape of the ASI, rather than corrigibility (ongoing ability to direct the ASI) or the "control agenda" (bribing or coercing an unaligned ASI not to kill you, somehow.)

(Whether or not that's the intended meaning, it is consistent with the argument outlined in OP. Suppose you want to create an ASI that hums with genuine lovingkindness for all sentient beings and wants to (say) make the universe objectively maximally good within the constraint of also making the universe much better from the perspective of human CEV. Or (whatever you prefer to specify.) Or (whichever one of several potential notkilleveryoneism-compatible ASIs is most feasible to build.) All of those are strictly narrower than just building ASI in general.)

One world where this model doesn't apply, I think, is one where there is a very large moral realist attractor basin, that ASIs converge on by default, unless human actors put a lot of effort into making them corrigible to Xi Jinping or Mark Zuckerberg or whomever, which in turn leads to the almost certain ruin you've noted. Here we can be pretty stupid about intelligence aside from how to make more of it, and have to actively try to fuck ourselves over.

Given uncertainty about what the attractor basin is like, though, gets us somewhere similar; we'd rather not create ASI prior to either understanding intelligence enough to know how to artificially align it to our values, or understanding intelligence enough to know that it would naturally align.

(I put some nontrivial credence into the idea that there is a large moral realist attractor basin that RL pushes most agents away from, but that's a discussion for another day, I think.)

[-]mishka3mo20

Yes, this makes a lot of sense.

To me, the main dichotomy is whether we expect a unipolar world controlled by a singleton or whether we expect a multi-polar world with a lot of agents of varying nature and varying capabilities.

I think a lot of considerations are pointing towards a likely multi-polar diverse world, where one needs to have interests of various entities (that have radically different nature and radically different levels of capabilities) to be taken into account and protected. And so one needs a system of collective control which does that, and protects various entities from being steamrolled.

The technical aspects in regard to the ability of “the world” to constrain a single system from radical misbehavior are somewhat easier in that scenario (since “the world” is collectively very smart), but this is a small subtask of a much more complicated task of figuring out what kinds of invariant properties a self-modifying world of this kind should achieve and reliably maintain and how the collective task of figuring out those invariant properties and reaching the situation where these properties are achieved and reliably maintained should be approached.

[-]Boaz Barak3mo5-5

maybe footnote 1 means that this post is not for me, but I believe that the world can survive the existence of misaligned/unsafe ASI as long as it is dominated (in terms of compute/intelligence) by aligned and safe ASI. See item 6 here https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/

[-]JennaS3mo104

I think the point of the post is, you can’t actually get to a world that has a dominant safe ASI without having first satisfied one of the listed conditions (absolute secrecy and control, complete technical orthogonality, or a global ban). Otherwise, you get an unsafe ASI first, and then it dominates the world, and since there is no safe ASI to be a defender, we lose.

I suppose you could argue that the early unsafe ASI takes long enough to establish dominance that we could invent safe ASI before the unsafe one finishes dominating. Then, if the safe ASI is stronger, it could complete its domination before the unsafe one does. This requires that:

-there be a significant time lag in domination in the first place,

-and the unsafe ASI is not able to sabotage safe ASI projects before establishing dominance,

-and an ASI built in that time window, under pressure, could be made safe,

-and it can be made stronger than the unsafe ASI is after it does its own self improvement.

If it can’t be made stronger, then humanity plus a weaker but safe ASI needs to be able to beat an unsafe ASI plus whatever resources it marshaled during the time window (including humans it swayed/blackmailed/etc). Or, at least, to be able to put up enough of a fight to bargain for a significant chunk of the lightcone (supposing that’s an acceptable outcome).

These are all specific requirements, though, which need to be debated on their own merits, if I got the framing right.

[-]Boaz Barak3mo1-8

My rough heuristic is that intelligence scales with compute, so the crucial condition is the vast majority of FLOPs are deployed for safe intelligence. It seems that a lot of the arguments in the post are how there may be some leak of unsafe or misaligned ASI in one way or another but this doesn’t mean this ASI will have lots of compute at its disposal

[-]TsviBT3mo1713

According to me, a key point is made in the post when it describes this question as a bottleneck:

how do you execute [a controllable ASI agenda] without accidentally [...] building unsafe ASI along the way?

The unsafe ASI is coming from inside the house. Especially given the research practice, predominant in "frontier AI labs", of "just try it and see, with as much compute as you need", I don't see what you're imagining here. The "implement first, ask questions later" strategy is exactly how you make it approximately totally impossible, rather than merely extremely difficult, to stop short of a dangerous ASI.

[-]Boaz Barak2mo1-1

Compute is always a finite resource, you you can't just train anything at any scale. Also, I don't think your description is accurate, at least of the one frontier lab I am familiar with... We are aware of the risks of internal deployment and are monitoring for issues. See for example this blog post.

[-]TsviBT2mo96

The issue isn't that you wouldn't be aware of the possibility of a catastrophically dangerous AI emerging training, the issue is that you don't know how to detect or stop it. Do you think that the measures you currently implement or will be able to feasibly implement in the future are likely to prevent this? If so, would you be willing to have a more extended debate on that point with an expert who believes the opposite? (I don't have a specific one in mind, but would try to find one if you'd be interested.) If you're not willing, why not?

[-]Boaz Barak2mo5-3

Risks of internal deployment are something we are tracking and (as this blog shows) are actively working on. I don’t think it will be useful to debate an external expert that doesn’t know the details of our internal setup. However we are continuously working on mitigations, reporting (system cards, blogs) as well as collaborating with third parties

[-]TsviBT2mo72

as well as collaborating with third parties

Does that include review from independent experts on the risk of a catastrophic AI emerging? Who would that be?

[-]Boaz Barak2mo1-1

When we have something to publish we will do so. Generally our system cards and other publications contain evaluations of different aspects of safety and alignment by us and third parties. I expect that as capabilities grow and stakes of internal and external deployment are higher, we will continue and expand both our own evaluations and such collaborations.

[-]TsviBT2mo87

So to recap,

(Not sure if you agree about this one?) You agree that if someone made an unaligned ASI while there is not also a comparably powerful aligned ASI, that would likely spell doom for humanity.
You agree that an unaligned ASI could emerge during research in general and in particular during large frontier training runs involving lots of compute.
You state that OpenAI is doing some activity related to checking whether or not that's happening in their research.
(Not sure if you're meaning to imply this or not?) You imply that those activities are somehow adequate to the task of preventing the creation of an unaligned ASI before the creation of an aligned ASI.

Do you agree with these bullet points? I would add this bullet point:

There's no good public reason for anyone to believe that implication, whether in the form of a technical plan or analysis of why it's feasible, or review from independent experts, or any discussion or debate on this point from you or anyone else at OpenAI.

And I wonder what you think of it.

[-]Boaz Barak2mo0-2

I don't really agree with any of these bulletpoints. I am not even sure we are on the same page of the definition of ASI and I don't view "emerging" as a good way to describe training. I feel like we are getting into more fundamental disagreements which I covered to some extent here.

[-]Chris_Leong3mo20

This ignores the offence-defence balance which, in some circumstances, may massively benefit the attacker.

[-]Boaz Barak2mo20

If you look at my post (lesswrong copy here) then you will see I discuss this balance

[-]Chris_Leong2mo40

Thanks, I'll check it out!

[-]JennaS3mo21

I don’t think the post requires the unsafe ASI to breach containment. The frontier AI being run on the labs‘ own servers could itself be unsafe, successfully scheming without detection. Or it could be corrigible and still be an x/s-risk from humans enacting biorisk, human takeover using the AI, or gradual disempowerment.

Even if a leaked unsafe ASI has extremely constrained compute, it still can enact asymmetric attacks that work within those constraints. How much compute would it need to manipulate someone into - or just help them to - obtain and release smallpox? How much would it need to effectively sabotage safety research or the creation of a safe ASI?

And even if you are actually succeeding at building a safe ASI, it is still very likely to be harder and to take more time to build than an unsafe one. In the time window between when the leading lab could’ve built an unsafe ASI and them building a safe ASI, their competitors may just go ahead and build the unsafe ASI themselves, letting it run on their servers. And then the unsafe ASI has first mover advantage, with plenty of compute to spare, and probably wins.

[-]the gears to ascension3mo20

the aligned and safe asi would have to be actually trying hard to patch vulnerabilities of all kinds that the malicious/misaligned/unsafe ai is trying to attack via, which laudibly is currently being attempted in cybersecurity, but the jury still seems out on institutional/social/biological/manipulation/epistemology/market.

[-]Boaz Barak3mo20

Yes, as I wrote in my post, aligned ASI's would need to spend some fraction of their resources improving defender in the offense/defense balance. My main point was that the balance is not infinite and so if aligned resources vastly outnumber misaligned resources that should be enough.

[-]Logan Zoellner3mo3-7

On your way to figuring out how to build controllable ASI, you will have figured out how to build unsafe ASI, because unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path.

This is only true if you are building some kind of cartoon ASI that self-replicates without regard for its creators' intentions. If you (a human being) are trying to build ASI to achieve any purpose at all you basically have to solve AI safety along the way. This is empirically demonstrated. GPT 3.5 wasn't vastly more intelligent than GPT 3, but it was vastly more useful because RLHF was used to aim it at goals. We see the exact same trend today. Far from paying an "alignment tax", Anthropic is able to build the most powerful AI models because they are obsessed with the question "how do I control the AI?"

[-]Vasco Grilo1mo10

Hi Connor. I am open to bets against short transformative AI timelines, or what they supposedly imply, up to 10 k$. Do you see any that we could make that is good for both of us under our own views?

[-]Mitchell_Porter3mo00

Good luck at your new job (American head of PauseAI?), but I don't think you have much time left in which to rein in Anthropic and OpenAI, if you don't want them to cross the threshold to ASI. (The other day, Trump talked about the need for a "kill switch" for advanced AI, so I guess that's a start...)

Moderation Log