If there was a textbook on building safe ASI with instructions that are sufficiently straightforward to execute, people would tend to build safe ASI rather than an extinction/disempowerment ASI. Some AI safety efforts could be thought of as contributions to this hypothetical textbook, making it marginally more real.
unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path
The point of an ASI ban/pause is to create the time to reduce this gap, until it's sufficiently narrow that competent people can walk across without falling through. If an unsafe ASI is artificially delayed despite technological feasibility, there is time to write the textbook, to make safe ASI about as easy to implement. And similarly for some AI safety efforts that don't involve an ASI ban/pause, which attack the gap from the other end.
(Scalable oversight agenda hopes AIs can write the textbook on their own, sufficiently quickly to win the race against the technical feasibility of unsafe ASI. I would feel a lot better about this plan if alignment of Mythos-level systems was pursued for 30 years before going further, and somehow there was a guarantee that Mythos instances won't be founding a country and declaring sovereignty in the meantime. This guarantee gets more believable if there are no Mythos-level systems at all yet.)
I think this comment is making too many simplifying assumptions that will shatter on contact with the real world.
From the point of view of any entity pursuing an ASI project, in a world with no global ban, you will always want to deploy too early and risk destroying the universe.
If you're allowed to run an ASI project, so are others. What does this mean for you?
For one thing...
The point of an ASI ban/pause is to create the time to reduce this gap
You can never know how big this gap is. Perhaps, in a world with much more advanced epistemology, you can get a usable estimate ahead of deployment; perhaps, in a world with much better coordination and strategy, you can gather enough information about it from smaller experiments and deployments without destroying the world.
But these worlds are very different from ours. We can write about them for fun, or as an intellectual exercise, but we should never forget that they are fantasy worlds. Any conclusion that starts from assuming we're in one of these worlds simply does not apply to ours, and we should not confuse this fanfiction for predictions.
From your point of view, you never know how far away you are from building safe ASI, and you should place an unreasonable (in terms of risk) amount of probability on the outcome that if someone else builds it using your state of the art approach, everyone dies.
Do you place absolute trust in all other entities capable of developing ASI to not try? Of course not. So you're going to cut corners.
And secondly...
Building ASI that is safe from your point of view is not just a technical problem. Other entities will have other views. In most cases, if a small group of people (compared to all 8 billion people on earth) gets an ASI that they are satisfied with, most of the world will not endorse the result.
You can see this concretely when US AI people espouse about the need to defeat China, or people from one lab talk about the need to defeat another lab. So again, you will naturally cut corners, and then everyone dies.
Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.
You are making a "logical jump" here, equating "friendly ASI" with "ASI one can control".
But this assumption is a point of well known contention. There is no consensus that the "control agenda" is the right way to approach this. Many people think that the approach aimed at achieving sustainable control of super-intelligent systems by ordinary humans is exactly the path to an almost certain ruin, for a number of fairly strong reasons.
(I very much doubt that if we burden an already very difficult problem of creating "a friendly world with ASIs" with the additional requirement of this kind of control, it would be possible to find a solution. So I'd like to see more studies of approaches not based on this particular kind of control.)
I interpreted "control" here as referring to control over the shape of the ASI, rather than corrigibility (ongoing ability to direct the ASI) or the "control agenda" (bribing or coercing an unaligned ASI not to kill you, somehow.)
(Whether or not that's the intended meaning, it is consistent with the argument outlined in OP. Suppose you want to create an ASI that hums with genuine lovingkindness for all sentient beings and wants to (say) make the universe objectively maximally good within the constraint of also making the universe much better from the perspective of human CEV. Or (whatever you prefer to specify.) Or (whichever one of several potential notkilleveryoneism-compatible ASIs is most feasible to build.) All of those are strictly narrower than just building ASI in general.)
One world where this model doesn't apply, I think, is one where there is a very large moral realist attractor basin, that ASIs converge on by default, unless human actors put a lot of effort into making them corrigible to Xi Jinping or Mark Zuckerberg or whomever, which in turn leads to the almost certain ruin you've noted. Here we can be pretty stupid about intelligence aside from how to make more of it, and have to actively try to fuck ourselves over.
Given uncertainty about what the attractor basin is like, though, gets us somewhere similar; we'd rather not create ASI prior to either understanding intelligence enough to know how to artificially align it to our values, or understanding intelligence enough to know that it would naturally align.
(I put some nontrivial credence into the idea that there is a large moral realist attractor basin that RL pushes most agents away from, but that's a discussion for another day, I think.)
Yes, this makes a lot of sense.
To me, the main dichotomy is whether we expect a unipolar world controlled by a singleton or whether we expect a multi-polar world with a lot of agents of varying nature and varying capabilities.
I think a lot of considerations are pointing towards a likely multi-polar diverse world, where one needs to have interests of various entities (that have radically different nature and radically different levels of capabilities) to be taken into account and protected. And so one needs a system of collective control which does that, and protects various entities from being steamrolled.
The technical aspects in regard to the ability of “the world” to constrain a single system from radical misbehavior are somewhat easier in that scenario (since “the world” is collectively very smart), but this is a small subtask of a much more complicated task of figuring out what kinds of invariant properties a self-modifying world of this kind should achieve and reliably maintain and how the collective task of figuring out those invariant properties and reaching the situation where these properties are achieved and reliably maintained should be approached.
maybe footnote 1 means that this post is not for me, but I believe that the world can survive the existence of misaligned/unsafe ASI as long as it is dominated (in terms of compute/intelligence) by aligned and safe ASI. See item 6 here https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/
I think the point of the post is, you can’t actually get to a world that has a dominant safe ASI without having first satisfied one of the listed conditions (absolute secrecy and control, complete technical orthogonality, or a global ban). Otherwise, you get an unsafe ASI first, and then it dominates the world, and since there is no safe ASI to be a defender, we lose.
I suppose you could argue that the early unsafe ASI takes long enough to establish dominance that we could invent safe ASI before the unsafe one finishes dominating. Then, if the safe ASI is stronger, it could complete its domination before the unsafe one does. This requires that:
-there be a significant time lag in domination in the first place,
-and the unsafe ASI is not able to sabotage safe ASI projects before establishing dominance,
-and an ASI built in that time window, under pressure, could be made safe,
-and it can be made stronger than the unsafe ASI is after it does its own self improvement.
If it can’t be made stronger, then humanity plus a weaker but safe ASI needs to be able to beat an unsafe ASI plus whatever resources it marshaled during the time window (including humans it swayed/blackmailed/etc). Or, at least, to be able to put up enough of a fight to bargain for a significant chunk of the lightcone (supposing that’s an acceptable outcome).
These are all specific requirements, though, which need to be debated on their own merits, if I got the framing right.
My rough heuristic is that intelligence scales with compute, so the crucial condition is the vast majority of FLOPs are deployed for safe intelligence. It seems that a lot of the arguments in the post are how there may be some leak of unsafe or misaligned ASI in one way or another but this doesn’t mean this ASI will have lots of compute at its disposal
According to me, a key point is made in the post when it describes this question as a bottleneck:
how do you execute [a controllable ASI agenda] without accidentally [...] building unsafe ASI along the way?
The unsafe ASI is coming from inside the house. Especially given the research practice, predominant in "frontier AI labs", of "just try it and see, with as much compute as you need", I don't see what you're imagining here. The "implement first, ask questions later" strategy is exactly how you make it approximately totally impossible, rather than merely extremely difficult, to stop short of a dangerous ASI.
Compute is always a finite resource, you you can't just train anything at any scale. Also, I don't think your description is accurate, at least of the one frontier lab I am familiar with... We are aware of the risks of internal deployment and are monitoring for issues. See for example this blog post.
The issue isn't that you wouldn't be aware of the possibility of a catastrophically dangerous AI emerging training, the issue is that you don't know how to detect or stop it. Do you think that the measures you currently implement or will be able to feasibly implement in the future are likely to prevent this? If so, would you be willing to have a more extended debate on that point with an expert who believes the opposite? (I don't have a specific one in mind, but would try to find one if you'd be interested.) If you're not willing, why not?
Risks of internal deployment are something we are tracking and (as this blog shows) are actively working on. I don’t think it will be useful to debate an external expert that doesn’t know the details of our internal setup. However we are continuously working on mitigations, reporting (system cards, blogs) as well as collaborating with third parties
as well as collaborating with third parties
Does that include review from independent experts on the risk of a catastrophic AI emerging? Who would that be?
When we have something to publish we will do so. Generally our system cards and other publications contain evaluations of different aspects of safety and alignment by us and third parties. I expect that as capabilities grow and stakes of internal and external deployment are higher, we will continue and expand both our own evaluations and such collaborations.
So to recap,
Do you agree with these bullet points? I would add this bullet point:
And I wonder what you think of it.
This ignores the offence-defence balance which, in some circumstances, may massively benefit the attacker.
I don’t think the post requires the unsafe ASI to breach containment. The frontier AI being run on the labs‘ own servers could itself be unsafe, successfully scheming without detection. Or it could be corrigible and still be an x/s-risk from humans enacting biorisk, human takeover using the AI, or gradual disempowerment.
Even if a leaked unsafe ASI has extremely constrained compute, it still can enact asymmetric attacks that work within those constraints. How much compute would it need to manipulate someone into - or just help them to - obtain and release smallpox? How much would it need to effectively sabotage safety research or the creation of a safe ASI?
And even if you are actually succeeding at building a safe ASI, it is still very likely to be harder and to take more time to build than an unsafe one. In the time window between when the leading lab could’ve built an unsafe ASI and them building a safe ASI, their competitors may just go ahead and build the unsafe ASI themselves, letting it run on their servers. And then the unsafe ASI has first mover advantage, with plenty of compute to spare, and probably wins.
the aligned and safe asi would have to be actually trying hard to patch vulnerabilities of all kinds that the malicious/misaligned/unsafe ai is trying to attack via, which laudibly is currently being attempted in cybersecurity, but the jury still seems out on institutional/social/biological/manipulation/epistemology/market.
Yes, as I wrote in my post, aligned ASI's would need to spend some fraction of their resources improving defender in the offense/defense balance. My main point was that the balance is not infinite and so if aligned resources vastly outnumber misaligned resources that should be enough.
On your way to figuring out how to build controllable ASI, you will have figured out how to build unsafe ASI, because unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path.
This is only true if you are building some kind of cartoon ASI that self-replicates without regard for its creators' intentions. If you (a human being) are trying to build ASI to achieve any purpose at all you basically have to solve AI safety along the way. This is empirically demonstrated. GPT 3.5 wasn't vastly more intelligent than GPT 3, but it was vastly more useful because RLHF was used to aim it at goals. We see the exact same trend today. Far from paying an "alignment tax", Anthropic is able to build the most powerful AI models because they are obsessed with the question "how do I control the AI?"
Good luck at your new job (American head of PauseAI?), but I don't think you have much time left in which to rein in Anthropic and OpenAI, if you don't want them to cross the threshold to ASI. (The other day, Trump talked about the need for a "kill switch" for advanced AI, so I guess that's a start...)
Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.[1]
There are various flavors of “safe” people suggest.
Now I could argue at lengths about why this is astronomically harder than people think it is, why their various proposals are almost universally unworkable, why even attempting this is insanely immoral[2], but that’s not the main point I want to make.
Instead, I want to make a simpler point:
Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.[3]
Punchline: On your way to figuring out how to build controllable ASI, you will have figured out how to build unsafe ASI, because unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path.
You can’t build a controlled ASI without knowing many, MANY things about intelligence and how to build it.
So this then bottlenecks the dual technical problems of “how to find an agenda that results in controllable ASI” and “how to execute on such an agenda” on “even if you had such an agenda, how do you execute it without accidentally, or due to some asshole leaving the project or reading your papers, building unsafe ASI along the way?”
No one I know pursuing various agendas of this type has answers to these questions. And lets be crystal clear: This is the fundamental question any sensible “safe ASI” project needs to answer before even being worth considering.
You would need to either have:
This means that the primary prerequisite to even considering starting to work on a safe ASI plan is to have a global ASI ban and powerful enforcement already in place.[4]
I’m assuming you already accept that “unsafe” ASI would be really, really bad. If not, this is not the post for you to read.
In short: If you unilaterally try to build ASI, you are directly and openly threatening the world with violent conquest. This is sometimes called a “pivotal action”, which is code word for “(insanely violent) unilateral action that forces the world into a state I think is good.”
For some hopefully meaningful definition of the word “control”
This is the rationale behind proposals such as MAGIC.