The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably

Alex Amadori

This post is based on my personal views, which mostly overlap with the views of my employer ControlAI but does not necessarily fully reflect them. This applies in particular, but not exclusively, to technical opinions about AI development and geopolitical predictions.

You might’ve heard that superintelligent AI (ASI) poses extreme risks like human extinction and other comparably undesirable outcomes.

If you’re like me, you probably looked into possible solutions. And if so, you may have found a range of reassuringly tractable theories of change. To name a few:

Technical AI safety research agendas
Racing to ASI so your favorite company or country can get there first and prevent anyone else from building “bad” ASI
Building a good ASI and handing it control over the whole world (so that we don’t have to be subject to any evil human dictators)

If you think about it, all of these feel quite convenient, especially if you’re a tech-leaning person: you don’t need to change your career at all. Just keep working on your favorite ASI project, and things will work out.

It’s quite easy to come across theories that predict good outcomes without needing to change your strategy at all, even if you’re actively working to bring about ASI as soon as possible. I see these as being mostly semantic stopsigns. Most of them about AI alignment being feasible:

AI alignment is easy and people are working hard on it, so it’ll probably be ok.
AI will help us do alignment research.
Iterative deployment will help us catch problems before AI gets too powerful.

In this post, I want to show you that even if the theories of change mentioned above were applied extremely successfully or if AI alignment actually turned out to be technically easy, all the value in the world is still on track to be destroyed because of AI development. This means, mostly, human extinction. It also includes scenarios that don’t literally qualify as human extinction but are still comparably undesirable. For example, the least-bad scenario I consider in this post is all-out war between nuclear superpowers, and the worst scenarios are suffering risks (s-risks).

There are many ways in which AI development can destroy the world. In this post I'll explain the three most likely pathways. Any plan for survival needs to address all of them and prevent those threats from being realized.

In my opinion, the only solution that addresses all the potential threats is to achieve two things together:

A level of global coordination sufficient to stop or slow down progress toward ASI, such that all parties can ensure the trajectory of AI development happens according to the consensus and interests of most parties.
Mass awareness across society of the implications of ASI and of the worst risks posed by AI development, so the various parties can correctly judge whether allowing development to proceed at a certain pace is in their best interest.

This is why I work at ControlAI, which, at the moment, I believe is the best bet for moving the world closer toward this state. However, in this post I won’t try very hard to sell my favorite theory of change (ControlAI's already got a post for that!).

Rather than arguing for international coordination, I will simply describe how common theories of change that don’t take this route don’t prevent the world from being destroyed.

Preamble: pressure to cut corners invalidates most theories of change

Before explaining the multiple ways in which AI development can destroy the world, I need to introduce this concept as it will come up over and over again. ASI would offer its creator an insurmountable competitive advantage, if it didn’t kill them. This means there is an extreme pressure to cut corners to be able to reap its benefits as soon as possible.

This topic has already been explored, so I won’t go into it too deeply. If you want to see an explanation of why ASI is so powerful, look at “Situational Awareness.” If you want to see my own game theoretic analysis of an ASI race, read my paper: “Modeling the geopolitics of AI development.”

Suffice it to say, a large advantage in AI capabilities would allow its creator, or the rogue AI, to perform an extremely low-cost, low-risk takeover of all other countries and actors in the world. From that point on, they’d maintain a singleton: that is, a permanently unassailable total control over the world.

Once you understand this, it follows that you have to ensure no one else builds an AI capable of overpowering you. Assuming you don’t have the means to do this, then you have to be the first to gain this insurmountable advantage, before someone else does it and kills you.

First of all, let’s step into the shoes of a state actor, or any other powerful actor, and see what actions immediately come to mind after realizing the importance of ASI: “If any other actor has an ASI project more advanced than mine, I will try to steal, hijack, or otherwise take control of this ASI project.” Between states, this means espionage and sabotage, including extreme measures up to and including acts of war.

It also means that skilled actors, such as competent psychopaths or propagandists, will try really hard to gain control over the project. In the case of competent psychopaths, they may manipulate their way into the project’s leadership.

This also means that if you are a private company, there is not a chance in hell you will complete your ASI project and get to keep the ASI because:

Your government will take over the project!
If your government is sufficiently incompetent, other powerful actors (probably an adversary state) will infiltrate your project, steal your technology, and then sabotage you!^[1]

For whoever develops an ASI, there will be pressure to establish a singleton as soon as possible, so no one else can ever build an ASI or otherwise topple their regime.

Finally, race dynamics interact with AI alignment and control: there is extreme pressure to cut corners to speed up the development and deployment of powerful AI. At any given moment, deciding to cut corners just a little bit more is locally rational to each actor: the sacrifice probably won’t make the difference between catastrophe and success, and it gives a competitive advantage.

Presumably, at some point the perceived risk of catastrophe is so high that the least careful actor is not willing to cut any more corners, and an equilibrium is found. I have no reason to believe this equilibrium settles at a reasonable point! From a state’s perspective, the counterweight for the pressure to care about AI safety is the pressure to avoid total annihilation at the hands of an adversary.

—

If you take only one thing from this post, take this: any theory of change that falls to one of these competitive pressures is completely useless*.^[2]* The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.

This is why I’m going for international coordination: while it’s very difficult, it would address the problem at the source. After that, if someone wants to build ASI, it should be done under an extremely extensive degree of supervision by all parties, such that the other theories of change on how to safely build ASI become much more feasible.

If you try to address any of the other problems, for example by trying to solve AI alignment and control, before having removed competitive pressures, you are swimming against a strong current and will be swept over the falls.

First filter: all-out war between nuclear superpowers

I think that hawkish writings about China usually fail to take their reasoning to the logical conclusion. For example, Leopold Aschenbrenner’s “Situational Awareness” and Dario Amodei’s essays, including “On DeepSeek and Export Controls” and some of “Machines of Loving Grace.”

People understand that the US and Chinese governments will wake up to the potential of ASI, and that when they do, absent strong international coordination (which Leopold and Dario assume is absent), the governments will be in an all-out race to who can build it first. The mistake Leopold, Dario and others make is to assume this is a restricted game, where most of what is happening is AI R&D and at most countries will engage in mutual espionage and sabotage.^[3]

If you take these views to their logical conclusion, you would see the ending of this story: all-out war between the US and China. When the superpowers try to sabotage each other’s ASI projects, they will not stop at grey-zone or covert sabotage. From a state’s perspective, if your adversary gets ASI, you are done. Your state will stop existing. You might as well have gotten all your major cities vaporized.

I am very confident that a superpower that knows it’s about to lose the race, or even considers a high risk of losing, will engage in unambiguous acts of war. The paper “Superintelligence Strategy” talks about possible kinetic strikes, but I think it will get much worse.

If states start building very hardened ASI projects, then stopping an opponent’s progress can be impossible without taking extreme measures that attempt to make the opponent’s country completely dysfunctional. For example:

Systematically attacking basic infrastructure (like the electrical grid) throughout the opponent’s territory
Sabotaging core functions of the opponent’s government, such as attempting or strongly supporting a coup
Launching an invasion, either to physically stop the ASI projects or to consume all the opponent’s resources through war

If we get to this point, I don’t see any reason to be confident that the situation won’t escalate all the way to a full-blown nuclear war between superpowers.

I think it would be a fool’s errand to try to predict the exact reaction of the national security establishment of the losing superpower. It will depend too much on unpredictable and opaque factors, from the structure of the natsec apparatus to whether the people responsible happen to be in a bad mood at some specific, decisive moment.

But I think it’s important to note that there are strong mechanisms pushing in the direction of arbitrary escalation, and no strong mechanisms preventing it from doing so.

And if all-out war between nuclear powers doesn’t sound bad enough to you, remember this: war would be waged with much more advanced AI than we have today, and the war itself would further shape the incentives around the AI race.

Contra “stable multipolar scenarios”

Stable multipolar scenarios can happen in one of two ways: if AI’s efficacy at war has reached the limits of physics, or if AIs have a way to enforce a consensus (like in the good ending of “AI 2027”).

AI advantages compound, and if the gap is wide enough, one of the competitors (potentially a rogue AI) wins. It seems unlikely that AI’s ability to wage war will climb all the way to the limits of physics while the gap between the various actors never gets wide enough to conclude the conflict.

About AIs enforcing a consensus, roughly, I think this would require AI to already be vastly smarter and more competent than any human or existing human organization. Which makes this proposed “solution” kind of tautological: you still need to pass all the filters and build an ASI that you can trust.

As an example, in “AI 2027,” the two ASIs strike a deal by building a “consensus AI” that will forever enforce, to some degree, the preferences of both AIs. To do this, you’d need to have developed an extremely deep fundamental understanding of how to program AI, the kind of understanding that lets you write an AI as lines of code rather than a neural network.^[4]

Due to the competitive pressures I talk about in this post, the plan would not unfold this way. Much, much earlier than when you’d be able to achieve such a deep understanding of AI, you’d achieve an understanding just barely good enough to build ASI. Then you would build it and thus destroy all value in the world, unless you already figured out a way past all the filters in this post.

An alternative proposal is to have AIs strike deals that are enforced through mutual monitoring. By the time AIs can strike such deals autonomously, they are already fairly superhuman and / or significantly in charge of running the world, and we need to have passed the filters.

To be clear, I don’t necessarily think it’s a bad idea to have weaker AIs help us enforce monitoring-based international agreements. But this needs to be done before AIs get too strong, at which point humanity would have to do it, even if aided by weaker AIs.

(My paper “Modeling the geopolitics of AI development” talks about this filter in more detail, but the thinking is less refined since it was written a while ago.)

Second filter: misaligned AI that kills everyone

It is probably very hard to build an ASI that doesn’t end up killing every human being simply by running it.
The basic argument is that ASI would be so effective that any failure, even partial, would result in an ASI handling extreme amounts of power while not going out of its way to preserve human life and values.

ASI would kill us as a side effect of whatever it ends up doing, just like a human destroys an anthill without a second thought when it’s in the way of a construction project. The field of making sure that ASIs act in a desirable way is called “AI safety.”

The threat model of misaligned AI is the one that has already been explored the most, so I will assume that readers are at least passingly familiar with it and won’t try to convey the basic idea here. If you need an introduction, read the book “If Anyone Builds It, Everyone Dies” by Eliezer Yudkowsky and Nate Soares.^[5]

What I want to focus on here is how the pressure to cut corners I mentioned earlier makes it nearly impossible to solve alignment in time for when ASI will be developed. Think of the following competitive pressures:

Pressure to cut corners on safety methods
Pressure to deploy as fast as possible
Pressure to give AI as much autonomy as possible
Pressure to hand over existing decision loops to AI as quickly and thoroughly as possible

So what happens is, AI projects will develop and deploy AI that is as capable as possible given current capabilities techniques, while only being as safe as absolutely necessary to make them usable. The most important part here is: only as safe as absolutely necessary to make them usable. What does it mean? Well, the first instance of this pattern we’ll discuss is the commercial one.

Software engineers won’t use an AI that cheats to make the tests pass every time, but they’ll use an AI they can usually catch cheating, as long as the violations don’t fall through the cracks often enough for the engineer to get fired.

CEOs will not use AI employees that regularly take costly, irreversible actions to the point that the company loses a lot of money or it gets the CEO in trouble. But they will, for example, use AI that takes illegal actions as long as the company gets fined for less than the money it made, or the crime happens in a third-world country, etc.

So far it doesn’t sound like an extinction risk, but what is the “usability limit” when it comes to integrating AI in the military? What about tail risks, situations that are too rare and so haven’t yet appeared in the feedback loop of fixing AI bugs?

And most importantly, what happens when someone first gets to the capability level where they mostly hand over AI R&D to AIs themselves?

The AI will be just barely safe enough to profitably (not spotlessly!) do jobs that:

Are roughly as complicated as AI R&D^[6]
Have short-enough feedback loops that failures have already happened, such that AI companies already have bug tickets for these failures
Have already addressed these bug tickets

Of course, you will not be able to get this guarantee for novel tasks, such as AI R&D itself. Probably, you won’t even be able to get it for tasks that already exist but are not common enough for you to test the AI thoroughly on them during the (very brief) allotted time. You have to hope that whatever safety you have transfers from this small, nonrepresentative set of tasks to the ones that matter.

Why Technical AI safety agendas do not address this problem

Technical AI safety agendas for addressing extinction risks usually focus on the “misaligned AI that kills everyone” filter, so I have to briefly address why, as a general rule, they don’t work. In fact, they make things worse.

This is because all alignment work is capabilities work.

Take RLHF (reinforcement learning from human feedback), for example.^[7]RLHF improved “alignment,”^[8]but it also improved capabilities a lot more: the AIs that we built after RLHF were more liable to do dangerous things than the ones we built before it existed. This is true even if you do your best to use RLHF to make the model safe.

To the degree that interpretability and scalable oversight work, I confidently predict that they will do exactly the same thing.

The underlying, fundamental reason for this is that capabilities are easier to formalize than safety. By this I mean that capabilities are easier to measure and easier to describe to other people, to AIs, and to code without loss of information.

Imagine that we get an interpretability breakthrough. You would have more readability into the internal algorithms of AIs, but those algorithms are very big and complicated: you wouldn’t automatically know which parts are helpful and which are harmful.

Some will be obviously harmful and removed right away. What then? Maybe you can do some manual searches for patterns you suspect exist? But humans are slow. You can get AIs to help you, but AIs are not (yet) smarter than you, and so they’d miss some stuff. Perhaps AIs already have some misaligned biases and so would sometimes actively hinder your efforts.

On other hand, capabilities, oh how they’d skyrocket. Better interpretability would yield more powerful methods to modify AIs: it would allow engineers and learning algorithms to modify AIs in more targeted, deliberate, and understandable ways than can be done today.

Since capabilities are more formalized, you can quickly train a large team of engineers to make use of the novel techniques. Perhaps you can cut engineers out of this loop entirely, integrating the novel technique as part of automated learning algorithms.

If you want to modify AI to improve a quality that is hard to measure, like safety, you need a human to stand there and opine about each candidate modification. Worse still, the human needs to have good taste about the property you are trying to improve.

To summarize: capabilities can improve at machine speeds, while safety will always be bottlenecked by humans. The only way to solve this dilemma would be to describe our safety desiderata to the same level that we have described our capabilities desiderata. That way, we could potentially automate AI safety, or at least reliably train a team of engineers to do it. Good luck doing that during an all-out race to ASI!

I encourage you to think about this issue yourself, especially if you are a researcher at a major AI company working on a technical AI safety agenda. Your work may end up boosting capabilities more than most of the people over on the capabilities teams.

Third filter: nightmare singletons

Ok, imagine that the alignment problem is on track to get solved, such that a human being (or group of human beings) could operate an ASI without killing themselves and everyone else as a side effect. You and the rest of your team, the “responsible actors” in a world composed mostly of irresponsible ones, have the lead in AI development. You will build ASI first and then establish an eternal utopia, right? No.

Here’s what really happens: the government takes over your project before you get to ASI, by default as a military project, possibly top secret. You are questioned just enough that they know how to make use of the project’s assets (like code, documentation, hardware, etc.), and then you are thrown out the door.

Or maybe a softer version of this happens, where your AI company still technically exists. However, your CEO does not retain effective control of the company, and you have military personnel looking over your shoulder as you work.

If your government is asleep at the wheel, a foreign government will take over your project, or at least steal all the progress you’ve made so far and then pour their resources into going faster than you. Or if all governments are asleep at the wheel, another company will take over your project, or perhaps a charming psychopath CEO will manipulate their way into a top leadership position at the company where you work.

What then?

Whoever controls an ASI can establish a singleton. A singleton is a “world order in which there is a single decision-making agency at the highest level, capable of exerting effective control over its domain, and permanently preventing both internal and external threats to its supremacy.”

—

Let me spell out, for people who haven’t thought about this subject before, how nightmarish this scenario can get.

An individual in control of an ASI could establish a dictatorship that controls the entire earth, possibly the entire universe.

They could monitor every corner of their domain 24/7 and assign a virtually infinite amount of intelligence to analyze all of this information.

They could compel everyone to install brain implants (or forcibly upload them, etc.) and have complete oversight and control over their thoughts, actions, and experiences.

Eventually, they could shape the whole world to their preference until every atom is exactly as they want it, and do it as easily as a child shapes playdough.

—

In AI safety, some people’s strategy is to give power and resources to “good” or “responsible” actors, such as their favorite AI company.^[9]The theory of change for this strategy is that the “responsible” actor is the first to build ASI and establishes a utopian (or at least “good”) singleton.

I think that it is an enormous mistake to trust any one person or company with this. If your strategy is to use ASI to establish a “good” singleton, I will fight to prevent you from succeeding because I don’t trust you. But regardless, I hope this post makes you see that this strategy will break horribly.

If you are part of an ASI project and this is your plan, know this: someone more powerful than you will take your toys away before you get to ASI. Then, they will use them to race to ASI without you.

What happens later is fundamentally unpredictable. The result does not have to be as bad as the nightmare scenario I painted earlier. But from where we’re standing, it could easily get really bad.

I think what happens if any individual or small group of people obtains absolute power over the universe is an extremely dystopian scenario, potentially worse than death depending on your values. I think the same is true for scenarios in which we just barely make enough progress on alignment that ASI doesn’t kill us all as a side effect. ASI may want a future for us, but it could be a future that we find abhorrent, and it would have absolute power over us.^[10]

Even in the best case scenario, where the ASI project is taken over by a government with a very robust democratic process, the situation would most likely be considered a national security emergency. Such emergencies are dealt with by the military (or more generally, the executive branch), which needs to be able to act quickly. As a result, it has weaker democratic oversight compared to other government branches.

What will this government do after having declared an emergency situation, armed with proto-ASI? Would you feel safe if you thought your government was bound to establish a singleton?

How common theories of change fail trivially

Any solution or theory that focuses entirely on technical AI safety fails trivially by not taking into account the two other filters. For example, some people think AI alignment will be easy to solve. I think this view is most likely mistaken on a very deep level. But even if it were correct, it would not address the other two problems at all.

Furthermore, I think that all technical AI safety projects will not be successful, not in a world where actors are able to unilaterally push the frontier of AI development toward ASI. This is due to the pressure to cut corners on safety, and because any technique will accelerate capabilities much more than it accelerates progress in AI safety.

The philosophy of “iterative deployment” will simply not apply in a world where the pace of deployment depends entirely on competitive pressure and is entirely causally disconnected from any consideration of what may be a “responsible” pace for AI development.

There are some who try to acquire personal power or influence so they can exert it “when the time comes.” This can mean attaining influence inside of AI companies or in governments. As I pointed to in the third filter, I think power within AI companies is meaningless.

And I think the people who try to acquire unilateral power within governments are deeply misguided. When push comes to shove, they will fail at gaining enough power to steer the actions of governments.

If the majority of the government does not understand the meaning of ASI, these people will not be able to make massively expensive and complex asks to leadership. For example: “slow down AI development to improve safety” or “pressure other major powers to enter a hefty trust but verify regimes capable of providing mutual assurances on AI development.” If these people try to push these asks without first building a broad support base (probably as broad as a decent voting bloc), then they will simply get purged.

Finally, there are people trying to get a major power to engage in a race to ASI, beat all their adversaries to it, and establish a singleton. I think these theories of change fail on literally all three filters:

The world will likely be consumed by war before any actor can get to ASI.
Even if we narrowly avoid all-out war, this theory of change leads to a race to the bottom on AI safety and to uncontrollable ASI that kills everyone.
Even if the ASI ends up being somewhat controllable, no country on Earth currently has such institutional robustness that it would not produce a dystopia if it acquired ASI.

Conclusion

These were the main three challenges that I think stand between us and surviving ASI. Even if we pass all three, I don’t think things automatically go well.

I have more intuition pumps that I would like to publish in a future post. They are mostly about how, in scenarios with AI that is strong but not as strong as I’ve been implying ASI is, that:

There is a strong tendency for power to concentrate and for the world to gravitate toward the three outcomes I’ve been describing.
There is a tendency for human preferences and behavior to mutate beyond recognition, to a degree that we might think of such people as essentially “dead.”

The main way that I envision humanity passing these filters is with deep awareness of what ASI entails and with international coordination.

Deep awareness is necessary so the relevant parties understand what their interests are with respect to ASI. Chiefly, they need to understand that ASI can become powerful enough to destroy the world, and that it is indeed extremely hard to deploy an ASI without destroying the world.

Coordination, backed by mutual monitoring and deterrence, is necessary so the major parties can avoid a race to the bottom over who builds ASI first. Without it, they will end up developing and deploying ASI in the most irresponsible way possible, and thus destroy the world.

Both deep awareness and coordination are necessary so countries can eventually get to work to figure out how to go through this transition while avoiding the horrific failure modes I’ve described, and others yet.

At the moment, my best bet for achieving these goals is to work at ControlAI. If you’re interested in learning more about ControlAI, feel free to read our funding pitch, which also goes in detail about ControlAI’s theory of change. Alternatively, feel free to shoot me a message.

This includes things like stealing your weights and then sabotaging your ASI projects, but also trying to insert backdoors into your AI systems. ↩︎
And worse than useless if you consider that it absorbs funds and attention. ↩︎
Even when they acknowledge the possibility of war, it is treated as something that happens in the very endgame. Countries are not treated as being able to look ahead and strike preemptively. ↩︎
Even with such an understanding, code may not be the optimal way to build an AI, and you may choose to use neural networks or a new technique altogether. The point is if you wanted to write it in code, you could. ↩︎
Some people criticize Yudkowsky and Soares’ arguments for not engaging properly with the peculiarities of LLMs and claim that LLMs make alignment easier. I have it on my to-do list to write about why the shape of current AI systems doesn’t make me particularly optimistic about alignment. Unfortunately, at the moment I don’t know of a good post to convey this; the best one I can point you to is: “Current AIs seem pretty misaligned to me.” ↩︎
In fact, I think this is quite optimistic. AI companies are prioritizing AI R&D over anything else, so it will be one of the first (if not the first) task AIs will be able to perform at its level of complexity. There will not have been trial runs with similarly complex tasks. ↩︎
RLHF is the technique that enabled the creation of the first version of ChatGPT. ↩︎
Insofar as you could get LLMs to actually do the task you asked them to do, even when the task was not extremely simple and even if you weren’t an expert base-model prompter. ↩︎
This includes technical people who decide to work on capabilities at an AI company. ↩︎
The bad ending of “AI 2027” falls under this last category, and it was considered the most likely ending by the authors at the time of writing. ↩︎

I've been critical about your "theory of change" in the past, but regardless of that, I think the arguments in this post are completely right. Things really are this bad. And yeah, joining big labs to work on alignment is especially not helping.

A lot of people have pointed out the CS Lewis quote which is perfect for the occasion:

"At first, of course," said Filostrato, "the power will be confined to a number - a small number - of individual men. Those who are selected for eternal life."

"And you mean," said Mark, "it will then be extended to all men?"

"No," said Filostrato. "I mean it will then be reduced to one man."

Ah, yes. That Hideous Strength. A strange novel, certainly.

I've long felt that it was about the Singularity, as seen from a primarily biological perspective. You see the dreams of power and transcendence, the early partial successes, and the inevitable corruption of nearly all the people involved. There's even a discussion of s-risks. The ending is positive because C.S. Lewis wanted to tell an entertaining story, and because his religion encouraged him to believe in impossible last-minute grace.

I may have referred to various AI labs as "the National Institute for Coordinated Experiments" in the group chat. I suspect that they employ more than one Filostrato, Wither or Frost.

Google once prided itself on "Don't be Evil," and they even sometimes lived up to it, to some degree. But money and power have corrupted them pretty thoughly, over stakes no higher than some ad revenue. And one of the arguments of That Hideous Strength is that people who push for a Singularity are unusually corruptible, because the temptations are so great. And, frankly, because they would need to be unspeakably arrogant and reckless to try.

FWIW, I resonate with some parts of this post -- e.g., that a near-ASI endgame is deeply unstable from a risk-of-war perspective.

But I also think it overall has a code smell of what you might call "politics as done by physicists" (https://xkcd.com/793/) : far too much confidence in the exhaustiveness of frameworks; far less respect for friction, uncertainty, and risk-averseness than exists in actual policymaking; and very little exploration of polities and societies as anything other than command-and-control actors.

And I note that, in general, politics as done by "physicists" has failed to work very well. (These are metaphorical "physicists" referencing the XKCD comic -- actual literal physicists can become good politicians, if they learn the actual affordances of the societies they live in, however. Metaphorical "physicists" include things like the writings of Karl Marx, in addition to some literal physicists).

This is a good analysis of the geopolitical situation but some claims seem very overconfident. For example:

The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.

It is very possible one lab/country gains a decisive advantage before any others. The approach to ASI Is likely to be chaotic and fraught with disagreement. It very well may not be obvious what is happening to other powers until a decisive advantage is gained. If the lab/country with a decisive advantage succeeds at technical alignment you may end up in a world which bypasses many of these concerns.

a foreign government will take over your project, or at least steal all the progress you’ve made so far and then pour their resources into going faster than you.

It took the Soviets years of espionage to steal atomic secrets. If a lab is approaching ASI, one can expect that the pre-ASI AI will be heavily woven into their security architecture for securing model weights which may effectively prevent nation states from stealing them.

Overall your post was quite high quality and explores a neglected area of AI discourse (And I agree we should pause AI!) but I also don’t think things are quite as hopeless as outlined.

Partially agree.

If AI development is more insight-driven than compute-driven, then there is more room for sudden progress that gains a decisive advantage over other labs and govs before getting noticed (other entities suspecting the lab getting close to ASI with non-neglible confidence) and reacting. This allows the lab to control the singleton instead of the mainstream labs and govs, and in this situation, the lab might escape from race dynamics.

However, this scenario results in a random lab controlling a singleton. While it's not as hopeless as a singleton built by a racing entity, it doesn't look really hopeful either.

If we're talking about the differential effect of a given lab joining the race, then they could have a positive effect, if we know they have good intentions to benefit humanity. However, it's still difficult to ensure the good intentions are still there when they actually get to ASI.

I think there is an interesting middle ground here too. You have the dichotomy Insight versus compute driven.

I would generally agree with your assessment.

Within the compute driven paradigm I'd propose a sub delineation. Intelligence able to generate novel insights versus intelligence able to succeed only in highly specified domains.

It is possible that LLM intelligence only scales in verifiable domains, but not in "insight" driven domains where you end up with a an AI that can solve math problems, identify vulnerabilities, and code in a vastly superhuman way, but is notably hobbled in other domains.

If we're talking about the differential effect of a given lab joining the race, then they could have a positive effect, if we know they have good intentions to benefit humanity.

FWIW, I think that this has mostly the effect of just adding fuel to the fire, because the government takes over the project regardless of the intentions of the company.

For this to be different, the "insight" would have to accelerate progress to ASI so much that the company can build ASI in a very short time while staying under the radar, including having very few employees and using little compute.

If you buy my arguments in the section "Why Technical AI safety agendas do not address this problem", the appearance of such an insight would actually be extremely bad: AI safety is always more bottlenecked on humans compared to capabilities, and this company has very few humans!

Inventing God while no one's watching? Are labs likely to keep a decisive advantage sufficiently obscure long enough to take over the world? We saw with Mythos that Anthropic announced it very loudly, in close consultation with the US government. We have an executive order that "voluntarily" requires labs to let the US government to review models before release even at this level of capability.

But even if one entity gets the decisive advantage, we're still just back in the "nightmare singleton" scenario.

Separately, "succeeds at technical alignment" here has to mean that technical alignment is so easy, you can do it under competitive pressures. Do even the "alignment is easy" folk believe that? Or do you mean that the singleton takes over first, then solves alignment? It doesn't seem like you take over the world and still remain in control with a misaligned ASI...

Would pre-ASI AI be enough to stop every point of espionage, or at least enough of them? Anthropic is still at least a year from SL-4 - not even 5 - and who knows how the other labs are doing. Can it stop espionage that isn't purely cyber? Do we think other countries don't have spies in at least one of the major labs? Are AI secrets as well protected as nuclear ones were?

Ultimately, even granting that it could be physically possible that a) you could get a surprise decisive advantage, b) solve technical alignment before then, c) ward off all espionage attempts before then, and d) the resulting singleton acts to humanity's best interests - is the conjunction of all those a high enough probability to be termed "not quite as hopeless"?

Aside from whether the claims you are citing are true or well-calibrated, I want to point something out.

I actually think that your example scenarios here illustrate exactly the type of scenario that I wanted to disarm with this post. Or if not disarm them, at least give people the tools they need to disarm them.

It is very possible one lab/country gains a decisive advantage before any others. The approach to ASI Is likely to be chaotic and fraught with disagreement. It very well may not be obvious what is happening to other powers until a decisive advantage is gained. If the lab/country with a decisive advantage succeeds at technical alignment you may end up in a world which bypasses many of these concerns.

This very specific scenario, that I think could technically happen but is extremely unlikely, falls to the third filter, "nightmare singleton": the winning AI company having to create a singleton in order to actually bypass the first and second filters.

You may believe this can be a good outcome, but you are still fundamentally trusting that the singleton established by a private company will be good for you / most people, which I don't qualify as a satisfying solution to the third filter.

(The reason I think this scenario is unlikely is that for this to work the AI company would've had to get a really, really big advantage over anyone else: how did they do this without cutting too many corners on safety and failing at the second filter? And of course, the thing about the government taking over the project.)

It took the Soviets years of espionage to steal atomic secrets. If a lab is approaching ASI, one can expect that the pre-ASI AI will be heavily woven into their security architecture for securing model weights which may effectively prevent nation states from stealing them.

I could spend time arguing why I think this scenario is unlikely, but I think this would miss the main point of the post: it doesn't address the fact that the winning AI company still needs to make a singleton, and we're failing at the third filter.

Suffice it to say, a large advantage in AI capabilities would allow its creator, or the rogue AI, to perform an extremely low-cost, low-risk takeover of all other countries and actors in the world.

A simple "large advantage" is not enough to get low-cost low-risk takeover. I think most people would say that frontier models have a large advantage over open weight ones (eg. Claude Mythos compared to Kimi K2.6), but keeping this gap into the future would not allow the US to low-cost low-risk take over China.

What advantage would you need? I'd surmise AGI/ASI with at least a year ahead of anyone else. You'd probably need to cripple others without detection (otherwise face retaliation), and still pay a high cost of developping physical infrastructure to operate such takeoevers (Drones&robots).

Your post does show that trying to maintain control over a superpoweful AI ends in disaster with high probability.

We do know that there are efforts which disclaim maintaining control over their AIs (which presumably involves different risks, but probably not the risks described in this post, at least if each AI in question is sufficiently distributed geographically, rather than locally concentrated).

Do you assume that those efforts are doomed to lose to efforts based on control in terms of speed of technical progress and therefore can be disregarded, or do you mean to analyze that class of efforts and their safety problems elsewhere?

You will build ASI first and then establish an eternal utopia, right?

Note that one needs control over the leading ASI if one wants to become a dictator, but if one is actually aiming for utopia, then control-based approaches are likely to be highly counter-productive. Humans don't have good track record in their utopia-building attempts, to say the least.

EDIT: I am aware of at least one project whose leader is disclaiming control and pushing for a different approach (Ben Goertzel) and of at least one project whose leader has a history of being very skeptical of the control approach and of pushing for different approaches not involving long-term human control over AI (Ilya Sutskever). It’s likely that there are more of those. With Ben, it’s difficult to say if his org has a chance, but they are specifically pushing for a very distributed architecture, which is not easy to fork or to take over.

We need smarter+wiser humans actually have a shot at getting ASI right. The current people need to be shut down.

alignment

Because both essentially are the same thing: alignment is about AI doing what I want - and capabilities are about Ai doing what I want?

Well, capabilities are more about having large effects, and alignment is about having effects that are desirable to the principal. You can imagine a 2x2 matrix with aligned/unaligned, highly capable / slightly capable. Aligned + highly capable is the best, unaligned + highly capable is the worst.

One way to think about it is, AI can be made to have enough short term alignment that we are incentivized to hand over long-term powers to it, before we've solved the problem of long term alignment. If those powers are long term enough, that handover would be irreversible.

This is because all alignment work is capabilities work.

This is not at all obvious and needs much more argument to be convincing. RLHF is a pretty weak example, as it's more or less a capabilities technique applied to alignment, and so it's unsurprising that it's differentially valuable for capabilities. Something like Constitutional AI does not clearly improve capabilities at all, let alone more so than alignment.

IMO the core reason alignment work is not automatically capabilities work is that ultimately in alignment we are interested in suppressing certain behaviors, while capabilities people are interested in eliciting or unlocking them, and there's no a priori reason why a technique that is useful for one of these must be useful for the other. Taking your interpretability example, suppose we get some great tool that allow us to conclude that there's some circuit or whatever in our model that corresponds to bomb-making knowledge, and no circuit for doing calculus. If we want the model to stop making bombs, this tool points directly at something we can remove, but if we want it to start doing calculus, it doesn't tell us what we should add or change. The work on activation steering is a good example of this; you can subtract the "unkind" vector to make the model nicer, but you can't add a "smart" vector to make it smarter.

I agree with most of the points presented in the post to various extents, but I don't think these arguments actually support the post's conclusion.

If you take only one thing from this post, take this: any theory of change that falls to one of these competitive pressures is completely useless. The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.

Doesn't this plan (effectively an outright ASI ban) fall to competitive pressures between labs and nations just as easily as the others?

In fact, I'd say it's a lot more brittle. On a long enough timeline this plan would fail if any secret ASI project exists somewhere in the world, which due to the very competitive pressures outlined in the post, is almost certain to exist, with the bonus that the winners in this scenario would be guaranteed to be a group willing to start a secret ASI project presumably against international law.

In the meantime, the three filters that this post mention all involve a smaller number of actors. In the first filter, only a few countries can start a global nuclear war, and in the other two filters (in the case where they occur) only one AI or group of people will decide the fate of humanity.

Also, a tangential observation: some of the arguments presented in the post seem to suggest that on the margin, it would be better to race faster in order to create aligned ASI before either your own government or their nuke-happy adversaries realize what's going on. I don't know what to make of this observation.

If governments are situationally aware, then they will be aware of the risk of secret AI projects and act on it, given that such projects are tantamount to declaring war (see filter 1). No one is going to allow the possibility of a small actor running such a project. The equilibrium can only either be actually solving the common knowledge problem for real (however costly this might be), or existential war. Though I suppose the alternative is, if the verification is not good enough, but the actors involved think it is, then there's room for a secret project to happen.

I agree, though I think the risk of a secret ASI project being successful with limited resources probably increases significantly the longer the timeline, based on a basic fragile-world extrapolation. The group of actors with capability to reach ASI will expand from major powers to middle powers and eventually small nations (some of which may be nuclear pariah states) and private groups. All it takes is one group to pull off a successful secret project (or possibly a not-so-secret one if it's run by a country with enough nukes) to break the equilibrium.

Assuming the above, then an AI pause is doable, but not an indefinite ASI ban. Eventually the risk of secret projects is large enough that the world as a whole needs to resume AI development to stay ahead of one if it existed.

FWIW, I think this is a problem for decades in the future. I agree that an ASI ban doesn't solve the problem indefinitely, but we'll have extra decades to figure out what to do.

I think the failure modes you're pointing to are valuable to point to. I'm left with the impression that you confidently believe everything other than your agenda -- which btw I can donate to -- fails trivially. Yet your arguments for this are kinda weak?!

I feel like at the heart of counterarguments to solutions that resemble those found in human history -- cooperative / multipolar / non-inner alignment --, there is a fallacy of comparing time scales but not shifting perspective to the new subjects of post-human history, AIs. It'll all go too fast for us, that's for sure -- but does that imply that one rogue AI will win everything? The speed of decision making will increase significantly, after all, possibly more than the speed of weapon development. The grounds for decision making will become more abstract, and a move by one ASI to simply secure computing ressources beyond a certain limit could be regarded as a provocation by its fellow ASIs, who could intervene incredibly swiftly. If it's just one or two AIs who escape the peloton of the arms race: okay, that sounds bad, but if it's a couple? The very thing that OP is working on -- deep understanding for ASI risk -- would be wide-spread among all relevant players.

Consider human development of the last century or so. We have surpassed the limits of traditional weapons and produced enough to destroy the world many times over. None of this is comprehensible to chimps. There have also been times at which one player had enough lead to establish total military control, but they didn't really try, at least not by all means. Why not? Because there is multipolarity nested in multipolarity; people show signs of virtue to succeed inside their social environments and that is what saved everyone so far, including some of the underpowered players like, say, chimps. This behaviour is not a priori, it emerged under evolutionary pressure and has to be relearned by every specimen.

By comparing this to ASI timelines, I have probably committed a number of other fallacies, but so it goes.

Great post! Far too much writing about ASI geopolitics assumes uninformed, irrational, & neutered versions of state actors, and it's good to see some analysis that doesn't.

Max, it might be useful here to disclose your relationship with the same employer as Alex.

(I do agree with some parts of this post and think they are good, disagree with other parts that are weak arguments, but think LessWrong has an intense allergy against anything that could be misconstrued as astroturfing karma. I know you and think you're a nice dude, so it's to everyone's benefit to flag this.)

Hey! We do both work at ControlAI. I didn't know there was a strict norm about disclosing being colleagues with people when commenting on their posts (I'm fairly new to using LW): is that the case?

But just in case anyone is wondering, it's not like I was told to comment positively on this. Alex mentioned he wrote it in a personal capacity on Slack and I just read it and thought it was good.

It's helpful, I think, to do so; I don't know if the mods have declared it an explicit norm, but I do know some folks who've had negative experiences when folks thought they were trying to pull a fast one.

(And just to reaffirm -- don't think Max was doing that, that's why I wrote "misconstrued" above -- just trying to help everyone keep their trigger fingers nice and relaxed here)

This is probably not a useful comment, but I really want someone to reassure me that I am wrong, because I don't like the view I am presenting here.

Aren't we desperate enough for an AI-supported human dictatorship?

Many vices of a dictatorship seem to come from insecurity about keeping power and a god-king probably does not need concentration camps if his propaganda/education/genetic engineering are good enough. Of course, being brainwashed is still bad, but not as bad as whatever happens with an unaligned AI.
Humans are generally good or at least have aspirations to be good, so I would be surprised if a hypothetical dictator would not choose flourishing of a big group of people. When I try to remember worse dictators of the past, most come off as "Let those I like flourish and those I don't perish", not "Let those I don't like suffer more than those I like flourish". When I am thinking about it, what comes to mind is "Human from countries that the leader likes, maybe being brainwashed to love the leader and maybe all having the eye color of his dead lover, but still falling in love, laughing, having friends and living more or less full lives, limited mostly in political and philosophical aspects".

Three counter-arguments that come to mind are:

Dictator would be operating well outside the distribution of normal human operation. I don't know if people with god-like power keep good intentions.
People at power are selected for negative characteristics. Most things like ruthlesness or hypocricy don't seem to be relevant. Worst qualities like sadism are probably selected against. However, the psychopathy is probably still selected for.
We have a better alternative of ASI used for the benefit of all humanity. If we really could make an international treaty, then I would be all for it. I hope that we can, but also the futures seem to be a) Rogue ASI, very very very bad b) best case ASI - very, very, very good c) ASI human dictatorship - still better than what we have now, due to unlimited technology. It seems to me that c is closer to b than to a, and working on alignment seems to mainly decrease P(a), by pumping the probability into P(b or c), while treaty approach seems to try to increase P(b), which takes the cut from both P(c) and P(a), which might increase our expected value of the future less. Of course there is a quantitative question and it is entirely possible that alignment is simply that hard, so alignment work does not actually do much and only law can prevent extinction.

A lot of people have pointed out the CS Lewis quote which is perfect for the occasion:

"At first, of course," said Filostrato, "the power will be confined to a number - a small number - of individual men. Those who are selected for eternal life."

"And you mean," said Mark, "it will then be extended to all men?"

"No," said Filostrato. "I mean it will then be reduced to one man."

Ah, yes. That Hideous Strength. A strange novel, certainly.

I may have referred to various AI labs as "the National Institute for Coordinated Experiments" in the group chat. I suspect that they employ more than one Filostrato, Wither or Frost.

FWIW, I resonate with some parts of this post -- e.g., that a near-ASI endgame is deeply unstable from a risk-of-war perspective.

This is a good analysis of the geopolitical situation but some claims seem very overconfident. For example:

The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.

a foreign government will take over your project, or at least steal all the progress you’ve made so far and then pour their resources into going faster than you.

Overall your post was quite high quality and explores a neglected area of AI discourse (And I agree we should pause AI!) but I also don’t think things are quite as hopeless as outlined.

Partially agree.

However, this scenario results in a random lab controlling a singleton. While it's not as hopeless as a singleton built by a racing entity, it doesn't look really hopeful either.

I think there is an interesting middle ground here too. You have the dichotomy Insight versus compute driven.

I would generally agree with your assessment.

Within the compute driven paradigm I'd propose a sub delineation. Intelligence able to generate novel insights versus intelligence able to succeed only in highly specified domains.

If we're talking about the differential effect of a given lab joining the race, then they could have a positive effect, if we know they have good intentions to benefit humanity.

FWIW, I think that this has mostly the effect of just adding fuel to the fire, because the government takes over the project regardless of the intentions of the company.

But even if one entity gets the decisive advantage, we're still just back in the "nightmare singleton" scenario.

Aside from whether the claims you are citing are true or well-calibrated, I want to point something out.

It is very possible one lab/country gains a decisive advantage before any others. The approach to ASI Is likely to be chaotic and fraught with disagreement. It very well may not be obvious what is happening to other powers until a decisive advantage is gained. If the lab/country with a decisive advantage succeeds at technical alignment you may end up in a world which bypasses many of these concerns.

It took the Soviets years of espionage to steal atomic secrets. If a lab is approaching ASI, one can expect that the pre-ASI AI will be heavily woven into their security architecture for securing model weights which may effectively prevent nation states from stealing them.

Suffice it to say, a large advantage in AI capabilities would allow its creator, or the rogue AI, to perform an extremely low-cost, low-risk takeover of all other countries and actors in the world.

Your post does show that trying to maintain control over a superpoweful AI ends in disaster with high probability.

You will build ASI first and then establish an eternal utopia, right?

We need smarter+wiser humans actually have a shot at getting ASI right. The current people need to be shut down.

alignment

Because both essentially are the same thing: alignment is about AI doing what I want - and capabilities are about Ai doing what I want?

This is because all alignment work is capabilities work.

I agree with most of the points presented in the post to various extents, but I don't think these arguments actually support the post's conclusion.

If you take only one thing from this post, take this: any theory of change that falls to one of these competitive pressures is completely useless. The only way to avoid these pressures is if we could build common knowledge, at any given time, that no one is trying to develop ASI.

Doesn't this plan (effectively an outright ASI ban) fall to competitive pressures between labs and nations just as easily as the others?

FWIW, I think this is a problem for decades in the future. I agree that an ASI ban doesn't solve the problem indefinitely, but we'll have extra decades to figure out what to do.

By comparing this to ASI timelines, I have probably committed a number of other fallacies, but so it goes.

Great post! Far too much writing about ASI geopolitics assumes uninformed, irrational, & neutered versions of state actors, and it's good to see some analysis that doesn't.

Hey! We do both work at ControlAI. I didn't know there was a strict norm about disclosing being colleagues with people when commenting on their posts (I'm fairly new to using LW): is that the case?

But just in case anyone is wondering, it's not like I was told to comment positively on this. Alex mentioned he wrote it in a personal capacity on Slack and I just read it and thought it was good.

This is probably not a useful comment, but I really want someone to reassure me that I am wrong, because I don't like the view I am presenting here.

Aren't we desperate enough for an AI-supported human dictatorship?

Many vices of a dictatorship seem to come from insecurity about keeping power and a god-king probably does not need concentration camps if his propaganda/education/genetic engineering are good enough. Of course, being brainwashed is still bad, but not as bad as whatever happens with an unaligned AI.
Humans are generally good or at least have aspirations to be good, so I would be surprised if a hypothetical dictator would not choose flourishing of a big group of people. When I try to remember worse dictators of the past, most come off as "Let those I like flourish and those I don't perish", not "Let those I don't like suffer more than those I like flourish". When I am thinking about it, what comes to mind is "Human from countries that the leader likes, maybe being brainwashed to love the leader and maybe all having the eye color of his dead lover, but still falling in love, laughing, having friends and living more or less full lives, limited mostly in political and philosophical aspects".

Three counter-arguments that come to mind are:

Dictator would be operating well outside the distribution of normal human operation. I don't know if people with god-like power keep good intentions.
People at power are selected for negative characteristics. Most things like ruthlesness or hypocricy don't seem to be relevant. Worst qualities like sadism are probably selected against. However, the psychopathy is probably still selected for.
We have a better alternative of ASI used for the benefit of all humanity. If we really could make an international treaty, then I would be all for it. I hope that we can, but also the futures seem to be a) Rogue ASI, very very very bad b) best case ASI - very, very, very good c) ASI human dictatorship - still better than what we have now, due to unlimited technology. It seems to me that c is closer to b than to a, and working on alignment seems to mainly decrease P(a), by pumping the probability into P(b or c), while treaty approach seems to try to increase P(b), which takes the cut from both P(c) and P(a), which might increase our expected value of the future less. Of course there is a quantitative question and it is entirely possible that alignment is simply that hard, so alignment work does not actually do much and only law can prevent extinction.

74

The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably

74

Preamble: pressure to cut corners invalidates most theories of change

First filter: all-out war between nuclear superpowers

Contra “stable multipolar scenarios”

Second filter: misaligned AI that kills everyone

Why Technical AI safety agendas do not address this problem

Third filter: nightmare singletons

How common theories of change fail trivially

Conclusion

74

Aren't we desperate enough for an AI-supported human dictatorship?

Three counter-arguments that come to mind are:

74

Aren't we desperate enough for an AI-supported human dictatorship?

Three counter-arguments that come to mind are: