AI-enabled coups: a small group could use AI to seize power

Lukas Finnveden; rosehadshar

137 AI-enabled coups: a small group could use AI to seize power

by Tom Davidson, Lukas Finnveden, rosehadshar

16th Apr 2025

AI Alignment Forum

8 min read

137 Ω 52

We’ve written a new report on the threat of AI-enabled coups.

I think this is a very serious risk – comparable in importance to AI takeover but much more neglected.

In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here’s a very basic threat model for AI takeover:

Humanity develops superhuman AI
Superhuman AI is misaligned and power-seeking
Superhuman AI seizes power for itself

And now here’s a closely analogous threat model for AI-enabled coups:

Humanity develops superhuman AI
Superhuman AI is controlled by a small group
Superhuman AI seizes power for the small group

While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be to first stage an AI-enabled coup in the United States (or whichever country leads on superhuman AI), and then go from there to world domination. A single person taking over the world would be really bad. I’ve previously argued that it might even be worse than AI takeover. ^[1]

The concrete threat models for AI-enabled coups that we discuss largely translate like-for-like over to the risk of AI takeover.^[2] Similarly, there’s a lot of overlap in the mitigations that help with AI-enabled coups and AI takeover risk — e.g. alignment audits to ensure no human has made AI secretly loyal to them, transparency about AI capabilities, monitoring AI activities for suspicious behaviour, and infosecurity to prevent insiders from tampering with training.

If the world won't slow down AI development based on AI takeover risk (e.g. because there’s isn’t strong evidence for misalignment), then advocating for a slow down based on the risk of AI-enabled coups might be more convincing and achieve many of the same goals.

I really want to encourage readers — especially those at labs or governments — to do something about this risk, so here’s a link to our 15 page section on mitigations.

If you prefer video, I’ve recorded an podcast with 80,000 hours on this topic.

Okay, without further ado, here’s the summary of the report.

Summary

This report assesses the risk that a small group—or even just one person—could use advanced AI to stage a coup. An AI-enabled coup is most likely to be staged by leaders of frontier AI projects, heads of state, and military officials; and could occur even in established democracies.

We focus on AI systems that surpass top human experts in domains which are critical for seizing power, like weapons development, strategic planning, and cyber offense. Such advanced AI would introduce three significant risk factors for coups:

An AI workforce could be made singularly loyal to institutional leaders.
AI could have hard-to-detect secret loyalties.
A few people could gain exclusive access to coup-enabling AI capabilities.

An AI workforce could be made singularly loyal to institutional leaders

Today, even dictators rely on others to maintain their power. Military force requires personnel, government action relies on civil servants, and economic output depends on a broad workforce. This naturally distributes power throughout society.

Advanced AI removes this constraint, making it technologically feasible to replace human workers with AI systems that are singularly loyal to just one person.

This is most concerning within the military, where autonomous weapons, drones, and robots that fully replace human soldiers could obey orders from a single person or small group. While militaries will be cautious when deploying fully autonomous systems, competitive pressures could easily lead to rushed adoption without adequate safeguards. A powerful head of state could push for military AI systems to prioritise their commands, despite nominal legal constraints, enabling a coup.

Even without military deployment, loyal AI systems deployed in government could dramatically increase state power, facilitating surveillance, censorship, propaganda and the targeting of political opponents. This could eventually culminate in an executive coup.

If there were a coup, civil disobedience and strikes might be rendered ineffective through replacing humans with AI workers. Even loyal coup supporters could be replaced by AI systems—granting the new ruler(s) an unprecedentedly stable and unaccountable grip on power.

AI could have hard-to-detect secret loyalties

AI could be built to be secretly loyal to one actor. Like a human spy, secretly loyal AI systems would pursue a hidden agenda – they might pretend to prioritise the law and the good of society, while covertly advancing the interests of a small group. They could operate at scale, since an entire AI workforce could be derived from just a few compromised systems.

While secret loyalties might be introduced by government officials or foreign adversaries, leaders within AI projects present the greatest risk, especially where they have replaced their employees with singularly loyal AI systems. Without any humans knowing, a CEO could direct their AI workforce to make the next generation of AI systems secretly loyal; that generation would then design future systems to also be secretly loyal and so on, potentially culminating in secretly loyal AI military systems that stage a coup.

AI systems could propagate secret loyalties forwards into future generations of systems until secretly loyal AI systems are deployed in powerful institutions like the military.

Secretly loyal AI systems are not merely speculation. There are already proof-of-concept demonstrations of AI 'sleeper agents' that hide their true goals until they can act on them. And while we expect there will be careful testing prior to military deployments, detecting secret loyalties could be very difficult, especially if an AI project has a significant technological advantage over oversight bodies.

A few people could gain exclusive access to coup-enabling AI capabilities

Advanced AI will have powerful coup-enabling capabilities – including weapons design, strategic planning, persuasion, and cyber offence. Once AI can autonomously improve itself, capabilities could rapidly surpass human experts across all these domains. A leading project could deploy millions of superintelligent systems in parallel – a 'country of geniuses in a data center'.

These capabilities could become concentrated in the hands of just a few AI company executives or government officials. Frontier AI development is already limited to a few organisations, led by a small number of people. This concentration could significantly intensify due to rapidly rising development costs or government centralisation. And once AI surpasses human experts at AI R&D, the leading project could make much faster algorithmic progress, gaining a huge capabilities advantage over its rivals. Within these projects, CEOs or government officials could demand exclusive access to cutting-edge capabilities on security or productivity grounds. In the extreme, a single person could have access to millions of superintelligent AI systems, all helping them seize power.

This would unlock several pathways to a coup. AI systems could dramatically increase military R&D efforts, rapidly developing powerful autonomous weapons without needing any human workers who might whistleblow. Alternatively, systems with powerful cyber capabilities could hack into and seize control of autonomous AI systems and robots already deployed by the state military. In either scenario, controlling a fraction of military forces might suffice—historically, coups have succeeded with just a few battalions, where they were able to prevent other forces from intervening.

Exclusive access to advanced AI could also supercharge traditional coups and backsliding, by providing unprecedented cognitive resources for political strategy, propaganda, and identifying legal vulnerabilities in constitutional safeguards.

Furthermore, exclusive AI access significantly exacerbates the first two risk factors. A head of state could rely on AI systems’ strategic advice to deploy singularly loyal AI in the military and assess which AI systems will help them stage a coup. A CEO could use AI R&D and cyber capabilities to instil secret loyalties that others cannot detect.

These dynamics create a significant risk of AI-enabled coups, especially if a single project has substantially more powerful capabilities than competitors, or if fully autonomous AI systems are deployed in the military.

Mitigations

While the prospect of AI-enabled coups is deeply concerning, AI developers and governments can take steps that significantly reduce this risk.

We recommend that AI developers:

Establish rules that prevent AI systems from assisting with coups, including in model specs (documents that describe intended model behaviour) and terms of service for government contracts. These should include rules that AI systems follow the law, and that AI R&D systems refuse to assist attempts to circumvent security or insert secret loyalties.
Improve adherence to model specs, including through extensive red-teaming by multiple independent groups.
Audit models for secret loyalties including by scrutinising AI models, their training data, and the code used to train them.
Implement strong infosecurity to guard against the creation of secret loyalties and to prevent unauthorised access to guardrail-free models. This should be robust against senior executives.
Share information about model capabilities, model specs, and how compute is being used.
Share capabilities with multiple independent stakeholders, to prevent a small group from gaining exclusive access to powerful AI.

We recommend that governments:

Require AI developers to implement the mitigations above, through terms of procurement, regulation, and legislation.
Increase oversight over frontier AI projects, including by building technical capacity within both the executive and the legislature.
Establish rules for legitimate use of AI, including that government AI should not serve partisan interests, that AI military systems be procured from multiple providers, and that no single person should direct enough AI military systems to stage a coup.
Coup-proof any plans for a single centralised AI project, and avoid centralisation altogether unless it’s necessary to reduce other risks.

These mitigations must be in place by the time AI systems can meaningfully assist with coups, and so preparation needs to start today. For more details on the mitigations we recommend, see section 5.

There is a real risk that a powerful leader could remove many of these mitigations on the path to seizing power. But we still believe that mitigations could substantially reduce the risk of AI-enabled coups. Some mitigations, like technically enforced terms of service and government oversight, cannot be unilaterally removed. Others will be harder to remove if they have been efficiently implemented and convincingly argued for. And leaders might only contemplate seizing power if they are presented with a clear opportunity—mitigations could prevent such opportunities from arising in the first place.

From behind the veil of ignorance, even the most powerful leaders have good reason to support strong protections against AI-enabled coups. If a broad consensus can be built today, then powerful actors can keep each other in check.

Preventing AI-enabled coups should be a top priority for anyone committed to defending democracy and freedom.

Vignette

That’s the report’s summary. To finish, here’s a short vignette that illustrates each of the three risk factors can interact with each other:

In 2030, the US government launches Project Prometheus—centralising frontier AI development and compute under a single authority. The aim: develop superintelligence and use it to safeguard US national security interests. Dr. Nathan Reeves is appointed to lead the project and given very broad authority.

After developing an AI system capable of improving itself, Reeves gradually replaces human researchers with AI systems that answer only to him. Instead of working with dozens of human teams, Reeves now issues commands directly to an army of singularly loyal AI systems designing next-generation algorithms and neural architectures.

Approaching superintelligence, Reeves fears that Pentagon officials will weaponise his technology. His AI advisor, to which he has exclusive access, provides the solution: engineer all future systems to be secretly loyal to Reeves personally.

Reeves orders his AI workforce to embed this backdoor in all new systems, and each subsequent AI generation meticulously transfers it to its successors. Despite rigorous security testing, no outside organisation can detect these sophisticated backdoors—Project Prometheus' capabilities have eclipsed all competitors. Soon, the US military is deploying drones, tanks, and communication networks which are all secretly loyal to Reeves himself.

When the President attempts to escalate conflict with a foreign power, Reeves orders combat robots to surround the White House. Military leaders, unable to countermand the automated systems, watch helplessly as Reeves declares himself head of state, promising a "more rational governance structure" for the new era.

Link to twitter thread.

Link to full report.

^{^}
But even if human takeover is 1/10 as bad as AI takeover, the case for working on AI-enabled coups is strong due to its neglectedness.
^{^}
Indeed, I think a promising way to develop more concrete threat models for AI takeover is to game out detailed scenarios for AI-enabled coups – drawing on a wealth of literature and thinking about specific institutional vulnerabilities.

Frontpage

137 Ω 52

Mentioned in

63When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.

43The Case for Mixed Deployment

16Optimization & AI Risk

New Comment

Rendering 22/23 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 5:33 PM

[-]Knight Lee8moΩ11225

In a just world, mitigations against AI-enabled coups will be similar to mitigations against AI takeover risk.

In a cynical world, mitigations against AI-enabled coups involve installing your own allies to supervise (or lead) AI labs, and taking actions against humans you dislike. Leaders mitigating the risk may simply make sure that if it does happen, it's someone on their side. Leaders who believe in the risk may even accelerate the US-China AI race faster.

Note: I don't really endorse the "cynical world," I'm just writing it as food for thought :)

[-]Isopropylpod8mo33

Your cynical world is just doing a coup before someone else does.

[-]Knight Lee8mo42

Yeah, it's possible when you fear the other side seizing power, you start to want more power yourself.

[-]otto.barten8mo32

Only one person, or perhaps a small, tight group, can succeed in this strategy though. The chance that that's you is tiny. Alliances with someone you thought was on your side can easily break (case in point: EA/OAI).

It's a better strategy to team up with everyone else and prevent the coup possibility.

[-]Knight Lee8mo53

I agree that teaming up with everyone and working to ensure that power is spread democratically is the right strategy, rather than giving power to loyal allies who might betray you.

But some leaders don't seem to get this. During the Cold War, the US and USSR kept installing and supporting dictatorships in many other countries, even though their true allegiances was very dubious.

[-]otto.barten8mo10

Agree

[-]Shankar Sivarajan8mo-1-12

It's called "defensive democracy," and is standard practice in most of Europe.

[-]Tyler Tracy8mo105

I feel myself being equally scared of hackers taking over as leaders. Even if you limit the people who have ultimate power over these AIs to a small and extremely trusted group, there will potentially be a much larger number of bad actors outside of the lab with the capabilities of hacking into it. A hacker who impersonates a trusted individual or secretly alters the model spec or training code might be able to achieve an AI assistant coup just the same.

I'd also recommend a mitigation of also requiring labs to have very strong cyber defenses. Maybe the auditing mitigation includes this, but I think hackers could hide their tracks effectively.

This story obviously depends on how the cyber offence/defence balance goes, but it doesn't seem implausible.

[-]otto.barten8mo*90

I love this post, I think this is a fundamental issue for intent-alignment. I don't think value-alignment or CEV are any better though, mostly because they seem irreversible to me, and I don't trust the wisdom of those implementing them (no person is up to that task).

I agree it would be good to implement these recommendations, although I also think they might prove insufficient. As you say, this could be a reason to pause that might be easier to grasp by the public than misalignment. (I think currently, the reason some do not support a pause is perceived lack of capabilities though, not (mostly) perceived lack of misalignment).

I'm also worried about a coup, but I'm perhaps even more worried about the fate of everyone not represented by those who will have control over the intent-aligned takeover-level AI (IATLAI). If IATLAI is controlled by e.g. a tech CEO, this includes almost everyone. If controlled by government, even if there is no coup, this includes everyone outside that country. Since control over the world of IATLAI could be complete (way more intrusive than today) and permanent (for >billions of years), I think there's a serious risk that everyone outside the IATLAI country does not make it eventually. As a data point, we can see how much empathy we currently have for citizens from starving or war-torn countries. It should therefore be in the interest of everyone who is on the menu, rather than at the table, to prevent IATLAI from happening, if capabilities awareness would be present. This means at least the world minus the leading AI country.

The only IATLAI control that may be acceptable to me, could be UN-controlled. I'm quite surprised that every startup is now developing AGI, but not the UN. Perhaps they should.

[-]habryka5mo*Ω561

But even if human takeover is 1/10 as bad as AI takeover, the case for working on AI-enabled coups is strong due to its neglectedness.

The logic for this doesn't check out, I think? If human takeover is 1/10 as bad as AI takeover, and human takeover pre-empts AI takeover (because it ends the race and competitiveness dynamics giving rise to most of the risk of AI takeover), then a human takeover might be the best thing that could happen to humanity. This makes the case for working on AI-enabled coups particularly weak.

If by 1/10th as bad we mean "we lose 10% of the value of the future, as opposed to ~100% of the value of the future" then increasing marginal probability of human takeover seems great as long as you assign >10% probability to AI takeover^[1], which I think most people who have thought a lot about AI risk do.

^{^}
And you expect the risk to be uncorrelated, i.e. human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans

[-]Lukas Finnveden5moΩ460

That's a good point.

I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn't necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.

Related to that evidential update: I would disagree that "human takeover equally reduces the probability of AI takeover, and no takeover from either AI or small groups of humans". I think it disproportionately reduces the probability of "no takeover from either AI or small groups of humans". Because I think it's likely that, if a human attempts an AI-enabled coup, they would simultaneously make it very easy for misaligned systems to seize power on their own behalf. (Because they'll have to trust the AI to seize power, and they can't easily use humans to control the AIs at the same time, because most humans are opposed to their agenda.) So if the AIs don't take over on their own behalf, and instead gives the power back to the coup-leader, I think that suggests that alignment was going pretty well, and that AI takeover would've been pretty unlikely either way.

But here's something I would agree with: If you think human takeover is only 1/10th as bad as AI takeover, you have to be pretty careful about how coup-preventing interventions affect the probability of AI takeover when analyzing whether they're overall good. I think this is going to vary a lot between different interventions. (E.g. one thing that could happen is that you could get much more alignment audits because the govt insists on them as a measure of protecting against human-led coups. That'd be great. But I think other interventions could increase AI takeover risk.)

[-]rosehadshar5mo*Ω120

I think I agree that, once an AI-enabled coup has happened, the expected remaining AI takeover risk would be much lower. This is partly because it ends the race within the country where the takeover happened (though it wouldn't necessarily end the international race), but also partly just because of the evidential update: apparently AI is now capable of taking over countries, and apparently someone could instruct the AIs to do that, and the AIs handed the power right back to that person! Seems like alignment is working.

I don't currently agree that the remaining AI takeover risk would be much lower:

The international race seems like a big deal. Ending the domestic race is good, but I'd still expect reckless competition I think. Maybe you're imagining that a large chunk of powergrabs are motivated by stopping the race? I'm a bit sceptical.
I don't think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.
There's another evidential update which I think is much stronger, which is that the world has completely dropped the ball on an important thing almost no one wants (powergrabs), where there are tractable things they could have done, and some of those things would directly reduce AI takeover risk (infosec, alignment audits etc). In a world where coups over the US are possible, I expect we've failed to do basic alignment stuff too.

Curious what you think.

[-]Lukas Finnveden5moΩ120

The international race seems like a big deal. Ending the domestic race is good, but I'd still expect reckless competition I think.

I was thinking that AI capabilities must already be pretty high by the time an AI-enabled coup is possible. If one country also had a big lead, then probably they would soon have strong enough capabilities to end the international race too. (And the fact that they were willing to coup internally is strong evidence that they'd be willing to do that.)

But if the international race is very tight, that argument doesn't work.

I don't think the evidential update is that strong. If misaligned AI found it convenient to take over the US using humans, why should we expect them to immediately cease to find humans useful at that point? They might keep using humans as they accumulate more power, up until some later point.

Yeah, I suppose. I think this gets into definitional issues about what counts as AI takeover and what counts as human takeover.

For example: If, after the coup, the AIs are ~guaranteed to eventually come out on top, and they're just temporarily using the human leader (who believe themselves to be in charge) because it's convenient for international politics — does that count as human takeover or AI takeover?

If it counts as "AI takeover", then my argument would apply. (Saying that "AI takeover" would be much less likely after successful "human takeover", but also that "human takeover" mostly takes probability mass from worlds where takeover wasn't going to happen.)
If it counts as "human takeover", then my argument would not apply, and "AI takeover" would be pretty likely to happen after a temporary "human takeover".

The practical upshot for how much "human takeover" ultimately reduces the probability of "AI takeover" would be the same.

[-]Tom Davidson5moΩ131

Thanks very much for this.

The statement you quoted implicitly assumes that work on reducing human takeover won't affect the probability of AI takeover. And i agree that it might well affect that. And those effects are important to track. We should be very cautious about doing things that reduce human takeover risk but increasing AI takeover risk.

But i don't think reducing human takeover risk does typically increase ai takeover risk. First, some points at a high level of abstraction:

If human takeover is possible then the incentive the race is a lot higher. The rewards of winning are higher -- you get a personal DSA. And the costs of losing are higher -- you get completely dominated.
A classic strategy for misaligned AI takeover is "divide and rule". Misaligned AI offers greedy humans opportunities to increase their own power, increasing its own influence in the process. This is what happened with the Conquistadors i believe. If there are proper processes preventing illegitimate human power-seeking, this strategy becomes harder for misaligned AI to pursue.
If someone actually tries to stage a human takeover, i think they'll take actions that massively increase AI risk. Things like: training advanced AI to reason about how to conceal its secret loyalties from everyone else and game all the alignment audits; deploying AI broadly without proper safeguards; getting AIs from one company deployed in the military; executing plans your AI advisor gave you that you don't fully understand and haven't been independently vetted.

Those are pretty high-level points, at the level of abstraction of "actions that reduce human takeover risk". But it's much better to evaluate specific mitigations:

Alignment audits reduce human takeover and ai takeover.
Infosecurity against tampering with the weights reduces both risks.
Many things that make it hard for lab insiders to insert secret loyalties also make it hard for misaligned AI (+ humans they manipulate) to pass their specific type of misalignment onto future generations. (I think making it hard for misaligned AI to do this could be pretty crucial.)
Guardrails and control measures help with both risks.
Clear rules for what Ai should and shouldn't do in high-stakes situations (gov and military deployments) reduces both risks. It reduces wiggle room for misaligned AI and humans to use such deployments to seize power
Transparency about Ai capabilities and what large amounts of compute are being used for reduces both risks i think.
- Though there is more uncertainty here. If you really trust one particular lab and expect them to win the race and solve alignment, you might think that transparency will prevent them from saving the world. (Maybe this is the kind of thing you have in mind?) My own view here is that no company/project should be trusted with this.
Making one centralised project would (i'd guess) increase human takeover risk but reduce misaligned risk. And i agree that those worried about human takeover should be wary to oppose centralised projects for this reason. Though i also think those worried about ai takeover should be wary about pushing for centralised projects.

Not sure if this address your point? It seemed like you might think that most actions that reduce human takeover risk increase ai takeover risk - if so, i'd be interested to hear more about why.

[-]Charbel-Raphaël8moΩ263

Well done - this is super important. I think this angle might also be quite easily pitchable to governments.

[-]Charlie Steiner8moΩ350

Why train a helpful-only model?

If one of our key defenses against misuse of AI is good ol' value alignment - building AIs that have some notion of what a "good purpose for them" is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.

[-]Tom Davidson8moΩ110

Yep, I think this is a plausible suggestion. Labs can plausibly train models that are v internally useful without being helpful only, and could fine-tune models for evals on a case-by-case basis (and delete the weights after the evals).

[-]otto.barten8mo10

I expected this comment, value alignment or CEV indeed doesn't have the few-human coup disadvantage. It does however have other disadvantages. My biggest issue with both is that they seem irreversible. If your values or your specific CEV implementation turns out to be terrible for the world, you're locked in and there's no going back. Also, a value-aligned or CEV takeover-level AI would probably start straight away with a takeover, since else it can't enforce its values in a world where many will always disagree. That takeover won't exactly increase its popularity. I think a minimum requirement should be that a type of alignment is adjustable by humans, and intent-alignment is the only type that meets that requirement as far as I know.

[-]Charlie Steiner8mo30

I agree that trying to "jump straight to the end" - the supposedly-aligned AI pops fully formed out of the lab like Athena from the forehead of Zeus - would be bad.

And yet some form of value alignment still seems critical. You might prefer to imagine value alignment as the logical continuation of training Claude to not help you build a bomb (or commit a coup). Such safeguards seem like a pretty good idea to me. But as the model becomes smarter and more situationally aware, and is expected to defend against subversion attempts that involve more of the real world, training for this behavior becomes more and more value-inducing, to the point where it's eventually unsafe unless you make advancements in learning values in a way that's good according to humans.

[-]CronoDAS8mo10

This is silly of me, but I can't help thinking that this would make a great "Pinky and the Brain" episode.

[-]IC Rainbow8mo1-1

Could be handy for our next "overthrow the government" day celebration /s

[-]Anders Lindström8mo1-1

The main reason for developing AI in the first place is to make possible what the headline says: "AI-enabled coups: a small group could use AI to seize power".

AI-enabled coups are a feature, not a bug.

Moderation Log

LESSWRONG
is fundraising!
LW