Speculative, exploratory. I am not confident this works. I am even less confident that it is implementable. The goal of this post is not to argue that this approach solves AI existential risk, but to propose a different framing that might be worth examining, especially given the apparent difficulty of solving alignment directly.
Bottom line: Maybe we could avoid a catastrophe by implementing multiple, extremely powerful AI guardians that all monitor each other. All the models must be developed and trained fully separately. This way, the guardians will be able to stop 'bad' agents without becoming misaligned themselves.
Motivation: Why the usual framing feels incomplete
Much of the discussion around AI existential risk implicitly assumes a failure mode like:
A single highly capable AI system becomes misaligned → gains decisive strategic advantage → catastrophe.
Under this framing, the core problem is alignment: we must ensure that advanced AI systems reliably want what we want, even under distributional shift and capability amplification.
I agree that alignment is hard and important. But I’m increasingly uneasy with how much weight we place on solving it cleanly, especially given historical precedents in other domains of power.
Human societies rarely rely on the benevolence or correctness of a single powerful actor. When power becomes large enough, the dominant strategy is not “make the ruler virtuous,” but prevent any one actor from ruling unopposed.
This raises a question:
Are we trying to solve an alignment problem that is actually a power concentration problem?
Historical analogy: how humans handle dangerous concentrations of power
Modern governments are built around an assumption that:
individuals will make mistakes,
incentives will corrupt behavior,
values will drift over time,
even well-intentioned actors cannot be trusted with unchecked authority.
The response is not moral perfection, but structural constraint:
separation of powers,
checks and balances,
veto points and thresholds,
adversarial institutions,
intentional inefficiency.
These systems are ugly and slow, but they are surprisingly robust. They don’t prevent bad outcomes, but they make irreversible catastrophes harder.
Importantly, this robustness does not rely on shared values or trust. It relies on misaligned incentives canceling out.
A question worth asking
Instead of asking:
“How do we build a single AI that will always do the right thing?”
We might ask:
“How do we prevent any single AI system from being able to do irreparable harm on its own?”
These are not equivalent questions.
A rough idea: distributed AI oversight via mutual monitoring
Here is a sketch of a possible alternative direction.
Imagine a future where, instead of one dominant frontier system, there exist multiple highly capable AI systems that:
are developed independently (different organizations, architectures, training pipelines),
are comparable in capability,
are explicitly designed to monitor and audit one another for dangerous behavior.
The core intuition is similar to constitutional separation of powers:
no single AI has unilateral authority,
significant actions require agreement among multiple independent systems,
defection by one system triggers scrutiny by the others.
In this framing, safety does not require that any individual AI be perfectly aligned. It requires only that their failure modes are not perfectly correlated.
Why this might help (in theory)
1. Coordination is genuinely hard
For a catastrophic outcome to occur, more than half of these systems would need to:
converge on harmful behavior,
hide it from the remaining systems,
maintain coordination over long time horizons,
avoid triggering alarms during intermediate steps.
This is a much stronger requirement than a single-system failure.
2. Capability cuts both ways
As AI systems become more capable, so does their ability to:
detect deception,
notice anomalous planning,
identify power-seeking behavior,
audit internal reasoning traces (if interpretability improves at all).
In a multi-system setup, increased intelligence may increase mutual detectability rather than unilateral risk.
Important ways this could fail
This is where optimism needs to stop.
1. Convergent instrumental behavior
Even independently developed systems may converge on similar instrumental goals:
self-preservation,
influence-seeking,
control over oversight channels.
If so, “independence” may not buy much.
2. Coordination without communication
AIs might not need explicit communication to coordinate. They could interact through:
shared environments,
economic signals,
policy constraints,
predictable human responses.
This kind of stigmergic coordination is common in nature and markets.
3. Who decides when systems disagree?
At some point, humans adjudicate disputes between AIs. A sufficiently advanced system may be better at:
framing evidence,
overwhelming decision-makers with complexity,
subtly steering outcomes.
This shifts the problem rather than eliminating it.
4. Termination power is dangerous
Giving AI systems the ability to shut down or isolate other AI systems introduces its own risks:
false positives,
escalation dynamics,
preemptive “defensive” actions.
Poorly designed, this becomes an AI arms race rather than a safeguard.
Comparison to the “single aligned superintelligence” approach
This proposal does not claim to be safer in absolute terms. It shifts the burden:
From perfect values → to imperfect but distributed oversight
From benevolence → to structural stalemates
From trust → to mutual suspicion
This mirrors how humans handle other irreversible risks. It may be pessimistic—but pessimism has a good track record in institutional design.
Why this feels worth discussing anyway
Even if this idea is flawed, it highlights something important:
Many AI safety proposals assume centralized correctness. Human civilization survives by decentralizing failure.
If alignment remains unsolved—or only partially solved—then approaches that bound damage rather than prevent error entirely may be the only viable path.
This is not an argument that separation-of-power AIs are the answer. It is an argument that they might belong in the space of ideas we take seriously, alongside alignment, governance, and capability controls.
Closing thought
Human governments are not safe because leaders are good. They are safe because leaders are constrained.
If AI systems ever wield comparable power, it seems unlikely that virtue alone will save us.
Whether structural constraints can scale to superhuman agents remains an open—and terrifying—question.
A note on real-world relevance and feedback loops
One motivation for writing this post is that some version of these questions is already being discussed—informally and unevenly—inside parts of the national-security ecosystem.
I have limited connections in the world of intelligence and security agencies, and I am considering sharing a condensed version of this idea through those channels so it can reach people who think professionally about strategic risk, deterrence, and adversarial systems. I am not confident this proposal survives contact with practitioners, and that is precisely why I think it is worth exposing to criticism early.
In particular, I am interested in feedback on questions like:
Does separation-of-powers logic still help once agents are superhuman?
Are correlated failure modes likely to overwhelm the benefits of independent development?
Does giving AI systems mutual oversight authority reduce risk, or merely relocate it?
Are there institutional or operational constraints that make this idea unrealistic in practice?
If this proposal is flawed, I would much rather find out before it gets laundered into policy conversations. If it is incomplete, I want to know what assumptions are missing. And if it is actively dangerous, that should be made explicit.
Epistemic status
Speculative, exploratory. I am not confident this works. I am even less confident that it is implementable. The goal of this post is not to argue that this approach solves AI existential risk, but to propose a different framing that might be worth examining, especially given the apparent difficulty of solving alignment directly.
Bottom line: Maybe we could avoid a catastrophe by implementing multiple, extremely powerful AI guardians that all monitor each other. All the models must be developed and trained fully separately. This way, the guardians will be able to stop 'bad' agents without becoming misaligned themselves.
Motivation: Why the usual framing feels incomplete
Much of the discussion around AI existential risk implicitly assumes a failure mode like:
Under this framing, the core problem is alignment: we must ensure that advanced AI systems reliably want what we want, even under distributional shift and capability amplification.
I agree that alignment is hard and important. But I’m increasingly uneasy with how much weight we place on solving it cleanly, especially given historical precedents in other domains of power.
Human societies rarely rely on the benevolence or correctness of a single powerful actor. When power becomes large enough, the dominant strategy is not “make the ruler virtuous,” but prevent any one actor from ruling unopposed.
This raises a question:
Historical analogy: how humans handle dangerous concentrations of power
Modern governments are built around an assumption that:
The response is not moral perfection, but structural constraint:
These systems are ugly and slow, but they are surprisingly robust. They don’t prevent bad outcomes, but they make irreversible catastrophes harder.
Importantly, this robustness does not rely on shared values or trust. It relies on misaligned incentives canceling out.
A question worth asking
Instead of asking:
We might ask:
These are not equivalent questions.
A rough idea: distributed AI oversight via mutual monitoring
Here is a sketch of a possible alternative direction.
Imagine a future where, instead of one dominant frontier system, there exist multiple highly capable AI systems that:
The core intuition is similar to constitutional separation of powers:
In this framing, safety does not require that any individual AI be perfectly aligned. It requires only that their failure modes are not perfectly correlated.
Why this might help (in theory)
1. Coordination is genuinely hard
For a catastrophic outcome to occur, more than half of these systems would need to:
This is a much stronger requirement than a single-system failure.
2. Capability cuts both ways
As AI systems become more capable, so does their ability to:
In a multi-system setup, increased intelligence may increase mutual detectability rather than unilateral risk.
Important ways this could fail
This is where optimism needs to stop.
1. Convergent instrumental behavior
Even independently developed systems may converge on similar instrumental goals:
If so, “independence” may not buy much.
2. Coordination without communication
AIs might not need explicit communication to coordinate. They could interact through:
This kind of stigmergic coordination is common in nature and markets.
3. Who decides when systems disagree?
At some point, humans adjudicate disputes between AIs. A sufficiently advanced system may be better at:
This shifts the problem rather than eliminating it.
4. Termination power is dangerous
Giving AI systems the ability to shut down or isolate other AI systems introduces its own risks:
Poorly designed, this becomes an AI arms race rather than a safeguard.
Comparison to the “single aligned superintelligence” approach
This proposal does not claim to be safer in absolute terms. It shifts the burden:
This mirrors how humans handle other irreversible risks. It may be pessimistic—but pessimism has a good track record in institutional design.
Why this feels worth discussing anyway
Even if this idea is flawed, it highlights something important:
If alignment remains unsolved—or only partially solved—then approaches that bound damage rather than prevent error entirely may be the only viable path.
This is not an argument that separation-of-power AIs are the answer. It is an argument that they might belong in the space of ideas we take seriously, alongside alignment, governance, and capability controls.
Closing thought
Human governments are not safe because leaders are good.
They are safe because leaders are constrained.
If AI systems ever wield comparable power, it seems unlikely that virtue alone will save us.
Whether structural constraints can scale to superhuman agents remains an open—and terrifying—question.
A note on real-world relevance and feedback loops
One motivation for writing this post is that some version of these questions is already being discussed—informally and unevenly—inside parts of the national-security ecosystem.
I have limited connections in the world of intelligence and security agencies, and I am considering sharing a condensed version of this idea through those channels so it can reach people who think professionally about strategic risk, deterrence, and adversarial systems. I am not confident this proposal survives contact with practitioners, and that is precisely why I think it is worth exposing to criticism early.
In particular, I am interested in feedback on questions like:
Does separation-of-powers logic still help once agents are superhuman?
Are correlated failure modes likely to overwhelm the benefits of independent development?
Does giving AI systems mutual oversight authority reduce risk, or merely relocate it?
Are there institutional or operational constraints that make this idea unrealistic in practice?
If this proposal is flawed, I would much rather find out before it gets laundered into policy conversations. If it is incomplete, I want to know what assumptions are missing. And if it is actively dangerous, that should be made explicit.