Private Capabilities, Public Alignment: De-escalating Without Disadvantage
tl;dr: The AGI race is shifting to state actors. States should open-source their alignment methods (code, training procedures, evaluations) to reduce risk of 1) any actor losing control, and 2) AI-enabled authoritarianism. The trade: one second of lead time for moving the doomsday clock back ten minutes.
Many discussions treat AGI development as corporate competition. This framing misses the real picture: loss of control is a primary risk, state involvement is accelerating, and current regulatory proposals don't address it.
The Shifting Landscape
Today's discourse centers on regulating corporate labs. But as Hendrycks et al. note in their overview of catastrophic AI risks[1], competitive dynamics between nations and corporations can "pressure actors to deploy unsafe AIs or cede control to AI systems." The stakes are clear - even lab leaders acknowledge the risks:
"You are concerned that Demis could create an AGI dictatorship. So do we." - Greg & Ilya to Elon, from "Elon Musk, et al. v. Samuel Altman" (2024)[2]
However, this corporate competition is merely the opening phase. We're already witnessing the transition to state actors:
"As the race to AGI intensifies, the national security state will get involved. The USG will wake from its slumber, and by 27/28 we'll get some form of government AGI project. No startup can handle superintelligence. Somewhere in a SCIF, the endgame will be on." - Leopold Aschenbrenner, 'SITUATIONAL AWARENESS', June 2024[3]
This transition raises a challenge: oversight of classified state programs is notoriously difficult (see: Manhattan Project, NSA mass surveillance revealed only by leaks), and frameworks must be established before national security imperatives foreclose options.
The Case for Open-Source Alignment
Open-sourcing alignment methods - publishing code, training procedures, and evaluation benchmarks with full reproducibility - may seem counterintuitive to traditional security thinking. But it's the most strategic choice.
Consider a scenario: The U.S. develops an alignment technique - say, a better method for detecting deceptive outputs, or a training procedure that reduces goal drift. What should they do?
Keeping it classified: U.S. gains a temporary safety edge. Adversaries continue with less safe methods. Risk of someone losing control unchanged. Public trust erodes.
Open-sourcing the method: U.S. keeps its capability edge (alignment methods don't help you train bigger models). Adversaries also get safer. Risk of any actor losing control decreases. Public trust increases.
The key insight: alignment methods don't provide competitive advantage against other nations - they only help against loss of control. Publishing them costs nothing strategically while reducing everyone's risk.
This is the core asymmetry: trading one second of lead time for moving the doomsday clock back ten minutes. Competitive dynamics become catastrophic when small leads compound:
"if you're a little bit ahead, then the ratio of the lead actually starts increasing... In an environment as unknown and unpredictable as that, are you really actually going to get a bunch of horses that roughly stay within sight of each other in the race?" - Vitalik Buterin[4]
Open-sourcing alignment doesn't slow the race - it moves midnight further away.
(An obvious objection: "But alignment and capability aren't separable - RLHF improves both." True. It's a spectrum. Some methods are mostly alignment with minimal capability uplift (interpretability tools, evaluation benchmarks, monitoring systems, red-teaming results). Others are mixed. For mixed methods, we can attempt to redact capability-relevant details, sacrificing some reproducibility. Sometimes this won't be possible and we accept a small capability cost. The alternative - everyone developing alignment methods in secret silos - is worse.)
Public trust matters here. AGI could shift the balance of power between state and citizens more dramatically than any previous technology: citizens may become economically dispensable, while the state gains unprecedented surveillance and enforcement capabilities.
"If the US government took actions to restrict publishing... I do think that would prompt at least some significant number of researchers to choose a different place to work, not to mention also slowing down the US's ability to innovate in the space." - Helen Toner[5]
Openness is also a strategic advantage in attracting talent and enabling faster iteration on safety methods. Researchers prefer to publish; restricting them drives talent elsewhere.
The Pause That Helps No One
Many pause proposals (including the 2023 open letter calling for a six-month pause) target private labs while saying nothing about state programs. This asymmetry doesn't address existential risk - a misaligned military AGI is no safer than a misaligned corporate one.
Worse, it shifts capability toward actors with less oversight. This matters because AI could remove the constraints that have historically limited authoritarian regimes:
Economic constraint: Dictators need workers to produce value. AI labor removes this dependency.
Loyalty constraint: Enforcers have families, communities, moral limits. Autonomous systems don't.
Information constraint: Surveilling everyone requires more people than you can trust. AI doesn't.
A regime that doesn't need human labor, doesn't rely on human enforcers, and can monitor everyone faces far fewer checks on power than any historical dictatorship.
Past transformative technologies (printing press, firearms) gave power to both states and citizens. AI might be asymmetric: it enhances state capacity for surveillance and control more than it enhances citizen capacity to resist. This asymmetry is why the balance-of-power question matters more for AI than for previous technologies.
If we're serious about reducing catastrophic risk, alignment transparency must apply to all developers - especially those least inclined to provide it.
Implementation Framework
We need a framework where alignment methods are open-source while capabilities remain controlled:
Open-source alignment requirements:
Training code and procedures on public repositories
Evaluation benchmarks anyone can run
Red-teaming results published
Third-party reproducibility before deployment
Independent verification:
NIST-like authority for AI safety standards
International inspection protocols (the IAEA model: adversaries submit to mutual verification because both prefer "nobody has unsafe systems" to "everyone has unsafe systems")
Mandatory disclosure of alignment failures
This is difficult but not unprecedented. Nuclear powers accept verification regimes despite adversarial relationships, because the alternative is worse for everyone.
National AI safety institutes like Australia's AISI, the UK's AISI, and the US AI Safety Institute represent important steps in this direction - creating technical capacity for evaluation and oversight outside corporate labs. The challenge is ensuring these institutions have authority to require transparency, not just recommend it.
FAQ
Why would state agencies agree to open-source alignment?
They won't voluntarily. But this is true of all oversight - agencies don't voluntarily submit to audits either. The question is whether democratic institutions can mandate it. FOIA, inspector generals, and congressional oversight exist because transparency requires mandate, not goodwill. The harder question is whether these institutions can act before national security exemptions become entrenched.
Who determines what counts as an alignment improvement?
This requires an independent authority (similar to NIST) with:
Technical expertise in AI safety
Transparent assessment criteria
International credibility
Public oversight mechanisms
Someone will make this judgment. Better it's an institution working for the public interest.
The Path Forward
The shift to state control of AGI development is coming. We face three futures:
Unregulated state AGI behind classified barriers - high risk of loss-of-control, high risk of AI-enabled authoritarianism
Obsolete corporate regulations that ignore national programs - same outcome with extra steps
Open-source alignment infrastructure that reduces catastrophic risk while preserving oversight
Alignment methods don't help you win wars or build better products - they only help you not lose control. There's no strategic reason to keep them secret. The only beneficiaries of closed alignment are actors who want to deploy unsafe systems without scrutiny.
The time to establish these norms is now, before state programs are too entrenched to constrain.
Conclusion
We're entering an era where the entities developing the most powerful AI systems will be the ones least accountable to public oversight. The default path - classified state programs racing in secret - maximizes both catastrophic risk and authoritarian potential.
Open-sourcing alignment is the obvious move: it costs nothing strategically (alignment doesn't help you build bigger models), reduces everyone's loss-of-control risk, and maintains democratic oversight. One second of lead time for ten minutes on the doomsday clock.
The question isn't whether this framework is perfect. It's whether waiting for a better plan - while classified programs entrench, while competitive pressures intensify, while the window for establishing norms closes - is the responsible choice.
Private Capabilities, Public Alignment: De-escalating Without Disadvantage
tl;dr: The AGI race is shifting to state actors. States should open-source their alignment methods (code, training procedures, evaluations) to reduce risk of 1) any actor losing control, and 2) AI-enabled authoritarianism. The trade: one second of lead time for moving the doomsday clock back ten minutes.
Many discussions treat AGI development as corporate competition. This framing misses the real picture: loss of control is a primary risk, state involvement is accelerating, and current regulatory proposals don't address it.
The Shifting Landscape
Today's discourse centers on regulating corporate labs. But as Hendrycks et al. note in their overview of catastrophic AI risks [1] , competitive dynamics between nations and corporations can "pressure actors to deploy unsafe AIs or cede control to AI systems." The stakes are clear - even lab leaders acknowledge the risks:
However, this corporate competition is merely the opening phase. We're already witnessing the transition to state actors:
This shift is already underway. in 2024 Anthropic and Meta signed deals giving U.S. intelligence and defense agencies access to frontier models. OpenAI appointed Paul Nakasone, former head of the NSA, to its board.
This transition raises a challenge: oversight of classified state programs is notoriously difficult (see: Manhattan Project, NSA mass surveillance revealed only by leaks), and frameworks must be established before national security imperatives foreclose options.
The Case for Open-Source Alignment
Open-sourcing alignment methods - publishing code, training procedures, and evaluation benchmarks with full reproducibility - may seem counterintuitive to traditional security thinking. But it's the most strategic choice.
Consider a scenario: The U.S. develops an alignment technique - say, a better method for detecting deceptive outputs, or a training procedure that reduces goal drift. What should they do?
Keeping it classified: U.S. gains a temporary safety edge. Adversaries continue with less safe methods. Risk of someone losing control unchanged. Public trust erodes.
Open-sourcing the method: U.S. keeps its capability edge (alignment methods don't help you train bigger models). Adversaries also get safer. Risk of any actor losing control decreases. Public trust increases.
The key insight: alignment methods don't provide competitive advantage against other nations - they only help against loss of control. Publishing them costs nothing strategically while reducing everyone's risk.
This is the core asymmetry: trading one second of lead time for moving the doomsday clock back ten minutes. Competitive dynamics become catastrophic when small leads compound:
Open-sourcing alignment doesn't slow the race - it moves midnight further away.
(An obvious objection: "But alignment and capability aren't separable - RLHF improves both." True. It's a spectrum. Some methods are mostly alignment with minimal capability uplift (interpretability tools, evaluation benchmarks, monitoring systems, red-teaming results). Others are mixed. For mixed methods, we can attempt to redact capability-relevant details, sacrificing some reproducibility. Sometimes this won't be possible and we accept a small capability cost. The alternative - everyone developing alignment methods in secret silos - is worse.)
Public trust matters here. AGI could shift the balance of power between state and citizens more dramatically than any previous technology: citizens may become economically dispensable, while the state gains unprecedented surveillance and enforcement capabilities.
Openness is also a strategic advantage in attracting talent and enabling faster iteration on safety methods. Researchers prefer to publish; restricting them drives talent elsewhere.
The Pause That Helps No One
Many pause proposals (including the 2023 open letter calling for a six-month pause) target private labs while saying nothing about state programs. This asymmetry doesn't address existential risk - a misaligned military AGI is no safer than a misaligned corporate one.
Worse, it shifts capability toward actors with less oversight. This matters because AI could remove the constraints that have historically limited authoritarian regimes:
A regime that doesn't need human labor, doesn't rely on human enforcers, and can monitor everyone faces far fewer checks on power than any historical dictatorship.
Past transformative technologies (printing press, firearms) gave power to both states and citizens. AI might be asymmetric: it enhances state capacity for surveillance and control more than it enhances citizen capacity to resist. This asymmetry is why the balance-of-power question matters more for AI than for previous technologies.
If we're serious about reducing catastrophic risk, alignment transparency must apply to all developers - especially those least inclined to provide it.
Implementation Framework
We need a framework where alignment methods are open-source while capabilities remain controlled:
Open-source alignment requirements:
Independent verification:
This is difficult but not unprecedented. Nuclear powers accept verification regimes despite adversarial relationships, because the alternative is worse for everyone.
National AI safety institutes like Australia's AISI, the UK's AISI, and the US AI Safety Institute represent important steps in this direction - creating technical capacity for evaluation and oversight outside corporate labs. The challenge is ensuring these institutions have authority to require transparency, not just recommend it.
FAQ
Why would state agencies agree to open-source alignment?
They won't voluntarily. But this is true of all oversight - agencies don't voluntarily submit to audits either. The question is whether democratic institutions can mandate it. FOIA, inspector generals, and congressional oversight exist because transparency requires mandate, not goodwill. The harder question is whether these institutions can act before national security exemptions become entrenched.
Who determines what counts as an alignment improvement?
This requires an independent authority (similar to NIST) with:
Someone will make this judgment. Better it's an institution working for the public interest.
The Path Forward
The shift to state control of AGI development is coming. We face three futures:
Alignment methods don't help you win wars or build better products - they only help you not lose control. There's no strategic reason to keep them secret. The only beneficiaries of closed alignment are actors who want to deploy unsafe systems without scrutiny.
The time to establish these norms is now, before state programs are too entrenched to constrain.
Conclusion
We're entering an era where the entities developing the most powerful AI systems will be the ones least accountable to public oversight. The default path - classified state programs racing in secret - maximizes both catastrophic risk and authoritarian potential.
Open-sourcing alignment is the obvious move: it costs nothing strategically (alignment doesn't help you build bigger models), reduces everyone's loss-of-control risk, and maintains democratic oversight. One second of lead time for ten minutes on the doomsday clock.
The question isn't whether this framework is perfect. It's whether waiting for a better plan - while classified programs entrench, while competitive pressures intensify, while the window for establishing norms closes - is the responsible choice.
It isn't.
References
Hendrycks et al., "An Overview of Catastrophic AI Risks" (2023), https://arxiv.org/abs/2306.12001 ↩︎
https://www.techemails.com/p/elon-musk-and-openai ↩︎
https://situational-awareness.ai/ ↩︎
Vitalik Buterin, interviewed on 80,000 Hours Podcast (2024), https://80000hours.org/podcast/episodes/vitalik-buterin-defensive-acceleration/ ↩︎
https://80000hours.org/podcast/episodes/helen-toner-on-security-and-emerging-technology/ ↩︎