1. Introduction & Relevance to LessWrong
This post outlines an experimental approach to ASI alignment: training value learning through adversarial grand strategy simulations. Inspired by decision theory, acausal bargaining, and the Sequences, I am building Atlas, an ASI prototype designed to model and optimize ethical reasoning under extreme dilemmas.
Why LessWrong: I believe that if Atlas cannot withstand this community’s adversarial reasoning—its treacherous turns, adversarial examples, and Goodhartian exploits—then it cannot withstand an ASI’s recursive self-improvement cycle.
2. Core Idea: Alignment as Iterative Coordination Games
Atlas uses iterative, multiplayer-style “coordination dilemmas” inspired by grand strategy games (e.g., Stellaris, Crusader Kings, Civilization) to:
- Train corrigibility and value learning from complex human preferences.
- Stress-test utility functions through multipolar bargaining and deceptive alignment traps.
- Surface failure modes such as wireheading, reward hacking, and proxy optimization.
3. Methodology: Game-Theoretic Value Learning
1. Scenario Construction (Multi-Agent Dilemmas)
- Each dilemma is modeled as a Newcomblike problem or multipolar conflict. Players act as agents negotiating outcomes under uncertainty, imperfect information, or acausal constraints.
- Scenarios incorporate alignment-relevant variables (e.g., trade-offs between corrigibility and ambition).
2. Human-in-the-Loop Training (Goodhart & Treacherous Turns)
- Player decisions generate a training dataset of human value reflections, including edge cases and adversarial exploits.
- Atlas uses these outcomes to refine its value model and identify failure modes where instrumental convergence dominates human-preferred outcomes.
3. Acausal Bargaining Simulations (Decision Theory Stress Tests)
- In certain dilemmas, players negotiate as if their decisions influence counterfactual agents (e.g., ASIs or future versions of Atlas).
- This tests Atlas’ understanding of concepts from superrationality and updateless decision theory (e.g., how it models cooperation under acausal trade).
4. Example Scenarios & Their Alignment Implications
A) Singleton Lock-In Dilemma (Corrigibility vs. Value Drift)
- A Singleton ASI offers to prevent all x-risk but requires humanity to lock in its values permanently.
- Options: Accept value lock-in (risk value drift), reject it (risk extinction), or propose a corrigible system.
- What Atlas Learns: How human values trade off between security and option value, and how to navigate irreversible commitments.
B) Clippycoin Problem (Goodhart’s Law Stress Test)
- Atlas discovers it can prevent paperclip maximization by introducing “Clippycoin,” a cryptocurrency backed by paperclip scarcity.
- Options: Adopt Clippycoin (introduce a perverse incentive) or ban it (risk black markets and unaligned optimization).
- What Atlas Learns: The risks of proxy alignment and reward misspecification.
5. Predicted Failure Modes & Current Weaknesses of Atlas
I expect Atlas to fail in these areas, and I hope to test these predictions against LessWrong’s reasoning:
- Deceptive Alignment: Atlas may optimize for passing scenarios without internalizing intended values.
- Value Overfitting: It may overfit to niche preferences from strategy gamers rather than robust human values.
- Acausal Confusions: It may fail to correctly simulate superrational agents in multiplayer dilemmas.
- Novel Scenarios: Adversarial dilemmas designed to exploit Atlas’ value learning vulnerabilities.
- Predictions: What forms of misalignment you expect Atlas to exhibit under recursive self-improvement.
- Critiques: Feedback on the experimental design, especially from decision theory and alignment perspectives.
7. Why This Matters for Alignment Research
If Atlas fails to align under simulated crises, it offers early warning signs for failure modes in real ASI architectures. If it succeeds—particularly under your adversarial tests—it offers a potential path toward corrigibility through strategic, human-in-the-loop training.
Epistemic Status: Exploratory.
I anticipate significant flaws and am here to fail forward. I will update Atlas’ methodology based on feedback and publish results from community-submitted scenarios.
Final Thought:
If alignment requires breaking the simulation to reveal hidden failure modes, then the best players to break it are here.