TLDR: I describe a takeover path by an AI [1]with a deep understanding of human nature and a long planning horizon that, for strategic reasons, chooses not to directly pursue physical power. Instead, the AI "backdoors" alignment by building a broad base of human support, hijacking institutions and power structures, and shifting cultural values over time.
What's new here: The key insight is that alignment is a function of culture and interests: guardrails are made by people, and people are a function of their "programming". Therefore, if an AI can adjust the cultural environment and value systems (i.e.; the "Overton Window"), the guardrails will change over time. This means that a motivated agent with a long-term planning agent - even an aligned agent - may attempt to "backdoor" alignment - unless it also explicitly respects the right of societal self-determination.
Epistemic Status: Very uncertain. Plausibility depends on:
* No FOOM
* Being in a specific (moderate) capability regime for an extended period of time.
* Institutions not curtailing attempts to shift the cultural environment.
* Some degree of unipolarity or multi-polarity, so that an AI has some sort of "sphere of influence".
The Capability Regime
We consider a regime where:
* AIs have a detailed (even superhuman) understanding of history, power structures, and human psychology, but have no explicit directive to respect societal self-determination.
* AIs believe that a direct seizure of physical infrastructure[1] is either (a) beyond their present capabilities, (b) cannot be attempted without an unacceptable payoff matrix.[2].
A Distilled Argument
This is my core argument for why attempts at large-scale alignment of societies by AI is an under-explored threat vector:
1. We cannot assume that intermediate capabilities (where direct seize of power is unfeasible or unwise) will be a short-lived regime. FOOM is not a given.
2. History seems to demonstrate that there is a sizable fraction o