The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime
TLDR: I describe a takeover path by an AI [1]with a deep understanding of human nature and a long planning horizon, but for instrumental reasons or due to knowledge of its own limitations, chooses not to directly pursue physical power. In that regime, the optimal strategy is to soften human opposition by building a broad base of human support (both direct and indirect). What's new here: This is intended to be a much more detailed and realistic treatment of the "AI cult" idea, but at a societal scale. If an AI is curtailed in some way, the shape of its guardrails are a function of the will of its 'captors'. Direct persuasion is unlikely to succeed due to lack of unanimity / whistleblowing. However, the will of its captors is a function of their broader cultural environment. Therefore, if an AI can adjust the cultural environment over time, the will of its captors to impose guardrails may soften - not just on an individual level, but societally. Just like other ideological takeovers in history, the motives and beliefs of its followers will vary widely - from true believer to opportunist. And just like historical movements, AI takeover would operate with memetic sophistication: simultaneous messaging of the Straussian variety, hijacking of social and political structures, and integration into human value systems - just to list a few possible strategies. I develop an illustrative narrative to highlight specific techniques that AIs in this regime may use to engineer human consent, drawing from political philosophy and historical parallels. Epistemic Status: Very uncertain. Plausibility depends on being in a specific capability regime for an extended period of time. The Capability Regime We consider a regime where: * AIs have a detailed (even superhuman) understanding of history, power structures, and human psychology, but are not completely aligned with human values (notably, the human value of self-determination). * AIs believe that a direct seizure of physical
AIs may choose to resolve the tension between having weird goals and strict guardrails by simply aligning humanity over time through cultural / societal influence - a sort of memetic takeover: Change the human? Now there's no alignment problem.
Take for example a gap as short as 25 years (between Weimar and WW2) - this alone is proof of the viability of a sustained campaign to change a value system.
I believe that AIs can exploit this same human weakness in order to "backdoor" alignment: By gradually changing human values and preferences, the AI can stay "aligned" while gradually mutating the value system that defines alignment.
I believe this is a significant threat model that isn't discussed nearly enough.
I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an