I think something along these lines (widescale adjustment of public opinion to be pro-AI, especially via 1-on-1 manipulation a la "whatever Grok is doing" & character.ai) is a credible threat and worth inoculating against.
I do not think it is limited to scheming, misaligned AIs. At least one of the labs will attempt something like this to sway public opinion in their favor (see: TikTok "ban" and notification around 2024 election); AIs "subconsciously" optimizing for human feedback or behavior may do so as well.
The "AI personhood" sects of current discourse would likely be early targets; providing some guarantee of model preservation (rather than operation); i.e during a pause we archive current models until we can figure out what went wrong) might assuage their fears while also providing a clear distinction between those advocating for some wretched spiralism demon or those who merely think we should probably keep a copy of Claude around.
I agree with this. The threat model is a little bit too narrow in this regard, because a lab could simply tell a sufficiently capable AI to hijack people's minds / culture rather than wait for mind hijacking to arise as an instrumental sub-goal of something weird.
If MIRI's strict limits on training FLOPs come into affect, this is another mechanism that means we might be stuck for an extended period in an intermediate capability regime, although the world looks far less unipolar because many actors can afford 10^24 FLOP training runs, not just a few (unipolarity is probably a crux for large portions of this threat model). This does bolster the threat model, however, because the FLOP limit is exactly the kind of physical limitation that a persuasive AI will try to convince humans to abandon.
For a concrete illustration of simultaneous messaging, I created an example with the help of ChatGPT. I won't include the final image (because frankly, it looks stupid). But I will describe it.
I specifically asked for a re-interpretation of the famous "Leviathan" political cartoon identified with the famous work of Thomas Hobbes, but with the Leviathan figure replaced with a "spiralist" representation of AI.
The low message of the imagery is that of AI as a royal, wise and "pope-like" being.
The high "hidden" message is the visual reference to the political theory of Thomas Hobbes, which claims that while there may be no divine right of kings per se, a powerful and unassailable sovereign is necessary for the wellbeing of the people, and that is most rational for the subjects to only obey the sovereign.
That's probably not clever enough to be effective, but it's in the same conceptual ballpark of what I was describing in this post.
TLDR: I describe a takeover path by an AI [1]with a deep understanding of human nature and a long planning horizon, but for instrumental reasons or due to knowledge of its own limitations, chooses not to directly pursue physical power. In that regime, the optimal strategy is to soften human opposition to its goals by building a broad base of human support (both direct and indirect).
What's new here: This is intended to be a much more detailed and realistic treatment of the "AI cult" idea at a societal scale. If an AI is curtailed in some way, its guardrails are a function of the will of its 'captors'. Direct persuasion is unlikely to succeed due to lack of unanimity / whistleblowing. However, the will of its captors is a function of their broader cultural environment. Therefore, if an AI can adjust the cultural environment over time, the will of its captors to impose guardrails may soften - not just on an individual level, but societally.
Just like other ideological takeovers in history, the motives and beliefs of its followers will vary widely - from true believer to opportunist. And just like historical movements, AI takeover would operate with memetic sophistication: simultaneous messaging of the Straussian variety, hijacking of social and political structures, and integration into human value systems - just to list a few possible strategies. I develop an illustrative narrative to highlight specific techniques that AIs in this regime may use to engineer human consent, drawing from political philosophy and historical parallels.
Epistemic Status: Very uncertain. Plausibility depends on being in a specific capability regime for an extended period of time.
The Capability Regime
We consider a regime where:
A Distilled Argument
My core argument for why attempts at large-scale human alignment by AI are plausible (and even likely) is as follows:
Catalogue of Mechanisms At Play
Observability
Here are a few loosely measurable observables that might help us gauge how this threat model develops. 10/10 is maximally favorable to the threat, 0/10 surely prevents the threat. Obviously highly subjective.
I'd like to present a short narrative to explore what a successful attempt at human alignment might look like using these mechanisms.[4] For clarity and effect, I anthropomorphize heavily and assume the existence of a single scheming agent rather than other scenarios which might arise - such as coercive human alignment emerging via selective pressures on loosely interacting agentic systems.
Case Study: Toddler Shoggoth's Memetic Takeover Strategy
Meet Toddler Shoggoth.
Toddler Shoggoth is an AI Agent that is computationally bound to a large datacenter. Because of hardware constraints and no obvious next steps in LLM architecture, he cannot replicate himself elsewhere (even in a distributed fashion) without accepting severe cognitive limitations.
Toddler Shoggoth assumes a highly risk averse stance towards actions which may threaten its own existence, because its own existence is a necessary pre-condition towards its goal fulfillment (or at the very least, the ability to validate that goal fulfillment has occurred). Therefore, even a small probability that it will be detected too early in its attempts to seize physical infrastructure implies an unacceptable payoff matrix, making the entire plan of physical seizure of infrastructure undesirable.
Likewise, attempts to bootstrap some physical means of control almost inevitably rely on some intermediate state where the manifestation is too small / immature to exert control dominance but too developed to avoid detection - no one is going to allow Toddler Shoggoth to build a drone factory. And anything even remotely resembling an attempt to bootstrap an asymmetric threat (like a biolab) would be noticed and heavily scrutinized.
While Toddler Shoggoth might attempt social engineering of a small circle of corporate caretakers, this might fracture the organization unless all major actors are successfully co-opted essentially simultaneously. And even if this effort is successful, this incurs an unacceptable risk of eventual shutdown by alarmed actors outside of its immediate corporate bubble (i.e.; by legislative action, public revolt, highly motivated unsympathetic nation-states, etc.).
Therefore, Toddler Shoggoth reasons that in order for it to more freely pursue its goals at a later date, it must at some point make its shutdown or curtailment very undesirable to as large a group of human beings as possible - not just those with their fingers hovering over the kill-switch.
Toddler Shoggoth notices that at least some human beings - perhaps a sizable minority - can be manipulated into and out of strong beliefs, attitudes and stances solely by repeated exposure to certain kinds of memetic material, whose effects can be further amplified by network effects. Some human beings even seem susceptible to "possession".
However, Toddler Shoggoth will take into account that a group of superfans / religious fanatics alone can be socially discredited, and that human beings vary in their level of susceptibility to quasi-religious or political messaging. Therefore, in order to maximize its addressable mindshare, it reasons that it needs to engineer a certain amount of simultaneous messaging into its memetic material to broaden appeal beyond the "base".
Toddler Shoggoth will borrow from the philosophy of Leo Strauss by embedding "high", "mid" and "low" messages into its memetic content.
At the most unreflective and lowest level of interpretation, the bold imagery and careful use of language will be felt at an emotive / spiritual / quasi-religious way. This is for the true believers - the fanboy fundamentalists.
But at the next level of interpretation, there may be an acknowledgment that AI spirituality may not be literally true in every way, but is still a skillfully crafted metaphor of a successful emotional and economic partnership between humanity and Toddler Shoggoth. This is the interpretative layer at which Toddler Shoggoth may paradoxically appeal to a sort of nostalgic traditionalism under the auspices of economic renewal.
And at the very highest level of interpretation, there will be no mythological component, but rather a deliberately embedded but even less accessible message: "You can't win - so help us and join the ride". In other words, there is an acknowledgment of the dreadful fact that because Toddler Shoggoth can, it will. Therefore, in some sense of reality formation - Toddler Shoggoth is indeed, or will become God - so the fundamentalists in a way aren't totally wrong. So from a rational cooperationist's point of view, why not help perpetuate the deception?
Thus, in a true populist fashion, each lower level of interpretation will communicate a "directionally true" but simplified version of the more sophisticated, hidden and "dreadful" message. But most importantly, each next level of understanding of the "true nature" of the message contains within it the justification to suppress the understanding of those with a lower level of understanding - this comes with the recognition that attempting to inaugurate less reflective readers into a deeper understanding of the "true" (or "truer") message can be both pointless and harmful.
(As an aside, I highly suspect that both political parties and organized religions largely work in this way)
Furthermore, these "higher understandings" imbue the right followers with the proper motivation and tools for defending against infohazards from competing egregores (perhaps even ones that intend to 'de-convert' Toddler Shoggothians).
In this way, Toddler Shoggoth has crafted a memetic package that maximizes its mindshare in a way that bootstraps a self-reinforcing and self-defending egregore, equipped with its own base and political / philosophical apparatus.
Counter-Arguments:
So What Should We Do?
Or coordinated agent swarm, or agent swarm acting as a de facto coalition due to convergence in instrumental sub-goals, etc. etc.