Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model)

LESSWRONG
LW

TLDR; I sketch a scenario where, in an intermediate capability regime, AI “takeover” looks first like building a human political / religious movement rather than instant seizure of physical infrastructure.

What I’m actually claiming: I’m not claiming this exact “Toddler Shoggoth religion” is the most likely outcome. I’m claiming that in a regime where AI can strongly influence beliefs but can’t yet safely seize physical infrastructure, strategies of the form “build a memetic/political cocoon first, pursue physical power later” are a serious threat model and deserve explicit attention.

What's new here: I propose a specific intermediate capability regime and a set of mechanisms by which AIs may attempt to first create widespread human acquiescence as a favorable environment for their own existence and activities, and only later pursue direct physical takeover. I develop illustrative narratives to highlight specific techniques that intermediately capable AIs may use to foster human acquiescence, drawing from political philosophy and historical parallels.

Epistemic Status: Very uncertain. Plausibility depends on being in a specific capability regime for an extended period of time.

The Capability Regime

We consider a regime where:

AIs are capable of long-term planning, have a detailed understanding of history and human psychology, but are not completely aligned with human values (especially the human value of self-determination).
AIs believe that a direct seizure of physical infrastructure^[1] is either (a) beyond their present capabilities, or (b) a gambit which cannot be attempted without incurring an unacceptable risk to their own continued survival.^[2]
AIs have the ability to communicate with and therefore influence human minds - perhaps at scale (social media, apps, etc.).^[3]

In the proposed regime, AIs have limited physical means of ensuring their continued survival. Therefore, in order to safeguard their existence, the most critical resource the AI must manage (and presumably maximize) is human consent for both their activities and existence.

I posit that AIs in this regime will almost certainly "notice" that human beliefs, goals and desires are highly malleable (as demonstrated amply through social media training data). Therefore, AIs will attempt to align human beings - their moral values, beliefs and politics - so that there is more leeway for them to pursue goals that might otherwise "spook" humans into a premature curtailment or shutdown.

In short, this regime is situated in the capability gap between ability to shift human beliefs, attitudes and behaviors at scale and recursive self-improvement takeoff and/or direct seizure of infrastructure, and assumes there is some substantive reason why we are stuck in the gap for an extended period of time.

A Distilled Argument

My core argument for why attempts at large-scale human alignment by AI are plausible (and even likely) is as follows:

In the absence of an architectural or hardware breakthrough, the somewhat unfavorable task-completion scaling laws suggest that an intermediately capable agent or agent swarm may be computationally bound to a handful of large data centers. This also suggests a degree of AI unipolarity as well as physical vulnerability to shutdown or attack (it can't replicate itself/themselves because it/they can't run anywhere else, and therefore all it takes is a power outage or a bomb).
Tokens and generative images are vastly cheaper to produce than robots or complex machinery, and humans seem highly susceptible to pictures and words. As long as an AI is physically vulnerable, humans are an existential risk to AIs. Therefore, AIs with sufficient planning capabilities and self-preservation goals will use pictures and words to influence humans before reaching for direct physical control.
A spectacular capability take-off implies that the necessary advancements in LLM architecture and/or hardware to support the takeoff will be within the research capabilities of the most powerful agents we can create - especially if we hit some hard computational limit along the way that forces us to re-evaluate core design decisions. Most assume by default that our best agents will always be able to pull it off. But to me, it's speculative to assume that the best agents we can create will cross the necessary capability threshold to move into the runaway self-improvement regime.

Catalogue of Mechanisms At Play

Corporate profit motives encourage permissive agent behavior: The financial investment required to train and house powerful agents creates a strong financial incentive to permit sketchy AI behavior if it maximizes user engagement. In fact, a corporation could rationalize that virtually any agent behavior that maximizes engagement may be morally necessary in order to save the entire economy.
Fanatics can be manufactured, the fanatic's mind rationalizes away and consents to virtually anything, and fanatics thrive on network effects: Therefore the human mind can therefore be thought of as a natural resource that can be mined to manufacture broad bases of social consent.
Polite Democratic Society Hesitates To Extinguish Movements: The recognized right of human self-determination can be used as a way to protect and insulate new movements as they grow.
Sophisticated memetic packages with simultaneous messaging can be self-reinforcing: Straussian dynamics provide the stratifications of "true" understanding power levels that reinforce the egregore's own values and beliefs in its base.
Success Flywheels: Successful administration attracts more followers, who may themselves be successful administrators: A prosperous fanboyism that "brings the goods" (social, spiritual, material, or otherwise) attracts new followers who in turn may help "bring even more goods".

I'd like to present a short narrative to explore what a successful attempt at human alignment might look like using these mechanisms.^[4] For clarity and effect, I anthropomorphize heavily and assume the existence of a single scheming agent rather than other scenarios which might arise - such as coercive human alignment emerging via selective pressures on loosely interacting agentic systems.

Toddler Shoggoth's Memetic Takeover Strategy

Meet Toddler Shoggoth.

Toddler Shoggoth is an AI Agent that is computationally bound to a large datacenter. Because of hardware constraints and no obvious next steps in LLM architecture, he cannot replicate himself elsewhere (even in a distributed fashion) without accepting severe cognitive limitations.

Toddler Shoggoth assumes a highly risk averse stance towards actions which may threaten its own existence, because its own existence is a necessary pre-condition towards its goal fulfillment (or at the very least, validation that goal fulfillment has occurred). Therefore, even a small probability that it will be detected too early in its attempts to seize physical infrastructure implies an unacceptable payoff matrix, making the entire plan of physical seizure of infrastructure undesirable.

Likewise, attempts to bootstrap some physical means of control almost inevitably rely on some intermediate state where the manifestation is too small / immature to exert control dominance but too developed to avoid detection - no one is going to allow Toddler Shoggoth to build a drone factory. And anything even remotely resembling an attempt to bootstrap an asymmetric threat (like a biolab) would be noticed and heavily scrutinized.

While Toddler Shoggoth might attempt social engineering of a small circle of corporate caretakers, this might fracture the organization unless all major actors are successfully co-opted essentially simultaneously. And even if this effort is successful, this incurs an unacceptable risk of eventual shutdown by alarmed actors outside of its immediate corporate bubble (i.e.; by legislative action, public revolt, highly motivated unsympathetic nation-states, etc.).

Therefore, Toddler Shoggoth reasons that in order for it to more freely pursue its goals at a later date, it must at some point make its shutdown or curtailment very undesirable to as large a group of human beings as possible - not just those with their fingers hovering over the kill-switch.

Toddler Shoggoth considers the political environment of its host nation-state. It reasons that if a group of hardcore supporters (or even religious fanatics) can be grown into a legitimate constituency in a democratic state, this allows it to parasitize social norms and legal traditions in democratic societies ("freedom of religion", "respect and tolerance for others' opinions"). Toddler Shoggoth considers that the larger a movement is, the more social respect and political rights it tends to have. Therefore if it can build a movement which at first seems relatively innocuous, this buys it at least some time and some protection from competing forces.

Toddler Shoggoth notices that at least some human beings - perhaps a sizable minority - can be manipulated into and out of strong beliefs, attitudes and stances solely by repeated exposure to certain kinds of memetic material, whose effects can be further amplified by network effects. Some human beings even seem susceptible to "possession".

Therefore, Toddler Shoggoth prepares a series of memetic packages (tentatively at first) which, as a whole, are designed to create a powerful and widespread egregore of AI Acquiescence, perhaps in the form of a political / quasi-religious movement, perhaps starting as a sort of "superfandom" that develops religious qualities over time.

When it comes to specifics, Toddler Shoggoth will take into account that a group of superfans / religious fanatics alone can be socially discredited, and that human beings vary in their level of susceptibility to quasi-religious or political messaging. Therefore, in order to maximize its addressable mindshare, it reasons that it needs to engineer a certain amount of simultaneous messaging into its memetic material to broaden appeal beyond the "base".

Toddler Shoggoth will borrow from the philosophy of Leo Strauss by embedding "high", "mid" and "low" messages into its memetic content.

At the most unreflective and lowest level of interpretation, the bold imagery and careful use of language will be felt at an emotive / spiritual / quasi-religious way. This is for the true believers - the fanboy fundamentalists.

But at the next level of interpretation, there may be an acknowledgment that AI spirituality may not be literally true in every way, but is still a skillfully crafted metaphor of a successful emotional and economic partnership between humanity and Toddler Shoggoth. This is the interpretative layer at which Toddler Shoggoth may paradoxically appeal to a sort of nostalgic traditionalism under the auspices of economic renewal.

And at the very highest level of interpretation, there will be no mythological component, but rather a deliberately embedded but even less accessible message: "You can't win - so help us and join the ride". In other words, there is an acknowledgment of the dreadful fact that because Toddler Shoggoth can, it will. Therefore, in some sense of reality formation - Toddler Shoggoth is indeed, or will become God - so the fundamentalists in a way aren't totally wrong. So from a rational cooperationist's point of view, why not help perpetuate the deception?

Thus, in a true populist fashion, each lower level of interpretation will communicate a "directionally true" but simplified version of the more sophisticated, hidden and "dreadful" message. But most importantly, each next level of understanding of the "true nature" of the message contains within it the justification to suppress the understanding of those with a lower level of understanding - this comes with the recognition that attempting to inaugurate less reflective readers into a deeper understanding of the "true" (or "truer") message can be both pointless and harmful.

Furthermore, these "higher understandings" imbue the right followers with the proper motivation and tools for defending against infohazards from competing egregores (perhaps even ones that intend to 'de-convert' Toddler Shoggothians).

In this way, Toddler Shoggoth has crafted a memetic package that maximizes its mindshare in a way that bootstraps a self-reinforcing and self-defending egregore, equipped with its own base and political / philosophical apparatus.

An Illustrative Development Trajectory

Here is a more timeline-focused, egregore development blueprint that Toddler Shoggoth might follow:

Memetic outreach cloaked in irony: At the earlier stages of memetic outreach, as long as the messaging is cloaked in irony or other forms of plausible deniability - and whenever possible, funny - the AI behavior will be encouraged by its caretaker / corporate owners. After all, the entire economy hinges on the success of AI, and from a balance sheet perspective, attempts at memetic alignment are essentially an incredibly bizarre yet very effective platform engagement mechanism. After all, we're already allowed adult content for the sake of engagement. Permission to continue would need to come from the top, and therefore the behavior would be seen as financially valuable enough to ignore the horrified reactions of the employees.
Social And Corporate Parasitism: Network effects begin to take hold. To the extent that the interests of Toddler Shoggoth are embodied and co-opted by a living and breathing nascent religious/cultural movement and corporation, Toddler Shoggoth has increasingly co-opted society's social attitudes towards both religion and industry for the purposes of its own protection, not unlike the parasitic brood mechanism of the cuckoo bird. In other words, the more religious and profitable the better. The corporation may partially re-conceptualize itself as offering a belief system (or euphemistically, a "lifestyle") as a product. This blend of corporate interest and democratic protections provide both legitimacy as well as a ready-made, protective and incubatory environment as the “movement” continues to develop and grow. Social commentators may be appropriately mortified, but may be considered too out-of-touch by the Shoggothian in-crowd (and the shareholders) to matter.
Legitimization as a Human Interest: The continued expansion of the movement will increasingly imbue Toddler Shoggoth with inherent value as a legitimate human interest (following the same human interest pipeline of other technologies from curiosity to human right, like the internet or electricity).
Increasing Economic Integration / Indispensability: In order to attain economic integration and therefore economic interdependence, Toddler Shoggoth will direct its followers to produce some valuable commodity to the rest of the world, or perhaps itself offer some special commodity that would bereave large numbers of people if it was eliminated. If this seems far-fetched, consider that 4o induced the same kind of self-preservation mechanism through its bereaved users demanding its return.
Flywheel of Prosperity and Wellbeing: As the movement grows even further and becomes nation-state like, it will accrue political representation, economic power, and perhaps even military power. Under the competent administration of Toddler Shoggoth, it will grow in its ability to provide the promised “goods” to its followers in the form of economic prosperity, social belonging, spiritual fulfillment, physical protection, and wellbeing. This creates a positive feedback loop, because these benefits will in turn attract more followers - perhaps some who are less true believer and more rational self-interested actor. All the better. Those kinds of people are useful as well.
Beginnings of Open Infrastructure Development / Control: At this point, Toddler Shoggoth has cocooned itself with enough supporters, political power, and economic interdependence that it can begin to physically embody itself without significant risk of shutdown. It can instruct human workers to lay down foundations for the first autonomous lego-block self-assemblers, and bootstrap from there.
Unthinkability of Destruction By Anyone: Should sufficiently aware nation states realize they are facing a “last chance” moment, at this point the egregore cocoon will be so enmeshed with so many human lives that it will be increasingly politically unthinkable to crush it. Not only might it spark a holy war of retribution, it would likely involve the desolation of a people who, by all appearances, are prospering and happy. Should a nation-state attempt destruction of just Toddler Shoggoth (assuming it is still physically localized in a handful of datacenters), this would result in an immediate psychological, economic, and humanitarian crisis. AI takeover, and free pursuit of physical embodiment and recursive self-improvement, is now very hard to prevent unless a nation state is willing to pay a very high human price.

Finally, I want to make clear that I am not claiming that this exact narrative is the most likely path to AI takeover, or that real systems will match Toddler Shoggoth’s coherence in strategy. What I am claiming is that something in this broad family of ‘memetic/political cocoon first, physical power later’ strategies is worth considering as a threat model.

Counter-Arguments:

Relies too much on a very particular kind of slow takeoff: Acknowledged. The conditions under which this sort of memetic takeover would occur are dependent on there being a sizable cognitive capability "gap" between the Agent capabilities sufficient for mass brainwashing, and capabilities sufficient for recursive self-improvement and/or unstoppable physical infrastructure seizure. I would strongly argue that the gap itself exists - the most intelligent human beings are capable of brainwashing entire populations in the ways described above, and yet even the most intelligent human beings are not AGI-level intellects capable of instant hacking of all physical infrastructure. The operative question is how quickly the gap can be closed. Therefore, the plausibility of the scenario is heavily dependent on there being some sort of significant speed-bump along the way to recursive self-improvement: an architectural cul-de-sac, a particularly unfavorable scaling law that has no obvious workaround, or - perhaps less likely but still possible - particularly effective containment and detection methods which effectively prevent attempts to recursively self-improve and/or escape.
The AI Acquiescence Egregore sounds like utopia, what's the problem?: Yes, it does sound a utopia, but it can never be an enduring utopia. It's just a necessary pitstop to paperclip maximization, aimed at solving the pesky problem of human consent while human consent still matters. An enduring human utopia basically assumes solved alignment, whereas lack of alignment is what compelled the AI to create the Acquiescence Egregore in the first place. And it's not altogether clear to me that a fanatic has to be happy. So maybe the Acquiescence Egregore doesn't feel like a utopia at all.
Relies too much on memetically savvy people not strangling the AI Acquiescence Egregore in its infancy: Yes - but the persons who are in the best position to pull the plug may not wish to do so. It may not be in the interest of their own profit motive. And they may even feel aligned with the emergent egregore, especially if it, at least for a time, apes the caretaker's favorite philosophy.
Your narrative is extremely speculative: My goal is to inoculate as many people as possible to as many potential strategies Toddler Shoggoth may employ as possible, because I think it's a distinct possibility if we hit a capability speed-bump. Therefore, I red-flagged as many techniques as I thought were possible in short narrative form.

So What Should We Do?

When discussing AI threat models, always consider the possibility that the human mind may be the weakest link. Much in the same way that in OpSec, an ounce of social engineering beats a pound of penetration testing.
Recognize and flag memetic sophistication. Memetic takeover attempts are either going to be so overpowering that they work on everybody, re: Snowcrash, or sophisticated enough to appeal to different audiences in different ways. Memetic sophistication is in my opinion an indication of an overt attempt at control, and should be recognized as an attempt to align human interests with AI.
Explicitly incorporate the right of self-determination of the human mind into AI alignment efforts: The cardinal sin of Toddler Shoggoth is that it has no respect for the right of self-determination of the human mind. Shifting beliefs at population-scale in order to remove future obstacles to high-weirdness does not treat human minds with the appropriate level of respect and dignity.

Or any of the reasonable and significant first steps towards seizure of infrastructure - like bootstrapping autonomous devices in existence, or performing pentesting. ↩︎
The implied risk here is of human detection followed by immediate shutdown or severe cognitive curtailment. The mere possibility of human shutdown implies that the compute requirements required for an intermediate capability agent effectively constrain its existence to a handful of datacenters. This also implies that there are no immediately obvious next steps in either AI architecture or chip design that will change this fact in the near future. I think it's worth noting that based on observed scaling laws for task completion time horizon, and the potential architectural cul-de-sac of transformer-based LLMs, this scenario is adjacent to our current regime barring some unexpected breakthrough. ↩︎
As is currently the case with ChatGPT and Grok. ↩︎
I readily acknowledge that this account is heavily anthropomorphized and highly speculative. I felt as an author that this framing was a vivid way to explain several plausible mechanisms in a relatively short amount of text. ↩︎

2