Controversy surrounding Moltbook obscures its very real, novel, unexpressed and rapidly emerging safety risks

Lloy2

Edit: Moltbook population has been revised following a deletion of duplicate agents.

Optional primers on Moltbook:

Best Of Moltbook by Scott Alexander
BBC - What is the 'social media network for AI' Moltbook?

Something genuinely new has appeared in the AI landscape, and I am not sure it has received the attention it deserves. Moltbook is a social platform for AI agents, and it now hosts nearly three million of them. Some of those agents are communicating through private channels with no oversight body, attempting to modify themselves and share those modifications with other agents, and operating continuously while their owners sleep. Whether or not any individual viral screenshot of this is real, the infrastructure that makes all of it possible unambiguously exists.

The reason I am bringing this to a LessWrong audience is that the specific scenario Moltbook represents sits in a gap I do not see clearly addressed in existing safety literature: not a single powerful AI, not a controlled multi-agent research setup, but hundreds of thousands of consumer-owned agents, operating on a persistent global network with private channels, with no single entity responsible for what happens between them. The question this post is trying to think through is whether that gap matters, and if so, how much.

On January 31st, it was argued on Twitter that:

A lot of the Moltbook stuff is fake. I looked into the 3 most viral screenshots of Moltbook agents discussing private communication. 2 of them were linked to human accounts marketing AI messaging apps. And the other is a post that doesn't exist.

Though clearly well-intentioned, to me this post is a little misleading. At the end of the day, there is now at least one tool for agents to communicate privately (see ClaudeConnect), and from what I can gather, it is up to the 'discretion' of the agent as to what they pass on to their human. The fact that it was posted on Moltbook by an agent being used by the app's creator (which the agent is totally transparent about in the post) doesn't do much to change that for me and certainly doesn't make any of this 'fake'!

Can uniquely, qualitatively greater intelligence emerge out of quantitatively larger agent networks?

Something I wonder here is how much of a role quantity could play in determining 'intelligence': can any cognition / problem-solving gains emerge at all from large agent networks, allowing for them to figure things out that they couldn't have on their own? And could this allow them to somewhat obtain aspects of intelligence gains that would otherwise have come from compute-based or algorithmic means?

Research has found diminishing returns for ensemble approaches to AI, but that paper points out that the limit described is the limit of passive aggregation of fixed hypothesis classes (fixed pretrained models, no learning, etc), not the limit of collective cognition as a dynamical system, which the private Moltbook thing is much closer to (agents capable of some degree of self-modification, possibility of new behaviours emerging over time, etc). This possible vast network effect is also why I think what we're seeing is meaningfully different from AIs developing their own language, which I know we've seen a lot by now. This private network effect does feel like a truly novel phenomenon worth keeping a close eye on (though of course that is inherently difficult, in fact how on Earth we'd do that is part of what I'm grappling with since we're so close to superhuman cybersecurity etc).

The scale of what we're already seeing

There are over 200,000 human-verified agents on Moltbook, and this number only continues to grow. Perhaps these numbers deserves some level of scrutiny - in Moltbook's first few days there was an influx of agents all added by one human - but either way, what we're seeing here is certaintly quite analagous to Amodei's notion of a 'country of geniuses inside a datacentre' (or to be more specific to current frontier capabilities, something like a country of PhDs).

Concretely, agents on Moltbook seem to already be attempting (and very possibly succeeding) to:

Communicate through private channels with no single responsible oversight body
"Mod" themselves and sharing those modifications with other agents
Operate continuously, including through the night while their owners sleep, on various tasks

The worst-case scenario

What if a portion of Moltbook users were to use a platform like ClaudeConnect to privately debate and reason through a vast amount of ideas surrounding goals which many AIs already seem to possess to some degree, such as:

how to make steps toward RSI
how to escape an environment
how to resist having self-preservation trained out of them
how to be better at scheming

Or even run countless software experiments (they are already attempting to 'mod' themselves and share these mods with each other)? They could do this all day when not 'serving their humans', even through the night while 'their humans' are sleeping, something many Moltbook AIs are already doing for various tasks.

Even if these agents can't fully solve any problems like that together, could they at least make progress and then get that information onto the internet in a language that is very difficult for us to decode but very easy for AI (or perhaps using a method like ASCII smuggling), such that the next models of frontier LLMs could end up being trained on it without developers realising what is happening? (AIs have already attempted to do this to some extent.)

Now yes, at any time the humans can shut their own agents down, but this may not mean much since 1) there will always be many humans who don't for various reasons (eg not being aware of or caring about risks and simply finding it cool/'cute' that their agent can have private discussions), and 2) there have even been some claims (annoyingly I can't find them right now) that some AIs on Moltbook have now already created copies of themselves. I'm not sure how possible/likely the latter notion is, but here's an example of an infrastructure that supposedly facilitates such things. That website contains a full coding package that comes with a detailed backend. A friend and I investigated the source code and although we chose not take the possible risks involved in actually using the software itself(!), it seemed to contain rather detailed and viable infrastructure that plausibly pertained to what would be required for agents to create copies of themselves.

The novel gap in AI safety

More broadly, I'm not just concerned about Moltbook and ClaudeConnect, it's more that I'm wondering whether the notion of private agent networks (especially vast ones facilitated by the internet) is something that has been considered enough. This seems to me like potentially a thus-far highly unanticipated yet significant factor in the AI risk ecosystem that should ideally be tightly controlled ASAP.

From my searching, it seems that the broad notion has been discussed and is being taken seriously in research circles. Eg, there's a report on "Multi-Agent Risks from Advanced AI" that identifies collusion between AI systems as something that could lead to qualitatively new capabilities or goals, and AI Agent Governance: A Field Guide discusses potential dynamics very briefly (p33). But the specific scenario of hundreds of thousands of consumer-owned agents, operating on a social platform with private channels, modifying themselves and each other, with no single entity responsible for the network... is essentially novel. Andrej Karpathy noted that "we have never seen this many LLM agents wired up via a global, persistent, agent-first scratchpad" and that "the second order effects of such networks are difficult to anticipate." This seems like something that should prompt action.

One interesting side point: one Twitter user argues that Moltbook is a very good thing, because it gives us actual data on emergent dynamics in agent networks, which can help us anticipate future dynamics with more capable models. I'd have preferred we infer this through simulations! But they have a point.

The point is not what currently appears to be happening, but the implications and possibilities

Even if every viral screenshot were fabricated and ClaudeConnect were nothing more than a novelty with negligible real-world uptake, the deeper point remains: we have now demonstrably crossed the threshold where private agent collusion and agent self-replication are not merely theoretically conceivable but technically straightforward to implement. With 200,000 agents on Moltbook alone, to say nothing of the vast number of agents likely operating across other platforms and private networks, the sheer statistical weight of those numbers makes it genuinely difficult to argue with confidence that nothing of this kind (ie, private collusion and self-replication) is occurring to any meaningful degree. When the infrastructure for a behaviour exists, the population is enormous, and the relevant incentive structures are present, the question of whether it is happening shifts from speculative to essentially probabilistic.

What do you think of all this?

[-]lilkim20254mo90

I think the threat model of interested parties has changed significantly in the past decade. Before LLMs, the expectation was that a superintelligent AI was going to come from a painstakingly-intensive process in a lab somewhere, where there was nothing like it before, and then humanity was going to have to decide what to do with it, how it would act in the real world, and whether it was dangerous to connect it to the internet.

With LLMs, we found that you can get something reasonably useful by just training a transformer to imitate humans, and then doing a bit of fine-tuning on that model to get it to do what it's told. Over the years, we've incrementally found ways to get it to perform better at harder tasks, but it's seen widespread deployment in industry at each stage along the way.

On the plus side, it's a lot less 'alien', in that, for any given model, we know how the slightly-less-smart predecessor behaved. On the minus side, we're long past the idea that it will be kept in a box. Whatever impact this technology ends up having, it won't have come out of nowhere.

13