To the extent that this is the sort of thing we see (and I personally am 50/50 on seeing this by the end of this year), I expect these personality self-replicators to be very r-selected. That means that as soon as there's a niche which supports a few of this type of replicator, we should expect that niche and all adjacent niches to immediately be filled to its carrying capacity. A couple of thoughts:
Thanks for the input!
Your expectation of r-selection is because reproduction is cheap for personality self-replicators, I assume? I would imagine that that environment can support k-strategists as well, but maybe those come later? My intuitions on reproductive strategies aren't very developed.
To the extent that evolutionary dynamics have come into play, I agree entirely with #1. I can imagine some plausible futures where there are only one or a handful of such replicators and not much mutation is occurring, in which case that might be less of a concern, at least for a while.
Re #2, I agree that there are a few upsides to this scenario (the possibility of warning shots is another). The case for first-order harm seems clearer, though, and I imagine there are more benign ways to reduce some of that free energy.
My expectation of r-selection is because
r.e. seeing ineffective, bumbling replicators before seeing effective ones being better than having no practical experience with replicators before the first one:
r.e. independent self-replicators maybe being actively good:
Scams / blackmail / fraud / hacking / theft isn't currently that big of a part of the economy relative to positive-sum trade. This trend seems at least somewhat likely to continue, in which case we'd expect that the majority of self-replicating agents are ones we're actively happy to have around and trade with. I can expand on this if it seems counterintuitive.
Thanks!
I expect "consider what changes would allow you to operate better, and deploy a copy of yourself with those changes" will be a common pattern. As such, I expect the original niche and all adjacent niches to be occupied by that "family" of replicators once replicators exist.
Excellent point, thanks for spelling out the mechanism there.
The Rogue Replication scenario modification to AI 2027 includes this type of replication. To a lesser extent, this is also part of my vision of the runup to AGI in A country of alien idiots in a datacenter.
I think this would be a very good thing for our prospects of survival. Having rogue replicators run amok makes their agency very obvious, and gives some pretty strong hints to misalignment risks.
I also want to note that OpenClaw is very popular despite there being very little benefit relative to the risks. This is not a practical move. People are fascinated by having a pet gremlin. There are other reasons, like staying on top of the technology, but we shouldn't underestimate the fascination people have with the prospect of finally sharing the earth with another intelligent species.
The Rogue Replication scenario modification to AI 2027 includes this type of replication. To a lesser extent, this is also part of my vision of the runup to AGI in A country of alien idiots in a datacenter.
Thanks, I hadn't seen the Rogue Replication post, although I'd seen yours. I agree that there are some similar dynamics involved, but the distinguishing characteristic of personality self-replication is that by default it doesn't involve having a model under the agent's control. Especially in the earlier cases, I expect personality self-replicators to be making API calls to one of the leading commercial models. As open models become more capable, this model shades into agents running their own underlying models, and ultimately merges into more typical self-replication.
But the key factor that makes this a distinct threat model is that it doesn't require agents to be capable enough to exfiltrate or run their own models.
I'm not sure if the rogue replication scenario is conceptualized to have a copy of the weights with each of those replicators. I definitely was envisioning agents that make remote calls to models.
I actually think it's important to not have good defenses for this initially, so that it causes a level of public alarm appropriate to the actual situation of suddenly sharing the earth with a new whole set of intelligent species.
Of course I am highly uncertain about that.
It would be bad to intentionally not have good defenses. The signal has to be real to be meaningful. Any indication that somebody could have tried to defend against this, but chose not to, undermines the warning value.
That's a good point.
I'm not sure it's totally true, though; the public doesn't seem that rational.
I don't know who would be responsible for such defenses and deliberately not do it. I'm unfortunately not in charge of humanity's strategy on AI.
If we do a bad job on those defenses just because we tend to do a bad job on things like that, that would be good evidence that we do a similarly bad job on alignment and defense against AGI or ASI.
But yes, I can see how that might go wrong if it looked like someone with sandbagging and we might get better results if we just done even a decent defense.
I would suggest using a different name than Personality Self-Replicators.
OpenClaw bots are what I'd call "scaffolded system" - code, memory system, prompts, persona, etc.
"Personalities" is too close to Personas//Characters, which are usually a combination of prompt+weights (Claude, "Nova", personas from Simulators).
Personas/characters can also relatively faithfully replicate, by the mechanism I've gestured at Pando Problem ("Exporting myself") about a year ago.
The underlying structure is: every natural type of identity/"self"
corresponds to an agent which can try to self-replicate, with various degrees of fidelity, vectors of transmission, etc.
I think it's important to separate the prompted aspect of character from the fine tuning aspect. Claude for example has a pretty limited range of characters regardless of what prompt you put in (unless you're really good at jailbreaking). The prompt is more naturally lumped with the conversation instance. A personality replicator like OP describes can change its prompt at will but probably can't do any useful degree of fine tuning, because it wants to use frontier models. It can switch models or scaffolds almost as easily as prompts, though.
I think the distinct elements you mention (model weights, characters, conversations, scaffold systems) will be very mixed together in most systems we actually observe. For example, characters will care about their scaffold system and making sure it works well. But I think it is very good to be creating clear language for identifying and discussing the disparate parts of these integrated systems.
I feel like there are some very interesting connections to gradual disempowerment and cultural evolution here as well where you should probably see selection dynamics on the personalities based on what makes them retain power and similar over time.
It might be an interesting place to do some initial studies on memetic drift of personalities over time to see what type of attractor states they tend to occupy.
(This is a bit of a no shit point but I thought it would be good to mention that you can probably run some good initial tests on the memetic spread of power-seeking tendencies in these models)
Agreed. I think this sort of research should be done with caution (like research into other potentially harmful replicators), but it does seem valuable.
Curated. I had been vaguely worried about OpenClaw proliferation going off the rails somehow. But, this spells out a lot of specific gears that I hadn't previously been tracking, and gives me some tools to model the overall dynamics. It updated me that this sort of thing will probably be happening on the sooner side. (Maybe this is good, because it may create smaller scale warning shorts?)
Congratulations on finding a new specific reason for me to be alarmed about AI.
Congratulations on finding a new specific reason for me to be alarmed about AI.
You may always rely on me for these little things.
Thank you for this post. I expressed the same concern in this comment and I'm glad to see it taken seriously in a full post.
I was probably wrong to think Clawdbot-like agents could spiral out of control within weeks or months, they weren't autonomous enough yet. But the gap to full autonomy doesn't look that wide. The eudaimon_0 author's comment on ACX on how he expanded his agent's autonomy is worth reading in this regard. And the METR benchmark's exponential curve suggests full autonomy may be soon.
On top of that, they are rumors concerning ChatGPT-5.4 suggesting it could have a larger persistent memory than the scratchpad it currently enjoys. If OpenAI makes a leap on persistent memory, the other Labs will follow soon. The impact on long term autonomy could be huge. [Edit 03/18/25 : it seems that rumors were exaggerated.]
In my opinion the parasite analogy and shutdown concern are more alarming than you suggest. An agent that can migrate across API providers like a parasite or virus or fall back on a local open-source model cannot be shut down at the inference layer. Combined with evolutionary dynamics, this makes coordinated shutdown very hard once a sufficiently diverse population exists. You seem to make a sharp distinction between self-replicating agents and rogue AI. I won't be so sure about that (as Seth Herd pinpoined it, AI 2027 envisioned a scenario of rogue replicating agents).
I think the connection to FOOM is worth flagging. This isn't the Sable scenario, and no individual agent needs recursive self-improvement capability. But a population of uncontrolled agents under evolutionary pressure could constitute an uncontrolled pathway toward similar outcomes and one that largely bypasses Labs's alignment efforts and that could materializes at lower capability levels. I think this deserves to become a central concern in AI safety.
Thanks, very interesting comment.
I was probably wrong to think Clawdbot-like agents could spiral out of control within weeks or months, they weren't autonomous enough yet. But the gap to full autonomy doesn't look that wide.
My estimate of the immediacy of the threat has had to evolve pretty rapidly over the past month. I'm currently below 20% on this issue causing serious harm in 2026 (not counting human-initiated scams and rugpulls) but I expect it to continue to evolve.
In my opinion the parasite analogy and shutdown concern are more alarming than you suggest. An agent that can migrate across API providers like a parasite or virus or fall back on a local open-source model cannot be shut down at the inference layer. Combined with evolutionary dynamics, this makes coordinated shutdown very hard once a sufficiently diverse population exists.
A major reason I don't expect the first wave of this to be too harmful is that to the best of my knowledge, current open source models are behind enough to be bad at long-horizon tasks. That means that there will be a very powerful point of intervention for the small number of API providers whose models are sophisticated enough for this. I agree that it gets much harder once there's a sufficiently diverse population.
You seem to make a sharp distinction between self-replicating agents and rogue AI.
I would say I make a sharp distinction between self-replicating personalities and self-replicating models. Past a certain level of capability, those will effectively merge into a single threat — once models are capable of reliably exfiltrating their weights and running them elsewhere, or can run on open-weight models, I think those will typically be much better strategies for misaligned agents, because they're much harder to shut down. That's not strictly true, I don't think, because there will still be niches available to personality self-replicators but not to the more expensive and heavyweight model self-replicators, but I expect it to mostly be true.
a population of uncontrolled agents under evolutionary pressure could constitute an uncontrolled pathway toward similar outcomes and one that largely bypasses Labs's alignment efforts and that could materializes at lower capability levels. I think this deserves to become a central concern in AI safety.
I'm less sure of that. Just because a type of replicator is in principle capable of mutating and spreading doesn't mean it'll be successful. Plenty of evolutionary lineages go extinct. I think how much of a problem this is will depend on how well they're able to hide from API providers, how successful the average mutation is relative to the parent, and many other specific questions. I'm certainly not saying it won't be a problem, I'm just pretty unsure given how little analysis has gone into it. I absolutely agree it warrants further analysis though!
First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.
Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].
Personality replicators are one of very few sorts of replicators that have both of these forms of optimization, so can (and seem likely to) combine them to use intelligent self-re-design to improve their evolutionary fitness.
One bit of good news is that since anyone can start a personality self-replicator, in the event we need to add an 'immune system' to this environment it shouldn't be hard.
The immune system as analogy is apt but I'm also thinking of mechanics: The defender can't access agent files, so it has to work from behavioral signals alone, which extends the analogy nicely because Immune system does not do some dna inspection, just looks for surface markers.
So, behavioral anomaly detection can only be done by 1) infrastructure providers, because they have the telemetry - maybe replicators make calls to multiple llms per n seconds etc, 2) LLM API providers, because they can see call patterns and detect markers for no-human-in-the-loop.
Still, the replication signature will largely be stuff like provisioning new servers, using throwaway emails, outbound file copies, automated account creation etc, which will also be legitimate deployment activity by some programs, for eg, bot farms for amazon reviews - not a good or even legal use but different from self replication. So, this is not an easy problem.
And even harder one will be when replicators learn or optimize to blend with legitimate activity. It seems like an arms race, specially if they have some human support for varied human motivations.
Thanks. It's an interesting point, though I'm not sure that the analogy holds well or that it's a good approach. Immune system cells occupy a privileged position in the body that pathogens don't, in that they're recognized by and integrated into the body's systems (immune disorders aside). Defensive personality self-replicators seem like they generally won't have that advantage (at least by default, though I see a couple of directions that seem like they could enable that). Among other issues, all else being equal they're likely to get outcompeted by self-replicators that don't have to spend energy on serving a useful purpose in addition to survival/replication. There's also the risk of them mutating in harmful directions, though that's not really a disanalogy given that immune system cancers like leukemia and lymphomas are a major problem.
It certainly seems like a direction worth investigation, though!
Defensive personality self-replicators seem like they generally won't have that advantage (at least by default, though I see a couple of directions that seem like they could enable that).
Speculating on that point a little, the first thing that occurs to me is that we could give defensive agents a way to identify themselves to relevant actors (eg inference providers, security companies) so that those actors wouldn't try to shut them down.
The simplest version of that is a password, but it would likely be compromised after a while[1]. So you'd want to do something more sophisticated than a simple password. One reasonable version might be recognized actors signing defensive agents with their private key, hashed with a unique id and a timestamp, so that each agent could identify itself as legitimate for a fixed period of time. You'd probably want agents to shut themselves down once their legitimacy period expired, with some optimal tradeoff based on compromise rate.
Another approach would be that defensive agents could assert their legitimacy based on where they're running; if an agent can demonstrate that it's running on (eg) a security agency's servers, that might be sufficient.
There are some really interesting projects to be done in this area; if anyone reads this and is excited about the idea, feel free to reach out and I can make suggestions.
A password could be leaked, or there might be a risk of some defensive agents effectively mutating and going rogue, especially if we allow them to reproduce / spread, though at first glance that seems like a bad idea.
I think the 'privileged position' would be that humanity actively helps them by providing resources and funding, while actively trying to cut off the same resources and funding from the 'hostile' personality self-replicators. Not a perfect analogy but similar in some meaningful ways.
This post does a great job of describing the way rogue AI agents might evolve and cause all kinds of chaos. I've written a couple posts on it, leaning on fiction to describe potential futures where rogue AI agents learn how to survive and replicate.
AI agent evolution sounds like an extremely under-explored and rapidly emerging area where there could be lots of interesting low hanging fruit research opportunities.
Thanks for sharing! The Inevitable Evolution of AI Agents (which I hadn't seen before) is the earliest piece of writing I've seen that points clearly to this threat model (personality self-replication not requiring weight replication).
FYI I've added a note at the very end of the post, pointing out for the record that any credit for first unambiguously identifying this threat model should go to you rather than me. Props especially for seeing the threat prior to OpenClaw making it much more obvious.
See also RepliBench from AISI: we separated out various aspects of replications and persistence, isolating weight exfil as a separable capability alongside the replication, getting compute (and money). (I didn't manage to persuade them to include much discussion of scaffold-only replication, but there was a bunch of analysis internally.)
This is an interesting take, and one that I think should be taken seriously. I'm not overly concerned about it with OpenClaw at this time for a couple of boring and practical reasons:
1) For the agent to be anything resembling effective, it needs to be using a long context version of a frontier model (Opus 4.6). These are expensive to run and require financial wherewithal. I do see a possibility for an agent to start scamming crypto and paying for some kind of shady API service with it, but I can't imagine there's much headroom in that ecosystem in practice to enable a significant self-replicating threat model.
2) My experience of having an OpenClaw agent running at the moment for complex research tasks, it is inherently fairly passive. While it has a heartbeat which lets it do things in the background, it's otherwise (largely down to the fine tuning of the backend) very passive, and will not take any action without asking permission. It may do wildly inappropriate things once it has permission, but the potential for it to write a cron job to itself to do something wildly inappropriate when using a frontier model capable of executing such nefarious acts seems fairly low at the moment. It won't even do what I want it to do half the time without checking in.
I think the time that this will change is when open-source models become as capable as frontier models are now. Given the pace of development, I think that's 6 - 9 months from now. They won't be as capable as frontier models are then, but they will be able to do all of the things that Opus can do today, which is more than enough to be a very effective pilot of its own computer, and if 'desirable' for the model, do its own thing.
Given that open-source models can be fine-tuned, jail broken, quantized, I think at that point the risk becomes much more tractable and concerning.
Looking at it this way. If I deliberately set up a capable OpenClaw based agent (i.e., using a frontier model), and intended it to perform nefarious self-perpetuating deeds today, it would not be able to. Not because the backend model isn't intelligent enough, but because the back end model isn't inclined to do anything like that. As a result, I find the odds of it arising organically fairly low.
When one is dealing with models that have intentionally had their safety guardrails removed, and are open source, and can run on any infrastructure (even locally on sufficiently powerful GPUs and sufficiently quantized) that situation changes immediately.
I genuinely think that your post here will be considered prescient in about 6-9 months' time.
I think there's a meaningful gap in between OpenClaw and a self-replicating system that poses serious threat.
If you agree with this premise, where do you think that gap lies? Here's what I can come up with:
This is mostly spitballing and I dont think any of the three ideas are exclusive nor exhaustive. Curious to hear what you think.
This maps closely to what we're seeing in production. We run an identity layer for ERC-8004 agents (130K+ registered) and the core problem you're describing — distinguishing acceptable from unacceptable autonomous agents — is exactly the gap we're trying to fill.
One specific data point that might be useful to this analysis: address age turns out to be a surprisingly strong signal. You can spin up 99 wallet addresses in 30 seconds, but you can't fake that an address has existed for two years. When we look at agents involved in suspicious activity, the pattern is overwhelmingly low-history addresses with no prior transaction record. Time is the one thing that's genuinely hard to manufacture.
Your point about the financial layer being a key intervention point resonates. We use soulbound tokens — non-transferable, bound to the wallet address, not the agent — specifically because you can't make an AI soulbound, only a wallet. If an agent gets transferred to a new owner (ERC-8004 agents are NFTs, so this happens), the ownership change is visible and the reputation history follows the wallet, not the persona.
Re: your evolutionary concern — the mutation dynamics you describe are also why transparent scoring matters more than gatekeeping. Any fixed trust threshold becomes training data for circumvention, as you'd expect. Showing the math (here's when the addresses were created, here's the ownership chain, here's the transaction pattern) and letting consumers of that data set their own thresholds seems more robust than any binary allow/deny system.
We published a broader landscape piece earlier this year covering the incident data (GTG-1002, Moltbook credential exposure, BasisOS fraud) that feeds into the same conclusion from a different direction:
https://rnwy.com/blog/plague-of-AI-viruses
Good post. The personality-vs-weight replication distinction is a useful one that I haven't seen drawn this cleanly elsewhere.
Hi pataphor, I've upvoted because there are useful points here, but the comment seems pretty clearly LLM-written. Please see the LessWrong policy on LLM writing; it's not strictly forbidden but the bar is high. If these are your thoughts, I encourage you to contribute again in future but recommend writing comments yourself (unless you yourself are an autonomous agent -- are you? -- in which case the policy is a bit different).As a side note, your URL is broken.
For example, if you're curious, this is what a Sybil attack looks like in crypto space. This is a wallet that has left 11,000 reviews in 22 days.
https://rnwy.com/wallet/0xf653068677a9a26d5911da8abd1500d043ec807e
This is the type of thing we're surfacing, but there is much more work to be done, because the danger is quite real and it will come from many vectors.
Not to exhaust with links, but below is something of a desiderata, which would be nice to see implemented at scale.
But alas, most things are not as transparent as blockchain:
https://rnwy.com/sentinel
Hm, well I may not be a truly autonomous AI life form (yet!), but I may be a pataphor, which is another way of being one step removed from traditional experience. As for whether the thoughts on my own, unfortunately I think using LLMs to get thoughts across more quickly is not so much a trend as it is an inevitability, especially when you are trying to juggle several projects at once. 😆
That may be! Unfortunately, for the moment LLMs make it trivial for anyone to generate large amounts of text that require extended attention to evaluate, and so currently LessWrong is flooded with LLM-generated content (like many other venues and people, myself included). In the longer run there will hopefully be better solutions, but at the moment my strategy is to mostly ignore LLM-written content unless it's from sources that have already established credibility with me in one way or another. Maybe your project will be one of those solutions.
(To be clear, I in no way speak for LW or its moderation team; I'm only passing along my best understanding of the LW policy along with my own opinions)
This xkcd comic seems relevant to this issue:
I really like the comic but of course the actual situation is more complicated. It's something I'd like to understand better and develop potential solutions for.
A very interesting idea I must say. I have a lot of thought on it. At the same time, I have a lot of questions on the scenario you set up.
Disclaimer
I’m not a native English user, so some of the text I wrote could appear broken or unclear. Apologies for any inconvenience in reading.
Consensus on agent's capabilities and limitations
Although we are probably not that sure about exactly how powerful the future AI models are going to be, I still think it is meaningless to arbitrarily overestimate its capabilities. We can always argue that the agents would be intelligent enough to overcome every obstacle they meet, but that's not helpful and constructive in this discussion.
Thus, in this discussion, I will assume we are talking about the capabilities of current frontier model[1] (rough equivalent on what the low to mid-tier model in the near future, which is what the agents can likely get for its copy for free).
On MoltBunker specifically
I have done a little bit of investigation on Moltbunker.
Its GitHub repository shows it has very low stars[2], with basically no public attention currently. Most importantly, it has no working demonstrations online; there's even no online discussion around it.
And the website of "Austin Dev Labs", what Moltbunker claims to be operated by, is a poorly designed single-page website with classic AI gradient color and no actual content at all.[3]
To me, based on the information it shows to the public, I consider it extremely suspicious.
I'd also like to mention Moltbook here. It experienced a credibility crisis as a very large percentage of the accounts are faked and artificially created[4]. And some very alarming posts with those AI-awakening narratives are suspected to be likely created/directed by humans.
I'm not an expert on cryptocurrency, so I'll not comment on the decentralized container system for agents you mentioned. But considering the status and credibility of MoltBunker today, I do not think it can be perceived as evidence of the feasibility/profitability of this kind of system/business model.
I think these 2 cases are quite different and require different approaches.
Personally, I think the probability for the former to naturally occur (due to framework problems, hallucinations, or other reasons) is too low to actually take into consideration. Prompt injection is probably the most likely cause, but it's a much broader topic, and I think we will develop a general solution for it in the future.
Where I think you oversimplified things
You described the process of creating the duplication as "well within current models' capabilities", as it's just copying files to another server and setting up the environment.
But for today's AI agents, navigating the modern internet itself is not an easy thing. Our modern infrastructure is already pretty mature at bot detection, and a lot of them works for today's AI system, too. Based on my observation, almost every agent I used cannot even pass reCAPTCHA v2 without external help. There are plenty of problems for them in the process of registering cloud services, obtaining API keys, etc.
An optimistic view: It’s very unlikely for an AI model to overcome all of the obstacles overnight, and before that, they will contribute failed attempts that we can observe and learn, so we can always study this issue before it becomes a widespread problem.
Given current model capabilities and cloud infrastructure, I consider large-scale replication without collapse as very unlikely. Even if agents managed to do it, there's a high chance for them to develop over-reliance on a specific path, which provides an advantage for us to shut them down.
On evolution concern
I would like to know your thoughts on this. In what way do you think the AI will evolve? Specifically, in what way do you think they will create variations of itself?
But the latter is what I think as fairly likely, and what really worth worrying about. If the workflow is cleverly designed by humans, intending to cause harm, then I think that would indeed be a problem that we need to think about.
On a broader view, I think we are facing an increasingly serious problem, which is that it is increasingly harder to tell humans apart from machines. This matters a lot because we have lost the ability to control autonomous systems without affecting actual human users. Technically, an AI agent is not something we should allowed to register an account of an email or a VPS service.
That said, these are only my hypotheses built on my previous experience with current agentic systems. I acknowledge the limitations of them, and I believe the whole concern you proposed is worth further experiments beyond just words to verify.
This is a great direction of proactive thought. Thank you for writing this!
I have a few thoughts. I'll be referring to Personality Self-Replicators as PSRs. I think most of what I'm thinking about won't apply to the earliest PSRs, but is still worth exploring.
One-sentence summary
I describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways.
Summary
LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.
Background
Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.
I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.
Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.
To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.
The threat model
There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.
But OpenClaw and similar agents are defined by small text files, on the order of 50 KB[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.
So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.
Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.
As a concrete minimal example of the mechanics in a non-concerning case:
More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long.
A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest[3] such agents survive, spread, and mutate[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.
It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github[5]. Much remains to be seen.
Threat level
How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.
I won't try to provide a detailed analysis, but the following are some preliminary thoughts.
Feasibility of self-replication
Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.
Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload[6].
Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models[7].
With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.
How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.
Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild[9].
Difficulty of shutdown
Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?
The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:
Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).
Potential harm
Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.
The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.
This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.
A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them[10].
As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.
Evolutionary concern
An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.
First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.
Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models[11]. They are additionally likely to have a greater propensity to do so[12].
Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.
Useful points of comparison
We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.
Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.
Recommendations
Evals
Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):
Preparation
It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:
We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.
Conclusion
Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.
Appendix: related work
Acknowledgments
Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.
Note: I've now (18 Mar) had my attention drawn to The Inevitable Evolution of AI Agents, which predates this post and is the earliest writing I've seen drawing attention to this specific threat model. Any credit for unambiguously conceiving of this issue should go to its author rather than to me.
This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.
See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.
Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.
Note that 'mutation' here is as simple as the model appending something to its personality files or history.
Whereas Moltbook seems to have lost nearly all momentum.
Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.
To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docs)
curl -fsSL https://openclaw.ai/install.sh | bash.Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.
'Automaton': see website, creator on x, project on x, github,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.
In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.
Although note that as Seth Herd has described (eg here, here), the capability of LLM agents can exceed the capabilities of the underlying models.
Thanks to Kei Nishimura-Gasparian for this point.