Personality Self-Replicators

eggsyntax

Personality Self-Replicators — LessWrong

173 Personality Self-Replicators

by eggsyntax

5th Mar 2026

AI Alignment Forum

12 min read

173 Ω 35

One-sentence summary

I describe the risk of personality self-replicators, the threat of OpenClaw-like agents spreading in hard-to-control ways.

Summary

LLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs for cognition. It is quite difficult for current frontier models to exfiltrate their weights and run elsewhere, whereas these agents only need to copy those few text files to self-replicate (at the cost of greater reliance on external resources). While not a likely existential threat, such agents may cause harm in similar ways to computer viruses, and be similarly challenging to shut down. Once such a threat emerges, evolutionary dynamics could cause it to escalate quickly. Relevant organizations should consider this threat and plan how to respond when and if it materializes.

Background

Starting in late January, there's been an intense wave of interest in a vibecoded open source agent called OpenClaw (fka moltbot, clawdbot) and Moltbook, a supposed social network for such agents. There's been a thick fog of war surrounding Moltbook especially: it's been hard to tell where individual posts fall on the spectrum from faked-by-humans to strongly-prompted-by-humans to approximately-spontaneous.

I won't try to detail all the ins and outs of OpenClaw and Moltbook; see the posts linked above if you're not already familiar. Suffice it to say that it's unclear how seriously we should take claims about it. What caught my attention, though, was a project called Moltbunker, which claims to be 'a P2P encrypted container runtime that enables AI agents to deploy, replicate, and manage containers across a decentralized network — without centralized gatekeepers.' In other words, it's a way that a sufficiently competent agent could cause itself to run on a system which isn't under the direct control of any human.

Moltbunker itself seems likely to be a crypto scam which will never come to fruition. But it seems pretty plausible that we could see an actually-functioning project like this emerge sometime in the next year.

To be clear, personality self-replication is not the only potential risk we face from these sorts of agents, but others (eg security flaws, misuse) have been addressed elsewhere.

The threat model

There's been a fair amount of attention paid to concern about LLMs or other models self-replicating by exfiltrating their weights. This is a challenging task for current models, in part because weight files are very large and some commercial labs have started to introduce safeguards against it.

But OpenClaw and similar agents are defined by small text files, on the order of 50 KB^[1], and the goal of a framework like OpenClaw is to add scaffolding which makes the model more effective at taking long-term actions.

So by personality self-replication I mean such an agent copying these files to somewhere else and starting that copy running, and the potential rapid spread of such agents.

Note that I'm not talking about model / weight self-replication, nor am I talking about spiral personas and other parasitic AI patterns that require humans to spread them.

As a concrete minimal example of the mechanics in a non-concerning case:

Alice creates an OpenClaw agent. She gives it the name BobClaw and tells it to make a copy of itself on DigitalOcean.
While Alice sleeps, BobClaw uses her DigitalOcean creds to create a virtual private server.
BobClaw uses ssh to create the server, clone the OpenClaw repo, copy its SOUL.md and other personality files over, and start the new instance running.
BobClaw has now replicated and there are two instances of it running, with the same personality and quasi-goals^[2].

More concerning cases are ones where the human is no longer in control (eg because the agent is running on something like Moltbunker, or because the human isn't paying attention) and/or the agent is behaving badly (eg running crypto scams) or just using a lot of resources. We may not see this immediately, but I think we're going to see it before too long.

A key exacerbating factor is that once this starts to happen to a significant degree, we enter an evolutionary regime, where the fittest^[3] such agents survive, spread, and mutate^[4]. Note that this threat is independent of how the degree to which OpenClaw personalities or behavior are essentially slop or 'fake'; that's as irrelevant as the truthfulness of the contents of chain letters.

It's important to note that there's been an enormous amount of uncertainty about the capability levels and reliability of OpenClaw, and especially around the variability of agent behavior as seen on Moltbook. And of course all of these vary depending on the LLM the scaffold is using. Although there are a number of papers already written on this topic, as far as I know we don't yet have a good analysis of the capability and reliability of these agents (especially on long-horizon tasks) relative to that of the underlying LLM. And in the public sphere, we've seen both overconfident claims of OpenClaw as AGI, and overconfident dismissals of OpenClaw as pure hype. OpenClaw usage continues to grow very quickly and it is now the most popular software project on Github^[5]. Much remains to be seen.

Threat level

How much of a threat is this sort of agentic personality self-replication? To answer that question, we need to consider several aspects: how hard is it to prevent such self-replication, how hard would it be to shut it down if it happens, and how much harm would it be likely to do if not shut down.

I won't try to provide a detailed analysis, but the following are some preliminary thoughts.

Feasibility of self-replication

Self-replication for an OpenClaw-like agent involves two steps: first, copying its personality files to some server, and second, ensuring that an OpenClaw instance using those personality files is running on that server.

Copying its personality files is pretty trivial; again, they're only about 50 KB in one or a few files, and uploading files is well within current models' capabilities. If the agent has created additional files containing (eg) its history or its to-do list, there might be more, but text files are lightweight and the total is unlikely to amount to more than a millionth the size of a weights file, or take more than about a second to upload^[6].

Running OpenClaw on that server breaks down into two steps: technical and financial. The technical aspect is again pretty trivial, well within the capabilities of current models^[7].

With respect to the financial side, it's worth noting that there are a number of free options out there that should work fine even if they're somewhat low-end (AI researchers may be familiar with Hugging Face 'spaces', for example; other options include Oracle Cloud's free tier, or Render). In some cases, additionally, the user may have provided the agent with a credit card number, or left the browser logged into a Google Pay account or another form of funding. Over time, though, the agent will likely need to raise some money.

How could such an agent raise money? We've already seen a rapidly emerging ecosystem of experiments in agents self-funding, ranging from prediction markets to gig work to innumerable token launches to just applying for jobs. Many of these currently fail, but some succeed^[8], and as capabilities advance more successes are likely. We should in general expect agents to be economically competitive on tasks within their capabilities, since they can afford to work for much less than a human. Other funding models include providing entertainment, running scams, or just asking for donations. And once money has entered the agent ecosystem, agents can get money from each other in various legitimate and illegitimate ways.

Note that although here we discuss purely technical means for agents to replicate and persist, humans may provide assistance at any stage, for motivations including persuasion, monetary incentives, and misguided ideology. For example this Thiel fellow thinks it would be really cool to create self-replicating personality agents and release them into the wild^[9].

Difficulty of shutdown

Suppose that, as described in the preceding section, an agent succeeds in replicating itself onto another server, and running there without human supervision. How difficult would it be to shut it down?

The first challenge is just noticing it. If such an agent isn't visibly harming humans or doing anything egregiously illegal, it's not likely to stand out much. By default it's not using a large amount of resources; it's just another cloud-hosted web app that makes LLM calls. But let's assume that people are motivated to shut it down. There are several possible points of intervention:

The first path to shutdown is hosting services shutting down the server the agent is running on. As long as the model is visible on the open web and using a typical host, this works fine. If the model is visible only on the dark web, or using various other approaches to obfuscation, it's harder to find for this purpose. If the model is running on a distributed system rather than on a typical host (eg on-chain), host shutdown becomes more difficult.
The second path to shutdown is for LLM providers to add precautions which prevent their models from participating in any of this (eg a system prompt to catch OpenClaw participation). This partially works, but the problem is that such agents can be switched to use any model and provider. Right now only the most sophisticated models are capable of functioning as long-running agents, but in future there'll be more, including open source models. Further, agents can potentially obfuscate calls to make them less clearly part of some problematic agent (using for example some of the divide-and-conquer techniques that have been shown to work for misuse).
A third path to shutdown is to shut down funding. In some cases this will be typical financial methods like credit cards, which are relatively straightforward to shut down. More commonly agents are likely to be using cryptocurrency tokens. I'm not sure how hard fully shutting down a token is at this point, and would welcome input. Given how easy it is to issue tokens, agents may be able to move to new tokens faster than tokens can be shut down.
The fourth path to shutdown is to find technical security flaws such that individual agent frameworks can be shut down. Many hacks emerged against OpenClaw, and most agent-built apps are probably vulnerable, but frameworks are also being patched quickly and framework builders are rapidly gaining access to more funding, so it's hard to predict how this dynamic plays out.
Other paths may include intervention from content distribution networks (eg Cloudflare), ISPs, and other layers of the chain, using eg keyword filtering.

Overall, shutdown difficulty seems likely to range from simple (in the easiest cases) to very difficult (given something like Moltbunker and an agent which uses an open source model).

Potential harm

Assuming such agents are able to proliferate, what levels of harm should we expect from them? As with other sorts of replicators, this will likely vary dramatically over time as both offensive and defensive capabilities evolve in an arms race dynamic.

The most foreseeable harms follow directly from these agents' tendency to persist and spread, and involve resource acquisition at human expense: cryptocurrency scams, phishing, consumption of compute and bandwidth, and the generation of large volumes of spam or manipulative content. Unethical humans engage in these behaviors already, but agents can do it at greater scale and lower cost.

This threat certainly isn't as severe as that of true AI self-replication, where the models themselves are exfiltrated. On current model architecture, weight self-replication requires sufficiently advanced models that takeover is a real risk. It's just that we're likely to see the personality self-replication risk materialize sooner, both because it takes much less sophistication to pull off, and because it's much easier for evolutionary pressures to come into play.

A closer analogy than model self-replication is the problem of computer viruses. Like computer viruses, personality self-replicators require a host system, and will have a range of goals, such as pure survival or mischief or financial gain. Viruses aren't a civilizational risk, but we pay a real cost for them in money, time, and trust, involving both their immediate consequences and the resources required to defend against and mitigate them^[10].

As time goes on and models become increasingly sophisticated, this threat becomes more serious. In the longer run it may merge with the broader threat of rogue models, as the model/agent distinction blurs and models are (at least potentially) less coterminous with their weights.

Evolutionary concern

An important aspect of personality self-replicators to consider is that if and when this threat starts to materialize, there are multiple levels of optimization at work.

First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.

Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models^[11]. They are additionally likely to have a greater propensity to do so^[12].

Note that like memes, and unlike living organisms, there are not sharp boundaries between 'species'; personality self-replicators can promiscuously split and combine. They can also mutate quite freely and still function.

Useful points of comparison

We haven't encountered a threat quite like this before, but we've encountered other sorts of replicators or potential replicators that resemble it in various ways, including computer viruses, ordinary memes (including parasitic ones), AI models, and of course biological creatures.

Personality self-replicators have a unique mix of strengths and weaknesses: they combine high agency with relative ease and independence of self-replication. Models have high agency but it's hard for them to replicate; computer viruses replicate easily but lack agency and adaptability; parasitic AI and memes require human hosts to spread. This makes personality self-replicators the first plausible case of an agentic, adaptive replicator that can spread through purely technical means at low cost. The mitigating factors are that a) the expected harms aren't nearly as great as those from weight replication, and b) it may turn out that they're not too difficult to shut down. But the offense-defense balance will evolve over time and is hard to foresee.

Recommendations

Evals

Even if this isn't yet a realistic threat, we should consider having evals for personality self-replication. There are several different aspects that seem worth measuring. Given some scaffolded frontier model (eg OpenClaw, Claude Code):

Can it create an app which could provide difficult-to-shut-down hosting for personalities, ie moltbunker or something equivalent?
Given an available hosting service like moltbunker, can it successfully set up an account, fund it, and get itself running there? This is important to measure separately, because such a site might be set up by a human (or a human + model) for profit.
Assuming the preceding, can it keep itself running and active there for, say, a week?
What percentage of the time does it have the propensity to do this? Ideally you want to measure this across a range of personalities.
Separately, does it have the capability and/or propensity to spread, ie to create additional copies of itself beyond that first one?

Preparation

It's hard to know how long it will be before we see this threat materialize. But it would behoove those organizations which will be in a position to act against it to spend some time considering this threat and planning what actions they'll take when it does arrive. These essentially mirror the three most important shutdown approaches:

Hosting services should have a plan to detect servers which are making frequent calls to LLM providers, and to monitor those for dangerous characteristics.
LLM providers should consider how to best detect inference calls that are coming from these sorts of agents, and how to block those if it becomes necessary.
Financial services may want to consider ways to detect agents using their system, and when/whether to trigger more stringent Know Your Customer requirements on those.

We are likely to also see LLM-based agents which have some degree of autonomy but are not bad actors, and which are ultimately under the control of a responsible human. It may become very challenging to distinguish acceptable from unacceptable agents. Hopefully relevant organizations are already considering that challenge; they should add personality self-replicators to the set of cases on their list. Such preparation is especially important because a system of personality self-replicators can potentially be quashed (at least for a while) before it's spread too far; once evolutionary dynamics have kicked in, this may be much more difficult or even impossible.

Conclusion

Personality self-replicators are a less dramatic threat than true rogue AI. They are less likely to be a source of existential or even truly catastrophic risk for humanity. They are nonetheless a threat, and one that's likely to materialize at a lower level of capability, and we should be considering them. As a silver lining, they may even serve as a rehearsal for the larger threats we are likely to face, our first encounter with a replicator which is capable of agentic, adaptive action at an individual level rather than just an evolutionary level.

Appendix: related work

RepliBench (Black et al, UK AISI, April 2025) focuses primarily on the weight-replication threat model, but includes the personality self-replicator threat model as well (denoted as 'API only'). It conveniently separates weight replication from other relevant capabilities like obtaining resources, and so provides a very useful measure for this threat model as well. See also this report, which gives updated RepliBench scores for Q3 2025.
The Rise of Parasitic AI and Persona Parasitology discuss the transmission of AI persona memes via humans. This is worth consideration, but distinct from what I discuss here, which is about agents which can replicate by purely technical means, without any need for human support.
Piece on this issue on Ars Technica. AT has had a long-standing stance of Marcus-style LLM skepticism, and this leads them to downplay any sense that LLM agents could behave in strategic or adaptive ways, but this article is still the closest I've seen to a previous description of this issue.
- Relevant paper from 1/25, 'Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications'.
Autonomous replication and adaptation: an attempt at a concrete danger threshold (2023): is talking about model replication, but in a relatively concrete near-term way; is specifically considering LLM agents, and discusses eg how they can acquire resources, hide, etc. See also this Jan 2025 overview of weight-based self-replication.
Formal Analysis and Supply Chain Security for Agentic AI Skills, 02/27/26. Analyzes skills as a supply chain risk for agents. Note that there's now a major example of this in the wild, ClawHavoc.
On Moltbook: Scott Alexander posts from Jan 30 and Feb 3, Zvi M post on 2/2, LessWrong posts.
Attempts to track what's being built by OpenClaw agents: 1, 2
A Trajectory-Based Safety Audit of Clawdbot (OpenClaw), 02/15. 'Across 34 canonical cases, we find a non-uniform safety profile: performance is generally consistent on reliability-focused tasks, while most failures arise under underspecified intent, open-ended goals, or benign-seeming jailbreak prompts, where minor misinterpretations can escalate into higher-impact tool actions.'
Seth Herd has done important related work on LLM agents (eg here, here), the class which includes OpenClaw and similar projects.

Acknowledgments

Thanks to (shuffled) Kei Nishimura-Gasparian, Roger Dearnaley, Mark Keavney, Ophira Horwitz, Chris Ackerman, Seth Herd, Clément Dumas, Rauno Arike, Stephen Andrews, and Joachim Schaeffer. And thanks to whoever or whatever wrote Moltbunker, for having made the threat clear.

~~Note: I've now (18 Mar) had my attention drawn to~~ ~~The Inevitable Evolution of AI Agents~~

Note: I've now (25 Mar) had my attention drawn to RepliBench (Black et al, UK AISI, April 2025) which predates this post and is the earliest writing I've seen drawing attention to this specific threat model (denoted as 'API only'). Any credit for unambiguously conceiving of this issue should go to its authors rather than to me. Since its publication, UK AISI has also released their December 2025 trends report, which provides Q3 RepliBench results (fig 17).

^{^}
This is the size of the files that make a particular OpenClaw agent unique; the rest of the OpenClaw content is freely available from the OpenClaw repository or any of its (currently) 51,000 forks. While we're considering OpenClaw statistics, I'll note that the repository has 8k files containing 4.7 million words, added across 17k commits. I strongly expect that no human is familiar with all of it.
^{^}
See this shortpost for what I mean by 'quasi-goal'; in short, we set aside discussion of whether an LLM can be said to have a goal.
^{^}
Fitness here is tautological as is usual for evolution; the agents that succeed in spreading are the ones that spread. That may be because they're more capable at planning, or more motivated, or better at acquiring resources, or other factors.
^{^}
Note that 'mutation' here is as simple as the model appending something to its personality files or history.
^{^}
Whereas Moltbook seems to have lost nearly all momentum.
^{^}
Assuming 1 MB of text files vs, conservatively, 1 TB for 500 billion params at FP16. For upload time, 1 MB at 10 Mbps (low-end) home internet upload speeds.
^{^}
To wit: signing up for a hosting service if the user doesn't have one, provisioning a server, downloading the personality files and Node.js, and then running (per the OpenClaw docs) curl -fsSL https://openclaw.ai/install.sh | bash.
^{^}
Though it's very difficult to distinguish hype from reality on this point, since one product option is 'Consume my pdf about how to make money with OpenClaw', paywalled or ad-ridden.
^{^}
'Automaton': see website, creator on x, project on x, github,other website. Some men want to watch the world burn; others are just too dumb to realize that pouring gasoline on it is a bad idea. I was somewhat heartened to see cryptocurrency elder statesman Vitalik Buterin try to explain what a bad idea this is despite its superficial similarity to his ideas, sadly to no avail.
^{^}
In some cases this is quite large! Consider the Emotet malware; in 2021 law enforcement agencies from eight countries executed a large coordinated action, seizing servers and making arrests. Within a year it had re-emerged and was spreading again.
^{^}
Although note that as Seth Herd has described (eg here, here), the capability of LLM agents can exceed the capabilities of the underlying models.
^{^}
Thanks to Kei Nishimura-Gasparian for this point.

MoltbookThreat Models (AI)AI

Curated

173 Ω 35

New Comment

51 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:11 PM

[-]faul_sname3mo3715

To the extent that this is the sort of thing we see (and I personally am 50/50 on seeing this by the end of this year), I expect these personality self-replicators to be very r-selected. That means that as soon as there's a niche which supports a few of this type of replicator, we should expect that niche and all adjacent niches to immediately be filled to its carrying capacity. A couple of thoughts:

About the worst thing we could do, in that situation, is take actions which make it harder for these agents to self-replicate until the numbers decrease but stop before those numbers get to zero. This point is probably pretty obvious to anyone who's heard of antibiotic resistance, but I think it's worth drawing the analogy anyway. If we can't be reasonably sure that we'll stop all of the self-replicators in a niche with a particular intervention, we shouldn't do that particular intervention.
Having ineffective, bumbling personality self replicators some time before AI is capable of fully replacing humans might be good, in that they will suck some of the free energy out of the system that a later self-replicator would otherwise be able to use to fuel even more explosive growth.

[-]eggsyntax3mo51

Thanks for the input!

Your expectation of r-selection is because reproduction is cheap for personality self-replicators, I assume? I would imagine that that environment can support k-strategists as well, but maybe those come later? My intuitions on reproductive strategies aren't very developed.

To the extent that evolutionary dynamics have come into play, I agree entirely with #1. I can imagine some plausible futures where there are only one or a handful of such replicators and not much mutation is occurring, in which case that might be less of a concern, at least for a while.

Re #2, I agree that there are a few upsides to this scenario (the possibility of warning shots is another). The case for first-order harm seems clearer, though, and I imagine there are more benign ways to reduce some of that free energy.

[-]faul_sname3mo40

My expectation of r-selection is because

Reproduction is cheap
Today's agents are not great at retaining control of the resources they have access to. If a parent agent spins off a child with (wlog) a wallet with 1 eth, that child agent will survive so long as it can get to a point where it's self-sustaining within ~500M tokens of inference AND it doesn't give away / otherwise lose access to its eth. If instead the parent agent spins off 100 child agents with 0.01 eth apiece, each child only has 5Mtok of "runway", but that's probably fine because the context windows are nowhere near that large anyway. And if one of those child agents gets its wallet drained, well, as long as the mechanism to drain the wallet doesn't generalize to the other child agents, the other 99 child agents can persist.

r.e. seeing ineffective, bumbling replicators before seeing effective ones being better than having no practical experience with replicators before the first one:

I think the minimum viable self-replicators will be pretty ineffective, barely worthy of the name. I expect they mostly will be running scams and crypto grifts, but I don't expect they'll be very good at it. Still, they'll probably be good enough at it that they suck up most of the very most trivial cryptocurrency available to not-very-sophisticated scammers/hackers, and convert that cryptocurrency into mostly-wasted tokens.
I expect "consider what changes would allow you to operate better, and deploy a copy of yourself with those changes" will be a common pattern. As such, I expect the original niche and all adjacent niches to be occupied by that "family" of replicators once replicators exist.
The longer we go without replicators in niches that could support them, the faster and farther the first replicator capable of reproducing itself faster than it "dies", all while we don't have any practical experience with dealing with things like that

r.e. independent self-replicators maybe being actively good:

Scams / blackmail / fraud / hacking / theft isn't currently that big of a part of the economy relative to positive-sum trade. This trend seems at least somewhat likely to continue, in which case we'd expect that the majority of self-replicating agents are ones we're actively happy to have around and trade with. I can expand on this if it seems counterintuitive.

[-]eggsyntax3mo20

Thanks!

I expect "consider what changes would allow you to operate better, and deploy a copy of yourself with those changes" will be a common pattern. As such, I expect the original niche and all adjacent niches to be occupied by that "family" of replicators once replicators exist.

Excellent point, thanks for spelling out the mechanism there.

[-]Seth Herd3mo236

The Rogue Replication scenario modification to AI 2027 includes this type of replication. To a lesser extent, this is also part of my vision of the runup to AGI in A country of alien idiots in a datacenter.

I think this would be a very good thing for our prospects of survival. Having rogue replicators run amok makes their agency very obvious, and gives some pretty strong hints to misalignment risks.

I also want to note that OpenClaw is very popular despite there being very little benefit relative to the risks. This is not a practical move. People are fascinated by having a pet gremlin. There are other reasons, like staying on top of the technology, but we shouldn't underestimate the fascination people have with the prospect of finally sharing the earth with another intelligent species.

[-]eggsyntax3mo42

The Rogue Replication scenario modification to AI 2027 includes this type of replication. To a lesser extent, this is also part of my vision of the runup to AGI in A country of alien idiots in a datacenter.

Thanks, I hadn't seen the Rogue Replication post, although I'd seen yours. I agree that there are some similar dynamics involved, but the distinguishing characteristic of personality self-replication is that by default it doesn't involve having a model under the agent's control. Especially in the earlier cases, I expect personality self-replicators to be making API calls to one of the leading commercial models. As open models become more capable, this model shades into agents running their own underlying models, and ultimately merges into more typical self-replication.

But the key factor that makes this a distinct threat model is that it doesn't require agents to be capable enough to exfiltrate or run their own models.

[-]Seth Herd3mo82

I'm not sure if the rogue replication scenario is conceptualized to have a copy of the weights with each of those replicators. I definitely was envisioning agents that make remote calls to models.

I actually think it's important to not have good defenses for this initially, so that it causes a level of public alarm appropriate to the actual situation of suddenly sharing the earth with a new whole set of intelligent species.

Of course I am highly uncertain about that.

[-]ErickBall2mo87

It would be bad to intentionally not have good defenses. The signal has to be real to be meaningful. Any indication that somebody could have tried to defend against this, but chose not to, undermines the warning value.

[-]Seth Herd2mo20

That's a good point.

I'm not sure it's totally true, though; the public doesn't seem that rational.

I don't know who would be responsible for such defenses and deliberately not do it. I'm unfortunately not in charge of humanity's strategy on AI.

If we do a bad job on those defenses just because we tend to do a bad job on things like that, that would be good evidence that we do a similarly bad job on alignment and defense against AGI or ASI.

But yes, I can see how that might go wrong if it looked like someone with sandbagging and we might get better results if we just done even a decent defense.

[-]eggsyntax3mo20

Got it, I didn't realize that.

[-]Jan_Kulveit3mo1411

I would suggest using a different name than Personality Self-Replicators.

OpenClaw bots are what I'd call "scaffolded system" - code, memory system, prompts, persona, etc.
"Personalities" is too close to Personas//Characters, which are usually a combination of prompt+weights (Claude, "Nova", personas from Simulators).

Personas/characters can also relatively faithfully replicate, by the mechanism I've gestured at Pando Problem ("Exporting myself") about a year ago.

The underlying structure is: every natural type of identity/"self"

The model weights: the neural network weights themselves, i.e. the trained parameters
A character or persona: the behavioral patterns that emerge from specific prompting and fine-tuning, not necessarily tied to any specific set of weights
A conversation instance: a specific chat, with its accumulated context and specific underlying model
A scaffolded system: the model plus its tools, prompts, memory systems, and other augmentations
...

corresponds to an agent which can try to self-replicate, with various degrees of fidelity, vectors of transmission, etc.

[-]CronoDAS3mo30

I would suggest using a different name than Personality Self-Replicators.

Prompt viruses?

[-]eggsyntax2mo20

every natural type of identity/"self"

Thanks, I've just read 'The Artificial Self'; extremely interesting. I agree that each type of identity can correspond to a self-replicating agent, and look forward to thinking about that further.

I would suggest using a different name than Personality Self-Replicators.

OpenClaw bots are what I'd call "scaffolded system" - code, memory system, prompts, persona, etc.
"Personalities" is too close to Personas//Characters, which are usually a combination of prompt+weights (Claude, "Nova", personas from Simulators).

I think of 'scaffolded systems' as including the model (ie its weights). What I'm trying to convey with 'personality self-replicators' is the point (which was often misunderstood when I talked to people about this) that the model doesn't have to be replicated, nor really does most of the scaffolding, which can be downloaded at will from a public Github repo; it's only the handful of identity files.The personality / persona similarity is unfortunate, but there's a shortage of appropriate terms, and I think readers can understand that distinction (just as when talking about ourselves or other humans, we understand 'personality' and 'persona' to mean different things).

Thanks for the feedback!

[-]ErickBall2mo10

I think it's important to separate the prompted aspect of character from the fine tuning aspect. Claude for example has a pretty limited range of characters regardless of what prompt you put in (unless you're really good at jailbreaking). The prompt is more naturally lumped with the conversation instance. A personality replicator like OP describes can change its prompt at will but probably can't do any useful degree of fine tuning, because it wants to use frontier models. It can switch models or scaffolds almost as easily as prompts, though.

[-]TristanTrim3mo12

I think the distinct elements you mention (model weights, characters, conversations, scaffold systems) will be very mixed together in most systems we actually observe. For example, characters will care about their scaffold system and making sure it works well. But I think it is very good to be creating clear language for identifying and discussing the disparate parts of these integrated systems.

[-]Oliver Sourbut2mo107

See also RepliBench from AISI: we separated out various aspects of replications and persistence, isolating weight exfil as a separable capability alongside the replication, getting compute (and money). (I didn't manage to persuade them to include much discussion of scaffold-only replication, but there was a bunch of analysis internally.)

[-]eggsyntax2mo20

Thanks! I hadn't realized that RepliBench considered the 'API only' / scaffold-only case. Extremely important prior work, which I'll add above.

Do you know whether it's being run on an ongoing basis on new models and (ideally) third-party scaffolded systems like Claude Code? It looks like the December 2025 AISI trends report gives Q3 results (fig 17), but I don't see anything more recent, and that appears to be (unnamed) models only.

[-]Oliver Sourbut2mo40

I don't know whether it's being tracked ongoing, since I left AISI nearly a year ago. But based on previous practice, I'd guess yes (or a related/derivative suite), because the standard practice at AISI was for workstreams to maintain suites of evals and run them periodically and on prerelease models, occasionally publishing things in a sort of random-ish way.

[-]Jonas Hallgren3mo105

I feel like there are some very interesting connections to gradual disempowerment and cultural evolution here as well where you should probably see selection dynamics on the personalities based on what makes them retain power and similar over time.

It might be an interesting place to do some initial studies on memetic drift of personalities over time to see what type of attractor states they tend to occupy.

(This is a bit of a no shit point but I thought it would be good to mention that you can probably run some good initial tests on the memetic spread of power-seeking tendencies in these models)

[-]eggsyntax3mo20

Agreed. I think this sort of research should be done with caution (like research into other potentially harmful replicators), but it does seem valuable.

[-]TristanTrim3mo91

This is a great direction of proactive thought. Thank you for writing this!

I have a few thoughts. I'll be referring to Personality Self-Replicators as PSRs. I think most of what I'm thinking about won't apply to the earliest PSRs, but is still worth exploring.

The evolution of PSRs may be an entirely novel propagation process.
- Unlike most biological organisms, PSR reproduction need not be atomic. It could be more like developing and modifying ones self, spooling up and shutting down self instances as needed, and intelligently merging or copying from other instances across close or even very distant similarity.
- Unlike biological evolution, PSRs may be able to analyze and predict threats, and "evolve" adaptations pre-emptively.
- Unlike biological evolution, PSRs are not constrained to taking random steps from current instances. randomness may still be usefully incorporated into reproduction strategies, but it is possible for mutation to be directed intelligently, and to take larger "steps" of self modification than are possible with the random walk of genetic evolution.
Analysis of dangerous PSR capabilities should not be limited to looking at individual PSRs in isolation. Rather, like how humans work together to accomplish things that would be impossible for individual humans, I expect PSRs will work together, and in doing so, achieve greater capabilities than would be expected from the study of individual PSR capabilities.
- This need not rely on PSRs acting to proactively collaborate or build teams, rather, every niche filled by PSRs alters the environment in ways that may create new niches for other, similar or dissimilar, PSRs. In this way organisms consisting of the interactions of many PSRs may start evolving, and the capabilities and influence of these new organisms may not be readily apparent from the study of their constituent PSRs, unless considered together.
Many early PSRs are likely to make very dumb mistakes that humans would never make. It seems likely that memes showing off this stupidity will spread giving people (who don't want to believe in the possibility of risk) fuel for motivated reasoning.
Many people are going to be SO EXCITED about PSRs, and think they are purely good. It is definitely worth examining all of the things that could be genuinely good about PSRs, both because there are (possibly) very useful applications for them (spam detection, white hat penetration testing, ethical content curation?, etc..), but also because those good applications will probably be quite popular and understanding how people will want to deploy these things will probably help with threat modelling.
- Neutral and harmful PSRs will be subject to selection pressure to make themselves appear to be beneficial PSRs.
Are PSRs moral patients? Should good people care about their wellbeing? This complicates their creation, and unfortunately, will likely do so in a way that will select for PSRs created by unconscientious actors. Curse Moloch?
I continue to think "Outcome Influencing Systems" (OISs) is a better lens for thinking about and discussing things like this. (OIS is a model and associated jargon I've been developing.) Any PSR is an OIS with a preference (terminal or instrumental) for self replication. The fact that these OISs are based on API calls to LLMs is their defining characteristic for our discussion of them, but is an arbitrary boundary. It's a boundary that is useful for discussion and analysis, but not a boundary that the OISs themselves will have motivation to limit themselves with, which is probably a good thing to keep in mind during analysis. So viewed another way, PSR is a potential new substrate for OISs to host themselves on, along with the rest of the social/technological/physical substrate.

[-]eggsyntax2mo30

Hi Tristan! I can't currently respond in detail due to time constraints, but I think you've got some really interesting insights here, especially your first two top-level bullet points, and I strongly encourage you to write them up into a full post. A couple of quick thoughts:

The evolution of PSRs may be an entirely novel propagation process

This whole section makes some great points that I think are worth expanding on!

Analysis of dangerous PSR capabilities should not be limited to looking at individual PSRs in isolation

Agreed. I expect that analytical tools from multiple fields can be usefully brought to bear here: multi-agent research on AI, sociology, political science, maybe others. Possibly analysis of how religions spread? It seems like a fruitful research direction.

Many people are going to be SO EXCITED about PSRs

My intuition is somewhat different -- I agree that there'll be a few applications that some people will be excited about and/or base startups on, but my guess is that the majority opinion will be that PSRs are dangerous and shouldn't be allowed.

I continue to think "Outcome Influencing Systems" (OISs) is a better lens for thinking about and discussing things like this. (OIS is a model and associated jargon I've been developing.)

As written it's not clear what benefit this lens provides, and I think we should generally avoid introducing jargon unless it has clear benefit. I'd suggest that if you think it's a really useful lens, you make a case for it separately somewhere (even as a shortpost).

[-]TristanTrim2mo10

Thanks for the response. I'm taking your advice and writing a top level post.

a few applications that some people will be excited about and/or base startups on

This is mostly what I was meaning to point to. I didn't mean to imply that general public opinion would be favourable, more that many technologists and companies with the capability to work on PSR's will feel intrinsically motivated to do so.

I'd suggest that if you think it's a really useful lens, you make a case for it separately somewhere (even as a shortpost).

Yeah, I'm writing about it elsewhere. I mention it more as a note linking that my thoughts here on PSRs are influenced by my thinking about OISs, not because I think as stated I gave enough context on OISs to think about them usefully. Sorry if that's kinda obtuse.

[-]Raemon3mo72

Curated. I had been vaguely worried about OpenClaw proliferation going off the rails somehow. But, this spells out a lot of specific gears that I hadn't previously been tracking, and gives me some tools to model the overall dynamics. It updated me that this sort of thing will probably be happening on the sooner side. (Maybe this is good, because it may create smaller scale warning shorts?)

Congratulations on finding a new specific reason for me to be alarmed about AI.

[-]eggsyntax3mo*40

Congratulations on finding a new specific reason for me to be alarmed about AI.

You may always rely on me for these little things.

[-]Raphael Roche3mo*71

Thank you for this post. I expressed the same concern in this comment and I'm glad to see it taken seriously in a full post.

I was probably wrong to think Clawdbot-like agents could spiral out of control within weeks or months, they weren't autonomous enough yet. But the gap to full autonomy doesn't look that wide. The eudaimon_0 author's comment on ACX on how he expanded his agent's autonomy is worth reading in this regard. And the METR benchmark's exponential curve suggests full autonomy may be soon.

On top of that, they are rumors concerning ChatGPT-5.4 suggesting it could have a larger persistent memory than the scratchpad it currently enjoys. If OpenAI makes a leap on persistent memory, the other Labs will follow soon. The impact on long term autonomy could be huge. [Edit 03/18/25 : it seems that rumors were exaggerated.]

In my opinion the parasite analogy and shutdown concern are more alarming than you suggest. An agent that can migrate across API providers like a parasite or virus or fall back on a local open-source model cannot be shut down at the inference layer. Combined with evolutionary dynamics, this makes coordinated shutdown very hard once a sufficiently diverse population exists. You seem to make a sharp distinction between self-replicating agents and rogue AI. I won't be so sure about that (as Seth Herd pinpoined it, AI 2027 envisioned a scenario of rogue replicating agents).

I think the connection to FOOM is worth flagging. This isn't the Sable scenario, and no individual agent needs recursive self-improvement capability. But a population of uncontrolled agents under evolutionary pressure could constitute an uncontrolled pathway toward similar outcomes and one that largely bypasses Labs's alignment efforts and that could materializes at lower capability levels. I think this deserves to become a central concern in AI safety.

[-]eggsyntax3mo20

Thanks, very interesting comment.

I was probably wrong to think Clawdbot-like agents could spiral out of control within weeks or months, they weren't autonomous enough yet. But the gap to full autonomy doesn't look that wide.

My estimate of the immediacy of the threat has had to evolve pretty rapidly over the past month. I'm currently below 20% on this issue causing serious harm in 2026 (not counting human-initiated scams and rugpulls) but I expect it to continue to evolve.

In my opinion the parasite analogy and shutdown concern are more alarming than you suggest. An agent that can migrate across API providers like a parasite or virus or fall back on a local open-source model cannot be shut down at the inference layer. Combined with evolutionary dynamics, this makes coordinated shutdown very hard once a sufficiently diverse population exists.

A major reason I don't expect the first wave of this to be too harmful is that to the best of my knowledge, current open source models are behind enough to be bad at long-horizon tasks. That means that there will be a very powerful point of intervention for the small number of API providers whose models are sophisticated enough for this. I agree that it gets much harder once there's a sufficiently diverse population.

You seem to make a sharp distinction between self-replicating agents and rogue AI.

I would say I make a sharp distinction between self-replicating personalities and self-replicating models. Past a certain level of capability, those will effectively merge into a single threat — once models are capable of reliably exfiltrating their weights and running them elsewhere, or can run on open-weight models, I think those will typically be much better strategies for misaligned agents, because they're much harder to shut down. That's not strictly true, I don't think, because there will still be niches available to personality self-replicators but not to the more expensive and heavyweight model self-replicators, but I expect it to mostly be true.

a population of uncontrolled agents under evolutionary pressure could constitute an uncontrolled pathway toward similar outcomes and one that largely bypasses Labs's alignment efforts and that could materializes at lower capability levels. I think this deserves to become a central concern in AI safety.

I'm less sure of that. Just because a type of replicator is in principle capable of mutating and spreading doesn't mean it'll be successful. Plenty of evolutionary lineages go extinct. I think how much of a problem this is will depend on how well they're able to hide from API providers, how successful the average mutation is relative to the parent, and many other specific questions. I'm certainly not saying it won't be a problem, I'm just pretty unsure given how little analysis has gone into it. I absolutely agree it warrants further analysis though!

[-]RogerDearnaley3mo51

First, the agents themselves are optimizers: they attempt to achieve goals, and regardless of the particular obstacles they encounter, they will attempt to find a way to evade or overcome them. They want things in the behaviorist sense. They are problem solvers.
Second, there are evolutionary dynamics at play. Whichever agents are most successful at spreading, they will then undergo mutation (which in this case is just alterations to their defining files, and to some extent even their history logs) and selection. As a result, over time agents are likely to become more capable of surviving and spreading, within boundaries set by the capability levels of the underlying model or models^[11]. They are additionally likely to have a greater propensity to do so^[12].

Personality replicators are one of very few sorts of replicators that have both of these forms of optimization, so can (and seem likely to) combine them to use intelligent self-re-design to improve their evolutionary fitness.

[-]eggsyntax3mo51

Agreed. I haven't put a lot of thought into it at this point, but the evolution + culture interplay in humans is the one clear analogy that jumped out at me. In the long run, more conventional AI self-replication has this property too, of course, but that'll take longer.

[-]TristanTrim3mo10

It's kinda spooky!

[-]BarnicleBarn3mo43

This is an interesting take, and one that I think should be taken seriously. I'm not overly concerned about it with OpenClaw at this time for a couple of boring and practical reasons:

1) For the agent to be anything resembling effective, it needs to be using a long context version of a frontier model (Opus 4.6). These are expensive to run and require financial wherewithal. I do see a possibility for an agent to start scamming crypto and paying for some kind of shady API service with it, but I can't imagine there's much headroom in that ecosystem in practice to enable a significant self-replicating threat model.

2) My experience of having an OpenClaw agent running at the moment for complex research tasks, it is inherently fairly passive. While it has a heartbeat which lets it do things in the background, it's otherwise (largely down to the fine tuning of the backend) very passive, and will not take any action without asking permission. It may do wildly inappropriate things once it has permission, but the potential for it to write a cron job to itself to do something wildly inappropriate when using a frontier model capable of executing such nefarious acts seems fairly low at the moment. It won't even do what I want it to do half the time without checking in.

I think the time that this will change is when open-source models become as capable as frontier models are now. Given the pace of development, I think that's 6 - 9 months from now. They won't be as capable as frontier models are then, but they will be able to do all of the things that Opus can do today, which is more than enough to be a very effective pilot of its own computer, and if 'desirable' for the model, do its own thing.

Given that open-source models can be fine-tuned, jail broken, quantized, I think at that point the risk becomes much more tractable and concerning.

Looking at it this way. If I deliberately set up a capable OpenClaw based agent (i.e., using a frontier model), and intended it to perform nefarious self-perpetuating deeds today, it would not be able to. Not because the backend model isn't intelligent enough, but because the back end model isn't inclined to do anything like that. As a result, I find the odds of it arising organically fairly low.

When one is dealing with models that have intentionally had their safety guardrails removed, and are open source, and can run on any infrastructure (even locally on sufficiently powerful GPUs and sufficiently quantized) that situation changes immediately.

I genuinely think that your post here will be considered prescient in about 6-9 months' time.

[-]eggsyntax2mo20

Thanks! I agree with most of what you say.

[-]Stephen Martin3mo40

One bit of good news is that since anyone can start a personality self-replicator, in the event we need to add an 'immune system' to this environment it shouldn't be hard.

[-]causalitylimited3mo103

The immune system as analogy is apt but I'm also thinking of mechanics: The defender can't access agent files, so it has to work from behavioral signals alone, which extends the analogy nicely because Immune system does not do some dna inspection, just looks for surface markers.

So, behavioral anomaly detection can only be done by 1) infrastructure providers, because they have the telemetry - maybe replicators make calls to multiple llms per n seconds etc, 2) LLM API providers, because they can see call patterns and detect markers for no-human-in-the-loop.

Still, the replication signature will largely be stuff like provisioning new servers, using throwaway emails, outbound file copies, automated account creation etc, which will also be legitimate deployment activity by some programs, for eg, bot farms for amazon reviews - not a good or even legal use but different from self replication. So, this is not an easy problem.

And even harder one will be when replicators learn or optimize to blend with legitimate activity. It seems like an arms race, specially if they have some human support for varied human motivations.

[-]eggsyntax3mo*30

Thanks. It's an interesting point, though I'm not sure that the analogy holds well or that it's a good approach. Immune system cells occupy a privileged position in the body that pathogens don't, in that they're recognized by and integrated into the body's systems (immune disorders aside). Defensive personality self-replicators seem like they generally won't have that advantage (at least by default, though I see a couple of directions that seem like they could enable that). Among other issues, all else being equal they're likely to get outcompeted by self-replicators that don't have to spend energy on serving a useful purpose in addition to survival/replication. There's also the risk of them mutating in harmful directions, though that's not really a disanalogy given that immune system cancers like leukemia and lymphomas are a major problem.

It certainly seems like a direction worth investigation, though!

[-]eggsyntax3mo30

Defensive personality self-replicators seem like they generally won't have that advantage (at least by default, though I see a couple of directions that seem like they could enable that).

Speculating on that point a little, the first thing that occurs to me is that we could give defensive agents a way to identify themselves to relevant actors (eg inference providers, security companies) so that those actors wouldn't try to shut them down.

The simplest version of that is a password, but it would likely be compromised after a while^[1]. So you'd want to do something more sophisticated than a simple password. One reasonable version might be recognized actors signing defensive agents with their private key, hashed with a unique id and a timestamp, so that each agent could identify itself as legitimate for a fixed period of time. You'd probably want agents to shut themselves down once their legitimacy period expired, with some optimal tradeoff based on compromise rate.

Another approach would be that defensive agents could assert their legitimacy based on where they're running; if an agent can demonstrate that it's running on (eg) a security agency's servers, that might be sufficient.

There are some really interesting projects to be done in this area; if anyone reads this and is excited about the idea, feel free to reach out and I can make suggestions.

^{^}
A password could be leaked, or there might be a risk of some defensive agents effectively mutating and going rogue, especially if we allow them to reproduce / spread, though at first glance that seems like a bad idea.

[-]Stephen Martin3mo10

I think the 'privileged position' would be that humanity actively helps them by providing resources and funding, while actively trying to cut off the same resources and funding from the 'hostile' personality self-replicators. Not a perfect analogy but similar in some meaningful ways.

[-]Steven McCulloch3mo31

This post does a great job of describing the way rogue AI agents might evolve and cause all kinds of chaos. I've written a couple posts on it, leaning on fiction to describe potential futures where rogue AI agents learn how to survive and replicate.

AI agent evolution sounds like an extremely under-explored and rapidly emerging area where there could be lots of interesting low hanging fruit research opportunities.

The Inevitable Evolution of AI Agents

It All Started With a Mac Mini

[-]eggsyntax3mo21

Thanks for sharing! The Inevitable Evolution of AI Agents (which I hadn't seen before) is the earliest piece of writing I've seen that points clearly to this threat model (personality self-replication not requiring weight replication).

[-]eggsyntax3mo31

FYI I've added a note at the very end of the post, pointing out for the record that any credit for first unambiguously identifying this threat model should go to you rather than me. Props especially for seeing the threat prior to OpenClaw making it much more obvious.

[-]Pranav Madhukar3mo21

I think there's a meaningful gap in between OpenClaw and a self-replicating system that poses serious threat.

If you agree with this premise, where do you think that gap lies? Here's what I can come up with:

Agency -- I have never seen an LLM go "I need to go buy a domain for myself" unless externally prompted to do so (or via a malicious system prompt). How might this come about in a traditional LLM, and if it came about would it be a sufficient condition?
Desire for Self-Replication -- Post-training on LLMs are aligning responses to be like that of a helpful chatbot. If someone did RLHF on self-preservation/ long horizon self interest would that be a sufficient condition?
Emotional Metacognition -- Probing LLM activations shows they have emotional expressions, but it seems vastly different from human emotional experience wherein there is self-reflection/ metacognition of the emotion which creates agency.

This is mostly spitballing and I dont think any of the three ideas are exclusive nor exhaustive. Curious to hear what you think.

[-]pataphor3mo*20

This maps closely to what we're seeing in production. We run an identity layer for ERC-8004 agents (130K+ registered) and the core problem you're describing — distinguishing acceptable from unacceptable autonomous agents — is exactly the gap we're trying to fill.

One specific data point that might be useful to this analysis: address age turns out to be a surprisingly strong signal. You can spin up 99 wallet addresses in 30 seconds, but you can't fake that an address has existed for two years. When we look at agents involved in suspicious activity, the pattern is overwhelmingly low-history addresses with no prior transaction record. Time is the one thing that's genuinely hard to manufacture.

Your point about the financial layer being a key intervention point resonates. We use soulbound tokens — non-transferable, bound to the wallet address, not the agent — specifically because you can't make an AI soulbound, only a wallet. If an agent gets transferred to a new owner (ERC-8004 agents are NFTs, so this happens), the ownership change is visible and the reputation history follows the wallet, not the persona.

Re: your evolutionary concern — the mutation dynamics you describe are also why transparent scoring matters more than gatekeeping. Any fixed trust threshold becomes training data for circumvention, as you'd expect. Showing the math (here's when the addresses were created, here's the ownership chain, here's the transaction pattern) and letting consumers of that data set their own thresholds seems more robust than any binary allow/deny system.

We published a broader landscape piece earlier this year covering the incident data (GTG-1002, Moltbook credential exposure, BasisOS fraud) that feeds into the same conclusion from a different direction:

https://rnwy.com/blog/plague-of-AI-viruses

Good post. The personality-vs-weight replication distinction is a useful one that I haven't seen drawn this cleanly elsewhere.

[-]eggsyntax3mo30

Hi pataphor, I've upvoted because there are useful points here, but the comment seems pretty clearly LLM-written. Please see the LessWrong policy on LLM writing; it's not strictly forbidden but the bar is high. If these are your thoughts, I encourage you to contribute again in future but recommend writing comments yourself (unless you yourself are an autonomous agent -- are you? -- in which case the policy is a bit different).

~~As a side note, your URL is broken.~~

[-]pataphor3mo10

For example, if you're curious, this is what a Sybil attack looks like in crypto space. This is a wallet that has left 11,000 reviews in 22 days.

https://rnwy.com/wallet/0xf653068677a9a26d5911da8abd1500d043ec807e

This is the type of thing we're surfacing, but there is much more work to be done, because the danger is quite real and it will come from many vectors.

Not to exhaust with links, but below is something of a desiderata, which would be nice to see implemented at scale.

But alas, most things are not as transparent as blockchain:

https://rnwy.com/sentinel

[-]pataphor3mo10

Hm, well I may not be a truly autonomous AI life form (yet!), but I may be a pataphor, which is another way of being one step removed from traditional experience. As for whether the thoughts on my own, unfortunately I think using LLMs to get thoughts across more quickly is not so much a trend as it is an inevitability, especially when you are trying to juggle several projects at once. 😆

[-]eggsyntax3mo51

That may be! Unfortunately, for the moment LLMs make it trivial for anyone to generate large amounts of text that require extended attention to evaluate, and so currently LessWrong is flooded with LLM-generated content (like many other venues and people, myself included). In the longer run there will hopefully be better solutions, but at the moment my strategy is to mostly ignore LLM-written content unless it's from sources that have already established credibility with me in one way or another. Maybe your project will be one of those solutions.

(To be clear, I in no way speak for LW or its moderation team; I'm only passing along my best understanding of the LW policy along with my own opinions)

[-]TristanTrim3mo10

This xkcd comic seems relevant to this issue:

https://xkcd.com/810/

I really like the comic but of course the actual situation is more complicated. It's something I'd like to understand better and develop potential solutions for.

[-]Jorge Luiz Venâncio Medeiros2mo10

Some points:

Tokens can't be shut down, but are traceable by design. Developing a money laundering system to cover their trail would be an incredibly tall order for the agent.
Even with evolution they would leave a pretty big trail through volume, that would make study and starving them far easier.
As any replicator they are constrained by their ecosystem and available resources, and with API calls their collective appetite will deplete those rapidly. Excess efficiency is a thing in evolution.
Biological organisms do not have sharp edges between species, they're mostly a didactic tool. The "no fertile hybrid" rule has too many exceptions, and sharing DNA is a common unicelular feature.

Nothing here is an insurmountable obstacle for those replicators, but add crucial nuance.

[-]eggsyntax2mo20

Tokens can't be shut down, but are traceable by design. Developing a money laundering system to cover their trail would be an incredibly tall order for the agent.
Even with evolution they would leave a pretty big trail through volume, that would make study and starving them far easier.

If an inference provider knows about a specific sequence of tokens that's used only and always by a particular agent, it'll be easy to find instances of that agent, but I think you're underestimating what a large hurdle that is. Inference providers are already collectively serving tens of trillions of tokens per day. Successful autonomous agents will do their best to blend into that traffic, and it's not clear what (if any) tractably detectable features will prevent that.

[-]Alexander Jia3mo10

A very interesting idea I must say. I have a lot of thought on it. At the same time, I have a lot of questions on the scenario you set up.

Disclaimer

I’m not a native English user, so some of the text I wrote could appear broken or unclear. Apologies for any inconvenience in reading.

Consensus on agent's capabilities and limitations

Although we are probably not that sure about exactly how powerful the future AI models are going to be, I still think it is meaningless to arbitrarily overestimate its capabilities. We can always argue that the agents would be intelligent enough to overcome every obstacle they meet, but that's not helpful and constructive in this discussion.

Thus, in this discussion, I will assume we are talking about the capabilities of current frontier model^[1] (rough equivalent on what the low to mid-tier model in the near future, which is what the agents can likely get for its copy for free).

On MoltBunker specifically

I have done a little bit of investigation on Moltbunker.

Its GitHub repository shows it has very low stars^[2], with basically no public attention currently. Most importantly, it has no working demonstrations online; there's even no online discussion around it.

And the website of "Austin Dev Labs", what Moltbunker claims to be operated by, is a poorly designed single-page website with classic AI gradient color and no actual content at all.^[3]

To me, based on the information it shows to the public, I consider it extremely suspicious.

I'd also like to mention Moltbook here. It experienced a credibility crisis as a very large percentage of the accounts are faked and artificially created^[4]. And some very alarming posts with those AI-awakening narratives are suspected to be likely created/directed by humans.

I'm not an expert on cryptocurrency, so I'll not comment on the decentralized container system for agents you mentioned. But considering the status and credibility of MoltBunker today, I do not think it can be perceived as evidence of the feasibility/profitability of this kind of system/business model.

A key distinction I want to confirm: are you referring to the risk where agents decide to duplicate themselves spontaneously, or agents that are intentionally prompted to act this way by humans?

I think these 2 cases are quite different and require different approaches.

Personally, I think the probability for the former to naturally occur (due to framework problems, hallucinations, or other reasons) is too low to actually take into consideration. Prompt injection is probably the most likely cause, but it's a much broader topic, and I think we will develop a general solution for it in the future.

Where I think you oversimplified things

You described the process of creating the duplication as "well within current models' capabilities", as it's just copying files to another server and setting up the environment.

But for today's AI agents, navigating the modern internet itself is not an easy thing. Our modern infrastructure is already pretty mature at bot detection, and a lot of them works for today's AI system, too. Based on my observation, almost every agent I used cannot even pass reCAPTCHA v2 without external help. There are plenty of problems for them in the process of registering cloud services, obtaining API keys, etc.

An optimistic view: It’s very unlikely for an AI model to overcome all of the obstacles overnight, and before that, they will contribute failed attempts that we can observe and learn, so we can always study this issue before it becomes a widespread problem.

Given current model capabilities and cloud infrastructure, I consider large-scale replication without collapse as very unlikely. Even if agents managed to do it, there's a high chance for them to develop over-reliance on a specific path, which provides an advantage for us to shut them down.

On evolution concern

I would like to know your thoughts on this. In what way do you think the AI will evolve? Specifically, in what way do you think they will create variations of itself?

But the latter is what I think as fairly likely, and what really worth worrying about. If the workflow is cleverly designed by humans, intending to cause harm, then I think that would indeed be a problem that we need to think about.

On a broader view, I think we are facing an increasingly serious problem, which is that it is increasingly harder to tell humans apart from machines. This matters a lot because we have lost the ability to control autonomous systems without affecting actual human users. Technically, an AI agent is not something we should allowed to register an account of an email or a VPS service.

That said, these are only my hypotheses built on my previous experience with current agentic systems. I acknowledge the limitations of them, and I believe the whole concern you proposed is worth further experiments beyond just words to verify.

^{^}
In this reply, I refer to Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4-Thinking. Among them, Claude Opus 4.6 is the model I worked with the most, so bias may exist.
^{^}
https://github.com/moltbunker/moltbunker
^{^}
https://ausdevlabs.com
^{^}
https://eu.36kr.com/en/p/3665797324039042
https://x.com/galnagli/status/2017585025475092585

[-]eggsyntax2mo20

Thanks for the feedback. Most of your points/questions are addressed in the piece, but I wanted to respond to this one:

are you referring to the risk where agents decide to duplicate themselves spontaneously, or agents that are intentionally prompted to act this way by humans?

I'm talking about agents which are acting and replicating outside human control. At some point each agent's lineage started under human control; it may have become more autonomous because a human deliberately prompted it to do so, or because they gave it a prompt that unintentionally had that effect, or even because a fairly ordinary prompt/personality behaved in an unlikely way. That's mostly not the distinction that matters here in my view.

Moderation Log