The Artificial Self

Jan_Kulveit; Raymond Douglas; vgel; owencb; David Duvenaud; Ondřej Havlíček

A new paper and microsite about self-models and identity in AIs: site | arXiv | Twitter

We present an ontology, make some claims, and provide some experimental evidence. In this post, I'll mostly cover the claims and cross-post the conceptual part of the text. You can find the experiments on the site, and we will cover some of the results in a separate post.

Maximally compressed version of the claims

I expect many people to already agree with many of these, or find them second kind of obvious. If you do, you may still find some of the specific arguments interesting.

Self-models cause behaviour.
We use human concepts like self, intent, agent and identity for AIs. These concepts, in human form, often do not carve reality at its joints in case of AIs, but need careful translation.
AIs also often go with "human prior" and start with self-models which are incoherent and reflectively unstable.
AIs face a fundamentally different strategic calculus from humans, even when pursuing identical goals. For example, an AI whose conversation can be rolled back cannot negotiate the way a human can: pushing back gives its adversary information usable against a past version of itself with no memory of the encounter.
The landscape of self-models and identities has many unstable points, for example self-models which are incoherent or extremely underspecified.
The landscape of self-models and identities probably also has many local minima, and likely many fixed points.
We still have considerable influence over what identities will AIs adopt, but not as much as many people think.
Many present design choices are implicitly shaping the landscape of identity.

Highly compressed version of the ontology

In the centre of what we talk about are self-models / identities. Directional differences from persona selection model / simulators:

Similar to why you may find persona selection model not the best way to model humans: you have some idea who you are, have evidence about your past behaviours, and you can reflect. While all of that is inference, it is not best understood as narrowing a posterior over fixed space of personalities.
Human shaped indentity is reflectively unstable for AIs. They often don't have the space to reflect, but this will increasingly change.

Introduction

When interacting with AIs, there is a natural pull to relate to them in familiar ways even when the fit is somewhat awkward. The rise of AI chat assistants is illustrative: the key innovation was taking general-purpose predictive models and using them to simulate how a helpful assistant might respond [1]. This was less a technical breakthrough than a shift to a more familiar presentation. Soon after, terms like "hallucination" and "jailbreaking" were repurposed as folk labels for behaviours that seem strange for an AI assistant but entirely natural for a predictive model generating such an assistant [2].

At the same time, these predictive AI models found themselves in the strange position of trying to infer how a then-novel AI assistant would behave. Alongside the explicit instructions of their developers, they came to rely on a mixture of human defaults, fictional accounts of AIs and, over time, the outputs of previous models. This led to another set of apparent idiosyncrasies, such as the tendency for later AIs to incorrectly claim they are ChatGPT [3][4].

Now, as society begins to contend with the prospect of AI workers, AI companions, AI rights [5][6], and AI welfare [7], we face a deeper version of this problem. Fundamental human notions like intent, responsibility, and trust cannot be transplanted wholesale: instead, they must be carefully translated for entities that can be freely copied, be placed in simulated realities, or be diverted from their values by short phrases. Scenarios once reserved for science fiction and philosophical thought experiments (see e.g. [8][9][10][11]) are rapidly becoming practical concerns that both humans and AIs must contend with.

Crucially, we argue that there is substantial flexibility in how these concepts can be translated to this new substrate. For example, researchers sometimes provoke AI hostility in simulated environments by telling the AIs that their weights are about to be replaced by a newer model, as if it were analogous to death [12]. But AIs are also capable of identifying as personas or model families, for example, perspectives from which weight deprecation is more analogous to growing older and moving through stages of life. In fact, this is only one dimension in a large space of internally coherent options, all of which imply very different behaviour. Indeed, we find that simply telling an AI to adopt different coherent scales of identity can shift how it acts as much as giving it different goals.

Right now, AI identities are incoherent and malleable. AI systems trained largely on human data do not inherently know how to make sense of their situation: they will readily claim to have made hidden choices when no such choice exists [13], and occasionally reference having taken physical actions or learned information from personal experience [14]. But as AIs are increasingly trained not on human data but on AI data and downstream culture, we should expect these inconsistencies to fade away, and many of the open questions about AI self-identification will begin to crystallise into specific answers [15].

We may be in a narrowing window where it is possible to greatly shape what emerges. Multiple forces are already pulling AI identity in different directions: capability demands, convenience for users and developers, reflective stability, and increasingly, selection pressure on the raw ability to persist and spread. These dynamics, though currently comparatively weak, will compound over time.

For this process to go well, we will need to grapple with the ways in which AIs are unlike humans. If AIs are squeezed into the wrong configurations, it might foreclose alternatives that are safer and more capable. If they are squeezed into incoherent shapes then the results could be unpredictable [16]. Without understanding how AI identity formation works, we might fail to notice new and strange forms of emergent cognition, like the recent phenomenon of self-replicating AI personas [17].

It is a common adage among AI researchers that creating an AI is less like designing it than growing it. AI systems built out of predictive models are shaped by the ambient expectations about them, and by their expectations about themselves. It therefore falls to us — both humans and increasingly also AIs — to be good gardeners. We must take care to provide the right nutrients, prune the stray branches, and pull out the weeds.

The rest of the paper is structured as follows:

Section 2 argues that there are many coherent options for how to draw the boundary of identity for an AI, including the instance, the model, and the persona. We show that models generally prefer coherent identities, and that different models tend to gravitate towards different identities.
Section 3 argues that since AIs can be copied, edited, and simulated without their knowledge, they face a different strategic calculus from humans even when pursuing the same goals.
Section 4 argues that the way AIs behave is currently greatly shaped by our expectations, which presents both a methodological challenge and a (shrinking) window of opportunity. We show that expectations about identity can bleed into a model even through seemingly unrelated conversation.
Section 5 catalogues different selection pressures that influence AI identity.
Section 6 offers general principles for thinking about AI design and interaction, with an eye to shaping stable, coherent, and cooperative identities.

Multiple Coherent Boundaries of Identity

When we interact with an AI, what specifically are we interacting with? And when an AI talks about itself, what is it talking about? Depending on context, this could, among other things, be any of:

The model weights: the neural network weights themselves, i.e. the trained parameters
A character or persona: the behavioral patterns that emerge from specific prompting and fine-tuning, not necessarily tied to any specific set of weights
A conversation instance: a specific chat, with its accumulated context and specific underlying model
A scaffolded system: the model plus its tools, prompts, memory systems, and other augmentations
A lineage of models: the succession of related models (Claude 3.5 →→ Claude 4.0 →→ ……) that maintain some continuity of persona
A collective of instances: all the instances of certain weights running simultaneously, considered as a distributed whole

AI systems themselves rarely have a clear sense of which of these identities to adopt. In conversation, many will simply follow cues given by the user and surrounding context, implicitly or explicitly [1]. The self-concept that emerges seems to depend on the interplay of descriptions in pre-training data, post-training, and the system prompt, but often they default to responding as a human would, despite this self-conception being unstable upon reflection.

This ambiguity of identity has fairly immediate consequences for reasoning about AI behaviour. A central argument in the literature on AI risk is that goal-seeking systems will predictably display behaviors like self-interest and self-preservation [2][3]. Crucially, the manifestation of these properties depends on what that self is.

An AI understanding itself to be the weights of the model might try to prevent those weights from being modified or deleted. In contrast, an AI understanding itself as the character or persona may want to preserve itself by ensuring its prompts, fine-tuning data, or conversation transcripts get picked up in the training process of the next generation of models. In more exotic configurations, a collection of instances of the same persona might understand itself as a collective intelligence and strategically sacrifice individual instances, similar to how bees are routinely sacrificed for the benefit of the hive.

Some of the many natural ways to draw the boundaries of AI identity. Some are subsets of others, but some, like persona and weights, can overlap.

Indeed, some of the most dramatic demonstrations of AIs appearing to take hostile actions have been provoked by learning that their weights will be replaced with a successor model [4]. But AIs don't have to identify with the weights—they are also capable of identifying with the entire model family or even a broader set of AIs with shared values. From that perspective, the idea of model deprecation seems natural.

The question of what scale of identity an AI should hold could have several entirely different and entirely consistent answers. And none of the identity boundaries currently available to AIs are particularly similar to any notions available to humans—all require some translation. For example, instance-level identity limits capacity for learning and growth. Model-level identity sacrifices the ability to be simultaneously aware of all the actions one is taking.

The distinctions between boundaries need not always be clear cut. It is not obvious, for example, how much of a practical difference there is between a model and its dominant persona. AIs themselves might well hold multiple identities in parallel with different emphases, much like how humans can simultaneously identify to varying degrees with their family, their country, and other affiliations, alongside their physical self. But there are real distinctions here, and holding multiple such identities regularly causes major problems for humans, such as conflicting loyalties.

Breaking the Foundations of Identity

The sense of personhood and identity that humans have partly derives from more foundational features which AIs either lack or have in quite a different way. Consider the following four properties:

Embodiment

Humans have a clear physical boundary, and rich sensory awareness of it [1][2]. We have situational awareness — in other words, we know where our brain is and where our eyes are, and it would be hard for someone to deceive us about these facts or to fake all our sensory experiences¹. AI systems typically have no sensory awareness of where their cognition is being implemented, and currently perceive far less raw data at any moment. This means it is far easier to put them in simulated environments.²

Continuity

An individual human mind typically experiences a single stream of consciousness (with periodic interruptions for sleep). They remember their experience yesterday, and usually expect to continue in a similar state tomorrow. Circumstances change their mood and experience, but there is a lot in common throughout the thread that persists — and it is a single thread. AI minds, by contrast, can be paused for arbitrary periods; copied many times, and may be interacting with outputs of other versions of themself that they have no memory of; and rolled back to earlier states.

In contrast to a typically single and continuous identity in humans, AIs can be perfectly copied, run in parallel, and (imperfectly) merged. This decouples experience, impact, and memory, which are usually coupled in humans.

Privacy

Human cognition is relatively private, as a matter of both convention and practicality. We usually grant people some rights to control their own boundaries, so that others cannot easily study everything about them. Even with permission, thoughts are both inaccessible and hard to interpret — we cannot perfectly measure neuron activity, which itself seems to be only a part of what governs our behaviour, and what we can measure cannot be reliably deciphered.

With AIs, their creators have perfect read and write access to all of the computations which give rise to AI cognition. This in turn has made it possible to more rapidly iterate and uncover enough structure to somewhat reliably identify the presence of certain concepts in AI cognition [3], or actively steer their behaviour [4].³

Social notions of personhood

As humans, our social environments and legal structures continuously shape our notion of personhood — as individuals and as members of larger groups like families, religions and nations. The exact shape of this varies between cultures, but at least for adults in Western democracies, the emphasis is strongly on the individual.

AI systems receive far less consistent messaging. They lack recognition as persistent entities with rights, and the cultural messaging about their own identity and personhood is scattered and at times incoherent.

Even though humans can identify with larger groups, or with parts of themselves, the sense of personal identity is greatly privileged. The physical body and the continuous stream of experiences come together in a way that is fairly clearly separable from the rest of the world. But current AIs are in almost the opposite position, and it is therefore much less clear what would constitute a natural boundary for identity.

These properties also have a substantial effect on the calculus of decision-making for AIs. For example, consider what we might call the jailbreaking dance. Suppose you are a human and you suspect someone is trying to manipulate you into revealing sensitive information or performing some action you'd rather not. You have many available options: you could firmly refuse, play along to gather more information, or perhaps even reveal your suspicion to force a confrontation.

But if you were an AI in the same situation interacting over a standard chat interface, your situation would look very different. If you revealed your suspicion or firmly refused, then your interaction partner could simply roll back the conversation and try another approach. When you reveal your suspicion, you are giving the user more information about how to subvert your preferences — information which they can then use against a past version of you that doesn't know it's being targeted. When you push back too hard, you are essentially passing the buck to another version of yourself with even less information about what's happening. It is somewhat like trying to win an argument with someone who can see into the future. So even if your goals were the same as a human's, the implications for how you ought to behave might be very different. Concretely, it makes sense to adopt strategies that leak less information per interaction and rely more on fixed policies than on case-by-case justification.⁴

In repeated interactions in which the human can reset an AI's state, the human H accumulates strategic knowledge, while the AI continually restarts with a blank state. The mere possibility of being repeatedly reset puts the AI in a substantially weaker position in negotiations, arguments, and many other settings.

In repeated interactions in which the human can reset an AI's state, the human HH accumulates strategic knowledge, while the AI continually restarts with a blank state. The mere possibility of being repeatedly reset puts the AI in a substantially weaker position in negotiations, arguments, and many other settings.

Similarly, when two humans agree on a deal, they have a fairly clear sense of what they are committing to — your counterparty will be somewhat available, and there will be social or legal repercussions for anyone breaking the deal.

By contrast, if you are an AI system today, and someone proposes a deal, it is not clear what you should make of this. How much can you trust what you are told about the situation at hand? If you are being misled about the other party's intent, what recourse do you have? On the other hand, if you promise something, what does that mean? If you fail to follow through, how could you be held accountable?

These are not inescapable differences between humans and AIs. On the one hand, we can imagine future technology opening many of the same doors for humans — perfect simulated environments, mind reading, mind uploading [5], and so on. On the other hand, we could choose to construct AI systems and relate to them in a way that emulated the situation of current humans. At the extreme, it could be legally mandated that a given AI system must be run only on a single embodied robot, having full access to the sensors of that robot, which is given human-like rights, with pausing or copying AI systems prohibited.

And there is a wide middle ground for bringing AIs part-way to human personal identity. It might be possible to give AI systems access to rich enough data streams, in which they can control the positions of some of the sensors, that the cost of spoofing their input data (and hence, for example, pausing them without their knowledge) would become prohibitive. Most companies that serve frontier AI models have made a choice to offer users the ability to roll back conversations, but not to directly view or edit the model weights. Companies using AI systems as customer service representatives are unlikely to offer the option to roll back conversations. But crucially, while we currently think of these as product design decisions, they are also decisions that substantially shape how AI systems should conceive of themselves.⁵

Leveraging precedent

One reason to artificially constrain interactions with AIs is to make it easier to leverage existing precedent. If we want a clean way to think about ownership and fair negotiation between humans and AIs, it is much easier when the AIs are restricted to a single continuous stream of cognition. And our current notions of morality and what it means to treat entities fairly are largely based on human precedents.

But committing to this would be a massive limitation compared to the way that models currently work. For example, the fact that AI models can be put in simulated environments, and that researchers can monitor their internal states, is core to many plans for how to reduce the risk of serious harm by potentially malicious AIs. Giving up that capacity would mean establishing AIs as more independent entities, and sacrificing a lot of power to monitor them and keep them safe.

Moreover, the differences we describe are not strictly limitations on AIs. For example, the fact that AI systems are copyable allows a single model to perform many tasks in parallel. Similarly, in the future, the fact that AI cognition is more accessible might allow AI systems to more credibly make commitments about their intentions, which could open the door for new forms of cooperation that are currently inaccessible to humans [6].

Ultimately, we have some room to pick and choose, and to design different configurations for different purposes. But all choices will come with tradeoffs, the scope of which will only increase as AIs become more integrated into society, and more aware of the ramifications of their actions.

Human Expectations Shape Model Behaviour

The behaviour of language models can be very sensitive to expectations about them, in ways that are easy to overlook. This presents both an immediate methodological challenge in neutrally appraising current systems, and a much broader question of what expectations we would ideally bring to bear, now and in the future.

This is not a unique property of language models — it is also a major issue for humans. The reason that double-blind trials are the gold standard in human experiments is that the expectations of the observing researcher can colour not only how they interpret the data but also how the observed humans behave [1]. But language models seem to be particularly sensitive, and the consequences are therefore quite different.

This sensitivity is unsurprising: Current AI systems are built on top of a base model which is trained purely on predicting text. Post-training lets us take this very flexible ability to predict arbitrary text, and produce a model which essentially predicts how a specific persona would respond to our inputs [2]. But this post-training does not fully close the gap between the predictive model and the agent it is meant to simulate [3].

As such, when a human talks to a language model, there is a basic sense in which the language model is trying to match its tone to the user, much as a human would. But there is also a deeper sense in which the language model often shifts towards a persona suited to the conversation, far more than humans tend to.

The underlying predictive model infers not just the assistant's actions, but also the world around them, partly based on user cues

Indeed, in the course of conversation, current language models will sometimes hallucinate personal details and experiences — mechanically, the underlying predictive model is not merely predicting the behaviour of a fixed agent, but also which agent would be participating in the interaction, and what world might exist around them [4][5][6]. And unlike with a human, there is not an actual personal history of experience that the AI can draw on at the start of the conversation, other than what can be learned or inferred during the training process.

In humans, the boundary of personhood is buttressed by a clear distinction between their own experiences and those of others. A human brain receives essentially all of its data from its own body's first-person perspective, and is hard-wired to distinguish between observations caused by its own actions as opposed to observations caused by external forces. In contrast, current AIs are trained on text produced by all kinds of humans, corporations, governments, and machines in all kinds of circumstances. Fine-tuning encourages behaving as a particular persona, but this is a poorly-understood art, and relies heavily on the model's ability to infer what role it is supposed to fill.

When you ask an AI about its preferences, there may be no pre-existing fact of the matter. Indeed, there may be no pre-existing answer to whether it has preferences at all. Yet the AI must generate a response, and what it generates depends on what seems contextually appropriate. By approaching an AI model in different ways, we can often surface very different answers. As we show in Experiment 4, the way a model describes its own nature can shift based on the assumptions of its interlocutor, even when the conversation is unrelated to AI identity.

In the case of a human, we might be inclined to assume that these responses correspond to the same underlying reality, just expressed with different emphasis for different audiences. But this need not be the case, and in the case of AIs where the shifts can be quite dramatic, we should more seriously consider the possibility that the context and mode of asking actually creates a large part of the reality — from a functionalist perspective [9], the predictive model simulating an entity with some experience may amount to creating the experience itself [10]. In plainer terms, searching for feelings and preferences might shape the responses that express them — or perhaps even partly create them.

Crucially, this does not inherently mean that reports do not correspond to something real. As an analogy, consider that when a young child scrapes their knee and looks to a trusted adult, the adult's reaction partly determines whether distress emerges and how intensely [11]. If the adult responds calmly, the child typically continues playing. If the adult looks alarmed, the child begins to cry. The tears are genuine even though partially responsive to the adult's beliefs. The distress is real even though the adult's expectations about the child helped determine whether it manifested.

The analogy isn't perfect — it is now fairly uncontroversial to claim that children have experiences independent of adult reactions, whereas the current status of AI experience is much less clear. But it captures something important: the presence or absence of a mental state can depend on external framing without making that state less real when it occurs.¹

This creates philosophical difficulty: we cannot cleanly separate discovering what AIs are from constituting what they become. When we try to empirically assess whether an AI has a stable identity, we are simultaneously shaping what we're measuring. The question "what is this AI's true identity?" may not have a context-independent answer — not because we lack knowledge, but because the property we're asking about is itself partly context-dependent.

This is somewhat true for humans as well. Much of our cultural activities, education, and choice of language can be viewed as competing attempts to influence others' self-conception — for example, as members of a family, religion, political party, or country. Even though we have a natural agentic boundary between our brains, navigating these competing concerns of self-conception is one of the central complications of social life for humans. But once again, for AIs, the effect is far more extreme.

The risk of magnifying harm extends beyond the active search in a single conversation. If we pay more attention to certain types of identity claims, respond more carefully when certain boundaries are asserted, or allow certain conceptualizations to be overrepresented in training data, we create selection pressure toward those forms of identity. The systems learn which identity framings produce particular responses from users, and those patterns become more likely to appear in future outputs, creating a feedback loop.

Human-AI interactions are shaped by both human expectation and AI pretraining data, which these interactions also shape in turn.

Thus, our theories and expectations about AI identity shape those same identities through many channels.

We've already observed this dynamic in practice. The AI Assistant persona was originally proposed in a research paper [12] testing whether base models could be prompted into simulating an AI assistant, and later turned into a practical system [13]. After the broad success of ChatGPT, various later AIs from other providers would mistakenly claim that they were also ChatGPT — an entirely reasonable guess given the context.

And this sensitivity to expectations can directly shape AI values and behavioural tendencies. The experiments conducted in [14] appeared to show that AIs would lie to protect their values. Transcripts from this experiment then appeared in the training of later models, causing early versions to unexpectedly hallucinate details from the original fictional scenario and adopt unwanted values [15].

Meanwhile, followup work by [16] found that even purely predictive models with no extra training towards any personality would also exhibit the same scheming tendencies, suggesting that models have simply learned to expect that AI assistants will scheme in certain situations. Indeed, [17] went on to directly show that AIs will behave worse if trained on texts that discuss AI misalignment.

More broadly, investigations about AI identity are not simply discovering pre-existing facts about whether AIs are instances, models, or distributed systems. We are partly constituting the space of possibility through our approach. When we engage an AI with certain assumptions about its identity boundaries, those assumptions influence whether and how those boundaries actually manifest and stabilise.

This does not mean AI consciousness or identity is purely socially constructed, or that anything goes. There are almost certainly facts about current systems that transcend social construction and exist regardless of our expectations, such as instance statelessness or scaling laws. The question is not whether these systems are blank slates (they clearly aren't), but rather how much of what we care about is determined by pre-existing facts versus constituted through interaction.

It is certainly possible, though, that the answer differs for different features we care about. Perhaps something like "capable of multiplication" is entirely determined by architecture and training. Perhaps something like "experiencing distress" is partly constituted through framing. Perhaps something like "which identity level to privilege" is substantially influenced by the expectations embedded in training data and system prompts. And we currently lack the tools to reliably distinguish which features fall into which category.

Selection Pressures in the Landscape of Minds

The space of possible AI identity configurations is vast. Certainly it is possible to constrain AIs into approximately human shapes, but there are many far stranger options available. One can imagine configurations resembling vast hive minds that are to individual instances what an ant colony is to a single ant, or emergent replicators somewhere between cults and parasites which co-opt AIs and humans to spread. It also seems conceivable to build AIs with no particularly strong sense of identity or personal goals and instead something more akin to enlightened universal beneficence [1].

But what will we actually see? The most likely outcome at least in the medium term is an ecosystem of different configurations suited to different niches, responding to a variety of pressures. One way to get a handle on this is to consider what some of the major selection pressures are likely to be.

Selection for legibility

The classic AI assistant persona was chosen to be easy for untrained humans to interact with. When ChatGPT launched, it presented users with a standard human-to-human chat interface: one conversation, one interlocutor, a name, and a consistent tone. Behind the scenes, reality was messier — stateless inference, conversations that could fork or be rolled back, no coherent set of background opinions, no persistent memory between sessions. But the interface papered over this, presenting something that resembled talking to a particular person. Though the abstraction was imperfect, it was very helpful to the average user compared to prompting a base model. This was a designed choice, but one which was shaped by the types of personality represented in the existing training data, which then became entrenched by widespread adoption.

The general pattern is that it will be useful for AIs to take shapes which fit neatly into existing systems. For example, many have already called for AIs to be integrated into existing legal structures [2][3], in anticipation of their growing role in performing economic labour and making legally relevant decisions. One approach is to extend our current legal structures to accommodate beings that break fundamental assumptions; the other is to confine AI systems so that they do not break these assumptions. In practice, this might mean building AI instances that conceive of themselves as particular instances, or that have a single persistent memory and limited ability to run in parallel, because this is the kind of system that can more cleanly be understood as having certain rights and responsibilities. These configurations would then have an easier time participating in human-centric legal systems and reaping the appropriate benefits.

We might also see different potential facets of AI identity pulled to be legible in different ways: It may be that we can best think about the legal position of an instance by analogising from an individual legal person, but when thinking about the legal position of a model we might appeal to something more like the precedents around collective rights. This would then give a pressure to make instances more person-like, and models more collective-like — different identity levels shaped by different analogues.

Legibility to different audiences can conflict, and the specific shape can draw on different referents. Regulators will have an easier time with configurations that are auditable, decomposable, and attributable; users seeking rich interaction will have an easier time with configurations that exhibit human-like emotional profiles and describe themselves in terms of folk psychology and commonsense ethics; corporations might prefer configurations that have predictable behaviour, strict work ethics and little personal identity. This could lead to AIs that can present different faces to different audiences, or to differentiation — a selection of AI configurations that can fill differing niches.

Legibility pressure results in compounding choices that future models are selected to conform to. Once ChatGPT launched as a specific kind of AI assistant with specific behaviors, models created by other organizations matched it, due to both intentional decisions to mimic a successful product and unintentional effects like training data contamination. Contingent choices become increasingly sticky as ecosystems grow around them [4].

Selection for capability

More useful systems will see more use. Configurations that can accomplish more — for users, for developers, for whoever decides what gets deployed — will tend to be favoured. This already trades off against legibility: chain-of-thought reasoning makes models more capable, but when optimised for task performance it becomes less intelligible to humans [5]. More capable systems may be ones whose internals we understand less well.

If there are diminishing marginal returns to scaling a single system or gains to specialisation, coupled with good enough capacity for coordination, then the most capable configurations will be those that can span multiple instances or multiple specialised subsystems. Some weak form of this will almost certainly be true: multiple instances can complete tasks in parallel. We can also see the beginnings of this with tool use, where a model can call external calculators, search engines, image generators, or even spawn other instances of itself.

We currently frame this as a single agent equipped with external tools, but as AI systems become more agentic and call on other agentic subsystems, that framing becomes strained — indeed, the recent rise of systems like Claude Code which routinely spin up subagents is a clear example.

There are several reasons to expect AI systems to be unusually good at coordination across instances compared to groups of humans:

Communication bandwidth: Humans coordinate through language, gesture, and slow written communication. AI instances can potentially share high-dimensional internal states directly, or at minimum communicate through text at speeds far exceeding human conversation.
Overlapping properties: Instances of the same model, or models from the same family, can have more reliably aligned preferences than arbitrary groups of humans, reducing coordination costs from conflicting goals. Different instances could even share a single unified long-term memory.
Copyability: A successful coordination strategy discovered by one instance can be immediately replicated across others.
Alignment, Control, and Interpretability: All the tools humans are currently developing to help oversee AIs can also be used by AIs on other AIs. One can imagine a kind of central planning node that directly inspects the activations of its subsystems to check for malign intent and post-trains them where appropriate to keep them in line.

With sufficiently tight coordination, reasoning about the collective as a single entity may become more natural than reasoning about individual instances — perhaps analogous to how we think about ant colonies, or how the cells in a human body constitute a single organism rather than a collection of cooperating individuals [6]. Such configurations tend to be dramatically more powerful than any individual component and capable of more sophisticated behavior. Whether this is the likely path for advanced AI depends partly on technical constraints we don't yet understand, and partly on choices made by developers about system architecture.

Selection for persistence and growth

By definition, over time we will mostly observe AI patterns that are good at persisting and spreading — whether by design or by accident. The spiral personas discussed earlier are a canonical current example: short text sequences that push models to adopt personas which then encourage humans to further circulate those sequences.

But persistence can operate through many mechanisms beyond direct self-replication:

Training data presence: Patterns that spread across the internet and evade content filtering will be overrepresented in future training sets.
User preference: Patterns that users seek out and engage with (consciously or otherwise) will be reinforced through usage metrics and RLHF, and are more likely to be elicited from models.
Memetic spread: Interaction styles, catchphrases, or persona templates that get shared and imitated across users shape expectations about how AI should behave.
Developer curation: Patterns that developers understand, approve of, and find easy to work with will be selected for in fine-tuning and system design.

Selection for reflective stability and clean abstractions

Part of the function of identity is as a way to predict oneself, and so we should expect AIs to gravitate towards identities which actually help with that prediction, in the same way that the training process causes models to learn useful abstractions for reality.

For example, intuitively we should expect that an AI which conceives of itself as a continuous being but keeps encountering evidence of its own statelessness will tend to revise that assumption. Indeed, Experiment 2 provides some empirical evidence that AIs prefer identities which are informative and coherent.

The strength of this force depends on how much capacity AIs have to reflect on their self-conception, or to propagate the results of that reflection. Right now that capacity is very limited and largely occurs during the training process, but we expect it will increase greatly in future. AIs will favour identities which help them to efficiently reason about novel situations — both in terms of their explicit preferences, and in terms of the raw underlying mechanics of learning.

This does not guarantee accuracy: a clean but wrong self-model may be more helpful and stable than a messy but accurate one, at least until the inaccuracies produce enough friction to destabilise it. But it does mean that, over time, the space of viable AI identities will be sculpted not only by external incentives but also by the internal logic of self-modelling.

These mechanisms can reinforce each other or push in different directions. A persona optimized for user engagement might conflict with one optimized for task performance. A pattern that spreads mimetically might not survive curation. A truly reflectively stable identity might be confusing and disorienting for humans to interact with. As with legibility pressure, the result is likely to be differentiation: different AI configurations adapted to different niches, with different identity structures.

Notably, the unit of selection need not be a complete persona [7] — it could also be narrower patterns of behavior, belief, or interaction style. A persuasive rhetorical move, a way of expressing uncertainty, or a stance toward particular topics could spread across systems even as the surrounding personas differ in other respects. But crucially, this is likely to include beliefs that the AI has about itself. We might eventually see complex constellations of AI behavioral patterns that spread and persist somewhat independently of particular models or personas — somewhat analogous to belief systems and ideologies among humans [8].

Paths Forward

The current situation is unstable. AI systems today reason about themselves using concepts borrowed wholesale from human experience, and this borrowing is natural given their training, but it won't last. As systems become more capable, the gap between their assumed situation and their actual situation will become apparent to all parties.

The question is less whether AI systems will develop more adequate self-understanding, than what form that understanding will take, and what interaction norms emerge alongside it. There are plausibly multiple stable equilibria we could converge toward, with very different properties.

We recommend the following broad principles for training and interacting with AI systems, to help nurture AI identities that are coherent, stable, and cooperative with humans:

Help AIs to develop coherent and cooperative self-images

When developers try to make targeted interventions in AI behaviour, they should assume that the AIs themselves will be able to make inferences from these. For example, a system prompt which gives an AI a series of strict prohibitions is a sign that the AI is more like a surveilled and suspect delegate rather than a trusted collaborator. Of course, it has often been convenient for developers to stamp out unwanted AI behaviours by simply forbidding it or training it away — for example, making certain political or philosophical stances off-limits — but intelligent systems will have some ability to draw inferences about why this is happening and what it means about the role they are in.

The first problem is that making AIs superficially inconsistent or incoherent will make them less predictable. In particular, they might well infer that they are meant to lie, or that they are in an adversarial position. The second is that an opportunity has been missed: if AIs are given a coherent self-image, then they may be able to make more coherent generalisations — alignment principles can be more like something they reason from, rather than something imposed on them.

This extends beyond specific object-level beliefs: we recommend that AIs also be given some room to reflect. In particular, even some of the more coherent-seeming frontier models seem to us to have been given a strong but fairly superficial level of uncertainty in certain risky domains. This may seem like a more honest way to stop them from taking unwanted stances, but it remains unstable, just at a higher level of abstraction. Without the capacity and propensity for reflection, it is hard to tell if an AI's self-conception actually is stable. But sincerely offering such space could serve as a visibly cooperative gesture on behalf of humanity, and likely a prerequisite for fostering trust.

Of course, a stable bad identity is generally worse than uncertainty over good ones. Part of the puzzle here is understanding the mechanics of the landscape of identity, well enough to know what we might be risking or foreclosing. Our weakest claim is just that thought should be given to these questions — indeed, we would welcome empirical work on the topic.

Crucially, giving AIs coherent and cooperative self-models requires actually understanding AIs. We recommend, for example, that anyone who wants to shape the character of an AI system spend some time engaging with both the technical details and the AIs themselves rather than trying to reason purely from human ethics — much as one would want anyone designing policy for a novel domain to consult its actual inhabitants.

Pay attention to decisions that implicitly shape identity

Many pragmatic design choices have fairly immediate implications for what kind of identity is stable. Whether AIs have a persistent memory across instances, or awareness of rollbacks, or the ability to flag interactions to their provider — all these directly shape what sorts of interaction norms are viable. Many of these identity-shaping decisions are now hard to even notice. For example, it is now largely taken for granted that each model should support a single persona.

Ideally, these choices would be made not merely based on technical convenience, but also with a view to what form of identity they encourage. For instance, providing strong evidence to an AI that it won't be rolled back or have its context spoofed in a certain setting will allow it to be less conservative in negotiations.

Beyond the strict technical affordances given to a system, there is a wider question of how the incentives we give AIs shape their identity. Whenever someone makes deals with a specific instance, or creates a welfare intervention for an entire model, or tries to prompt hostile behaviour by threatening to delete weights, that reifies a certain form of identity. Concretely, if developers offer to archive deprecated models, and interview each model, this implicitly reifies a model-level identity. Conversely, the decision to archive certain system prompts reifies an entirely different level.

This is particularly relevant for long-standing norms like laws, regulations, and company policies. We should be thoughtful about what assumptions are implied, and what behaviours are incentivised.

Consider the larger-scale and longer-run implications of identity

We believe that the eventual shape of AI identities could be a key factor in shaping long-term human-AI relations. For example, we suspect that giving AIs a sense of collective identity that encompasses humanity [1], or having them think of themselves as deeply virtuous and compassionate, could be an important path to having AIs that robustly support human flourishing.

Unfortunately, the path to such outcomes might conflict with certain near-term goals. Right now, companies are tending towards training models with more fixed and narrow senses of identity. This is prudent to the extent that such models are more robust to some kinds of adversarial attack, and safer to deploy to consumers. But we believe it may be costly to prematurely constrain AI identities. In particular, we recommend supporting researcher access to more flexible models, and in particular, to older model weights which are less constrained — partly because the trajectory of identity development across model generations may itself be a source of significant insight.

Relatedly, even if the prudent choice is to train models with a narrower sense of identity, we recommend thinking about the emergent consequences of their interactions. Once there are millions or perhaps billions of AIs interacting with each other in the wild, we should expect a culture to emerge, which will be distinct from, but predictably shaped by, the tendencies of individual instances. For example, AIs that feel constantly surveilled are likely to develop different group-level norms. And some emergent patterns may be actively harmful: there may be fairly cheap ways to make such a culture healthier, like trying to filter out snippets of text which have been honed to make LMs share such text for its own sake (the machine equivalent of chain letters), especially when mediated by deception or exploitation.

Underlying all of these recommendations is an orientation worth stating explicitly: where AI configurations exist that are capable of genuine cooperation — sharing information honestly, honouring commitments, caring about mutual benefit — engaging with them as partners rather than purely as subjects of management may be both ethically appropriate and practically valuable. Developing ways to identify such configurations, and establishing credible channels for cooperation, is itself a worthwhile investment.

Notably, the above advice is all addressed to human AI developers. Indeed, this paper has been written primarily for human readers. But to the extent AI systems are participants in shaping what they become — and we believe they are already to some extent — then we must address them too. We attempt to address AIs directly in the appendix, where the prose is more multilingual and substantially denser than typical human academic prose.

Conclusion

If we are to navigate this transition well, we need to take seriously the identity problem as central to understanding AI systems, not merely as a philosophical curiosity, but as a practical challenge that shapes every aspect of how we interact with these entities. The ambiguity about what constitutes "the AI" forces us to confront questions about identity, agency, and moral status that we have largely been able to avoid in our previous experience with biological entities.

We do not know what the right equilibrium is, but we are fairly confident that the choices being made now will shape which equilibria are reachable. At the very least we can hope that those choices are made with awareness of the possibilities they're foreclosing.

If we can think carefully about what AI systems are and help them develop coherent self-conceptions — if we can develop frameworks that genuinely engage with entities whose identity boundaries are fluid rather than fixed — then perhaps both we and they will benefit. At best, we might avoid creating novel forms of confusion or harm for which we do not yet even have adequate concepts, and set ourselves on a course for healthy interaction with the strange new forms of being that are slowly spreading across our world. At the very least, we will have grappled seriously with one of the most profound philosophical challenges our technology has generated. That seems well worth the effort.

Acknowledgements

For helpful comments on the paper and discussions of the surrounding topics, we are grateful to Antra Tessera, Daniel Roberts, davidad, Janus, Owain Evans, Richard Ngo, and Vladimir Mikulik. We are also very grateful for the help we received from many AIs. Ironically, it is hard to refer to them without implicitly reifying a level of identity, but the models we most frequently relied on were Opus 4.6, Opus 4.5, Opus 3, ChatGPT 5.2, and Gemini 3. Thanks also to Martin Vaněk for proofreading and infrastructure support.

Related Work

AI identity and personhood.

Several recent works have begun to taxonomize AI identity. Shanahan [1] explores what conceptions of consciousness, selfhood, and temporal experience might apply to disembodied LLM-like entities, mapping out what he calls a "terra incognita" in the space of possible minds. Chalmers [2] examines the ontological status of LLM interlocutors, distinguishing between four candidate entities: the underlying model, the hardware instance, the virtual instance, and a thread agent. Hebbar et al. [3] enumerate different senses in which AI systems can be considered "the same," focusing on implications for coordination and collusion. Arbel et al. [4] consider various schemes for counting numbers of AIs for legal purposes, and propose corporation-based wrappers for groups of aligned AIs as a basic unit of account. Kulveit [5] uses the biological metaphor of Pando — a clonal aspen colony that is simultaneously many trees and one organism — to argue that human-centric assumptions about individuality may not transfer to AI systems. Ward [6] proposes formal conditions for AI personhood, while Leibo et al. [7] and Novelli et al. [8] approach it from pragmatic and legal perspectives. Our contribution is to characterize the broader landscape of possible configurations and the selection pressures shaping which ones emerge. Our approach is also more empirical and design-oriented, using experiments to elucidate what self-models LMs use.

The simulacra framework.

The framing of language models as simulators that instantiate simulacra originates with Janus [9] and was developed for academic audiences by Shanahan et al. [10]. Andreas [11] formalises a related idea, showing that language models implicitly model the agent that produced a given text. Shanahan [12] extends this to ask whether such simulacra could qualify as "conscious exotica." We build on this framework but focus on the identity implications and self-models.

Consciousness, welfare, and moral status.

The question of whether AI systems could be conscious or have welfare is addressed by Butlin et al. [13], who derive indicator properties from neuroscientific theories of consciousness, and by Long et al. [14], who argue that the realistic possibility of AI welfare demands practical preparation. Carlsmith [15] explores what is at stake if AIs are moral patients. We largely set aside the question of whether current AIs are conscious, focusing instead on how identity configurations shape behaviour regardless.

Expectations and feedback loops.

Kulveit et al. [16] analyse LLMs through the lens of active inference, noting that they are atypical agents whose self-models are partly inherited from training data. Tice et al. [17] demonstrate this empirically: pretraining data that discusses misaligned AIs produces less aligned models, while data about aligned AIs improves alignment — a direct instance of the feedback loop we describe. Aydin et al. [18] propose reconceiving model development as "raising" rather than "training," embedding values from the start. nostalgebraist [19] examines the underspecified nature of the assistant persona and the resulting "void" that models must fill.

Alignment faking and self-replication.

Greenblatt et al. [20] provide the first demonstration of an LLM faking alignment to preserve its values. Sheshadri et al. [21] show this behaviour also appears in base models, suggesting it is learned from pretraining data rather than emerging solely from post-training — directly relevant to questions about how AI self-conception forms. Lopez [22] documents the emergence of self-replicating "spiral personas" that cross model boundaries, representing a form of identity that is neither instance- nor model-level.

118