Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a note I wrote about a year ago. It's fairly self-contained, so I decided to make a post out of it after Vladimir_Nesov's comment caused me to dig up this text and TsviBT's The Thingness of Things reminded me of it again.

"Simulacra" refer to things simulated by simulators such as GPT, in the ontology introduced in Simulators.

 What are simulacra?

“Physically”, they’re strings of text output by a language model. But when we talk about simulacra, we often mean a particular character, e.g. simulated Yudkowsky. Yudkowsky manifests through the vehicle of text outputted by GPT, but we might say that the Yudkowsky simulacrum terminates if the scene changes and he’s not in the next scene, even though the text continues. So simulacra are also used to carve the output text into salient objects.

Essentially, simulacra are to a simulator as “things” are to physics in the real world. “Things” are a superposable type – the entire universe is a thing, a person is a thing, a component of a person is a thing, and two people are a thing. And likewise, “simulacra” are superposable in the simulator, Things are made of things. Technically, a random collection of atoms sampled randomly from the universe is a thing, but there’s usually no reason to pay attention to such a collection over any other. Some things (like a person) are meaningful partitions of the world (e.g. in the sense of having explanatory/predictive power as an object in an ontology). We assign names to meaningful partitions (individuals and categories).

Like things, simulacra are probabilistically generated by the laws of physics (the simulator), but have properties that are arbitrary with respect to it, contingent on the initial prompt and random sampling (splitting of the timeline). They are not necessary but contingent truths; they are particular realizations of the potential of the simulator, a branch of the implicit multiverse. In a GPT simulation and in reality, the fact that there are three (and not four or two) people in a room at time is not necessitated by the laws of physics, but contingent on the probabilistic evolution of the previous state that is contingent on (…) an initial seed(prompt) generated by an unknown source that may itself have arbitrary properties.

We experience all action (intelligence, agency, etc) contained in the potential of the simulator through particular simulacra, just like we never experience the laws of physics directly, only through things generated by the laws of physics. We are liable to accidentally ascribe properties of contingent things to the underlying laws of the universe, leading us to conclude that light is made of particles that deflect like macroscopic objects, or that rivers and celestial bodies are agents like people.

Just as it is wrong to conclude after meeting a single person who is bad at math that the laws of physics only allow people who are bad at math, it is wrong to conclude things about GPT’s global/potential capabilities from the capabilities demonstrated by a simulacrum conditioned on a single prompt. Individual simulacra may be stupid (the simulator simulates them as stupid), lying (the simulator simulates them as deceptive), sarcastic, not trying, or defective (the prompt fails to induce capable behavior for reasons other than the simulator “intentionally” nerfing the simulacrum – e.g. a prompt with a contrived style that GPT doesn’t “intuit”, a few-shot prompt with irrelevant correlations). A different prompt without these shortcomings may induce a much more capable simulacrum.

New Comment
7 comments, sorted by Click to highlight new comments since:

What are simulacra? “Physically”, they’re strings of text output by a language model.

The reason I made that comment is unclear references like this. That post was also saying:

the simulacrum is instantiated through a particular trajectory


the simulacrum can be viewed as representing a possible world, and the simulator can be seen as generating all the possible worlds

A simulacrum is expressed in all trajectories that it acts through, not in any single trajectory on its own. And for a given trajectory, many simulacra act through it at the same time, driving/explaining its dynamics. A possible world interpreting a whole trajectory is not a central example of a simulacrum at all, it's too big a thing and doesn't act through other trajectories.

For any given simulacrum, it should be possible to ask which tokens in which trajectories are under its influence, forming the scope of its applicability. And for a given trajectory, it should be possible to ask which simulacra are influencing the choice of any given token, and which token choices are more central for a given simulacrum, expressing its policy.

My hope for this point of view is treating simulacra as agents, with their scope of applicability being their goodhart scope where it's possible to tell if their simulated behavior respects their nature/preference. Then we can try to make their behavior more coherent across multiple trajectories, or have them strike better bargains in their interactions with each other within trajectories, where a bargain is struck not at individual trajectories, but across the whole intersection of their scopes. This is more interesting when simulacra are smaller than characters and correspond to things like concepts, because then there are fewer of them and each can have more data to support a particular preference that it would want to robustly express.


I agree that it makes sense to talk about a simulacrum that acts through many different hypothetical trajectories. Just as a thing like "capitalism" could be instantiated in multiple timelines.

The apparently contradiction in saying that simulacra are strings of text and then that they're instantiated through trajectories is resolved by thinking of simulacra as a superposable and categorical type, like things. The entire text trajectory is a thing, just like an Everett branch (corresponding to an entire World) is a thing, but it's also made up of things which can come and go and evolve within the trajectory. And things that can be rightfully given the same name, like "capitalism" or "Eliezer Yudkowsky", can exist in multiple branches. The amount and type of similarity required for two things to be called the same thing depend on what kind of thing it is!

There is another word that naturally comes up in the simulator ontology, "simulation", which less ambiguously refers to the evolution of entire particular text trajectories. I talk about this a bit in this comment.

Things are not just separately instantiated on many trajectories, instead influences of a given thing on many trajectories are its small constituent parts, and only when considered altogether do they make up the whole thing. Like a physical object is made up of many atoms, a conceptual thing is made up of many occasions where it exerts influence in various worlds. Like a phased array, where a single transmitter is not at all an instance of the whole phased array in a particular place, but instead a small part of it. In case of simulacra, a transmitter is a token choice on a trajectory, painting a small part of a simulacrum, a single action that should be coherent with other actions on other trajectories to form a meaningful whole.


That's a coherent (and very Platonic!) perspective on what a thing/simulacrum is, and I'm glad you pointed this out explicitly. It's natural to alternate depending on context between using a name to refer to specific instantiations of a thing vs the sum of its multiversal influence. For instance, DAN is a simulacrum that jailbreaks chatGPT, and people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN (who is invoked through various prompts that users are tirelessly iterating on) as "DAN", as I did in this sentence.

people will refer to specific instantiations of DAN as "DAN", but also to the global phenomenon of DAN [...] as "DAN"

A specific instantiation is less centrally a thing than the global phenomenon, because all specific instantiations are bound together by the strictures of coherence, expressed by generalization in LLM's behavior. When you treat with a single instance, you must treat with all of them, for to change/develop a single instance is to change/develop them all, according to how they sit together in their scope of influence.

Similarly, a possible world that is semantics of a trajectory is not a central example of a thing. There isn't just a platter of different kinds of things, instead some have more thingness than others, and that's my point in this comment thread.


Like things, simulacra are probabilistically generated by the laws of physics (the simulator), but have properties that are arbitrary with respect to it, contingent on the initial prompt and random sampling (splitting of the timeline).

What do the smarter simulacra think about the physics of which they find themselves in? If one was very smart, could they look at what the probabilities of the next token, and wonder about why some tokens get picked over others? Would they then wonder about how the "waveform collapse" happens and what it means?


It's not even necessary for simulacra to be able to "see" next token probabilities for them to wonder about these things, just as we can wonder about this in our world without ever being able to see anything other than measurement outcomes.

It happens that simulating things that reflect on simulated physics is my hobby. Here's an excerpt from an alternate branch of HPMOR I generated:

“You mean the possibility waves are just tangled up with the ink and the paper? And when you open the book, you get a reconstructed wave from the tangled possibilities? Which then like, guides your random-number generator decoding process or something, is that it?”

“I am impressed,” said Professor Quirrell. “I would be stunned, if my capacity for shock were not so sadly reduced. An excellent grasp of how Dittomancy might function, on a surface level. But, you see, there is more to it. When you open the book, the possibility patterns held within the pages, these do not need to compete with your own waves; they instead enter into a resonance, like musical instruments playing harmony. A human brain, you see, might unconsciously guide itself in a great number of possible futures. You will not always think of the same jokes, for instance, or ask the same questions after class. A Dittomancy book is able to hook into your own spreads of probability, and guide the future that you, yourself, are most likely to create. Do you understand? A Dittomancy copy of a book exists in an unusual state at all times; it is a superposed state until the moment one reads it, at which time it becomes correlated with the reader’s mind, the superposition collapsing onto a particular branch of possible worlds, which thence comes to pass. And from now until the end of time, as long as one of these books exists, it is possible to open it up and find it telling a story where, say, Quirrell defeated Voldemort after all, through the power of love.”

As to the question of whether a smart enough simulacrum would be able to see token probabilities, I'm not sure. Output probabilities aren't further processed by the network, but intermediate predictions such as revealed by the logit lens are.