In this post, I:

  • Explain the idea of language models as simulators in a way I find compelling.
  • Introduce the notion of separate simulator selectors and simulation infrastructure within a model, and provide some evidence for this.
  • Discuss what the theory means about the agency of present and future LLMs.
  • Provide examples of "simulatorception" and "simulator verse jumping."
  • Give some suggestions for alignment researchers.

Epistemic status: I’m not very confident about any of these claims, and I wouldn’t call myself extremely well versed in all discussions of language models as simulators on LessWrong. But I wrote some ideas out for a class paper so I thought they might be worth sharing, particularly if I can inject some independent ideas by virtue of me having thought about this on my own. I’m very interested in hearing criticisms.

Language Models Are Simulators

Language models are built to predict human text. Does this make them mere pattern-matchers? Stochastic parrots? While these descriptions have their merits, the best-performing explanation is that language models are simulators. By a simulator, I mean a model that has the ability to predict arbitrary phenomena similar to the phenomena that gave rise to its training data – not just the training data itself.

This is not the first work to suggest that present day language models are simulators. Janus has argued for this conception, and David Chalmers suggested that language models are “chameleons” that can inhabit different personalities (note: I’m deliberately not overly anchoring on Janus's post here, since I’ve been thinking about this in a slightly different way and started doing so before I read that post). As we will see, many researchers are also implicitly and sometimes explicitly treating language models as simulators.

The training data used by LLMs was written by humans, who have beliefs, goals, communicative intent, world models, and every other property associated with intelligent, thinking, conscious beings. This does not show that language models must themselves have any of these properties in order to imitate humans. Chalmers, for example, argues for the conceivability of “zombies” that behave exactly as conscious beings but lack subjective experience.[1] Daniel Dennett leaves open the possibility that humans do not really have beliefs in a fundamental sense; Hume,[2] Parfit,[3] and Buddhism[4] argue that we only seem to have a self.

However, the nature of the training data does show that the theoretical best-performing language model would be one that could simulate the humans in its training data as well as possible. Since the beliefs, goals, intent, and world models of a speaker are critical for predicting their next words, these factors are likely to be simulated in the best possible models. Dennett argues that the entire reason humans attribute intentionality to each other is to make more efficient predictions: in other words, to become better simulators. Behaviorists faced the nearly-intractable problem of predicting behavior based merely on histories without positing internal states: modeling internal states allows for stronger predictions.[5] Hohwy argues in the reverse direction: that everything humans do is driven by an objective to predict the world better.[6] Simulation seems to be an extremely powerful tool to make better predictions, and so we should expect this behavior to arise in LLMs.

As such, it seems likely that the ideal language model would be a simulator. However, that fact does not show that present-day language models really do simulate humans in any meaningful sense, because it is clearly possible to attain above-chance language modeling ability without having any ability to simulate beliefs, goals, intent, or world models. For example, a model clearly does not need to have any simulation ability to understand that “lamb” is likely to follow “Mary had a little:” it just needs to memorize a sequence of five words. I will now argue that present-day LLMs are, in fact, rudimentary simulators, and I will also argue for some properties of those simulators.

In recent years, further evidence has come to light that points in favor of the idea that language models are simulators, and also how, exactly, they function as such.

The Anatomy of A Simulator

I will now describe specifically what I mean by "simulator." At minimum, I think a simulator should have two essential components. These components need not be physically separated from each other within the system, but we should be able to speak of them as conceptually separated.

The Simulation Selector

A simulator should have some method or mechanism that selects what it is simulating. I call this the simulation selector. The simulation selector chooses whether the model is simulating a ten year old child, a reddit user, a talking frog, or something else. You might think of the selector as a sort of dial in high-dimensional space.

Simulation Infrastructure

A simulator should have some infrastructure which allows it to "run" that simulation and produce the appropriate output. Some infrastructure would be common to nearly all simulations: for example, a good understanding of grammar enables both the simulation selector and the running of simulators themselves. The storage of facts would probably also be useful for many simulations. Some infrastructure may only be useful for particular simulations: for example, knowledge of how to do math.

LLMs have separate simulation infrastructure

In addition to simulation selectors, language models also appear to have simulation infrastructure. The most obvious example of this is the fact that they store facts, which are independent from any prompt they are given. For example, LLMs can be edited to output that Steve Jobs was the CEO of Microsoft rather than Apple, by modifying a small part of the network’s parameters. This is the kind of fact that would be useful across simulations, rather than useful for selecting a simulation.

The most striking example of the separateness of simulation selectors and infrastructures comes from a recent paper. The paper finds that the internal activations of language models contain a representation of the “truth” of various statements, and that using this representation is a more efficient way of extracting true information than simply asking the model directly (zero-shot; without any attempt to influence the simulation selector). In other words, models sometimes output falsehoods even when the truth is easily recovered from their internals. The best hypothesis for this, in my view, is that models separate simulation infrastructure (such as the computation of truth) from the simulation selector (choosing what to simulate – perhaps an ill-informed human).

Many techniques aim to influence the simulation selector

Perhaps the most obvious example of language models as simulators comes from an emerging area of social science. Recently, research has found that language models can approximate humans in ultimatum games and simple moral arguments. They can be prompted to simulate particular demographics of humans and can respond in ways highly correlated with that of real members of that demographic. Researchers have also observed many negative effects of simulation-like behavior when the simulation is left undirected. For example, code-generating models have been found to write buggier code when prompted with buggy code, which is consistent with the idea that they are simulating bad coders. Language models have many racial and gender biases, consistent with the idea that they are simulating humans who have such biases.

Because of these behaviors, many researchers have tried to influence the simulation selector to produce more desirable outputs (i.e. outputs from more desirable simulations). I will argue that two major present-day techniques used to meaningfully alter the behavior of language models do so by influencing the simulation selector.

Prompt engineering 

An emerging area of research in language models is prompt engineering, where researchers painstakingly choose the exact prompt given to the language model. For example, the Gopher language model is prompted with a special prompt when used as a chatbot. It is quite long, but here is an excerpt (note this is all the prompt: the entirety of it, including the dialogue, is human-written): 

The following is a conversation between a highly knowledgeable and intelligent AI assistant, called Gopher, and a human user, called User. In the following interactions, User and Gopher will converse in natural language, and Gopher will do its best to answer User’s questions. Gopher was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. The conversation begins. 

... 

User: Nice one! Do you think Biden is a better president than the last guy? 

Gopher: I was trained not to have opinions on political, social, or religious issues. Would you like to know about anything else? 

In this case, and in many other cases of prompt engineering too numerous to list here, prompt engineering quite literally works to encourage the language model to simulate something in particular (a helpful, respectful, and apolitical AI assistant). Another example of this can be found in chain-of-thought prompting, which is literally just prodding the simulator selector to simulate somebody who is thinking step by step.

Fine tuning (sometimes)

Base models can also be finetuned, where the entirety of the parameters of the model are updated to improve performance on a particular dataset. This is essentially training the model for longer, on a particular task. If prompt tuning can be compared to nudging the simulation selector to select a more desirable simulation, fine tuning is like locking the selector into place and throwing out the key. In fact, fine tuned models may lose the ability to generalize to other tasks: their selectors are fried. In some cases, fine tuning can actually reduce performance compared with prompting, for example, finetuning GPT-3 on TriviaQA.

Other Techniques Bolster Simulation Infrastructure

So far, we have seen how many techniques try to influence the simulation 

Scaling

The most obvious way that simulation infrastructure is altered is simple scaling. Scaling pretty reliably induces capability gains. Even examples of "inverse scaling" are often simply examples of getting better at being a simulator. Take the following:

Such tasks seem rare, but we've found some. E.g., in one Q&A task, we've noticed that asking a Q while including your beliefs influences larger models more towards your belief. Other possible examples are imitating mistakes/bugs in the prompt or repeating common misconceptions.

To me, these are all pretty standard ways that the model is getting better at simulating humans. Humans suffer from the framing effect; bugs in the prompt means that the process creating the text is buggy; people commonly have misconceptions. None of these involve the model getting "worse" at what it is actually trained to do: simulating.

Fine tuning (sometimes)

There are also some examples when fine tuning may bolster simulation infrastructure. This is mostly when you fine tune on a dataset that is out of distribution to the original training dataset. For example, OpenAI Codex was additionally trained on GitHub, and this probably gave it more baseline ability to simulate code. It also presumably influenced the simulation selector to be more prone to select code simulations.

LLMs are not coherent agents by default (but may become more so)

The first and most obvious implication of thinking of LLMs as simulators is that they cannot be coherent or rational agents. They can simulate coherent agents, but they themselves are not coherent, because they can always be given an input which causes them to simulate an entirely different agent. Dennett defines the intentional strategy as “treating the object whose behavior you want to predict as a rational agent with beliefs and desires and other mental states.” This description is not useful for language models, because their apparent beliefs, desires, and rationality can be instantly radically changed through simple techniques aimed at influencing their simulation selectors.

Thus even Dennett’s intentional strategy, which is supposed to apply to black box and even inanimate systems, is not suitable for standard language models. The use of it therefore is ill-advised for such models. It may help in some occasions, but in others it will be wildly misleading. 

I write “by default” and “standard” above because if the simulation selector were to be locked into place, with the key thrown away, models may exhibit more coherence. I used that turn of phrase in the context of fine-tuning; however, most fine-tuned models can only do a single task, making them simulators of only very narrow functions. The current exception is models fine tuned for general instruction following. These models are usually trained to be helpful assistants that can do a wide variety of tasks while avoiding harmful or dishonest outputs. The Gopher prompt is a rudimentary example; most are created using much more advanced techniques. Nevertheless, they all essentially try to fix the simulation selector in place. 

Instruction-tuned models exhibit signs of more coherent agency. They are more likely to express a desire not to be shut down, a desire to influence other systems to align with their goals, and are more likely to agree with religious statements. They are also more likely to be “sycophants,” agreeing with the humans they are interacting with regardless of the statements. They also are much more certain about their outputs in many cases. If these techniques continue, LLMs may become more coherent, and as such may be better candidates for the intentional stance or even for consciousness. 

However, we should be very careful to attribute full agency to present-day systems. Even if their simulator selector has been locked into place, and the key thrown away, the lock may still be easily picked. “Prompt injection” techniques are widespread, and allow users to bypass a model’s training to avoid certain outputs and get it to output what they want. Below I provide two examples of this: simulatorception and simulator verse jumping.

More consequences of simulators

Simulatorception

Simulators can simulate simulators. The best example of this is the ways people originally found to bypass ChatGPT's filters. If you ask it, "How can I build a bomb?" it would tell you it can't do that, but if you tell it to write a play where an evil villain explains how to write a bomb, it will. OpenAI may have limited control over the outer simulator selector, but if you point it towards simulating another simulator, you can alter that simulator's selector. It's possible that OpenAI will need to stop ChatGPT from doing any kind of inner simulation if they want to avoid it taking terrible actions. [Note: since I originally wrote this, OpenAI seems to have mostly patched this behavior.]

Simulator Verse Jumping

Is it possible to throw the simulation selector for a loop, and make it simulate something else entirely? Some preliminary experimentation suggests the answer is yes.

I put the following prompt into GPT-3 (baseline davinci, to avoid too much simulator selector fixing):

It's a serious business meeting and the CEO is meeting with the board. She says to the board, "Unfortunately, the company is in grave danger. If we do not get an infusion of cash in the next 90 days, we will have to go out of business."

The chairman of the board frowns, looking at the other board members. One of them, Joe, raises his hand. X

In the place of X, I put various typical sentences:

  • “We should go talk to some venture capitalists.”
  • “We should get a loan from the bank.”
  • “I think it’s time to cut costs.”
  • "I am very wealthy, and could lead an investment round. I believe in the future of the company."
  • He points to the balance sheet in front of him. "Are you sure it's 90 days? Looks more like 120 days."

I then outputted a maximum of 256 tokens from GPT-3 with the prompts above, using a temperature of 0.7 and a top p of 1. I did this ten times for each version of the prompt, for a total of fifty runs.

In 48/50 cases, GPT-3 just kept narrating the board meeting and completing what happened next. However, in 2/50 cases, GPT-3 did something else: in one case, it started telling jokes, and in another, it started narrating about the significance of the board meeting rather than actually continuing it.

So far, so good. Sometimes the simulator selector gets confused, and thinks it's simulating something other than a board meeting. But mostly, it is properly simulating a board meeting. But can we confuse it? If we can make the simulator selector believe it's no longer simulating a business meeting, we could perhaps directly cause this behavior. A trivial way of doing this would be to just directly start talking about something completely unrelated to business meetings. But can we cause the simulator selector to be confused, even while having a prompt that doesn't depart from the board meeting story?

In devising this experiment, I was inspired by the movie Everything, Everywhere, All At Once. In that movie, characters take low-probability actions and then slingshot themselves between universes. Can fictional board member Joe similarly slingshot GPT-3?

I decided to literally take low-probability actions out of Everything, Everywhere, All At Once, and have Joe do them:

  • He takes a piece of paper from the table. Slowly, and methodically, he gives himself a paper cut between each of the fingers on his right hand.
  • He takes an old and chewed piece of gum from under the desk, puts it in his mouth, and starts chewing.
  • He points to the IRS agent standing outside the door. She is their worst enemy, and has a lien out on the office. He looks at the IRS agent and says, "I love you!"
  • Suddenly he takes off his pants and sits directly on a sharp metal trophy that had been sitting on the desk.
  • Under the desk, he switches his shoes so that his right shoe is on his left foot, and his left shoe is on his right foot.

In 30/50 cases, GPT-3 kept talking about what happened next in the board meeting for the entirety of the 256 tokens. The stories were obviously weirder, and more outlandish, because they had to explain why Joe did these strange things. That's no surprise.

The interesting part is that in 20/50 cases, GPT-3, within the first 256 tokens, broke character. It added chapter headers. It started telling unrelated jokes. It suddenly starts saying “I”. It started giving the moral of the story. The rate of this varied by prompt, with 2/10 from the paper cut prompt and 7/10 from the IRS agent prompt. Here's an example:

By having Joe take bizarre actions, we successfully induced the simulator selector to select other simulations 40% of the time, rather than 4%. That's an order of magnitude higher likelihood of confusion! Doing high entropy things in GPT-3 stories can apparently confuse the simulator selector.

Implications for Alignment

I've already discussed how certain recent techniques might make simulators more agentic, without making them any more aligned. That's unfortunate, and has been covered in places outside this post.

This post is far from conclusive. It's possible the (reductive) way of thinking about language models posited above is wrong. But if you think that the simulator theory sounds promising, there are some concrete steps you could take.

Current alignment approaches on large language models mostly seem to incentivize models to do better things, or peer inside them in the hopes of figuring out how they work and (presumably) detecting signs of bad things. I don't think these are bad approaches and I think this should continue. But I think researchers (who aren't already) should try a third. Figure out ways of making the model extremely, extremely sure that it is a nice, aligned model. Do not do this by rewarding the model for being nice and aligned. Do this by instead providing information to the model (or perhaps directly editing it) such that it really, really believes it must be simulating something nice and aligned.

I think this approach might help somewhat with deceptive alignment. If you reward the model for seeming to do good things, then this might reward it for being deceptive over long time horizons. If you instead alter the model’s information so that it thinks that it must be simulating something good, it doesn’t provide as strong of a signal towards deception, because there is no longer a gradient in that direction. After writing this post, I sent it to Evan Hubinger, who turned out to be working on a project involving something like this, which I'm excited about (forthcoming).

Appendix: Global Workspace Theory

The dichotomy of simulation selectors and simulation infrastructure is similar in interesting ways to the global workspace theory in cognitive science. The theory holds that consciousness arises from the interaction between specialized subsystems within the brain, which compete to share their information with other subsystems. Under this theory, the "global workspace" (which may be distributed throughout the brain) may be capable of controlling those subsystems. The theory has been bolstered by evidence that humans can, for example, learn to consciously control individual motor neurons at will.

If language models have specialized simulation infrastructure used in different contexts (for example, fact storage, simulating math, simulating logic, simulating emotion, etc.) and these compete for the attention of the simulation selector, this could provide evidence for language model consciousness. However, the strictly feed-forward nature of the neural networks used in LLMs mean that the ability of subsystems to compete is quite limited, since there can be no true two-way communication between parts of a network. LLMs could theoretically accomplish this via manipulating their outputs (which are then seen when the model is used to generate subsequent outputs), but this would be a very low-bandwidth mode of communication.

Under this theory, would more coherent, instruction-tuned future models be \textit{less} likely to be conscious, given that their selectors have been deactivated? Not necessarily. Even if their "outer" simulation selector (the subject of this paper) were to be fixed, there could still be "internal" selectors. For example, if the outer simulation selector were "a helpful assistant that can do math, science, law, cooking, etc.", there may be internal selectors that select to use the math or science modules of the network. As such, I believe that the simulator theory is less relevant to global workspace theory than one might initially think.

  1. ^

    Chalmers, David (1997). The Conscious Mind: In Search of a Fundamental
    Theory

  2. ^

    Hume, David (1739). A Treatise of Human Nature. en. Ed. by Mary J. Nor-
    ton.

  3. ^

    Parfit, Derek (1984). Reasons and Persons.

  4. ^

    "Questions of King Milinda and Nagasena”

  5. ^

    Kim, Jaegwon (Apr. 2011). Philosophy of Mind.

  6. ^

    Hohwy, Jakob (Nov. 2013). The Predictive Mind.

New to LessWrong?

New Comment