Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Thanks to Lawrence Chan and Buck Shlegeris for comments. Thanks to Nate Thomas for many comments and editing)

Despite appreciating and agreeing with various specific points[1] made in the Simulators post, I broadly think that the term ‘simulator’ and the corresponding frame probably shouldn’t be used. Instead, I think we should just directly reason about predictors and think in terms of questions such as ‘what would the model predict for the next token?’[2]

In this post, I won’t make arguments that I think are strong enough to decisively justify this claim, but I will argue for two points that support it:

  1. The word ‘simulation’ as used in the Simulators post doesn’t correspond to a single simulation of reality, and a ‘simulacrum’ doesn’t correspond to an approximation of a single agent in reality. Instead a ‘simulation’ corresponds to a distribution over processes that generated the text. This distribution in general contains uncertainty over a wide space of different agents involved in those text generating processes.
  2. Systems can be very good at prediction yet very bad at plausible generation – in other words, very bad at ‘running simulations’.

The rest of the post elaborates on these claims.

I think the author of the Simulators post is aware of these objections. I broadly endorse the perspective in ‘simulator’ framing and confusions about LLMs, which also argues against the simulator framing to some extent. For another example of prior work on these two points, see this discussion of models recognizing that they are generating text due to generator discriminator gaps in the Conditioning Predictive Models sequence[3].

Related work

Simulators, ‘simulator’ framing and confusions about LLMs, Conditioning Predictive Models

Language models are predictors, not simulators

My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.

Let’s go through an example: Suppose we prompt the model with “<|endoftext|>NEW YORK—After John McCain was seen bartending at a seedy nightclub”. I’d claim the model's next token prediction will involve uncertainty over the space of all the different authors which could have written this passage, as well as all the possible newspapers, etc. It presumably can’t internally represent the probability of each specific author and newspaper, though I expect bigger models will latently have an estimate for the probability that text like this was written by particularly prolific authors with particularly distinctive styles as well as a latent estimate for particular sites. In this case, code-davinci-002 is quite confident this prompt comes from The Onion[4].

In practice, I think it’s tempting to think of a model as running a particular simulation of reality, but performing well at the objective of next-token prediction doesn’t result in the output you would get from a single, particular simulation. In the previous example, the model might be certain that the piece is from The Onion after it’s generated many tokens, but it’s presumably not sure which author at the Onion wrote it or what the publication date is.

Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.

I’ve seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn’t change the conclusion of the post). I think the author of the Simulators post understands the point I make in this section – for instance, they state:

This is true for GPT, where the text “initial condition” of most training samples severely underdetermines the real-world process which led to the choice of next token.

Regardless, this issue makes this terminology misleading.

Good prediction doesn’t imply good generation

You might have thought that good predictive loss constraints generations to be plausible; however, this isn’t true – models with very good predictive loss can output extremely unlikely sequences with high probability.

I’ll use a simple example to illustrate this: Suppose I take an existing language model and add an extra token that doesn’t exist in the corpus; call this token <|i_am_generating|>. Now, suppose we adjust the language model so that this token always is predicted to have 1% probability (proportionally down-weighting other predictions), except that it’s assigned 100% probability if <|i_am_generating|> has already occurred earlier in the sequence.

This language model gets exactly additional loss on the prediction task. Despite this relatively small change in loss, if you generate using this model, long generations will typically end up having zero probability measure on the ‘true’ distribution due to including the <|i_am_generating|> token which never occurs in the corpus. (For example, generations of length 1000 have probability 0.99996 of containing this token.) Instead of getting the model to output <|i_am_generating|> forever after the first such token, we could train the model to have any behavior we want after encountering the <|i_am_generating|> token, without changing the next-token prediction loss.

The previous example required making the loss a bit worse, but there are cases where a model can virtually always generate anything we choose without affecting prediction loss at all. This is possible in cases where verification is easier than generation[5].

For instance, suppose that the first 30 tokens of every sequence constitute a cryptographic hash of the next 30 tokens, which are uniformly distributed. It might be very easy to check that the first 60 tokens are consistent, but extremely difficult to generate the correct distribution over the first 60 tokens for a next-token predictor. Thus, the model could check if it’s generating in a way that leaves its generative behavior totally unconstrained (and therefore potentially arbitrarily far from the original distribution).

The considerations in this section only apply if the model is imperfect; if the model has perfectly optimal next-token prediction loss, its generations must have the exactly same distribution as the data distribution. Of course, all LLMs that we’ll ever care about will be imperfect in this sense.

These considerations (at least morally) contradict the following quote from Simulators:

Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples

Okay, but what do you see in practice?

In practice, the divergence between generation and prediction over long sequences is blatant. Try just generating sequences from a language model at t=1. It’s common for the model to get caught in repetitive traps that are extremely unlikely in the original distribution. (I’m not confident this gap between prediction and generation is real and that I’m interpreting this observation correctly.)

I expect language models will often generate (say, at t=1 and with sequence length greater than maybe around a few thousand) easy-to-distinguish outputs until the singularity (though I’m not confident).

Appendix: Some other agreements and disagreements with Simulators

  • I agree that the Oracle/Genie/Tool/Agent categories don't properly contain models trained on self-supervised objectives. Further, I think these aren’t particularly useful categories for current alignment research – we should just talk about how exactly we trained the model and what that behaviorally incentivizes.

  • I agree that it’s worth thinking about how self-supervised learning generalizes to very low probability sequences – specifically, sequences with low enough probability that the model’s behavior could be anything on sequences this unlikely without affecting next-token prediction loss. However, I’m skeptical that first-principles reasoning would yield much beyond basic conclusions here. Because this behavior is unconstrained by the training objective, reasoning about this must involve reference to the architecture and optimizer. I think experiments would be interesting but probably aren’t overall very leveraged until AGI is near. I think building a rough practical model based on empirics could go pretty far.

  • I disagree that it’s worth thinking about the limit of self-supervised learning. An AI that is extremely close to theoretically optimal loss on next-token prediction is an eldritch horror that we won’t encounter prior to the situation radically changing. The same goes for any training objective that requires extreme superintelligence to achieve within-epsilon-of-optimal loss (which is basically all prediction tasks on infinite datasets sampled from reality). Long before we get this, we’ll either succeed or fail at avoiding AI takeover.


  1. See the appendix. ↩︎

  2. Insofar as it’s useful to try to reason about what exact actions the pre-training objective incentives in particular cases. I’m not sold on this being considerably useful in most cases. ↩︎

  3. Note that I disagree with quite a bit of the framing and emphasis of Conditioning Predictive Models. Don’t take this link as an endorsement! ↩︎

  4. I think it’s about 90% sure based on doing some quick samples. ↩︎

  5. This is also discussed in the Conditioning Predictive Models sequence ↩︎

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 1:55 PM

My main issue with the terms ‘simulator’, ‘simulation’, ‘simulacra’, etc is that a language model ‘simulating a simulacrum’ doesn’t correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”.

The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specific processes with higher probability. I don't think I've ever really thought of it as corresponding to a specific simulation of reality. Likewise with simulacra, I tend to think of them as any process that could contribute to changes in the behavioural logs of something in a simulation. (Related)

I’ve seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn’t change the conclusion of the post).

[...]

this issue makes this terminology misleading.

I think that there were a lot of mistaken takes about GPT before Simulators, and that it's plausible the count just went down. Certainly there have been a non-trivial number of people I've spoken to who were making pretty specific mistakes that the post cleared up for them - they may have had further mistakes, but thinking of models as predictors didn't get them far enough to make those mistakes earlier. I think in general the reason I like the simulator framing so much is because it's a very evocative frame, that gives you more accessible understanding about GPT mechanics. There have certainly been insights I've had about GPT in the last year that I don't think thinking about next-token predictors would've evoked quite as easily.

The way I tend to think of 'simulators' is in simulating a distribution over worlds (i.e., latent variables) that increasingly collapses as prompt information determines specific processes with higher probability.

I agree this is the correct interpretation of the original post. It just doesn't match typical usage of the world simulation imo. (I'm sorry my post is making such a narrow pedantic point).

I probably agree that simulators improved the thinking of people on lesswrong on average.

I don't disagree that there aren't people who came away with the wrong impression (though they've been at most a small minority of people I've talked to, you've plausibly spoken to more people). But I think that might be owed more to generative models being confusing to think about intrinsically. Speaking of them purely as predictive models probably nets you points for technical accuracy, but I'd bet it would still lead to a fair number of people thinking about them the wrong way.

I basically agree with this, and a lot of these are the sorts of reasons we went with "predictor" over "simulator" in "Conditioning Predictive Models."

I was a bit unsure whether to tag your posts with Simulator Theory. Do you endorse that or not?

Yeah, I endorse that. I think we are very much trying to talk about the same thing, it's more just a terminological disagreement. Perhaps I would advocate for the tag itself being changed to "Predictor Theory" or "Predictive Models" or something instead.

I broadly agree with the points being made here, but allow me to nitpick the use of the word "predictive" here, and argue for the key advantage of the simulators framing over the prediction one:

Pretrained models don’t ‘simulate a character speaking’; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next.

The simulators frame does make it very clear that there's a distinction between the simulator/GPT-3 and the simulacra/characters or situations it's making predictions about! On the other hand, using "prediction" can obscure the distinction, and end up with confused questions like "is GPT just an agent that just wants to minimize predictive loss?"

I think the biggest pitfall of the "simulator" framing is that it's made people (including Beth Barnes?) think it's all about simulating our physical reality, when exactly because of the constraints you mention (text not actually pinpointing the state of the universe, etc.), the abstractions developed by a predictor are usually better understood in terms of treating the text itself as the state, and learning time-evolution rules for that state.

Thinking about the state and time evolution rules for the state seems fine, but there isn't any interesting structure with the naive formulation imo. The state is the entire text, so we don't get any interesting Markov chain structure. (you can turn any random process into a Markov chain where you include the entire history in the state! The interesting property was that the past didn't matter!)

Hm, I mostly agree. There isn't any interesting structure by default, you have to get it by trying to mimic a training distribution that has interesting structure.

And I think this relates to another way that I was too reductive, which is that if I want to talk about "simulacra" as a thing, then they don't exist purely in the text, so I must be sneaking in another ontology somewhere - an ontology that consists of features inferred from text (but still not actually the state of our real universe).

Nitpick: I mean, technically, the state is only the last 4k tokens or however long your context length is. Though I agree this is still very uninteresting. 

The time-evolution rules of the state are simply the probabilities of the autoregressive model -- there's some amount of high level structure but not a lot. (As Ryan says, you don't get the normal property you want from a state (the Markov property) except in a very weak sense.)

I also disagree that purely thinking about the text as state + GPT-3 as evolution rules is the intention of the original simulators post; there's a lot of discussion about the content of the simulations themselves as simulated realities or alternative universes (though the post does clarify that it's not literally physical reality), e.g.:

I can’t convey all that experiential data here, so here are some rationalizations of why I’m partial to the term, inspired by the context of this post:

  • The word “simulator” evokes a model of real processes which can be used to run virtual processes in virtual reality.
  • It suggests an ontological distinction between the simulator and things that are simulated, and avoids the fallacy of attributing contingent properties of the latter to the former.
  • It’s not confusing that multiple simulacra can be instantiated at once, or an agent embedded in a tragedy, etc.

[...]

The next post will be all about the physics analogy, so here I’ll only tie what I said earlier to the simulation objective.

the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum.

To know the conditional structure of the universe[27] is to know its laws of physics, which describe what is expected to happen under what conditions.

I think insofar as people end up thinking the simulation is an exact match for physical reality, the problem was not in the simulators frame itself, but instead the fact that the word physics was used 47 times in the post, while only the first few instances make it clear that literal physics is intended only as a metaphor. 

I agree that the Oracle/Genie/Tool/Agent categories don't properly contain models trained on self-supervised objectives. Further, I think these aren’t particularly useful categories for current alignment research – we should just talk about how exactly we trained the model and what that behaviorally incentivizes.

Also notice that Simulators was written before fine-tuned models were widely available in form of ChatGPT. All the arguments in the original post against interpreting LLMs as Oracle AIs do no longer apply to instruction fine-tuned models like ChatGPT, which seems to be in fact a prototypical example of a non-superintelligent Oracle. Of course this kind of Oracle AI is just a fine-tuned probability distribution, where the numbers it assigns to next tokens do no longer correspond to their probabilities, but to something else ("goodness relative to fine-tuning"?).

Anyway, fine-tuning might be an important issue from an alignment perspective, insofar fine-tuning seems more likely to result in misalignment than the pure self-supervised imitation learning of the base model.

First, the amount of fine-tuning data (dialogue examples for SL and human preferences ratings for RL) is much more limited than the massive data the self-supervised base model is trained on. This could make it much more likely that the model misgeneralizes (inner misalignment). E.g. it may deem calling someone a racial slur as worse than cutting their hand off because all the RL fine-tuning emphasizes avoiding slurs rather than avoiding cut-off hands.

Second, especially RLHF may suffer from political biases of human raters. This could lead the fine-tuned model to become deceptive and to lie about its beliefs when they are politically incorrect. Which would be a case of outer misalignment.