Epistemic Status

Highlighting a thesis in Janus' "Simulators" that I think is insufficiently appreciated.



In the limit, models optimised for minimising predictive loss on humanity's text corpus converge towards general intelligence[1].


From Janus' Simulators:

Something which can predict everything all the time is more formidable than any demonstrator it predicts: the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum (though it may not be trivial to extract that knowledge).


I affectionately refer to the above quote as the "simulators thesis". Reading and internalising that passage was an "aha!" moment for me. I was already aware (at latest July 2020) that language models were modelling reality. I was persuaded by arguments of the below form:

Premise 1: Modelling is transitive. If X models Y and Y models Z, then X models Z.

Premise 2: Language models reality. "Dogs are mammals" occurs more frequently in text than "dogs are reptiles" because dogs are in actuality mammals and not reptiles. This statistical regularity in text corresponds to a feature of the real world. Language is thus a map (albeit flawed) of the external world.

Premise 3: GPT-3 models language. This is how it works to predict text.

Conclusion: GPT-3 models the external world.

But I hadn't yet fully internalised all the implications of what it means to model language and hence our underlying reality. The limit that optimisation for minimising predictive loss on humanity's text corpus will converge to. I belatedly make those updates.

Interlude: The Requisite Capabilities for Language Modelling

Janus again:

If loss keeps going down on the test set, in the limit – putting aside whether the current paradigm can approach it – the model must be learning to interpret and predict all patterns represented in language, including common-sense reasoning, goal-directed optimization, and deployment of the sum of recorded human knowledge

Its outputs would behave as intelligent entities in their own right. You could converse with it by alternately generating and adding your responses to its prompt, and it would pass the Turing test. In fact, you could condition it to generate interactive and autonomous versions of any real or fictional person who has been recorded in the training corpus or even could be recorded (in the sense that the record counterfactually “could be” in the test set). 


The limit of predicting text is predicting the underlying processes that generated said text. If said underlying processes are agents, then sufficiently capable language models can predict agent (e.g., human) behaviour to arbitrary fidelity[2]. If it turns out to be the case that the most efficient way of predicting the behaviour of conscious entities (as discriminated via text records) is to instantiate conscious simulacra, then such models may perpetuate mindcrime.


Furthermore, the underlying processes that generate text aren't just humans, but the world which we inhabit. That is, a significant fraction of humanity's text corpus reports on empirical features of our external environment or the underlying structure of reality:

  • Timestamps
    • And other empirical measurements
  • Log files
  • Database files
    • Including CSVs and similar
  • Experiment records
  • Research findings
  • Academic journals in quantitative fields
  • Other reports
  • Etc. 

Moreover, such text is often clearly distinguished from other kinds of text (fiction, opinion pieces, etc.) via its structure, formatting, titles, etc. In the limit of minimising predictive loss on such text, language models must learn the underlying processes that generated them — the conditional structure of the universe.

The totality of humanity's recorded knowledge about the world — our shared world model — is a lower bound on what language models can learn in the limit[3]. We would expect that sufficiently powerful language models would be able to synthesise said shared world model and make important novel inferences about our world that are implicit in humanity's recorded knowledge, but which have not yet been explicitly synthesised by anyone[4].

The idea that the capabilities of language models are bounded by the median human contributor to their text corpus or even the most capable human contributor is completely laughable. In the limit, language models are capable of learning the universe[5].


Text prediction can scale to superintelligence[6].

This is a very nontrivial claim. Sufficiently hard optimisation for performance on most cognitive tasks (e.g. playing Go) will not converge towards selecting for generally intelligent systems (let alone strongly superhuman general intelligences). Text prediction is quite special in this regard.

This specialness suggests that text prediction is not an inherently safe optimisation target; future language models (or simulators more generally) may be dangerously capable[7].


Humanity's language corpus embeds the majority of humanity's accumulated explicit knowledge about our underlying reality. There does exist knowledge possessed by humans that hasn't been represented in text anywhere. It is probably the case that the majority of humanity's tacit knowledge hasn't been explicitly codified anywhere, and even among the knowledge that has been recorded in some form, a substantial fraction may be hard to access or not be organised/structured in formats suitable for consumption by language models.

I suspect that most useful (purely) cognitive work that humans do is communicated via language to other humans and thus is accessible for learning via text prediction. Most of our accumulated cultural knowledge and our shared world model(s), do seem to be represented in text. However, it's not necessarily the case that pure text prediction is sufficient to learn arbitrary capabilities of human civilisation.

Moreover, the diversity and comprehensiveness of the dataset a language model is trained on will limit the capabilities it can actually attain in deployment. Likewise, the limitations imposed by the architecture of whatever model we are training. In other words, that a particular upper bound exists in principle, does not mean it will be realised in practice.


Furthermore, the limit of text prediction does not necessarily imply learning the conditional structure of our particular universe, but rather a (minimal?) conditional structure that is compatible with our language corpus. That is, humanity's language corpus may not uniquely constrain our universe (but a set of universes of which ours is a member). The aspects of humanity's knowledge about our external world that are not represented in text may be crucial missing information to uniquely single out our universe (or even just humanity's shared model of our universe). Similarly, it may not be possible — even in principle — to learn features of our universe that humanity is completely ignorant of[8].

For similar reasons, it may turn out to be the case that it is possible to predict text generated by conscious agents to arbitrarily high fidelity without instantiating conscious simulacra. That is, humans may have subjective experiences and behaviour that cannot be fully captured/discriminated within language. Any aspects of the human experience/condition that are not represented (at least implicitly by reasonable inductive biases) are underdetermined in the limit of text prediction.


Ultimately, while I grant the aforementioned caveats some weight, and those arguments did update me significantly downwards on the likelihood of mindcrime in sufficiently powerful language models[9], I still fundamentally expect text prediction to scale to superintelligence in the limit.

I think humanity's language corpus is a sufficiently comprehensive record of humanity's accumulated explicit knowledge and sufficiently rich representation of our shared world model, that arbitrarily high accuracy in predicting text necessarily requires strongly superhuman general intelligence.

  1. ^

    Particularly strongly superhuman general intelligence. Henceforth "superintelligence".

  2. ^

    At least to degrees of fidelity that can be distinguished via text.

  3. ^

    More specifically, the world model implicit in our recorded knowledge.

  4. ^

    A ring theorist was able to coax ChatGPT to develop new nontrivial, logically sound mathematical concepts and generate examples of them. Extrapolating further, I would expect that sufficiently powerful language models will be able to infer many significant novel theoretical insights that could be in principle located given the totality of humanity's recorded knowledge.

  5. ^

    That is, they can learn an efficient map of our universe and successfully navigate said map to make useful predictions about it. Sufficiently capable language models should be capable of e.g. predicting research write ups, academic reports and similar.

  6. ^

    At least in principle, leaving aside whether current architectures will scale that far. Sufficiently strong optimisation on the task of text prediction is in principle capable of creating vastly superhuman generally intelligent systems.

  7. ^

    That is, sufficiently powerful language models are capable enough to a degree that they could — under particular circumstances — be existentially dangerous. I do not mean to imply that they are independently (by their very nature) existentially dangerous.

  8. ^

    That is, features of our universe that are not captured, not even implicitly, not even by interpolation/extrapolation in our recorded knowledge.

  9. ^

    This point may not matter that much as future simulators will probably be multimodal. It seems much more likely that the limit of multimodal prediction of conscious agents, may necessitate instantiating conscious simulacra.

    But this post was specifically about the limit of large language models, and I do think the aspects of human experience not represented in text are a real limitation to the suggestion that in the limit language models might instantiate conscious simulacra.

New to LessWrong?

New Comment
26 comments, sorted by Click to highlight new comments since: Today at 2:49 PM

See this post (especially the section about resolution) for some similar ideas: https://www.lesswrong.com/posts/EmxfgPGvaKqhttPM8/thoughts-on-the-alignment-implications-of-scaling-language

I think my current take on this is that it is plausible that current LM architectures may not scale to AGI, but the obstacles don't seem fundamental, and ML researchers are strongly incentivized to fix them. Also, you don't need a full model of the universe, or even of the human model of the universe; just as classical mechanics is not the actual nature of the universe but nonetheless is really useful in many situations, we might also expect the "minimal conditional structure" to be super useful (and dangerous). You can imagine that an AI doesn't need to understand the ineffable intricacies of consciousness to do lots of dangerous things, even things like deception or manipulation.


It isn't only the training process that limits a model's ability to (for instance) simulate conscious minds, but also the structure of the model itself. For instance, I bet there is literally no possible training that would make GPT-3 do that, because whatever weights you put in it it isn't doing a kind of computation that's capable of simulating conscious minds. But I wouldn't bet that much at very long odds; my reason for thinking this is that each token-prediction does a computation with not that many steps to it, and it doesn't seem as if there's "room" there for anything so exciting; but maaaaaybe there's some weight vector for the GPT-3 network that, when you give it the right prompt, emits a lengthy "internal monologue" and in the process does simulate a conscious mind.

Actually, here's a kinda related question. Is the transformer architecture Turing-complete in the sense that some plausible actual transformer network like GPT-3's, with some possible set of weights and a suitable prompt, will reliably-enough simulate an arbitrary Turing machine for an arbitrary number of steps? No, because the network and its input window are both finite, so there is only a finite number of states it can be in. And maybe there's a related handwavy argument that any conscious-mind simulation needs too large a repertoire of possible states?

Related related question. Suppose you undertake, whenever your transformer's output contains "*** READ ADDRESS n ***" or "*** WRITE ADDRESS n ***", with n a non-negative integer in decimal notation, to stop its token-output at that point and give a new prompt that (in the former case) consists of an integer equal to the last thing it tried to WRITE at ADDRESS n (if any; any value will do, if it never did) and (in the latter case) consists of just "Done". Is the transformer architecture, so augmented, Turing-complete? Is there some training process that would teach it to exploit this "external memory" effectively?

A Turing machine is a finite automaton that has access to sufficient space for notes. A Turing machine with a very small finite automaton can simulate an arbitrary program if the program is already written down in the notes. A Turing machine with a large finite automaton can simulate a large program out of the box. ML models can obviously act like finite automata. So they are all Turing complete, if given access to enough space for making notes, possibly with initialization notes containing a large program.

This is not at all helpful, because normal training won't produce interesting finite automata, not unless it learns from appropriate data, which is only straightforward to generate if the target finite automaton is already known. Also, even short term human memory already acts like ML models and not deliberative examination of written notes, so an LLM-based agent would need to reason in an unusual and roundabout way if it doesn't have a better architecture that continually learns from observations (and thus makes external notes unnecessary). Internal monologue is still necessary to produce complicated conclusions, but that could just be normal output wrapped in silencing tags.


I'm not sure how obvious it is that "ML models can act like finite automata". I mean, there are theorems that say things like "a large enough multi-layer perceptron can approximate any function arbitrarily well", and unless I'm being dim those do indeed indicate that for such a model there exist weights that make it implement a universal Turing machine, but I don't think that means that e.g. such weights exist that make a transformer of "reasonable" size do that. (Though, on reflection, I think I agree that we should expect that they do.) Your comment about normal training not doing that was rather the point of my final question.

Right, I don't know how much data a model stores, and how much of that can be reached through retraining, if all parameters can't be specified outright. If the translation is bad enough it couldn't quote an LLM and memorize its parameters as explicitly accessible raw data using a model of comparable size. Still, an LLM trained on actual language could probably get quite a lot smaller after some lossy compression (that I have no idea how to specify), and it would also take eons to decode from the model (by doing experiments on it to elicit its behavior). So size bounds are not the most practical concern here. But maybe the memorized data could be written down much faster with a reasonable increase in model size?

Hmm, there might be relevant limitations based on the structure of the model, but those limitations seem to be peculiar to the model under consideration. They don't seem to generalise to arbitrary systems selected for minimising predictive loss on text prediction.

That is, I don't think they're a fundamental limitation of language models, and it was the limits of language models I mostly wanted to explore in this post.


Agreed. But:

1. I was commenting on your "Moreover, the diversity and comprehensiveness of the dataset a language model is trained on will limit the capabilities it can actually attain in deployment. I.e. that a particular upper bound exists in principle, does not mean it will be realised in practice.": I think that in practice what's realisable will be limited at least as much by the structure of the model as by how it's trained. So it's not just "no matter how fancy a model we build, some plausible training methods will not enable it to do this" but also "no matter how fancy a training method we use, some plausible architectures will not be able to do this", and that seemed worth making explicit.

2. In between "current versions of GPT" and "absolutely anything that is in some sense trying to predict text" it seems like there's an interesting category of "things with the same general sort of structure as current LLMs but maybe trained differently".

(I worry a little that a definition of "language model" much less restrictive than that may end up including literally everything capable of using language, including us and hypothetical AGIs specifically designed to be AGIs.)

"no matter how fancy a training method we use, some plausible architectures will not be able to do this", and that seemed worth making explicit.

Fair enough. I'll try and add a fragment to the post making this argument (at a high level of generality, I'm too ignorant about LLM architecture details to describe such limitations in concrete terms).


(I worry a little that a definition of "language model" much less restrictive than that may end up including literally everything capable of using language, including us and hypothetical AGIs specifically designed to be AGIs.)

I'm using "language model" here to refer to systems optimised solely for the task of predicting text.

It is clear that in the limit LLM's are superhumanly good predictors. (Ie Solomonov induction on text). It is less clear whether or not neural networks can get anywhere near that good. However, it is less clear whether this is dangerous. Suppose you ask the LLM about some physics experiment that hasn't been done yet. It uses it's superhuman cognition to work out the true laws of physics, and then writes what humans would say, given the experimental results. This is smart but not dangerous. (It could be mindcrime) The LLM could be dangerous, if it predicts the output of a superintelligence. But it only goes there if it has really high generalization, ie it is capable of ignoring the fact that superintelligences don't exist while being smart enough to predict one. I am unsure how likely this is.

I strongly disagree with your statement here Donald. I think that the level of capability you describe here as 'not dangerous' is what I would describe as 'extremely dangerous'. An AI agent which has super-human capabilities but restricts itself to human-level outputs because of the quirks of its training process can still accomplish everything necessary to destroy humanity. The key limiting factor in your example is not the model's capability but rather its agency.

Ok, maybe my wording should be more like, "this probably wont destroy the world if it is used carefully and there are no extra phenomena we missed." 

Yeah, used carefully and intentionally by well-intentioned actors (not reckless or criminal or suicidal terrorists or...) and no big deal surprises... And no rapid further advances building off of where we've gotten so far... If all of those things were somehow true, then yeah, much less dangerous.

Sorry, by "dangerously capable" I meant "capable enough to be very dangerous" not "inherently very dangerous".

This leads to a natural question: What reflection process would change a language model towards becoming a better map of the world (rather than language in the training dataset)? Reflection only looks at the language model, doesn't look at the world, produces an improved version of the model, applies an inductive bias after the fact. This is a problem statement of epistemic rationality for AI.

At a guess, focusing on transforming information from images and videos into text, rather than generating text qua text, ought to help — no? 

That's not reflection, just more initial training data. Reflection acts on the training data it already has, the point is to change the learning problem, by introducing an inductive bias that's not part of the low level learning algorithm, that improves sample efficiency with respect to loss that's also not part of low level learning. LLMs are a very good solution to the wrong problem, and a so-so solution to the right problem. Changing the learning incentives might get a better use out of the same training data for improving performance on the right problem.

A language model retrained on generated text (which is one obvious form of implementing reflection) likely does worse as a language model of the original training data, it's only a better model of the original data with respect to some different metric of being a good model (such as being a good map of the actual world, whatever that means). Machine learning doesn't know how to specify or turn this different metric into a learning algorithm, but an amplification process that makes use of faculties an LLM captured from human use of language might manage to do this by generating appropriate text for low level learning.

We could do auto captioning of movies and videos.

Or we could just train multimodal simulators. We probably will (e.g. such models could be useful for generating videos from descriptions).

I think in the limit of text prediction, language models can learn ~all of humanity's shared world model that is represented explicitly. The things that language models can't learn are IMO:

  • Tacit knowledge of the world that we haven't represented in text
  • Underdetermined features of the world
    • Aspects of our shared world model as represented in language that do not uniquely constrain our particular universe
  • Stuff we don't know about the world

As a path to AGI, I think token prediction is too high-level, unwieldy, and bakes in a number of human biases. You need to go right down to the fundamental level and optimize prediction over raw binary streams.

The source generating the binary stream can (and should, if you want AGI) be multimodal. At the extreme, this is simply a binary stream from a camera and microphone pointed at the world.

Learning to predict a sequence like this is going to lead to knowledge that humans don't currently know (because the predictor would need to model fundamental physics and all it entails).

To me, the caveats section of this post highlights the limited scope from which language models will be able to learn human values and preferences, given explicitly stated (And even implied-from-text) goals != human values as a whole. 

Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior? I could imagine an argument that a model would want to seize all available computation in order to better predict the next token. Perhaps the model's reward schedule would never lead it to think about such paths to loss reduction, but if the model is creative enough to consider such a plan and powerful enough to execute it, it seems that many power-seeking plans would help achieve its goal. This is significantly different from the view advanced by OpenAI, that language models are tools which avoid some central dangers of RL agents, and the general distinction drawn between tool AI and agentic AI. 

Do you think that the next-token prediction objective would lead to instrumentally convergent goals and/or power-seeking behavior?

No. Simulators aren't (in general) agents. Language models were optimised for the task of next token prediction, but they don't necessarily optimise for it. I am not convinced that their selection pressure favoured agents vs a more general cognitive architecture that can predict agents (and other kinds of systems).

Furthermore, insomuch as they are actually optimisers for next token prediction, it's in a very myopic way. That is, I don't think language models will take actions to make future tokens easier to predict

I don't think language models will take actions to make future tokens easier to predict

For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant -- a study of YouTube's supposed radicalization spiral came up negative, though the authors didn't log in to YouTube which could lead to less personalization of recommendations. 

The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don't think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way. 

As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training.

As such, we shouldn't expect the AI to demonstrate such behaviour. It was not selected for power seeking.

There's an assumption that the text that language models are trained on can be coherently integrated somehow. But the input is a babel of unreliable and contradictory opinions. Training to convincingly imitate any of a bunch of opinions, many of which are false, may not result in a coherent model of the world, but rather a model of a lot of nonsense on the Internet.

Do you have much actual experience playing around with large language models?

text that language models are trained on can be coherently integrated somehow

In my experience, the knowledge/world model of GPT-3/ChatGPT are coherently integrated.


Training to convincingly imitate any of a bunch of opinions, many of which are false, may not result in a coherent model of the world, but rather a model of a lot of nonsense on the Internet.

This seems empirically false to my experience using language models, and prima facie unlikely. Lots of text on the internet is just reporting about underlying reality:

  • Log files
  • Research papers
  • Academic and industry reports
  • Etc.

Learning to predict such reports of reality, would privilege processes that can learn the structure of reality.


Furthermore, text that is fact and text that is fiction is often distinguished in writing style or presentation. In my experience, large language models do not conflate fact or fiction.