Proof sketch: Left to the reader as an exercise.
You might want to formally state the thing you want proved in Proposition 2; right now I can't even tell what you are trying to claim. Some issues with the current formalization:
My best guess is that you want to relate the quantities and , but I don't see why there would be any straightforward relation between these quantities (apart from the obvious one where the max sequence is one way to get the token and so is a lower bound on its probability, i.e. ).
EDIT: Maybe you want to say that is "not much higher than" ? If so, that seems false for LLMs; imagine the case where .
Hi, thanks for the response! I apologize, the "Left as an exercise" line was mine, and written kind of tongue-in-cheek. The rough sketch of the proposition we had in the initial draft did not spell out sufficiently clearly what it was I want to demonstrate here and was also (as you point out correctly) wrong in the way it was stated. That wasted people's time and I feel pretty bad about it. Mea culpa.
I think/hope the current version of the statement is more complete and less wrong. (Although I also wouldn't be shocked if there are mistakes in there). Regarding your points:
Agreed. As a linguist, I looked at the Proposition 2 and immediately thought "sketchy, shouldn't hold in a good model of a language model".
Proposition 1 is wrong. The coin flips that are eternally 0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.
So, a softmax can never emit a probability of 0 or 1, maybe they were implicitly assuming the model ends in a softmax (as is the common case)? Regardless, the proof is still wrong if a model is allowed unbounded context, as an infinite product of positive numbers less than 1 can still be nonzero. For example, if the probability of emitting another " 0" is even just as high as $1 - \frac1{n^{1.001}}$ after already having emitted $n$ copies of " 0", then the limiting probability is still nonzero.
But if the model has a finite context and ends in a softmax then I think there is some minimum probability of transitioning to a given token, and then the proposition is true. Maybe that was implicitly assumed?
Yeah, I think it was implicitly assumed that there existed some such that no token ever had probability .
Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it's reasonable to assume that such an always exists in practice (if we also assume that the probability gets rounded to something < 1).
Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the "non-degenerate" requirement into the text. In practice, the proposition holds pretty well - although I agree it would nice to have a deeper understanding of when to expect the transition rule to be "non-degenerate"
Calling individual tokens the 'State' and a generated sequence the 'Trajectory' is wrong/misleading IMO.
I would instead call a sequence as a whole the 'State'. This follows the meaning from Dynamical systems.
Then, you could refer to a Trajectory which is a list of sequence each with one more token.
(That said, I'm not sure thinking about trajectories is useful in this context for various reasons)
To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.
Intuitively I would say that all the tokens in the token window are the state.
And when you run an inference pass, select a token and append that to the token window, then you have a new state.
The model looks a lot like a collection of nonlinear functions, each of them encoded using every parameter in the model.
Since the model is fixed after training, the only place an evolving state can exist has to be in the tokens, or more specifically the token window that is used as input.
The state seems to contain, for lack of a better word, a lot of entanglement. Likely due to attention heads, and how the nonlinear functions are encoded.
There is another way to view such a system, one that while deeply flawed, at least to me intuits that whatever Microsoft and OpenAI are doing to "align(?)" something like Bing Chat is impossible (at least if the goal is bulletproof).
I would postulate:
- Alignment for such a system is impossible (assuming it has to be bulletproof)
- Impossibility is due to the architecture of such a system
Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle and @metasemi arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system then is the state, not the trajectory of states . In this situation, the dynamics of the system only depend on the current state (that's because it's a first-order system). When we move to higher-order systems, , then the state is still just , but the dynamics of the system but also the "direction from which we entered it". That's the first derivative (in a time-continuous system) or the previous state (in a time-discrete system).
At least I think that's what's going on. If someone makes a compelling argument that defuses my argument then I'm happy to concede!
My impression is that simulacra should be semantic objects that interact with interpretations of (sampled) texts, notably characters (agents), possibly objects and concepts. They are only weakly associated with particular texts/trajectories, the same simulacrum can be relevant to many different trajectories. Only many relevant trajectories, considered altogether, paint an adequate picture of a given simulacrum.
(This serves as a vehicle for discussing possible inductive biases that should move LLMs from token prediction and towards (hypothetical) world prediction.)
I agree. Here's the text of a short doc I wrote at some point titled 'Simulacra are Things'
What are simulacra?
“Physically”, they’re strings of text output by a language model. But when we talk about simulacra, we often mean a particular character, e.g. simulated Yudkowsky. Yudkowsky manifests through the vehicle of text outputted by GPT, but we might say that the Yudkowsky simulacrum terminates if the scene changes and he’s not in the next scene, even though the text continues. So simulacra are also used to carve the output text into salient objects.
Essentially, simulacra are to a simulator as “things” are to physics in the real world. “Things” are a superposable type – the entire universe is a thing, a person is a thing, a component of a person is a thing, and two people are a thing. And likewise, “simulacra” are superposable in the simulator, Things are made of things. Technically, a random collection of atoms sampled randomly from the universe is a thing, but there’s usually no reason to pay attention to such a collection over any other. Some things (like a person) are meaningful partitions of the world (e.g. in the sense of having explanatory/predictive power as an object in an ontology). We assign names to meaningful partitions (individuals and categories).
Like things, simulacra are probabilistically generated by the laws of physics (the simulator), but have properties that are arbitrary with respect to it, contingent on the initial prompt and random sampling (splitting of the timeline). They are not necessary but contingent truths; they are particular realizations of the potential of the simulator, a branch of the implicit multiverse. In a GPT simulation and in reality, the fact that there are three (and not four or two) people in a room at time is not necessitated by the laws of physics, but contingent on the probabilistic evolution of the previous state that is contingent on (…) an initial seed(prompt) generated by an unknown source that may itself have arbitrary properties.
We experience all action (intelligence, agency, etc) contained in the potential of the simulator through particular simulacra, just like we never experience the laws of physics directly, only through things generated by the laws of physics. We are liable to accidentally ascribe properties of contingent things to the underlying laws of the universe, leading us to conclude that light is made of particles that deflect like macroscopic objects, or that rivers and celestial bodies are agents like people.
Just as it is wrong to conclude after meeting a single person who is bad at math that the laws of physics only allow people who are bad at math, it is wrong to conclude things about GPT’s global/potential capabilities from the capabilities demonstrated by a simulacrum conditioned on a single prompt. Individual simulacra may be stupid (the simulator simulates them as stupid), lying (the simulator simulates them as deceptive), sarcastic, not trying, or defective (the prompt fails to induce capable behavior for reasons other than the simulator “intentionally” nerfing the simulacrum – e.g. a prompt with a contrived style that GPT doesn’t “intuit”, a few-shot prompt with irrelevant correlations). A different prompt without these shortcomings may induce a much more capable simulacrum.
This is certainly intriguing! I'm tentatively skeptical this is the right perspective though for understanding what LMs are doing. An important difference is that in physics and dynamical systems, we often have pretty simple transition rules and want to understand how these generate complex patterns when run forward. For language models, the transition rule is itself extremely complicated. And I have this sense that the dynamics that arise aren't that much more complicated in some sense. So arguably what we want to understand is the language model itself, not treat it as a black box and see what kinds of trajectories this black box induces. Though this does depend on where most of the "cognitive work" happens: right now it's inside the forward pass of the language model, but if we rely more and more on chain-of-thought reasoning, hierarchies of lots of interacting LLMs, or anything like that, maybe this will change. In any case, this objection doesn't rule out that stuff like the Lyapunov exponents could become nice tools, it's just why I think we probably won't get deep insights into AI cognition using an approach like this.
Trying to move towards semantic space instead of just token space seems like the right move not just because it's what's ultimately more meaningful to us, but also because transition dynamics should be simpler in semantic space in some sense. If you consider all possible dynamics on token space, the one a LM actually implements isn't really special in any way (except that it has very non-uniform next-token probabilities). In contrast, it seems like the dynamics should be much simpler to specify in the "right" semantic space.
Will leave high-level thoughts in a separate comment, here are just issues with the mathematical claims.
Proposition 1 seems false to me as stated:
For any given pair of tokens and , the probability (as induced by any non-degenerate transition rule) of any given token bridge of length occurring decreases monotonically as increases,
Counterexample: the sequence (1, 2, 3, 4, 5, 6, 7, 10) has lower probability than (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) under most reasonable inference systems (including actual language models, I'd guess, especially if we increase the length of the pattern). The mistake in the proof sketch is that I think you just ignore the factor? This claim seems potentially fixable "in the limit", e.g. a statement like: for any fixed , there is a natural number such that the probability is monotonically decreasing in for . I'm not convinced even this weaker result is true though, because of the next problem:
This is not the same as saying the probabilities decrease monotonically: the latter would also allow that they converge to some value greater than zero. And this stronger claim that the limit is zero is false in general: an infinite product where each of the factors is strictly between zero and one can still converge to a non-zero value if the factors approach one "quickly enough". If the "non-degenerate" assumption was meant to rule this out, you should specify what that assumption means precisely (I just read it as "no probabilities are exactly one").
I'm also skeptical of proposition 2, though I might be misunderstanding that one. For large , I'd often expect to be approximately the monogram probability of in the context of language models. E.g. if is "This" and is 1000, then I think we have basically no information about . So the LHS of your equation should asymptotically just be , right? So then proposition 2 would imply that the maximum probability of a sequence of length decays only as if I understand what you're claiming. I'd be a bit surprised if this was true for language models, thought not entirely sure. If it is true, I think it would have to be because the most likely sequence is something silly like "this this this ...", which highlights that maybe the proposition is "asking the wrong question"? In any case, the decay clearly isn't true for more general "well-behaved" transition rules, you need pretty strong assumptions! Rounding those off to "sufficiently well-behaved" seems misleading.
Hi Erik! Thank you for the careful read, this is awesome!
Regarding proposition 1 - I think you're right, that counter-example disproves the proposition. The proposition we were actually going for was , i.e. the probability without the end of the bridge! I'll fix this in the post.
Regarding proposition II - janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same "cost". At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something like , or that in the limit "the probability of the most likely sequence ending in will be (when appropriately normalized) proportional to the probability of ", which seems sensible? (assuming something like ergodicity). Although I'm now becoming a bit suspicious about the sign of the exponent, perhaps there is a "log" or a minus missing on the RHS... I'll think about that a bit more.
The proposition we were actually going for was , i.e. the probability without the end of the bridge!
In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn't necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)
So Proposition II implies something like , or that in the limit "the probability of the most likely sequence ending in will be (when appropriately normalized) proportional to the probability of ", which seems sensible?
One crux here is the "appropriately normalized": why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won't be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn't obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).
I am a bit confused here and I would appreciate your thoughts!
Do you want to assume finite or not? Either way I am confused:
1. is finite
In this case, the notion of almost all/almost surely is vacuous. Anything which is true up to a finite set is true if your initial measure space has finite cardinality itself.
II. is infinite
While there is no immediate problem, I think your condition that for almost all , we want for any becomes too strong I believe for a reasonable simulator.
Let mean a sufficient amount of repetitions of the sequence . Consider the set , where means a sufficient amount of repetitions of (I am being intentionnaly vague here, but I hope you get the idea). I have not empirically verified it, but it seems like might grow, i.e. the more often you repeat a string, the more likely it is that it will repeat itself. And I think is uncountable, so any reasonable measure should assign something greater than to it.
I think it is also worth mentioning that parts of this post reminded me of concepts introduced information theory. In fact if you go back to Shannon's seminal A Mathematical Theory of Communication the second section already anticipates something like this (and then for example higher temperature=more noise?). It could be though that your post is more orthogonal to it.
Might be worthwhile to explore symbolic dynamics. I wrote a research notebook exploring similar ideas a few years ago.
Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)
Proposition 1: This is false, and the proof is wrong. For the same reason that you can get an infinite series (of positive numbers) with a finite sum.
The terminology: I think it is a really bad idea to refer to tokens as "states", for several reasons. Moreover, these reasons point to fundamental open questions around the simulator framing, and it seems unfortunate to chose terminology which makes these issues confusing/hard to even notice. (Disclaimer: I point out some holes in the simulator framing and suggest improvements. However, I am well aware that all of my suggestions also have holes.)
(1) To the extent that a simulator fully describes some situation that evolves over time, a single token is a too small unit to describe the state of the environment. A single frame of a video (arguably) corresponds to a state. Or perhaps a sentence in a story might (arguably) corresponds to a state. But not a single pixel (or patch) and not a single word.
(2) To the extent that a simulator fully describes some situation that evolves over time, there is no straightforward correspondence between the tokens produced so far and the current state of the environment. To give several examples: The process of tossing a coin repeatedly can be represented by a sequence such as "1 0 0 0 1 0 1 ...", where the current state can be identified with the latest token (and you do not want to identify the current state with the whole sequence). The process of me writing the digits of pi on a paper, one per second, can be described as "3 , 1 4 1 ..." --- here, you need the full sequence to characterize the current state. Or what if I keep writing different numbers, but get bored with them and switch to new ones after a while: " pi = 3 , 1 4 1 Stop, got bored. e = 2 , 7. Stop, got bored. sqrt(2) = ...".
(3) It is misleading/false to describe models like GPT as "describing some situation that evolves over time". Indeed, fiction books and movies do crazy things like jumping from character to character, flashbacks, etc. Non-fiction books are even weirder (could contain snippets of stories, and then non-story things, etc). You could argue that in order to predict a text of a non-fiction book, GPT is simulating the author of that book. But where does this stop? What if the 2nd half of the book is darker because the author got sacked out of his day job and got depressed --- are you then simulating the whole world, to predict this thing? If (more advanced) GPT is a simulator in the sense of "evolving situations over time", then I would like this claim flashed out in detail on the example of (a) non-fiction books, (b) fiction books, and perhaps (c) movies on TV that include commercial breaks.
(4) But most importantly: To the extent that a simulator describes some situation that evolves over time, it only outputs a small portion of the situation that it is "imagining" internally. (For example, you are telling a story about a princess, and you never mention the colour of her dress, despite the princess in your head having blue dress.) So it feels like a type-error to refer to the output as "state". At best, you could call it something like "rendering of a state".
Arguably, the output (+ the user input) uniquely determines the internal state of the simulator. So you could perhaps identify the output (+ the user input) with "the internal state of the simulator". But that seems dangerous and likely to cause reasoning errors.
(5) Finally, to make (4) even worse: To the extent that a simulator describes some situation that evolves over time, it is not internally maintaining a single fully fleshed out state that it (probabilistically) evolves over time. Instead, it maintains a set of possible states (macro-state?). And when it generates new responses, it throws out some of the possible states (refines the macro-state?). (For example, in your story about a princess, dress colour is not determined, could be anything. Then somebody asks about the colour, and you need to refine it to blue --- which could still mean many different shades of blue.)
---
However, even the explanation, given in (5), of what is going on with simulators, is missing some important pieces. Indeed, it doesn't explain what happens in cases such as "GPT tells the great story about the princess with blue dress, and suddenly the user jumps in and refers to the dress as red". At the moment, this is my main reason for scepticism about the simulator framing. As result, my current view is that "GPT can act as a simulator" (in the sense of Simulators) but it would be "false" to say that "GPT is a simulator" (in the sense of Simulators).
Erik has already pointed out some problems in the math, but also
Formal definition: Same as for the attractor sequence, but for a positive Lyapunov coefficient.
I'm not sure this feels right. For the attractor sequence, it makes sense to think of the last part of the sequence as the attractor, that to which is arrived, and to think of the "structural properties incentivizing attraction" lying there. On the contrary, it would seem like the "structural properties incentivizing chaos" should be found at the start of the sequence (after which different paths wildly diverge), instead of in one of such divergent endings. Intuitively it seems like a sequence should be chaotic just when its Lyapunov exponent is high.
On another note, I wonder whether such a conceptualization of language generation as a dynamical system can be fruitful even for natural, non-AI linguistics.