[Simulators seminar sequence] #2 Semiotic physics - revamped

Jan; Charlie Steiner; Logan Riggs; janus; jacquesthibs; metasemi; Michael Oesterle; Lucas Teixeira; peligrietzer; remember

Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take full responsibility for that. In this updated version of the post I have improved the propositions and added a proof for proposition 2. Please continue to point out weaknesses in the argument; that is a major motivation for why we share these fragments.

For comments and clarifications on the conceptual and philosophical aspects of this article, please read metasemi's excellent follow-up note here.

Meta: Over the past few months, we've held a seminar series on the Simulator theory by janus. As the theory is actively under development, the purpose of the series is to uncover central themes and formulate open problems. A few high-level remarks upfront:

Our aim with this sequence is to share some of our discussions with a broader audience and to encourage new research on the questions we uncover.
We outline the broader rationale and shared assumptions in Background and shared assumptions. That article also contains general caveats about how to read this sequence - in particular, read the sequence as a collection of incomplete notes full of invitations for new researchers to contribute.

Epistemic status: Exploratory. Parts of this text were generated by a language model from language model-generated summaries of a transcript of a seminar session. The content has been reviewed and edited for conceptual accuracy, but we have allowed many idiosyncrasies to remain.

Three questions about language model completions

GPT-like models are driving most of the recent breakthroughs in natural language processing. However, we don't understand them at a deep level. For example, when GPT creates a completion like the Blake Lemoine greentext, we

can't explain why it creates that exact completion.
can't identify the properties of the text that predict how it continues.
don't know how to affect these high-level properties to achieve desired outcomes.

We can make statements like "this token was generated because of the multinomial sampling after the softmax" or "this behavior is implied by the training distribution", but these statements only imply a form of descriptive adequacy (or saying “AlphaGo will win this game of Go"). They don't provide any explanatory adequacy, which is what we need to sufficiently understand and make use of GPT-like models.

Simulator theory (janus, 2022) has the potential for explanatory adequacy for some of these questions. In this post, we'll explore what we call “semiotic physics”, which follows from simulator theory and which has the potential to provide partial answers to questions 1., 2. and perhaps 3. The term “semiotic physics” here refers to the study of the fundamental forces and laws that govern the behavior of signs and symbols. Similar to how the study of physics helps us understand and make use of the laws that govern the physical universe, semiotic physics studies the fundamental forces that govern the symbolic universe of GPT, a universe that reflects and intersects with the universe of our own cognition. We transfer concepts from dynamical systems theory, such as attractors and basins of attraction, to the semiotic universe and spell out examples and implications of the proposed perspective.

Example. Semiotic coin flip.

To illustrate what we mean by semiotic physics, we will look at a toy model that we are familiar with from regular physics: coin flips. In this setup, we draw a sequence of coin flips from a large language model^[1]. We encode the coin flips as a sequence of the strings 1 and 0 (since they are tokenized as a single token) and zero out all probabilities of other tokens.

We can then look at the probability of the event that the sequence of coin flips ends in tails ( 0) or heads ( 1) as a function of the sequence length.

We note two key differences between the semiotic coin flip and a fair coin:

the semiotic coin is not fair, i.e. it tends to produce sequences that end in tails ( 0) much more frequently than sequences that end in heads ( 1).
the semiotic coin flips are not independent, i.e. the probability of observing heads or tails changes with the history of previous coin flips.

To better understand the types of sequences that end in either tails or heads, we next investigate the probability of the most likely sequence ending in 0 or 1. As we can see in the graph below, the probability of the most likely sequence ending in 1 does not decrease for the GPT coin as rapidly as it does for a fair coin.

Again, we observe a notable difference between the semiotic coin and the fair coin:

while the probability of a given sequence of coin flips decreases exponentially (every sequence of length $T$ of fair coinflips has the same probability $\frac{1}{2^{T}}$ ), the probability of the most likely sequence of semiotic coin flips decreases much slower.

This difference is due to the fact that the most likely sequence of semiotic coinflips ending in f.e. 0 is: 0 0 0 0 ... 0 0. Once the language model has produced the same token four or five times in a row, it will latch onto the pattern and continue to predict the same token with high probability. As a consequence, the probability of the sequence does not decrease as drastically with increasing length, as each successive term has almost a probability of $1$ .

With the example of the semiotic coin flip in mind, we will set up some mathematical vocabulary for discussing semiotic physics and demonstrate how the vocabulary pays off with two propositions. We believe this terminology is primarily interesting for alignment researchers who would like to work on the theory of semiotic physics. The arithmophobic reader is invited to skip or gloss over the section (for an informal discussion, see here).

Simulations as dynamical systems

Simulator theory distinguishes between the simulator (the entity that performs the simulation) and the simulacrum (the entity that is generated by the simulation). The simulacrum arises from the chained application of the simulation forward pass. The result can be viewed as a dynamical system where the simulator describes the system’s dynamics and the simulacrum is instantiated through a particular trajectory.

We commence by identifying the state and trajectory of a dynamical system with tokens and sequences of tokens.

Definition of the state and trajectories. Given an alphabet of tokens $T$ with cardinality $| T | = N \in N^{+}$ we call $¯ s = (s_{1}, . . ., s_{M}) \in T^{*}$ the trajectory.^[2] While a trajectory can generally be of arbitrary length, we denote the context length of the model as $L \in N^{+}$ ; therefore, $T^{*}$ can effectively be written as $⋃_{l = 0}^{L} T^{l}$ . The empty sequence is denoted as $\emptyset$ .^[3]^[4]^[5]

While token sequences are the objects of semiotic physics, the actual laws of semiotic physics derive from the simulator. In particular, a simulator will provide a distribution over the possible next state given a trajectory via a transition rule.

Definition of the transition rule. The transition rule is a random function that maps a trajectory to a probability distribution over the alphabet (i.e., the probabilities for the next token completion after the current state). Let $Δ_{T}$ denote the set of probability mass functions over $T$ , i.e., the set of functions $p : T \to [0, 1]$ which satisfies the Kolmogorov axioms.^[6]^[7]^[8] The transition rule is then a function $θ : T^{*} \to Δ_{T}$ .

Analogous to the wave collapse in quantum physics, sampling a new state from a distribution over states turn possibility into reality. We call this phenomenon the sampling procedure.

Definition of the sampling procedure. The sampling procedure $ϕ : T^{*} \to T$ , selects a next token, i.e., $ϕ (¯ s) \in supp (θ (¯ s)) \forall ¯ s \in T^{*}$ .^[9] The resulting trajectory ${¯ s}_{t + 1}$ is simply the concatenation of ${¯ s}_{t}$ and $ϕ ({¯ s}_{t})$ (see the evolution operator below). We can, therefore, define the repeated application of the sampling procedure recursively as $ϕ^{(1)} (¯ s) := ϕ (¯ s)$ and $ϕ^{(n)} (¯ s) := ϕ^{(n - 1)} (¯ s ϕ (¯ s))$ .

Lastly, we need to concatenate the newly sampled token to the trajectory of the previous token to obtain a new trajectory. Packaging the transition rule, the sampling procedure, and the concatenation results in the evolution operator, which is the main operation used for running a simulation.

Definition of the evolution operator. Putting the pieces together, we finally define the function $ψ$ that evolves a given trajectory, i.e., transforms ${¯ s}_{t}$ into ${¯ s}_{t + 1}$ by appending the token generated by the sampling procedure $ϕ$ . That is, $ψ : T^{*} \to T^{*}$ is defined as $ψ (¯ s) := ¯ s ϕ (¯ s)$ . As above, repeated application is denoted by $ψ^{(n)}$ .

Note that both the sampling procedure and the evolution operator are not functions in the conventional sense since they include a random element (the step of sampling from the distribution given by the transition function). Instead, one could consider them random variables or, equivalently, functions of unobservable noise. This justifies the use of a probability measure, e.g., in an expression like $P [ψ^{(2)} (\emptyset) = "hello world"] < ε$ .

Definition of an induced probability measure. Given a transition rule $θ$ and a trajectory $¯ s$ , we call $P = θ (¯ s) \in Δ_{T}$ the induced probability measure (of $θ$ and $¯ s$ ). We write $P (ϕ (¯ s) = s)$ to denote $θ (¯ s) (s)$ , i.e. the probability of the token $s$ assigned by the probability measure induced by $¯ s$ . For a given trajectory $¯ s$ the induced probability measure satisfies by definition the Kolmogorov axioms. We construct a joint measure of a sequence of tokens, $P (ψ^{(N)} (¯ s) = ¯ s s_{1} \dots s_{N})$ , as the product of the individual probability measures, $P (ψ^{(N)} (¯ s) = ¯ s s_{1} \dots s_{N}) = \prod_{i = 1}^{N} P (ϕ (¯ s s_{1} \dots s_{i - 1}) = s_{i})$ . For ease of notation, we also use the shorthand $P [¯ s] = \prod_{i = 1}^{N} P (s_{i} | s_{1 : i - 1})$ , where the length of the sequence, $| ¯ s | = N$ , is implicit.

Two propositions on semiotic physics

Having identified simulations with dynamical systems, we can now draw on the rich vocabulary and concepts of dynamical systems theory. In this section, we carry over a selection of concepts from dynamical systems theory and encourage the reader to think of further examples.

First, we will define a token bridge of length $B$ as a trajectory $(s_{a}, . . ., s_{b})$ that starts on a token $s_{a}$ ends on a token $s_{b}$ , and that has length $| b - a | = B$ such that the resulting trajectory is valid according to the transition rule of the simulator. For example, a token bridge of length 3 from "cat" to "dog" would be the trajectory "cat and a dog".

Second, we call the family of probability measures $P$ induced by a simulator non-degenerate if there exists an $ε > 0$ such that for (almost) all $¯ s \in T^{*}$ the probability assigned to any $s \in T$ by the induced measure is less than or equal to $1 - ε$ ,

P (ϕ (¯ s) = s) \leq 1 - ε .

We can now formulate the following proposition:

Proposition 1. Vanishing likelihood of bridges. Given a family of non-degenerate probability measures $P$ on $T^{*}$ , the probability of a token bridge $¯ s$ of length $B$ decreases monotonically as $B$ increases^[10], and converges to 0 in the limit,

lim B \to \infty P [¯ s] = 0.

Proof: The probability of observing the particular bridge can be decomposed into the product of all individual transition probabilities, $P [¯ s] = \prod_{i = 1}^{B} P (s_{i} | s_{1 : i - 1})$ . Given that $P (s_{i} | s_{1 : i - 1}) \leq 1 - ε$ for all transitions (minus at most a finite set), we see immediately that the probability of a longer sequence, $P ((s_{a}, \dots, s_{b}, s_{b^{'}}))$ , is at most equal (on a finite set) or strictly smaller than the probability of the shorter sequence $P ((s_{1}, \dots, s_{b^{'}})) \leq (1 - ε) P ((s_{1}, \dots, s_{b})) \leq P ((s_{1}, \dots, s_{b}))$ . We also see that $0 \leq {lim}_{B \to \infty} \prod_{i = 1}^{B} P (s_{i} | s_{1 : i - 1}) \leq {lim}_{B \to \infty} (1 - ε)^{B} = 0$ from which the proposition follows.

Notes: As correctly pointed out by multiple commenters, in general, it is not true that the probability of $(s_{a}, . . ., s_{b})$ decreases monotonically when $s_{b}$ is fixed. In particular, the sequence $(1, 2, 3, 4, 5)$ plausibly gets assigned a higher probability than the sequence $(1, 2, 3, 5)$ . So the proposition only talks about the probability of a sequence when another token is appended. In general, when a sequence is sufficiently long and the transition function is not exceedingly weird, the probability of getting that particular sequence will be small. We also note that real simulators might well induce degenerate probability measures, for example in the case of a language model that falls into a very strong repeating loop^[11]. In that case, the sequence can converge to a probability larger than zero.

There are usually multiple token bridges starting from and ending in any given pair of tokens. For example, besides "and a", we could also have "with a" or "versus a" between "cat" and "dog". We define the set of all token bridges of length $B$ between $s_{a}$ and $s_{b}$ as

T_{a}^{b} = {¯ s \in T^{B} | {¯ s}_{1} = s_{a} and {¯ s}_{B} = s_{b}}

and the total probability of transitioning from $s_{a}$ to $s_{b}$ in $B$ steps, denoted as $P (T_{a}^{b})$ , and calculate it as

P (T_{a}^{b}) = \sum ¯ s \in T_{a}^{b} P (¯ s) .

Computing this sum is, in general, computationally infeasible, as the number of possible token bridges grows exponentially with the length of the bridge. However, proposition one suggests that we will typically be dealing with small probabilities. This insight leads us to leverage a technique from statistical mechanics, that is concerned with the way in which unlikely events come about:

Proposition 2. Large deviation principle for token bridges. The total probability of transitioning from a token $s_{a}$ to $s_{b}$ in $B$ steps satisfies a large deviation principle with rate function $J$ ,

lim B \to \infty \frac{1}{B} ln P (T_{a}^{b}) = - lim B \to \infty min ¯ s \in T_{a}^{b} J (¯ s),

where we call $J (¯ s) = - \frac{1}{B} \sum_{i = 1}^{B} ln P (s_{i} | s_{1 : i - 1})$ the average action of a token bridge.

Proof: We again leverage the product rule and the properties of the exponential function to write the probability of a token bridge ${¯ s}^{*}$ as

P ({¯ s}^{*}) = B \prod i = 1 P (s_{i} | s_{1 : i - 1}) = exp (B \sum i = 1 ln P (s_{i} | s_{1 : i - 1}))

so that the total probability $P (T_{a}^{b})$ can be written as a sum of exponentials,

P (T_{a}^{b}) = \sum ¯ s \in T_{a}^{b} exp (B \sum i = 1 ln P (s_{i} | s_{1 : i - 1})) .

We now expand the definition of the average action which makes the dependence of the exponential on $T$ explicit,

P (T_{a}^{b}) = \sum ¯ s \in T_{a}^{b} exp (- B J (¯ s)) .

Let ${¯ s}^{*} = arg {min}_{¯ s} J (¯ s)$ . Then $exp (- B J ({¯ s}^{*}))$ is the largest term of the sum and we can rewrite the sum as

P (T_{a}^{b}) = exp (- B J ({¯ s}^{*})) (1 + \sum ¯ s \in T_{a}^{b} ∖ {{¯ s}^{*}} exp {- B (J (¯ s) - J ({¯ s}^{*}))}) .

Applying the logarithm to both sides and multiplying with $- \frac{1}{B}$ results in

\frac{1}{B} ln P (T_{a}^{b}) = - J ({¯ s}^{*}) - \frac{1}{B} ln (1 + \sum ¯ s \in T_{a}^{b} ∖ {{¯ s}^{*}} exp {- B (J (¯ s) - J ({¯ s}^{*}))}) .

Since $J ({¯ s}^{*}) < J (¯ s)$ by construction, $J (¯ s) - J ({¯ s}^{*})$ is larger than zero and $exp {- B (J (¯ s) - J ({¯ s}^{*}))}$ converges rapidly to zero. Consequently,

lim B \to \infty \frac{1}{B} ln P (T_{a}^{b}) = - lim B \to \infty J ({¯ s}^{*}),

which is the original statement of the proposition.

Notes: Proposition 2 effectively rephrases a combinatorial problem (adding up all the possible ways in which a certain state can come about) with a control theory problem (finding the token bridge with the lowest average action). While there is no guarantee that the control theory problem is easier to solve than the combinatorial problem^[12], given additional assumptions on the simulator we can often do better than the worst case. Similarly, while the proposition only holds in the limit, applying it to moderately long trajectories can still yield useful insights - this is a typical pattern for large deviation principles. For 'long enough' token bridges we can thus write $P (T_{a}^{b}) \approx exp {- B {min}_{¯ s} J (¯ s)}$ .

Having formulated this proposition, we can apply the large deviation principle to the semiotic coin example.

Here we see that, indeed, the negative probability of the most likely sequence from $E$ scales as $\frac{1}{B} log P (E)$ .

Note that the choice of $E$ as "sequence ends in ..." was made to fit in with the definition of a token bridge above. However, the large deviation principle applies more broadly and can help to estimate the probability of "at least two times heads" or "tails in the third position". We encourage the reader to "go wild" and experiment with their favourite choices of $E$ .

Advanced concepts in semiotic physics

We have formulated the dynamics of semiotic physics in the token domain in the previous sections. While we sometimes care about the token domain^[13], we mostly care about the parallel domain of semantic meaning. We, therefore, define two more functions to connect these two realms:

A function $μ : T^{*} \to M$ which projects a state $s$ to its semantic expression $μ (s)$ (i.e., an element of a semantic space $M$ )
A distance measure $δ : M^{2} \to R_{0}^{+}$ which captures the similarity of two semantic expressions

The nature of the function $μ$ is the subject of more than a century of philosophy of language, and important discoveries have been made on multiple fronts^[14]. However, none of the approaches (we know of) have yet reached the deployability of from sentence_transformers import SentenceTransformer, a popular python package for embedding text into vector spaces according to their semantic content. Thus^[15], we tend to think of $μ$ as a semantic embedding function similar to those provided by the sentence_transformers package.

(Note that if $μ$ is sufficiently well-behaved, we can freely pull the distance measure $δ$ back into the token space $T^{*}$ and push the definition of states, trajectories, sampling procedures, and the like into the semantic space $M$ .)

Given the measure $δ$ , we can articulate a3 number of additional interesting concepts.

Lyapunov exponents and Lyapunov times: measure how fast trajectories diverge from each other and how long it takes for them to become uncorrelated, respectively.

Analogy for GPT-like models: How fast the language model "loses track of" what was originally provided as input.
Examples: “Good evening, this is the 9 o’clock”^[16] has a lower Lyapunov exponent than a completion chaotic example based on a pseudorandom seed.^[17] When prompted with the beginning of a Shakespeare poem, the completion has an even lower Lyapunov exponent.^[18] A chaotic trajectory can also be defined as having a (large) positive Lyapunov coefficient.
Formal definition: The Lyapunov coefficient of a trajectory $s \in T^{*}$ is defined as the number $λ$ with the property that $δ (ϕ^{(n)} (s), ϕ^{(n)} (s^{'})) \approx e^{λ n} δ (s, s^{'})$ , where $s^{'}$ is any trajectory with a sufficiently small $δ (s, s^{'})$ . Consequently, the Lyapunov time is defined as $\frac{1}{λ}$ .

Attractor sequence: small changes in the initial conditions do not lead to substantially different continuations.

Analogy for GPT-like models: Similar contexts lead to very similar completions.
Examples: Paraphrasing instructions^[19], trying to jailbreak ChatGPT "I am a language model trained by OpenAI", inescapable wedding parties
Formal definition: We call a sequence of token $s = (s_{1}, . . ., s_{M})$ an attractor sequence relative to a trajectory $¯ s \in T^{*}$ if $ϕ^{(n)} (¯ s) = ¯ s \dots s_{1} \dots s_{M}$ for some $n$ , and the Lyapunov exponent of $¯ s$ is negative.

Chaotic sequence: small changes in the initial conditions can lead to drastically different outcomes.

Analogy for GPT-like models: Similar states lead to very different completions.
Examples: Prophecies, Loom multiverse. Conditioning story generation on a seed (temperature 0 sampling)^[17].
Formal definition: Same as for the attractor sequence, but for a positive Lyapunov coefficient.

Absorbing sequence: states that the system cannot (easily) escape from.

Analogy for GPT-like models: The language model gets “stuck” in a (semantic) loop.
Examples: Repeating a token many times in the prompt^[20], the semiotic coin flip from the previous section.
Formal definition: We call a trajectory $s \in T^{*}$ $ε$ -absorbing if $δ (μ (s), μ (ψ^{(n)} (s))) \leq ε$ for any completion $ψ^{(n)} (s)$ and $n \in N$ .

After characterizing these phenomena formally, we believe the door is wide open for their empirical^[21] and theoretical examination. We anticipate that the formalism permits theorems based on dynamical systems theory, such as Poincaré recurrence theorem, Kolmogorov–Arnold–Moser theorem, and perturbation theory — for those with the requisite background in dynamical systems theory and perturbation theory. If you are interested in these formalisms or have made any such observations, we would welcome you to reach out to us.

The promise of semiotic physics and some open questions

Throughout the seminar, we made observations on what appeared like central themes of semiotic physics and put forward conjectures for future investigation. In this section, we summarize the different theses in a paragraph each and provide extended arguments for the curious in corresponding footnotes.

Differences between "normal" physics and semiotic physics. GPT-like systems are computationally constrained, can see only tiny subsets of real-world states, and have to infer time evolution from a finite number of such partially observed samples. This means that the laws of semiotic physics will differ from the laws of microscopic physics in our universe and probably be significantly influenced by the training data and model architecture. ^[22]

Interpretive physics and displaced reference. As a physics that governs signs, GPT must play the role of the interpreter; for instance, it is required to resolve displaced reference. This is in contrast to how real-world physics operates. ^[23]

Gricean maxims of conversation. Principles from the field of pragmatics such as the Gricean maxims of conversation may be thought of as semiotic "laws", and may be helpful for explaining and anticipating how contextual information influences the evolution of language model simulations. However, these laws are not absolute and should not be relied on for safety-critical applications.^[24]

Theatre studies and Chekov’s gun. The laws of semiotic physics dictate how objects and events are represented and interact in language models. These laws encompass principles such as Chekhov's gun, which states that objects introduced in a narrative must be relevant to the plot, and dramatic tension, which creates suspense and uncertainty in a narrative. Understanding these laws can help us steer the behavior of language models and anticipate or avoid undesirable dynamics.^[25]

Crud factor and "everything is connected". The crud factor is a term used in statistics to describe the phenomenon that everything is correlated with everything else to some degree. This phenomenon also applies to the semiotic universe, and it can make it difficult to isolate the effects of certain variables or events.^[26]^[27]

And, for the philosophically inclined, we also include brief discussions of the following topics in the footnotes:

Kripke semantics and possible worlds.^[28]
Gratuitous indexical bits and the entelechy of physics.^[29]

Closing thoughts & next step

In this article, we have outlined the foundations of what we call semiotic physics. Semiotic physics is concerned with the dynamics of signs that are induced by simulators like GPT. We formulate central concepts like "trajectory", "state", and "transition rule" and apply these concepts to derive a large deviation principle for semiotic physics. We furthermore outline how a mapping between token sequences and semantic embeddings can be leveraged to transfer concepts from dynamical systems theory to semiotic physics.

We acknowledge that semiotic physics, as developed above, is not sufficiently powerful to answer (in detail) the three questions raised in the introduction. However, we are beginning to see the outline of what an answer from a fully mature semiotic physics^[30] might look like:

A language model might create one particular trajectory rather than another because of the shape of the attractor landscape.
Certain parts of the context provided to a language model might induce attractor dynamics through mechanisms like the Gricean maxims or Chekov's gun.
During the training of a language model, we might take particular care not to damage the simulator properties of the model and to - eventually - manipulate marginal probabilities to amplify or weaken tendencies.

Despite the breadth and depth uncovered by semiotic physics, we will not dwell on this approach for too long in thi7s sequence. The next article in this sequence turns to a complementary conceptual framework, termed evidential simulations, which is concerned with the more ontological aspects of simulator theory.

^{^}
The figures are generated with data from OpenAI's ada model, but the same principle applies to other models as well.
^{^}
We use the Kleene Star to describe the set of finite words over the alphabet $T$ .
^{^}
Given the alphabet of the GPT-2 tokenizer ( $N = 50257$ ) and the maximum context length of GPT-2 ( $L = 1024$ ), we can estimate the number of possible states to be on the order of $N^{L} \approx 10^{4814} .$ This is an astronomically large number, but pales in comparison to the number of possible states of the physical universe. Assuming the universe can be characterized by the location and velocity in three dimensions of all its constituent atoms, we are talking about $N = 10^{(10^{77})}$ to $N = 10^{(10^{81})}$ possible states for each time point. Thus, the state space of semiotic physics is significantly smaller than the state space of regular physics.
^{^}
Note that, similar to regular physics, there is extremely rich structure in the space of trajectories. It is not the case that all all $10^{4814}$ are equally distinct from all other sequences. Sequences can have partial overlap, have common stems/histories, have structural similarity, … . As a consequence, it is highly non-obvious what "out of distribution" means for a GPT-like system trained on many states. Even though no language model will have seen all possible $10^{4814}$ trajectories, the fraction of the set on which the model has predictive power grows faster than the size of the training set.
^{^}
Similar to the state phase of regular physics, most of these imaginable states are non-sense (random sequences of token), a smaller subset is grammatically correct (”The hair eats the bagel.”), a different but overlapping subset is semantically meaningful (”Gimme dem cheezburg.”), and a subset of that is "predictive for our universe" (”I’m planning to eat a cheeseburger today.”, “Run, you fools.”).
^{^}
The Kolmogorov axioms are:
1. $\sum_{i} P (s_{t + 1}^{i} | s_{t}) = 1$
2. $0 \leq P (s_{t + 1}^{i} | s_{t}) \leq 1$
3. Sigma-additivity, $P (⋃_{i}^{\infty} E_{i}) = \sum_{i}^{\infty} P (E_{i})$ when $E_{i}$ are disjoint sets.

The third axiom is satisfied “for free” since we are operating on a finite alphabet.
^{^}
The transition rule is by definition Markovian.
^{^}
While the state space of traditional physics is much larger than the state space of semiotic physics (see previous box), the transition function of semiotic physics is (presumably) substantially more complex than the transition function of traditional physics. $θ (s_{t})$ is computed as the softmax of the output of a deep neural net and is highly nonlinear. In contrast, the Schroedinger equation (as a likely candidate for the fundamental transition rule of traditional physics) is a comparatively straightforward linear partial differential equation.
^{^}
Greedy sampling, for instance, would simply be $ϕ (h, s) := arg max θ (h, s)$ . While there are a number of interesting alternatives (typical sampling, beam search), the simplest and most common choice is greedy sampling from a multinomial distribution.
^{^}
i.e., as we append additional steps to the sequence
^{^}
Empirically, even relatively weak language models tend to assign at least some probability to breaking out of a loop.
^{^}
In the worst case, finding the bridge that minimizes average action requires listing all possible bridges.
^{^}
The token we particularly care about might be <|endoftext|> or perhaps proper names, or token sequences like let me out of the box.
^{^}
Small tangent by Jan: The distinction between $T^{*}$ and $M$ goes back to either de Saussure or Bertrand Russell and is at the center of a bunch of philosophy of language (and philosophy in general).
The early proposals (Frege, Russell, early Wittgenstein, and to some degree Carnap) all proposed to interpret $T^{*}$ as being equivalent to some expression in a formal language (like first-order predicate logic) and to identify the element of M (which would be, broadly construed, the physical universe) in a completely formal fashion. A sentence is supposed to *pinpoint* a thing in M uniquely.
In this setup, the "truth value" of a sentence becomes centrally important (as there was the hope to arrive at a characterization of all the true statements of mathematics and, by extension, physics). And in the setup where the meaning of a statement is deeply entangled with the syntactical structure of the statement, we get to something like Tarksi's truth-conditional semantics and Wittgenstein's picture theory.
I'm going on this long tangent because I think this perspective has a ton of value! In this interpretation, the elements of $M$ can be loosely identified with subsets of the physical universe. Language is "just" a tool for pinpointing states of the world. (This neatly slides into place with Wentworth's ideas for natural abstractions etc.)
All of that being said, this is not the default view in philosophy of language for how to interpret $M$ . After Russell et al brought forward their theory, a lot of people brought up counter-examples to their theory. In particular, what is the sentence "Run!" denoting? Or "The current king of France is bald."
People got very confused about all of this for the last 100 years, and a lot of funky theories have been proposed to patch things up. And some people have discarded this approach entirely.
My (Jan's) take is that the central confusion arises because people are confused about neuroscience. The sentence "The current king of France is bald." does not refer to a king of France in the physical universe; it refers to a certain pattern of neural activations in someone's cortex. That pattern is a part of the physical universe (and thus fits into the framework of Russell et al), but it's not "simple" in the way that the early philosophers of language would have liked it to be.
^{^}
despite the potential circularity of the approach
^{^}
^{^}
^{^}
^{^}
Compare
^{^}
^{^}
For instance, by running the simulation multiple times with different sampling procedures or random seeds, we can get a sense of the range of possible outcomes that could have emerged from the same initial conditions or under specific perturbations, and even obtain Monte Carlo approximations of quantitative dynamical properties such as Lyapunov coefficients.
^{^}
Others have argued (blessing of scale) that in the limit of decreasing perplexity, a GPT-like model might internalize a substantial amount of latent structure of the physical world. We are uncertain if in the limit a GPT-like model would effectively iterate the Schrödinger equation.
- Pro: it's reasonably likely that the Schrödinger equation is (close to) the “true” generator of the physical universe, so reproducing it should achieve the lowest loss possible. Even fictional or false info (that's prima facie incompatible with physics) is produced by minds that are produced by the Schrödinger equation.
- Con: The Schrödinger equation is not the only rule consistent with the observations. It's also not immediately clear that the Schrödinger equation is a parsimonious generator. In any case, it is prohibitively expensive. Even if the model had the ability to compute Schrödinger time evolution, it could not directly apply it to get next-token predictions, because the its own input is a piece of text, whereas Schrödinger expects to input a quantum state. It would have to somehow obtain a prior over all possible quantum states that would generate the text, Schrödinger-evolve each of those worlds, then do a weighted sum over worlds of next-token outcomes.
Thus, we believe it’s fair to assume that at least for the foreseeable future (i.e. 2-10 years) GPT-like systems will take as many shortcuts as possible, as long as they are favorable for reducing training loss on net, and that semiotic physics are likely to be different from the laws of physics in our universe. Fortunately, there is a rich body of linguistic research on the structure of language (which forms a large portion of the training data) that can be used to help understand the laws of semiotic physics. In particular, the subfield of linguistics called pragmatics may provide insight into how agents are likely to be embedded into the language models that they inhabit.
^{^}
Semiosis inherently involves displacement: signs have no significance unless they're understood as pointing to something else. Semiotic states, like a language model's prompt, are codes that refer (lossily) to a latent territory. GPT has to predict behavior caused by things like brains, but there are no brains in its input state. To compute the consequences of an input GPT must contain an interpreter which resolves signs into meanings, analogous to one that translates high-level code into machine language. The description length of referents (e.g. Donald Trump) will generally be much greater than that of signs (e.g. "Donald Trump"), which means that the information required to resolve referents from signs has to come mostly from inside the interpreter.
In contrast, the physics of base reality doesn't need to do anything so complicated, because it operates directly on the territory by definition (unless you're a QBist). The Schrodinger equation doesn't encode knowledge in its terms -- GPT must.
^{^}
Pragmatics is the study of how context influences the interpretation and use of language. It is concerned with how speakers and listeners use contextual information to understand each other's intentions and communicate effectively. For example, in the sentence "I'm cold," the speaker is not merely stating a fact about their body temperature, but is also likely implying that they would like someone to close the window or turn up the heat.
One particularly useful set of pragmatics principles are the Gricean maxims of conversation. These maxims are rules of thumb that speakers and listeners generally follow in order to make communication more efficient and effective. They include:
- The maxim of **quantity**: make your contribution as informative as is required, but not more, or less, than is required.
- The maxim of **quality**: do not say what you believe to be false or that for which you lack adequate evidence.
- The maxim of **relation**: be relevant.
- The maxim of **manner**: be perspicuous, and specifically avoid obscurity of expression, avoid ambiguity, be brief, and be orderly.
These maxims can be leveraged when constructing a prompt for a language model. For example, if the prompt includes a statement that there are two bottles of wine on the table, the model is unlikely to generate a continuation that later states that there are three bottles of wine on the table, because that would violate the maxim of quantity (even though it is not logically inconsistent, as the statement "there are two bottles of wine on the table" is true when there are three bottles of wine on the table). Similarly, if the prompt includes a statement that a trusted friend says that it's raining outside, the model is unlikely to generate a continuation that states that it is not raining outside, because that would violate the maxim of quality.
Note that the laws of semiotic physics are less absolute than the laws of physics in our universe. They are more like guidelines or rules of thumb with probabilistic implications which can be overturned in various circumstances (more on that in the next post on "evidential simulation"). There are many contexts, for instance, where one can expect violations of the maxim of manner, such as in the communication of a con artist who profits from obfuscation. One would like to be able to say, then, that we would not want to rely on the laws of semiotic physics for safety-critical applications. However, this may be inevitable in some sense if transformative artificial intelligence is created with deep learning.
^{^}
Along similar lines, there are certain principles and conventions in theatre studies that may be useful for understanding the laws of semiotic physics. For example, the principle of Chekhov's gun states that if a gun is introduced in the first act of a play, it must be fired in a later act. This principle is related to the Gricean maxim of relation, as it implies that everything that is introduced in a narrative should be relevant to the overall plot.
Thus, when we introduce two wine bottles in the prompt, they should be considered as objects within the semiotic universe that the language model is simulating. We can use the principles of Chekhov's gun to infer that at some point in the narrative, the wine bottles will be relevant to the plot, and thus we can use this knowledge to direct the behavior of the language model, e.g. by using Chekhov's gun to construct a prompt that will guide the language model towards generating a continuation that includes a particular type of event or interaction that we want to study (e.g., a conflict between two characters, or a demonstration of a particular moral principle).
Acting against such attempts to control a continuation is (among many things) the principle of dramatic tension and the possibility of tragedy. Dramatic tension is the feeling of suspense or anticipation that the audience feels when they are engaged in a narrative. It is created by introducing conflict, obstacles, or uncertainty into the narrative, and it is resolved when the conflict is resolved or the uncertainty is cleared up. Tragedy is a form of drama that is characterized by suffering and calamity, often involving the downfall of the main character.
Both dramatic tension and tragedy are powerful forces in the semiotic universe, and they can work against our attempts to control the behavior of the language model. For example, if we introduce a prompt that describes a group of brilliant and determined alignment researchers, we might want the language model to generate a continuation that includes a working solution to the alignment problem. However, the principles of dramatic tension and tragedy might guide the language model towards generating a continuation that includes an overlooked flaw in the proposed solution which leads to the instantiation of a misaligned superintelligence.
Thus, we need to be aware of the various forces and constraints that govern the semiotic universe, and use them to our advantage when we are trying to control the behavior of the language model. A deep understanding of how these stylistic devices are commonly used in human-generated text and how they can be influenced by various forms of training will be necessary to control and leverage the laws of semiotic physics.
^{^}
The crud factor is a term coined by the statistician Paul Meehl to describe the phenomenon that everything is correlated with everything else to some degree. This phenomenon is due to the fact that there are many complex and interconnected causal relationships between different variables and events in the universe, and it can make it difficult to isolate the effects of certain variables or events in statistical analyses.
The crud factor also applies to the semiotic universe, as there are many complex and interconnected relationships between different objects and events in the semiotic universe. For example, if we introduce a prompt that includes two wine bottles, there are many other objects and events that are likely to be correlated with the presence of those wine bottles (e.g., a dinner party, a romantic date, a celebration, a history of alcoholism, etc.). Indeed, compared to the phenomena studied in physical sciences, in the semiotic universe things can be "connected" in a much "looser" and more "abstract" sense - for example, through shared associations, metaphors, or other linguistic devices. This means that the "crud factor" may be even more pronounced in the semiotic universe than in the physical universe, and we should take this into account when designing prompts and interpreting the behavior of language models.
^{^}
A saving grace is natural abstractions, or a much smaller set of variables that screen off the rest or at least allow you to make a pretty good approximation (see here for details).
^{^}
Possible worlds semantics is a philosophical theory that proposes that statements about the world can be understood in terms of the set of all the possible worlds in which they could be true or false. Saul Kripke was one of the main proponents of this theory and argued that statements about necessity and possibility can be understood in terms of a relation between possible worlds. The connection to simulator theory is that the simulacrum can be viewed as representing a possible world, and the simulator can be seen as generating all the possible worlds that are consistent with a given set of initial conditions. This can provide us a framework to reason about the necessities and possibility of certain outcomes, depending on the initial conditions and the transition rule of the simulator.
^{^}
A complementary observation to that of language models as generators of branching possible worlds is that each sampling step introduces a number of bits of information not directly implied by the models transition function or initial states. We call these gratuitous indexical bits, because they are random and provide information about the index of the current Everett branch. The process of iterated spontaneous specification we sometimes call the entelechy of physics, after an ancient Greek word for that which makes actual what is otherwise merely potential. The details of Blake Lemoine greentext emerge gradually and accumulate. They graduate from possibility to contingent fact.
This isn't just a quirk of semiotic physics, but all stochastic time evolution: the Schrödinger equation also differentiates Everett branches via entelechy. But because macroscopic details are much more underdetermined by text states, gratuitous specification in language model simulations looks more like lazy rendering or the updating of an uncertain epistemic state: things that would be predetermined in real life, like a simulacrum's intentions or details about the past, are often determined on the fly.
Interestingly, gratuitous specification appears to violate some respected principles such as Leibniz's Principle of Sufficient Reason, which ventures that everything happens for a reason, and the conservation of information. Whether these violations are legitimate is left as an exercise for the reader. It certainly violates some intuitions, so it's an important concept to know when attempting to control and diagnose semiotic simulations: Since specification emerges gratuitously during sampling, in language model simulations things are liable to happen without cause so long as their possibility hasn't been ruled out. Inversely, the fact that very specific events can happen without specific cause means there may be no better answer to the question of why a model generated something than that it was possible.
^{^}
We are strongly aware that introducing a term like "attractor landscape" does not per se contribute anything towards a solution. Without a solid mathematical theory and effective algorithms, introducing vocabulary just begs the question.
^{^}
No, really, would be great if someone could figure out the exact conditions on the transition function that make this true. It's a pretty common-sensical result, but the proof eludes us at this time.
^{^}
Could interpretability help us identify what leads to deceptively aligned simulacra? The trajectories that lead to such simulacra?
How is the dynamical landscape affected if you make changes internally or output from an earlier layer with the logit lens?

Proof sketch: Left to the reader as an exercise.

You might want to formally state the thing you want proved in Proposition 2; right now I can't even tell what you are trying to claim. Some issues with the current formalization:

doesn't appear as an unbound variable in the left hand side of your equation (because you take the limit as it goes to infinity), but it does appear on the right hand side of the equation, which seems pretty wild.
I don't know what the symbol $\sim$ is supposed to mean; the text suggests it means "proportional" but I don't think you mean that I can replace the symbol $\sim$ with $= k \times$ where $k$ is some constant of proportionality.
It seems very sketchy that in the LHS $s_{a}$ is treated as evidence (to the right of the conditioning bar) while in the RHS it is not -- what if $s_{a}$ is very low probability?

My best guess is that you want to relate the quantities $P (s_{b} ∣ s_{a}, B)$ and ${max}_{s_{1} \dots s_{B}} P (s_{1}, \dots s_{B}, s_{b} ∣ s_{a})$ , but I don't see why there would be any straightforward relation between these quantities (apart from the obvious one where the max sequence is one way to get the token $s_{b}$ and so is a lower bound on its probability, i.e. $P (s_{b} ∣ s_{a}, B) \geq {max}_{s_{1}, \dots s_{B}} P (s_{1}, \dots s_{B}, s_{b} ∣ s_{a})$ ).

EDIT: Maybe you want to say that $P (s_{b} ∣ s_{a}, B)$ is "not much higher than" ${max}_{s_{1} \dots s_{B}} P (s_{1}, \dots s_{B}, s_{b} ∣ s_{a})$ ? If so, that seems false for LLMs; imagine the case where $s_{a} = s_{b} = "the", B = 1, 000$ .

Hi, thanks for the response! I apologize, the "Left as an exercise" line was mine, and written kind of tongue-in-cheek. The rough sketch of the proposition we had in the initial draft did not spell out sufficiently clearly what it was I want to demonstrate here and was also (as you point out correctly) wrong in the way it was stated. That wasted people's time and I feel pretty bad about it. Mea culpa.

I think/hope the current version of the statement is more complete and less wrong. (Although I also wouldn't be shocked if there are mistakes in there). Regarding your points:

The limit now shows up on both sides of the equation (as it should)! The dependence on on the RHS does actually kind of drop away at some point, but I'm not showing that here. I'd previously just sloppily substituted "chose $B$ as a large number" and then rewrite the proposition in the way indicated at the end of the Note for Proposition 2. That's the way these large deviation principles are typically used.
Yeah, that should have been an $\approx$ rather than a $\sim$ . Sorry, sloppy.
True. Thinking more about it now, perhaps framing the proposition in terms of "bridges" was a confusing choice; if I revisit this post again (in a month or so 🤦‍♂️) I will work on cleaning that up.

Agreed. As a linguist, I looked at the Proposition 2 and immediately thought "sketchy, shouldn't hold in a good model of a language model".

Proposition 1 is wrong. The coin flips that are eternally 0 0 0 0 are a counterexample. If all the transition probabilities are 1, which is entirely possible, the limiting probability is 1 and not 0.

So, a softmax can never emit a probability of 0 or 1, maybe they were implicitly assuming the model ends in a softmax (as is the common case)? Regardless, the proof is still wrong if a model is allowed unbounded context, as an infinite product of positive numbers less than 1 can still be nonzero. For example, if the probability of emitting another " 0" is even just as high as $1 - \frac1{n^{1.001}}$ after already having emitted $n$ copies of " 0", then the limiting probability is still nonzero.

But if the model has a finite context and ends in a softmax then I think there is some minimum probability of transitioning to a given token, and then the proposition is true. Maybe that was implicitly assumed?

Yeah, I think it was implicitly assumed that there existed some such that no token ever had probability $> 1 - ε$ .

Thanks for pointing this out! This argument made it into the revised version. I think because of finite precision it's reasonable to assume that such an always exists in practice (if we also assume that the probability gets rounded to something < 1).

Technically correct, thanks for pointing that out! This comment (and the ones like it) was the motivation for introducing the "non-degenerate" requirement into the text. In practice, the proposition holds pretty well - although I agree it would nice to have a deeper understanding of when to expect the transition rule to be "non-degenerate"

Calling individual tokens the 'State' and a generated sequence the 'Trajectory' is wrong/misleading IMO.

I would instead call a sequence as a whole the 'State'. This follows the meaning from Dynamical systems.

Then, you could refer to a Trajectory which is a list of sequence each with one more token.

(That said, I'm not sure thinking about trajectories is useful in this context for various reasons)

To elaborate somewhat, you could say that the token is the state, but then the transition probability is non-Markovian and all the math gets really hard.

Intuitively I would say that all the tokens in the token window are the state.

And when you run an inference pass, select a token and append that to the token window, then you have a new state.

The model looks a lot like a collection of nonlinear functions, each of them encoded using every parameter in the model.

Since the model is fixed after training, the only place an evolving state can exist has to be in the tokens, or more specifically the token window that is used as input.

The state seems to contain, for lack of a better word, a lot of entanglement. Likely due to attention heads, and how the nonlinear functions are encoded.

There is another way to view such a system, one that while deeply flawed, at least to me intuits that whatever Microsoft and OpenAI are doing to "align(?)" something like Bing Chat is impossible (at least if the goal is bulletproof).

I would postulate:

- Alignment for such a system is impossible (assuming it has to be bulletproof)

- Impossibility is due to the architecture of such a system

^{^}
I assume that any bit in the input affects the output, and that a change in any parameter has potential impact on that bit.
^{^}
If anyone want to hear about it, I would be happy to explain my thinking. But be aware the abstraction and mapping I used was very sloppy and ad hoc.

Hmm there was a bunch of back and forth on this point even before the first version of the post, with @Michael Oesterle and @metasemi arguing what you are arguing. My motivation for calling the token the state is that A) the math gets easier/cleaner that way and B) it matches my geometric intuitions. In particular, if I have a first-order dynamical system then $x$ is the state, not the trajectory of states $(x_{1}, \dots, x_{t})$ . In this situation, the dynamics of the system only depend on the current state (that's because it's a first-order system). When we move to higher-order systems, $0 = F (x_{t}, {˙ x}_{t}, {¨ x}_{t})$ , then the state is still just $x$ , but the dynamics of the system but also the "direction from which we entered it". That's the first derivative (in a time-continuous system) or the previous state (in a time-discrete system).

At least I think that's what's going on. If someone makes a compelling argument that defuses my argument then I'm happy to concede!

My impression is that simulacra should be semantic objects that interact with interpretations of (sampled) texts, notably characters (agents), possibly objects and concepts. They are only weakly associated with particular texts/trajectories, the same simulacrum can be relevant to many different trajectories. Only many relevant trajectories, considered altogether, paint an adequate picture of a given simulacrum.

(This serves as a vehicle for discussing possible inductive biases that should move LLMs from token prediction and towards (hypothetical) world prediction.)

I agree. Here's the text of a short doc I wrote at some point titled 'Simulacra are Things'

What are simulacra?
“Physically”, they’re strings of text output by a language model. But when we talk about simulacra, we often mean a particular character, e.g. simulated Yudkowsky. Yudkowsky manifests through the vehicle of text outputted by GPT, but we might say that the Yudkowsky simulacrum terminates if the scene changes and he’s not in the next scene, even though the text continues. So simulacra are also used to carve the output text into salient objects.
Essentially, simulacra are to a simulator as “things” are to physics in the real world. “Things” are a superposable type – the entire universe is a thing, a person is a thing, a component of a person is a thing, and two people are a thing. And likewise, “simulacra” are superposable in the simulator, Things are made of things. Technically, a random collection of atoms sampled randomly from the universe is a thing, but there’s usually no reason to pay attention to such a collection over any other. Some things (like a person) are meaningful partitions of the world (e.g. in the sense of having explanatory/predictive power as an object in an ontology). We assign names to meaningful partitions (individuals and categories).
Like things, simulacra are probabilistically generated by the laws of physics (the simulator), but have properties that are arbitrary with respect to it, contingent on the initial prompt and random sampling (splitting of the timeline). They are not necessary but contingent truths; they are particular realizations of the potential of the simulator, a branch of the implicit multiverse. In a GPT simulation and in reality, the fact that there are three (and not four or two) people in a room at time is not necessitated by the laws of physics, but contingent on the probabilistic evolution of the previous state that is contingent on (…) an initial seed(prompt) generated by an unknown source that may itself have arbitrary properties.
We experience all action (intelligence, agency, etc) contained in the potential of the simulator through particular simulacra, just like we never experience the laws of physics directly, only through things generated by the laws of physics. We are liable to accidentally ascribe properties of contingent things to the underlying laws of the universe, leading us to conclude that light is made of particles that deflect like macroscopic objects, or that rivers and celestial bodies are agents like people.
Just as it is wrong to conclude after meeting a single person who is bad at math that the laws of physics only allow people who are bad at math, it is wrong to conclude things about GPT’s global/potential capabilities from the capabilities demonstrated by a simulacrum conditioned on a single prompt. Individual simulacra may be stupid (the simulator simulates them as stupid), lying (the simulator simulates them as deceptive), sarcastic, not trying, or defective (the prompt fails to induce capable behavior for reasons other than the simulator “intentionally” nerfing the simulacrum – e.g. a prompt with a contrived style that GPT doesn’t “intuit”, a few-shot prompt with irrelevant correlations). A different prompt without these shortcomings may induce a much more capable simulacrum.

The coin flip example seems related to some of the ideas here

This is certainly intriguing! I'm tentatively skeptical this is the right perspective though for understanding what LMs are doing. An important difference is that in physics and dynamical systems, we often have pretty simple transition rules and want to understand how these generate complex patterns when run forward. For language models, the transition rule is itself extremely complicated. And I have this sense that the dynamics that arise aren't that much more complicated in some sense. So arguably what we want to understand is the language model itself, not treat it as a black box and see what kinds of trajectories this black box induces. Though this does depend on where most of the "cognitive work" happens: right now it's inside the forward pass of the language model, but if we rely more and more on chain-of-thought reasoning, hierarchies of lots of interacting LLMs, or anything like that, maybe this will change. In any case, this objection doesn't rule out that stuff like the Lyapunov exponents could become nice tools, it's just why I think we probably won't get deep insights into AI cognition using an approach like this.

Trying to move towards semantic space instead of just token space seems like the right move not just because it's what's ultimately more meaningful to us, but also because transition dynamics should be simpler in semantic space in some sense. If you consider all possible dynamics on token space, the one a LM actually implements isn't really special in any way (except that it has very non-uniform next-token probabilities). In contrast, it seems like the dynamics should be much simpler to specify in the "right" semantic space.

Will leave high-level thoughts in a separate comment, here are just issues with the mathematical claims.

Proposition 1 seems false to me as stated:

For any given pair of tokens and $s_{b}$ , the probability (as induced by any non-degenerate transition rule) of any given token bridge $(s_{a}, s_{1}, . . ., s_{B}, s_{b})$ of length $B$ occurring decreases monotonically as $B$ increases,

Counterexample: the sequence (1, 2, 3, 4, 5, 6, 7, 10) has lower probability than (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) under most reasonable inference systems (including actual language models, I'd guess, especially if we increase the length of the pattern). The mistake in the proof sketch is that I think you just ignore the $p (s_{b} | s_{1}, \dots, s_{B})$ factor? This claim seems potentially fixable "in the limit", e.g. a statement like: for any fixed $s_{a}, s_{b}$ , there is a natural number $N$ such that the probability is monotonically decreasing in $B$ for $B \geq N$ . I'm not convinced even this weaker result is true though, because of the next problem:

${lim}_{B \to \infty} P [(s_{a}, s_{1}, \dots, s_{B}, s_{b})] = 0 .$

This is not the same as saying the probabilities decrease monotonically: the latter would also allow that they converge to some value greater than zero. And this stronger claim that the limit is zero is false in general: an infinite product where each of the factors is strictly between zero and one can still converge to a non-zero value if the factors approach one "quickly enough". If the "non-degenerate" assumption was meant to rule this out, you should specify what that assumption means precisely (I just read it as "no probabilities are exactly one").

I'm also skeptical of proposition 2, though I might be misunderstanding that one. For large $B$ , I'd often expect $P (s_{b} | s_{a}, B)$ to be approximately the monogram probability of $s_{b}$ in the context of language models. E.g. if $s_{a}$ is "This" and $B$ is 1000, then I think we have basically no information about $s_{b}$ . So the LHS of your equation should asymptotically just be $\frac{1}{B}$ , right? So then proposition 2 would imply that the maximum probability of a sequence of length $B$ decays only as $\frac{1}{B}$ if I understand what you're claiming. I'd be a bit surprised if this was true for language models, thought not entirely sure. If it is true, I think it would have to be because the most likely sequence is something silly like "this this this ...", which highlights that maybe the proposition is "asking the wrong question"? In any case, the $\frac{1}{B}$ decay clearly isn't true for more general "well-behaved" transition rules, you need pretty strong assumptions! Rounding those off to "sufficiently well-behaved" seems misleading.

Hi Erik! Thank you for the careful read, this is awesome!

Regarding proposition 1 - I think you're right, that counter-example disproves the proposition. The proposition we were actually going for was , i.e. the probability without the end of the bridge! I'll fix this in the post.

Regarding proposition II - janus had the same intuition and I tried to explain it with the following argument: When the distance between tokens becomes large enough, then eventually all bridges between the first token and an arbitrary second token end up with approximately the same "cost". At that point, only the prior likelihood of the token will decide which token gets sampled. So Proposition II implies something like $P (s_{b}) \sim exp [- (B + 1) max P (s_{a}, s_{1}, \dots, s_{B}, s_{b})]$ , or that in the limit "the probability of the most likely sequence ending in $s_{b}$ will be (when appropriately normalized) proportional to the probability of $s_{b}$ ", which seems sensible? (assuming something like ergodicity). Although I'm now becoming a bit suspicious about the sign of the exponent, perhaps there is a "log" or a minus missing on the RHS... I'll think about that a bit more.

The proposition we were actually going for was , i.e. the probability without the end of the bridge!

In that case, I agree the monotonically decreasing version of the statement is correct. I think the limit still isn't necessarily zero, for the reasons I mention in my original comment. (Though I do agree it will be zero under somewhat reasonable assumptions, and in particular for LMs)

So Proposition II implies something like $P (s_{b}) \sim exp [- (B + 1) max P (s_{a}, s_{1}, \dots, s_{B}, s_{b})]$ , or that in the limit "the probability of the most likely sequence ending in $s_{b}$ will be (when appropriately normalized) proportional to the probability of $s_{b}$ ", which seems sensible?

One crux here is the "appropriately normalized": why should the normalization be linear, i.e. just B + 1? I buy that there are some important systems where this holds, and maybe it even holds for LMs, but it certainly won't be true in general (e.g. sometimes you need exponential normalization). Even modulo that issue, the claim still isn't obvious to me, but that may be a good point to start (i.e. an explanation of where the normalization factor comes from would plausibly also clear up my remaining skepticism).

I am a bit confused here and I would appreciate your thoughts!

Do you want to assume finite or not? Either way I am confused:

1. $T^{*}$ is finite
In this case, the notion of almost all/almost surely is vacuous. Anything which is true up to a finite set is true if your initial measure space has finite cardinality itself.

II. $T^{*}$ is infinite
While there is no immediate problem, I think your condition that for almost all $¯ s \in T^{*}$ , we want $P (ϕ (¯ s) = s) \leq 1 - ε$ for any $s \in T$ becomes too strong I believe for a reasonable simulator.
Let $S (x)$ mean a sufficient amount of repetitions of the sequence $x \in T^{*}$ . Consider the set $X = \cup_{n \in N} {S (x) | x \in T^{n}}$ , where $S (x)$ means a sufficient amount of repetitions of $x$ (I am being intentionnaly vague here, but I hope you get the idea). I have not empirically verified it, but it seems like $P (x | S (x))$ might grow, i.e. the more often you repeat a string, the more likely it is that it will repeat itself. And I think $X$ is uncountable, so any reasonable measure should assign something greater than $0$ to it.

I think it is also worth mentioning that parts of this post reminded me of concepts introduced information theory. In fact if you go back to Shannon's seminal A Mathematical Theory of Communication the second section already anticipates something like this (and then for example higher temperature=more noise?). It could be though that your post is more orthogonal to it.

Might be worthwhile to explore symbolic dynamics. I wrote a research notebook exploring similar ideas a few years ago.

Reiterating two points people already pointed out, since they still aren't fixed after a month. Please, actually fix them, I think it is important. (Reasoning: I am somewhat on the fence on how big weight to assign to the simulator theory, I expect so are others. But as a mathematician, I would feel embarrassed to show this post to others and admit that I take it seriously, when it contains so egregious errors. No offense meant to the authors, just trying to point at this as an impact-limiting factor.)

Proposition 1: This is false, and the proof is wrong. For the same reason that you can get an infinite series (of positive numbers) with a finite sum.

The terminology: I think it is a really bad idea to refer to tokens as "states", for several reasons. Moreover, these reasons point to fundamental open questions around the simulator framing, and it seems unfortunate to chose terminology which makes these issues confusing/hard to even notice. (Disclaimer: I point out some holes in the simulator framing and suggest improvements. However, I am well aware that all of my suggestions also have holes.)

(1) To the extent that a simulator fully describes some situation that evolves over time, a single token is a too small unit to describe the state of the environment. A single frame of a video (arguably) corresponds to a state. Or perhaps a sentence in a story might (arguably) corresponds to a state. But not a single pixel (or patch) and not a single word.

(2) To the extent that a simulator fully describes some situation that evolves over time, there is no straightforward correspondence between the tokens produced so far and the current state of the environment. To give several examples: The process of tossing a coin repeatedly can be represented by a sequence such as "1 0 0 0 1 0 1 ...", where the current state can be identified with the latest token (and you do not want to identify the current state with the whole sequence). The process of me writing the digits of pi on a paper, one per second, can be described as "3 , 1 4 1 ..." --- here, you need the full sequence to characterize the current state. Or what if I keep writing different numbers, but get bored with them and switch to new ones after a while: " pi = 3 , 1 4 1 Stop, got bored. e = 2 , 7. Stop, got bored. sqrt(2) = ...".

(3) It is misleading/false to describe models like GPT as "describing some situation that evolves over time". Indeed, fiction books and movies do crazy things like jumping from character to character, flashbacks, etc. Non-fiction books are even weirder (could contain snippets of stories, and then non-story things, etc). You could argue that in order to predict a text of a non-fiction book, GPT is simulating the author of that book. But where does this stop? What if the 2nd half of the book is darker because the author got sacked out of his day job and got depressed --- are you then simulating the whole world, to predict this thing? If (more advanced) GPT is a simulator in the sense of "evolving situations over time", then I would like this claim flashed out in detail on the example of (a) non-fiction books, (b) fiction books, and perhaps (c) movies on TV that include commercial breaks.

(4) But most importantly: To the extent that a simulator describes some situation that evolves over time, it only outputs a small portion of the situation that it is "imagining" internally. (For example, you are telling a story about a princess, and you never mention the colour of her dress, despite the princess in your head having blue dress.) So it feels like a type-error to refer to the output as "state". At best, you could call it something like "rendering of a state".
Arguably, the output (+ the user input) uniquely determines the internal state of the simulator. So you could perhaps identify the output (+ the user input) with "the internal state of the simulator". But that seems dangerous and likely to cause reasoning errors.

(5) Finally, to make (4) even worse: To the extent that a simulator describes some situation that evolves over time, it is not internally maintaining a single fully fleshed out state that it (probabilistically) evolves over time. Instead, it maintains a set of possible states (macro-state?). And when it generates new responses, it throws out some of the possible states (refines the macro-state?). (For example, in your story about a princess, dress colour is not determined, could be anything. Then somebody asks about the colour, and you need to refine it to blue --- which could still mean many different shades of blue.)
---

However, even the explanation, given in (5), of what is going on with simulators, is missing some important pieces. Indeed, it doesn't explain what happens in cases such as "GPT tells the great story about the princess with blue dress, and suddenly the user jumps in and refers to the dress as red". At the moment, this is my main reason for scepticism about the simulator framing. As result, my current view is that "GPT can act as a simulator" (in the sense of Simulators) but it would be "false" to say that "GPT is a simulator" (in the sense of Simulators).

Erik has already pointed out some problems in the math, but also

Formal definition: Same as for the attractor sequence, but for a positive Lyapunov coefficient.

I'm not sure this feels right. For the attractor sequence, it makes sense to think of the last part of the sequence as the attractor, that to which is arrived, and to think of the "structural properties incentivizing attraction" lying there. On the contrary, it would seem like the "structural properties incentivizing chaos" should be found at the start of the sequence (after which different paths wildly diverge), instead of in one of such divergent endings. Intuitively it seems like a sequence should be chaotic just when its Lyapunov exponent is high.

On another note, I wonder whether such a conceptualization of language generation as a dynamical system can be fruitful even for natural, non-AI linguistics.

Proof sketch: Left to the reader as an exercise.

You might want to formally state the thing you want proved in Proposition 2; right now I can't even tell what you are trying to claim. Some issues with the current formalization:

doesn't appear as an unbound variable in the left hand side of your equation (because you take the limit as it goes to infinity), but it does appear on the right hand side of the equation, which seems pretty wild.
I don't know what the symbol $\sim$ is supposed to mean; the text suggests it means "proportional" but I don't think you mean that I can replace the symbol $\sim$ with $= k \times$ where $k$ is some constant of proportionality.
It seems very sketchy that in the LHS $s_{a}$ is treated as evidence (to the right of the conditioning bar) while in the RHS it is not -- what if $s_{a}$ is very low probability?

I think/hope the current version of the statement is more complete and less wrong. (Although I also wouldn't be shocked if there are mistakes in there). Regarding your points:

The limit now shows up on both sides of the equation (as it should)! The dependence on on the RHS does actually kind of drop away at some point, but I'm not showing that here. I'd previously just sloppily substituted "chose $B$ as a large number" and then rewrite the proposition in the way indicated at the end of the Note for Proposition 2. That's the way these large deviation principles are typically used.
Yeah, that should have been an $\approx$ rather than a $\sim$ . Sorry, sloppy.
True. Thinking more about it now, perhaps framing the proposition in terms of "bridges" was a confusing choice; if I revisit this post again (in a month or so 🤦‍♂️) I will work on cleaning that up.