Update February 21st:After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take full responsibility for that. In this updated version of the post I have improved the propositions and added a proof for proposition 2. Please continue to point out weaknesses in the argument; that is a major motivation for why we share these fragments.
For comments and clarifications on the conceptual and philosophical aspects of this article, please read metasemi's excellent follow-up note here.
Meta: Over the past few months, we've held a seminar series on the Simulator theory by janus. As the theory is actively under development, the purpose of the series is to uncover central themes and formulate open problems. A few high-level remarks upfront:
Our aim with this sequence is to share some of our discussions with a broader audience and to encourage new research on the questions we uncover.
We outline the broader rationale and shared assumptions in Background and shared assumptions. That article also contains general caveats about how to read this sequence - in particular, read the sequence as a collection of incomplete notes full of invitations for new researchers to contribute.
Epistemic status: Exploratory. Parts of this text were generated by a language model from language model-generated summaries of a transcript of a seminar session. The content has been reviewed and edited for conceptual accuracy, but we have allowed many idiosyncrasies to remain.
Three questions about language model completions
GPT-like models are driving most of the recent breakthroughs in natural language processing. However, we don't understand them at a deep level. For example, when GPT creates a completion like the Blake Lemoine greentext, we
can't explain why it creates that exact completion.
can't identify the properties of the text that predict how it continues.
don't know how to affect these high-level properties to achieve desired outcomes.
We can make statements like "this token was generated because of the multinomial sampling after the softmax" or "this behavior is implied by the training distribution", but these statements only imply a form of descriptive adequacy (or saying “AlphaGo will win this game of Go"). They don't provide any explanatory adequacy, which is what we need to sufficiently understand and make use of GPT-like models.
Simulator theory (janus, 2022) has the potential for explanatory adequacy for some of these questions. In this post, we'll explore what we call “semiotic physics”, which follows from simulator theory and which has the potential to provide partial answers to questions 1., 2. and perhaps 3. The term “semiotic physics” here refers to the study of the fundamental forces and laws that govern the behavior of signs and symbols. Similar to how the study of physics helps us understand and make use of the laws that govern the physical universe, semiotic physics studies the fundamental forces that govern the symbolic universe of GPT, a universe that reflects and intersects with the universe of our own cognition. We transfer concepts from dynamical systems theory, such as attractors and basins of attraction, to the semiotic universe and spell out examples and implications of the proposed perspective.
Example. Semiotic coin flip.
To illustrate what we mean by semiotic physics, we will look at a toy model that we are familiar with from regular physics: coin flips. In this setup, we draw a sequence of coin flips from a large language model[1]. We encode the coin flips as a sequence of the strings 1 and 0 (since they are tokenized as a single token) and zero out all probabilities of other tokens.
We can then look at the probability of the event E that the sequence of coin flips ends in tails ( 0) or heads ( 1) as a function of the sequence length.
We note two key differences between the semiotic coin flip and a fair coin:
the semiotic coin is not fair, i.e. it tends to produce sequences that end in tails ( 0) much more frequently than sequences that end in heads ( 1).
the semiotic coin flips are not independent, i.e. the probability of observing heads or tails changes with the history of previous coin flips.
To better understand the types of sequences that end in either tails or heads, we next investigate the probability of the most likely sequence ending in 0 or 1. As we can see in the graph below, the probability of the most likely sequence ending in 1 does not decrease for the GPT coin as rapidly as it does for a fair coin.
Again, we observe a notable difference between the semiotic coin and the fair coin:
while the probability of a given sequence of coin flips decreases exponentially (every sequence of length T of fair coinflips has the same probability 12T), the probability of the most likely sequence of semiotic coin flips decreases much slower.
This difference is due to the fact that the most likely sequence of semiotic coinflips ending in f.e. 0 is: 0 0 0 0 ... 0 0. Once the language model has produced the same token four or five times in a row, it will latch onto the pattern and continue to predict the same token with high probability. As a consequence, the probability of the sequence does not decrease as drastically with increasing length, as each successive term has almost a probability of 1.
With the example of the semiotic coin flip in mind, we will set up some mathematical vocabulary for discussing semiotic physics and demonstrate how the vocabulary pays off with two propositions. We believe this terminology is primarily interesting for alignment researchers who would like to work on the theory of semiotic physics. The arithmophobic reader is invited to skip or gloss over the section (for an informal discussion, see here).
Simulations as dynamical systems
Simulator theory distinguishes between the simulator (the entity that performs the simulation) and the simulacrum (the entity that is generated by the simulation). The simulacrum arises from the chained application of the simulation forward pass. The result can be viewed as a dynamical system where the simulator describes the system’s dynamics and the simulacrum is instantiated through a particular trajectory.
We commence by identifying the state and trajectory of a dynamical system with tokens and sequences of tokens.
Definition of the state and trajectories. Given an alphabet of tokens T with cardinality |T|=N∈N+ we call ¯s=(s1,...,sM)∈T∗ the trajectory.[2] While a trajectory can generally be of arbitrary length, we denote the context length of the model as L∈N+; therefore, T∗ can effectively be written as ⋃Ll=0Tl. The empty sequence is denoted as ∅.[3][4][5]
While token sequences are the objects of semiotic physics, the actual laws of semiotic physics derive from the simulator. In particular, a simulator will provide a distribution over the possible next state given a trajectory via a transition rule.
Definition of the transition rule. The transition rule is a random function that maps a trajectory to a probability distribution over the alphabet (i.e., the probabilities for the next token completion after the current state). Let ΔT denote the set of probability mass functions over T, i.e., the set of functions p:T→[0,1] which satisfies the Kolmogorov axioms.[6][7][8] The transition rule is then a function θ:T∗→ΔT.
Definition of the sampling procedure. The sampling procedure ϕ:T∗→T, selects a next token, i.e., ϕ(¯s)∈supp(θ(¯s))∀¯s∈T∗.[9] The resulting trajectory ¯st+1 is simply the concatenation of ¯st and ϕ(¯st) (see the evolution operator below). We can, therefore, define the repeated application of the sampling procedure recursively as ϕ(1)(¯s):=ϕ(¯s) and ϕ(n)(¯s):=ϕ(n−1)(¯sϕ(¯s)).
Lastly, we need to concatenate the newly sampled token to the trajectory of the previous token to obtain a new trajectory. Packaging the transition rule, the sampling procedure, and the concatenation results in the evolution operator, which is the main operation used for running a simulation.
Definition of the evolution operator. Putting the pieces together, we finally define the function ψ that evolves a given trajectory, i.e., transforms ¯st into ¯st+1 by appending the token generated by the sampling procedure ϕ. That is, ψ:T∗→T∗ is defined as ψ(¯s):=¯sϕ(¯s). As above, repeated application is denoted by ψ(n).
Note that both the sampling procedure and the evolution operator are not functions in the conventional sense since they include a random element (the step of sampling from the distribution given by the transition function). Instead, one could consider them random variables or, equivalently, functions of unobservable noise. This justifies the use of a probability measure, e.g., in an expression like P[ψ(2)(∅)="hello world"]<ε.
Definition of an induced probability measure. Given a transition rule θ and a trajectory ¯s, we call P=θ(¯s)∈ΔT the induced probability measure (of θ and ¯s). We write P(ϕ(¯s)=s) to denote θ(¯s)(s), i.e. the probability of the token s assigned by the probability measure induced by ¯s. For a given trajectory ¯s the induced probability measure satisfies by definition the Kolmogorov axioms. We construct a joint measure of a sequence of tokens, P(ψ(N)(¯s)=¯ss1…sN), as the product of the individual probability measures, P(ψ(N)(¯s)=¯ss1…sN)=∏Ni=1P(ϕ(¯ss1…si−1)=si). For ease of notation, we also use the shorthand P[¯s]=∏Ni=1P(si|s1:i−1), where the length of the sequence, |¯s|=N, is implicit.
Two propositions on semiotic physics
Having identified simulations with dynamical systems, we can now draw on the rich vocabulary and concepts of dynamical systems theory. In this section, we carry over a selection of concepts from dynamical systems theory and encourage the reader to think of further examples.
First, we will define a token bridge of length B as a trajectory (sa,...,sb) that starts on a token sa ends on a token sb, and that has length |b−a|=B such that the resulting trajectory is valid according to the transition rule of the simulator. For example, a token bridge of length 3 from "cat" to "dog" would be the trajectory "cat and a dog".
Second, we call the family of probability measures P induced by a simulator non-degenerate if there exists an ε>0 such that for (almost) all ¯s∈T∗ the probability assigned to any s∈T by the induced measure is less than or equal to 1−ε,
P(ϕ(¯s)=s)≤1−ε.
We can now formulate the following proposition:
Proposition 1. Vanishing likelihood of bridges. Given a family of non-degenerate probability measures P on T∗, the probability of a token bridge ¯s of length B decreases monotonically as B increases[10], and converges to 0 in the limit,
limB→∞P[¯s]=0.
Proof: The probability of observing the particular bridge can be decomposed into the product of all individual transition probabilities, P[¯s]=∏Bi=1P(si|s1:i−1). Given that P(si|s1:i−1)≤1−ε for all transitions (minus at most a finite set), we see immediately that the probability of a longer sequence, P((sa,…,sb,sb′)), is at most equal (on a finite set) or strictly smaller than the probability of the shorter sequence P((s1,…,sb′))≤(1−ε)P((s1,…,sb))≤P((s1,…,sb)). We also see that 0≤limB→∞∏Bi=1P(si|s1:i−1)≤limB→∞(1−ε)B=0 from which the proposition follows.
Notes: As correctly pointed out by multiple commenters, in general, it is not true that the probability of (sa,...,sb) decreases monotonically when sb is fixed. In particular, the sequence (1,2,3,4,5) plausibly gets assigned a higher probability than the sequence (1,2,3,5). So the proposition only talks about the probability of a sequence when another token is appended. In general, when a sequence is sufficiently long and the transition function is not exceedingly weird, the probability of getting that particular sequence will be small. We also note that real simulators might well induce degenerate probability measures, for example in the case of a language model that falls into a very strong repeating loop[11]. In that case, the sequence can converge to a probability larger than zero.
There are usually multiple token bridges starting from and ending in any given pair of tokens. For example, besides "and a", we could also have "with a" or "versus a" between "cat" and "dog". We define the set of all token bridges of length B between sa and sb as
Tba={¯s∈TB|¯s1=sa and ¯sB=sb}
and the total probability of transitioning from sa to sb in B steps, denoted as P(Tba), and calculate it as
P(Tba)=∑¯s∈TbaP(¯s).
Computing this sum is, in general, computationally infeasible, as the number of possible token bridges grows exponentially with the length of the bridge. However, proposition one suggests that we will typically be dealing with small probabilities. This insight leads us to leverage a technique from statistical mechanics, that is concerned with the way in which unlikely events come about:
Proposition 2. Large deviation principle for token bridges. The total probability of transitioning from a token sa to sb in B steps satisfies a large deviation principle with rate function J,
limB→∞1BlnP(Tba)=−limB→∞min¯s∈TbaJ(¯s),
where we call J(¯s)=−1B∑Bi=1lnP(si|s1:i−1) the average action of a token bridge.
Proof: We again leverage the product rule and the properties of the exponential function to write the probability of a token bridge ¯s∗ as
P(¯s∗)=B∏i=1P(si|s1:i−1)=exp(B∑i=1lnP(si|s1:i−1))
so that the total probability P(Tba) can be written as a sum of exponentials,
P(Tba)=∑¯s∈Tbaexp(B∑i=1lnP(si|s1:i−1)).
We now expand the definition of the average action which makes the dependence of the exponential on T explicit,
P(Tba)=∑¯s∈Tbaexp(−BJ(¯s)).
Let ¯s∗=argmin¯sJ(¯s). Then exp(−BJ(¯s∗)) is the largest term of the sum and we can rewrite the sum as
Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take full responsibility for that. In this updated version of the post I have improved the propositions and added a proof for proposition 2. Please continue to point out weaknesses in the argument; that is a major motivation for why we share these fragments.
For comments and clarifications on the conceptual and philosophical aspects of this article, please read metasemi's excellent follow-up note here.
Meta: Over the past few months, we've held a seminar series on the Simulator theory by janus. As the theory is actively under development, the purpose of the series is to uncover central themes and formulate open problems. A few high-level remarks upfront:
Epistemic status: Exploratory. Parts of this text were generated by a language model from language model-generated summaries of a transcript of a seminar session. The content has been reviewed and edited for conceptual accuracy, but we have allowed many idiosyncrasies to remain.
Three questions about language model completions
GPT-like models are driving most of the recent breakthroughs in natural language processing. However, we don't understand them at a deep level. For example, when GPT creates a completion like the Blake Lemoine greentext, we
We can make statements like "this token was generated because of the multinomial sampling after the softmax" or "this behavior is implied by the training distribution", but these statements only imply a form of descriptive adequacy (or saying “AlphaGo will win this game of Go"). They don't provide any explanatory adequacy, which is what we need to sufficiently understand and make use of GPT-like models.
Simulator theory (janus, 2022) has the potential for explanatory adequacy for some of these questions. In this post, we'll explore what we call “semiotic physics”, which follows from simulator theory and which has the potential to provide partial answers to questions 1., 2. and perhaps 3. The term “semiotic physics” here refers to the study of the fundamental forces and laws that govern the behavior of signs and symbols. Similar to how the study of physics helps us understand and make use of the laws that govern the physical universe, semiotic physics studies the fundamental forces that govern the symbolic universe of GPT, a universe that reflects and intersects with the universe of our own cognition. We transfer concepts from dynamical systems theory, such as attractors and basins of attraction, to the semiotic universe and spell out examples and implications of the proposed perspective.
Example. Semiotic coin flip.
To illustrate what we mean by semiotic physics, we will look at a toy model that we are familiar with from regular physics: coin flips. In this setup, we draw a sequence of coin flips from a large language model[1]. We encode the coin flips as a sequence of the strings
1
and0
(since they are tokenized as a single token) and zero out all probabilities of other tokens.We can then look at the probability of the event E that the sequence of coin flips ends in tails (
0
) or heads (1
) as a function of the sequence length.We note two key differences between the semiotic coin flip and a fair coin:
0
) much more frequently than sequences that end in heads (1
).To better understand the types of sequences that end in either tails or heads, we next investigate the probability of the most likely sequence ending in
0
or1
. As we can see in the graph below, the probability of the most likely sequence ending in1
does not decrease for the GPT coin as rapidly as it does for a fair coin.Again, we observe a notable difference between the semiotic coin and the fair coin:
This difference is due to the fact that the most likely sequence of semiotic coinflips ending in f.e.
0
is:0
0
0
0
...0
0
. Once the language model has produced the same token four or five times in a row, it will latch onto the pattern and continue to predict the same token with high probability. As a consequence, the probability of the sequence does not decrease as drastically with increasing length, as each successive term has almost a probability of 1.With the example of the semiotic coin flip in mind, we will set up some mathematical vocabulary for discussing semiotic physics and demonstrate how the vocabulary pays off with two propositions. We believe this terminology is primarily interesting for alignment researchers who would like to work on the theory of semiotic physics. The arithmophobic reader is invited to skip or gloss over the section (for an informal discussion, see here).
Simulations as dynamical systems
Simulator theory distinguishes between the simulator (the entity that performs the simulation) and the simulacrum (the entity that is generated by the simulation). The simulacrum arises from the chained application of the simulation forward pass. The result can be viewed as a dynamical system where the simulator describes the system’s dynamics and the simulacrum is instantiated through a particular trajectory.
We commence by identifying the state and trajectory of a dynamical system with tokens and sequences of tokens.
Definition of the state and trajectories. Given an alphabet of tokens T with cardinality |T|=N∈N+ we call ¯s=(s1,...,sM)∈T∗ the trajectory.[2] While a trajectory can generally be of arbitrary length, we denote the context length of the model as L∈N+; therefore, T∗ can effectively be written as ⋃Ll=0Tl. The empty sequence is denoted as ∅.[3][4][5]
While token sequences are the objects of semiotic physics, the actual laws of semiotic physics derive from the simulator. In particular, a simulator will provide a distribution over the possible next state given a trajectory via a transition rule.
Definition of the transition rule. The transition rule is a random function that maps a trajectory to a probability distribution over the alphabet (i.e., the probabilities for the next token completion after the current state). Let ΔT denote the set of probability mass functions over T, i.e., the set of functions p:T→[0,1] which satisfies the Kolmogorov axioms.[6][7][8] The transition rule is then a function θ:T∗→ΔT.
Analogous to the wave collapse in quantum physics, sampling a new state from a distribution over states turn possibility into reality. We call this phenomenon the sampling procedure.
Definition of the sampling procedure. The sampling procedure ϕ:T∗→T, selects a next token, i.e., ϕ(¯s)∈supp(θ(¯s))∀¯s∈T∗.[9] The resulting trajectory ¯st+1 is simply the concatenation of ¯st and ϕ(¯st) (see the evolution operator below). We can, therefore, define the repeated application of the sampling procedure recursively as ϕ(1)(¯s):=ϕ(¯s) and ϕ(n)(¯s):=ϕ(n−1)(¯sϕ(¯s)).
Lastly, we need to concatenate the newly sampled token to the trajectory of the previous token to obtain a new trajectory. Packaging the transition rule, the sampling procedure, and the concatenation results in the evolution operator, which is the main operation used for running a simulation.
Definition of the evolution operator. Putting the pieces together, we finally define the function ψ that evolves a given trajectory, i.e., transforms ¯st into ¯st+1 by appending the token generated by the sampling procedure ϕ. That is, ψ:T∗→T∗ is defined as ψ(¯s):=¯sϕ(¯s). As above, repeated application is denoted by ψ(n).
Note that both the sampling procedure and the evolution operator are not functions in the conventional sense since they include a random element (the step of sampling from the distribution given by the transition function). Instead, one could consider them random variables or, equivalently, functions of unobservable noise. This justifies the use of a probability measure, e.g., in an expression like P[ψ(2)(∅)="hello world"]<ε.
Definition of an induced probability measure. Given a transition rule θ and a trajectory ¯s, we call P=θ(¯s)∈ΔT the induced probability measure (of θ and ¯s). We write P(ϕ(¯s)=s) to denote θ(¯s)(s), i.e. the probability of the token s assigned by the probability measure induced by ¯s. For a given trajectory ¯s the induced probability measure satisfies by definition the Kolmogorov axioms. We construct a joint measure of a sequence of tokens, P(ψ(N)(¯s)=¯ss1…sN), as the product of the individual probability measures, P(ψ(N)(¯s)=¯ss1…sN)=∏Ni=1P(ϕ(¯ss1…si−1)=si). For ease of notation, we also use the shorthand P[¯s]=∏Ni=1P(si|s1:i−1), where the length of the sequence, |¯s|=N, is implicit.
Two propositions on semiotic physics
Having identified simulations with dynamical systems, we can now draw on the rich vocabulary and concepts of dynamical systems theory. In this section, we carry over a selection of concepts from dynamical systems theory and encourage the reader to think of further examples.
First, we will define a token bridge of length B as a trajectory (sa,...,sb) that starts on a token sa ends on a token sb, and that has length |b−a|=B such that the resulting trajectory is valid according to the transition rule of the simulator. For example, a token bridge of length 3 from "cat" to "dog" would be the trajectory "cat and a dog".
Second, we call the family of probability measures P induced by a simulator non-degenerate if there exists an ε>0 such that for (almost) all ¯s∈T∗ the probability assigned to any s∈T by the induced measure is less than or equal to 1−ε,
P(ϕ(¯s)=s)≤1−ε.We can now formulate the following proposition:
Proposition 1. Vanishing likelihood of bridges. Given a family of non-degenerate probability measures P on T∗, the probability of a token bridge ¯s of length B decreases monotonically as B increases[10], and converges to 0 in the limit,
limB→∞P[¯s]=0.Proof: The probability of observing the particular bridge can be decomposed into the product of all individual transition probabilities, P[¯s]=∏Bi=1P(si|s1:i−1). Given that P(si|s1:i−1)≤1−ε for all transitions (minus at most a finite set), we see immediately that the probability of a longer sequence, P((sa,…,sb,sb′)), is at most equal (on a finite set) or strictly smaller than the probability of the shorter sequence P((s1,…,sb′))≤(1−ε)P((s1,…,sb))≤P((s1,…,sb)). We also see that 0≤limB→∞∏Bi=1P(si|s1:i−1)≤limB→∞(1−ε)B=0 from which the proposition follows.
Notes: As correctly pointed out by multiple commenters, in general, it is not true that the probability of (sa,...,sb) decreases monotonically when sb is fixed. In particular, the sequence (1,2,3,4,5) plausibly gets assigned a higher probability than the sequence (1,2,3,5). So the proposition only talks about the probability of a sequence when another token is appended. In general, when a sequence is sufficiently long and the transition function is not exceedingly weird, the probability of getting that particular sequence will be small. We also note that real simulators might well induce degenerate probability measures, for example in the case of a language model that falls into a very strong repeating loop[11]. In that case, the sequence can converge to a probability larger than zero.
There are usually multiple token bridges starting from and ending in any given pair of tokens. For example, besides "and a", we could also have "with a" or "versus a" between "cat" and "dog". We define the set of all token bridges of length B between sa and sb as
Tba={¯s∈TB|¯s1=sa and ¯sB=sb}and the total probability of transitioning from sa to sb in B steps, denoted as P(Tba), and calculate it as
P(Tba)=∑¯s∈TbaP(¯s).Computing this sum is, in general, computationally infeasible, as the number of possible token bridges grows exponentially with the length of the bridge. However, proposition one suggests that we will typically be dealing with small probabilities. This insight leads us to leverage a technique from statistical mechanics, that is concerned with the way in which unlikely events come about:
Proposition 2. Large deviation principle for token bridges. The total probability of transitioning from a token sa to sb in B steps satisfies a large deviation principle with rate function J,
limB→∞1BlnP(Tba)=−limB→∞min¯s∈TbaJ(¯s),where we call J(¯s)=−1B∑Bi=1lnP(si|s1:i−1) the average action of a token bridge.
Proof: We again leverage the product rule and the properties of the exponential function to write the probability of a token bridge ¯s∗ as
P(¯s∗)=B∏i=1P(si|s1:i−1)=exp(B∑i=1lnP(si|s1:i−1))so that the total probability P(Tba) can be written as a sum of exponentials,
P(Tba)=∑¯s∈Tbaexp(B∑i=1lnP(si|s1:i−1)).We now expand the definition of the average action which makes the dependence of the exponential on T explicit,
P(Tba)=∑¯s∈Tbaexp(−BJ(¯s)).Let ¯s∗=argmin¯sJ(¯s). Then exp(−BJ(¯s∗)) is the largest term of the sum and we can rewrite the sum as
P(Tba)=exp(−BJ(¯s∗))(1+∑¯s∈Tba∖{¯s∗}exp{−B(J(¯s)−J(¯s∗))}).Applying the logarithm to both sides and multiplying with −1B results in
1BlnP(Tba)=−J(¯s∗)−1Bln(1+∑¯s∈Tba∖{¯s∗}exp{−B(J(¯s)−J(¯s∗))}).Since J(¯s∗)<J(¯s) by construction, J(¯s)−J(¯s∗) is larger than zero and exp{−B(J(¯s)−J(