This post was created during the Dovetail Research Fellowship. Thanks to Alex, Alfred, José Pedro Faustino, everyone who read and commented on the draft, and everyone else in the fellowship for their ideas and discussions.
Introduction
Part of what makes a system intelligent is its ability to analyze data generated by some process (for example, the past observations it might have made about the world) and use it to predict what data the process is likely to generate in the future (such as what it might observe in the future). So to understand how systems do this, it would be interesting to study the underlying mechanics of making accurate predictions about and discovering patterns in sequences of data.
What is computational mechanics and why is it useful?
Here is the informal definition: Computational mechanics is a formalism (originating with ε-machines/causal states) for building minimal, maximally predictive models of stochastic processes, quantifying their structure, memory, and computation.
It is essentially a field of study that tries to answer the question: Given some observed data (a sequence of events), what does it mean to predict the future of that process as well as possible, and how can we build a minimal model that does so?[1]
The usefulness of answering this question should be apparent, since there are many applications (including in the field of AI) in which the goal is to predict a future sequence of events based on the events that happened so far, and it is convenient to be able to do that as accurately and efficiently as possible. And it could also be useful to model systems that are themselves predicting future events as having a minimal and maximally predictive internal model of the process that is generating the events. We will discuss what minimality and maximum predictiveness mean later.
One interesting and recent use case of computational mechanics (which I will call CM from now on) was shown in a paper called “Transformers Represent Belief State Geometry in their Residual Stream”, which had authors from AI safety research groups such as PIBBSS and Simplex. In the paper, they showed evidence that
“transformers construct predictive representations that surpass simple next-token prediction. Instead, these representations encode the complex geometry associated with belief updating over hidden states, capturing an inference process over the underlying structure of the data-generating process.”
This seems very significant, since it appears to indicate that current transformer-based AI systems do have internal models of the processes that generate their training data (for example human cognition, which generates text in natural language). (LessWrong post about the paper).
Example: Weather forecasting
The weather is an example of a process that can emit stochastic observations. For instance, imagine that, at some particular location, you have sensors that measure air pressure and humidity and are trying to predict what might be the resulting measurements before making them.
There are different types and amounts of information you could have about the weather. If you only had access to the current time and historical records, your best guess would be the average pressure and humidity for that particular location at that time of the year/day (assuming constant climate). But, you could get a more accurate prediction by making some measurements of those quantities across a timespan before predicting what you might measure next. For example, even though you can’t know the exact values you will get, they are probably not going to be very different from the last ones you got (assuming a short interval between measurements), and their rate of change will also be similar to the rate of change between your last 2 measurements.
But you could also infer other things, such as if you measure a sudden drop in air pressure, it is likely because of an upcoming storm, so you could use that information to predict an increase in humidity. By doing that, you are using information acquired by making observations (the measurements) of a stochastic process (the weather) to infer a hidden state of the process (a possible upcoming storm) which allows you to make better predictions about future observations (the increase in humidity).
So, in situations like this, as we'll see later, computational mechanics could be used to create models that use the least amount of information from the past measurements to make the most accurate predictions about what might be the results of future measurements.
Example: Text prediction
Imagine you are trying to predict the value of a bit in the binary version of an english text document (i.e. just the text translated to 0s and 1s). Without any information, your best guess would probably be that there is a 50% chance of being either 0 or 1. Even if you had access to a few of the previous bits, say …010 , you still wouldn’t have enough information to meaningfully improve your guess.
But if you had access to all previous bits, you could count them to figure out in which position of a byte you are, which would be useful information. Let’s say you were in position 7 and the previous bits were …00100000011100010111010 . Separating into bytes gives 00100000 01110001 0111010_ , the first two of which translate to “[space]” and “q”.
If the next bit is a 0, the full byte is 01110100 which translates to “t” and if it is a 1, it is 01110101, which translates to “u”. So, given that in English the string “ qu” is much more common than “ qt”, you can predict that the next bit is almost certainly a 1.
So you are using information acquired by looking at a history of data generated by a stochastic process (the text generation done by whoever wrote it, followed by the translation to binary) to infer a hidden state of the process (the current position within a byte, the position of the byte within a word and sentence, etc…) which allows you to make better predictions about future observations (the next bit being a 1).
This particular example of predicting the next piece of information by looking at all information contained in a specific context seems quite similar to the next token prediction performed by current LLMs, which is evidence for why computational mechanics could be useful for modeling their behavior.[2]
Causal states and ε-machines
Now let's start to see more formally what CM is. What we are interested in are stochastic processes that generate sequences of data in discrete timesteps.
Each element in the sequence is a random variable , where is the time index, drawn from a countable set (also called an alphabet)[3]. We will assume that the sequence is infinite both in its future and in its past, so is any integer and the sequence looks like this: . For example, if the alphabet is , a sequence might look like: …2313322311231… , in which each variable has a uniform distribution which is independent from all other variables (i.e. an IID process). We can also define a process (let’s call it ) in which every 1 is followed by a 2, and the distribution is uniform otherwise, in which case a sequence might look like: …312233212123… .
Let’s say we are given a process (that is, the alphabet and the rules that provide the distributions of the variables based on the values of other variables and/or the timestep) and a particular history of observations, by which I mean an infinite string of symbols going from time to (for example …3323112321) and we want to predict what the next observation will be. If the value of the variable at is independent of the values of all the variables in the history, the best we can do at predicting it is to guess based on its average distribution. But if the distribution changes depending on the values of the variables that came before it, then by knowing the history we can make a more accurate prediction, since we now have access to relevant information.
So then we can ask, just how much information from the past do we need to predict the future as accurately as possible (the amount of information required for maximum accuracy is called the statistical complexity of the process)? We might need the whole history or, depending on the process that generated the data, there might be a way to compress the information contained in the history.
One way to do that is to separate all possible histories into disjoint sets (sometimes referred to as equivalence classes or effective states of the partition) and then only look at which set a history is in, instead of at the history itself, which compresses information since there can only be as many sets as there are histories. Since, to predict the future, we only want to know the set and not the history, we are interested in the partitions that preserve all of the information about the future contained in the histories (i.e. that are maximally predictive).
This means that the partition must be such that all histories within a set imply the same distribution over futures (otherwise by knowing in which set a history is, you wouldn’t know which future distribution it implies, meaning that information wasn’t preserved). Formally, we say that if two histories and are equivalent (are in the same set), then
where is a future sequence of length . For example, in the process, the histories …333231 and …123221 imply the same distribution over futures, since for this process each symbol might only depend on the previous and both histories end in the same symbol.
But there are many ways to partition the histories in this way (the different ways are called prescient rivals, since they have the same predictive power) so which one do we choose? Well, the cut is made by Occam’s razor, which says that the simplest explanation is the best.
In this case, the simplest (or minimal) partition means the one with the fewest number of sets, which is the one in which not only all histories within a class imply the same future distribution but also all histories that imply the same future distribution are in the same class, i.e. there is a bijection between future distribution and equivalence classes.
We call the classes in this minimal and maximally predictive partition of histories the causal states of the stochastic process. They can be thought of as hidden states of the process that entirely determine the probability distribution for the next observation. And each observation uniquely determines the transition between causal states. The set of causal states, along with the transition probabilities between them, are called the ε-machine of the process.
As a more formal definition: The ε-machine of a process is the ordered pair {ε, T}, where ε is the causal state function and T is the set of the transition matrices for the states defined by ε.
The causal state function, where is the set of causal states, is the rule by which we partition the histories into the causal states, and the transition matrices are matrices associated with each possible observation that tell you the probabilities of going from each causal state to each other causal state and emitting . The matrix for a particular symbol is defined as
where means the past history concatenated with the symbol , and their sum is called the stochastic connection matrix of the process and corresponds to the total probabilities of transition between states.
One important property of ε-machines are that they are unifilar, which means that the current causal state along with the next emitted symbol uniquely determines the next causal state (there is a proof of this fact in the appendix). We will talk more about unifiliarity later.
All of this will probably become clearer with the next example, but it was essentially what we did in the examples seen so far. When predicting the weather, we created an equivalence class of all histories of observations which ended in a sudden drop in air pressure, and called it the “upcoming storm state”. By knowing we are in this state we can improve our future predictions without having to know any of the past measurements . And if we in the future detect a large increase in humidity we’ll know that we transitioned to the “raining state”.[4]
Example: The Golden Mean process
This process is defined as follows: The set of possible observations is . After emitting a 0, it deterministically emits a 1. After emitting a 1, it emits either another 1 with probability or a 0 with probability .
So it is easy to see that this process has 2 states: State A (free), corresponding to the last observation being a 1, so the next symbol is 1 with prob and 0 with prob ; State B (bound), corresponding to the last observation being a 0, so the next symbol is 1 with prob 1.
2-state hidden Markov model that generates the Golden Mean process.
Notice how just by knowing in which state the process is currently in, we can determine exactly the probability distribution for the next observation and corresponding next state, therefore the current state encodes the entire future distribution of the system, and you couldn’t get more information about it by looking at more of the past history of observations.
The transition matrices for this process are as follows (rows = current state, columns = next state):
And those are the matrices that define the ε-machine for the Golden Mean process. We could also look at some of the prescient rivals of the states of the ε-machine. For example, we could define a machine that works like this: State X, which emits a 1 with prob and transitions to state Y or emits a 0 with prob and transitions to state Z; State Y, which emits a 1 with prob and transitions to state X or emits a 0 with prob and transitions to state Z; State Z, which emits a 1 with prob 1 and transitions to state X.
3-state hidden Markov model that generates the Golden Mean process.
This machine also predicts the future observations of the Golden Mean process, and it does it as accurately as the ε-machine, but it has one extra state (in fact this machine, by using the states X and Y, is keeping track of the parity of the number of 1s after the last 0 in the sequence, not just which was the last observation as the ε-machine was), so it isn’t a minimal model of the process despite being maximally predictive. Therefore, Occam’s razor tells us the ε-machine is, as it’s always the case, a better model for this process, because it is simpler.
Important quantities in computational mechanics
:
The stationary distribution over causal states, denoted . It is a vector containing the long term average probabilities of being in each causal state. If the process is in a causal state then the vector containing the probability distribution of being in each state will be 1 at the coordinate corresponding to and 0 everywhere else. Then, the probabilities of being in each causal state at the next time step is given by . And the same goes for any timestep, , so . This means that the stationary distribution (if the limit exists and the ε-machine is irreducible). That implies that , so π is an eigenvector of with eigenvalue 1, which gives us an easy way to find given using linear algebra. For the Golden Mean, if , we have the system (because the components of always sum to 1, since they are probabilities). Solving it, gives us , and if , . There are also processes that oscillate between causal states in a cycle, which could make oscillate with as well and make the limit not exist. But since all matrices whose rows (or columns) sum to 1 have an eigenvalue of 1, there still exists such that and in this case represents the limit of the distribution averaged over time (called the Cesàro limit distribution) and can be defined as .
:
The statistical complexity of a partition, denoted . It is just the entropy of the stationary probability distribution of the partition states, so in the case of an ε-machine, . This quantity can be interpreted as the long term average information necessary to predict the future history, or the level of structure of the process.
For the ε-machine of the Golden Mean with , , slightly less than 1 bit. That makes sense, since there are only 2 possible future distributions, corresponding to the 2 possible values of the last observation, plus we already have some additional information since 1 is a more likely observation than 0.
For the 3 state partition of the Golden Mean with , the stationary distribution over states is and . So this partition is statistically more complex than the causal states and requires more information about the history to give predictions with the same accuracy.
It has been proven that the causal states, i.e. the partition with the fewest number of sets (or, in cases with infinite sets, the partition in which there is a bijection between sets and future distributions), are always unique and have the minimum amongst all of the prescient rivals. Two partitions can be different and have both the minimum if they differ only on a measure 0 set of histories.
:
The entropy rate of a process, denoted hμ. It is the average entropy per symbol given all of the past history (i.e. how much randomness remains on average even when you know the causal state). It is defined as where is a future sequence of observations of length L. For irreducible ε-machines, the entropy rate also follows the relation which is the weighted average of the entropies of the next observation given the possible current states. For the Golden Mean,
.
E:
The excess entropy of a process, denoted E. It is the total information (mutual information) the past observations provide about the future observations, defined as
or where s is the current causal state. can be interpreted as a measure of the "predictive gain" of the process, i.e. how much better on average is our prediction about the future when knowing the past compared to not knowing it (completely random processes have and deterministic processes have ). For the Golden Mean, , which is also the same as , but this isn’t the case for all processes, and only holds here because the golden mean is a process with a Markov Order of 1, meaning that the causal state depends only on the last symbol.
𝜒:
The crypticity of a process, denoted . It is defined as , and represents the “information loss”, that is, the difference between how much information we need to make the best prediction and how much information that prediction contains about the future (or how much uncertainty remains on average about the current causal state given the whole future). If is 0, all causal states are completely determined by the future observations (and not just by the past ones). For the Golden Mean, , which makes sense, since if the first future observation is a 1, we don’t know whether the current causal state is A or B, and any more future observations wouldn’t help distinguishing them.
Example: Non-unifilar generator
Before we saw how the Golden Mean process can be generated by two different Markov models, with one being its ε-machine. But in both cases, the Markov models were unifilar, meaning that if you are in a state and emit a symbol, there is never ambiguity regarding in which state you transition to. But that isn't always the case, as can be seen in this Markov model:
Here, if you are in state A and emit a 1 you don't know whether you transitioned to B or back to A, so any past that ends in a 1 implies uncertainty about the current state (which is why it is called a hidden Markov model, or HMM), which means that this Markov model isn't the ε-machine of the process it generates. This is the ε-machine:
It requires 3 states instead of 2. The causal state X correspond to the HMM state A, Y corresponds to B and Z corresponds to the state of uncertainty between A and B. As can be seen, the ε-machine is unifilar, all arrows coming out of any causal state have different emissions. To make the past histories partition explicit, any history that ends in a 0 followed by an even number of 2s corresponds to causal state X, any that ends in a 0 then an odd number of 2s is in Y and any that ends in a 1 followed by any number of 2s is in Z, as shown in the image:
Something interesting about the different Markov models is that knowing whether you are in state A or B can give you more information about the future than knowing whether you are in X, Y or Z, but you can't access that information just by looking at the past symbols.
For this process, the aforementioned quantities are: ; ; ; ; .
Example: The Mess3 process
This is the main process analyzed in the previously mentioned paper by the researchers from PIBBSS and Simplex. It is generated by the following HMM:
Hidden Markov model that generates the Mess3 process
This HMM also isn't unifilar, and in this case you can never know exactly in which of its states of the you are just by looking at the emissions, which implies that the ε-machine of Mess3 isn’t the same as the HMM. So then what are the causal states for this process? Is there a more intuitive way for us to understand them and refer to them other than just as a partition of histories? The answer is yes, each causal state is a probability distribution of being in each of the states of the HMM. Such probability distributions are called "mixed states" or "belief states". In the appendix there is a proof that the belief states are the causal states of Mess3.
Now, how many causal states are there? As it turns out, there is an uncountably infinite amount of causal states. Since each is a probability distribution over the states , the causal states can be arranged as points in an equilateral triangle such that the distance to a side represents the probability of being in the state associated with the opposite vertex (because the sum of the distances to the sides always adds up to 1). Doing that yields the following figure:
Geometry of the causal states of Mess3 as distributed in a triangle according to the probability distributions of being in each of the 3 states of the HMM.
As can be seen in the figure, the causal states are distributed in the triangle (2-simplex) forming a fractal pattern. Since there is an uncountable amount of (non measure zero) causal states, the statistical complexity is infinite[5], so in this case if you want maximal prescience, you need the whole history and there is no way to compress the information contained in it. But CM can still be useful for compressing if we are willing to sacrifice some predictive power. Instead of partitioning histories by their exact future distribution, we could have a partition in which each set has histories with "similar" future distributions. We can quantify their similarity by the maximum error between their predictions, which serves as a distance metric, , between two histories/causal states, called the Total Variation distance (TV). [6]More precisely, is the maximum difference between the conditional probabilities of any particular set of future symbols occurring when conditioning on the two histories.
So, as formalized in the appendix, if we stipulate a maximum allowed error (i.e. a resolution of the partition), we can find a partition with a finite number of sets such that if any two causal states and are in the same set then , and the smaller the larger needs to be. This corresponds to dividing the simplex into regions such that they cover all causal states.
The optimal approximations of the ε-machine are those that, given some minimize or, given some , minimize , depending on what you are optimizing for.
For the Mess3 process, the aforementioned quantities are: probability density represented by the fractal; ; ; ; .
Conclusion
Computational mechanics is a useful framework because it allows us to infer hidden underlying dynamics of stochastic processes which can be used to improve our predictions about future data generated by them, and it provides a way to quantify how much information about past data is required to make the best possible predictions. And ε-machines can be thought of as the objective "targets of learning", as they are the simplest best possible models of stochastic processes.
It also seems potentially useful for understanding the inner workings and behavior of agents, since to be effective optimizers most agents would need to model their environment based on the inputs they receive, so it would be interesting to investigate if such internal models resemble ε-machines.
So far we haven’t talked about, given a particular process, how to reconstruct the ε-machine (or an optimal machine given a limit on or ) associated with it, or about things like bidirectional ε-machines and ε-transducers. Those and other related topics will be explored if a part 2 of this explainer is ever written.
Appendix
Proof of the unifiliarity of ε-machines:
Let's consider the bidirectional history to be composed of the past, concatenated with the present concatenated with the future:
We want to prove that if two pasts are in the same causal state and emit the same symbol, then they must end up in the same causal state, that is, .
Assuming , we have:
where is the Borel σ-algebra generated by the cylinder sets of and is the alphabet. So, in particular, considering the cylinder set defined by , we have that
But , so
Now, taking the future into account, we have
and using we get:
This means that, no matter what the symbol is, the distribution over futures of the two different pasts concatenated with the same is the same, so .
Proof that the belief states are the causal states of Mess3:
The states of the HMM are called , and let’s call the random variable associated with the causal states . We know that the distribution over future sequences of symbols is completely determined by the current state , so no matter what the past history is, we have
. What we are interested in is the future distribution given the past if . We can open this distribution in terms of the states :
, where is the probability of being in state given the past, so is the distribution of states given the past, which is also called the mixed state or belief state. Since is fixed for each and doesn’t depend on the past, is entirely determined by which is a function ( where is the set of possible past histories and is the set of possible ) of the past, so the partition is a maximally predictive partition of the past histories, i.e. either the causal states or a prescient rival. To figure out which, we need to check bijectivity, that is, are there two distinct that give the same distribution over futures?
To answer that it is easier to only look at the distributions over the next symbol. If we are in state , the probabilities are:
Arranging them in a vector we have that the distribution over the next symbol for is . By symmetry, , . And the distribution over the next symbol given some mixed state is . Looking at one component of the vector,
. Since the probability of the next symbol is entirely determined by the ,
. So and thus .
Let’s assume there are two different mixed states and with the same distribution . That means
. Since the are L.I. , all of their coefficients must be 0, so , , , which contradicts the assumption that the mixed states were different. So two different mixed states must have different future distributions and therefore there is a bijection between mixed states and future distributions, which means the mixed states are the causal states of Mess3.
Approximability of ε-machines:
One possible metric for the distance between causal states is the Total Variation distance, defined as:
, where is the Borel σ-algebra generated by the cylinder sets of and is the alphabet.
Using this metric, we can say that a process has an arbitrarily well approximable ε-machine iff a partition of the causal states of the process into disjoint sets such that , that is, iff the closure of the set of causal states is a compact metric space.
This is the case for Mess3 since the causal states live in a finite region of a 2-dimensional space.
In essence, finding this model is a form of statistical inference with a very general hypothesis class: The set of unifilar hidden Markov models (including those with uncountably many states). By doing inference, you are trying to find the model that best fits the data. And by choosing a good prior, that prioritizes simpler models, you should converge to the minimal model in terms of its number of states and statistical complexity (which corresponds to the amount of information stored by the process for future data generation).
LLMs can be viewed as approximations of ε-machines for the process of natural language generation, in which the approximate causal states are the context vectors i.e. the activations of the final layer of the transformer, just before the "Unembedding" step.
The “upcoming storm" and "raining" states aren't actually causal states of the “weather process”, because they aren't maximally predictive as they don't contain all of the relevant information we could obtain from the past observations.
Actually, by the previously stated definition, the statistical complexity of Mess3 is undefined. So instead we can define it as the limit of the statistical complexities of the sets of states of the optimal approximations of the ε-machine, as goes to 0, which is infinite. There is however a better way to measure the complexity of processes with uncountable causal states, called the Information Dimension of the process, which as the name suggests measures the fractal dimension of the distribution of causal states. This dimension is a way of quantifying how well can we approximate the machine using finitely many states, as the lower the dimension the less states we need for the same "quality" of approximation.
This metric is not the same as the Euclidean distance in the simplex. In this particular case, using this metric, the points at some constant distance from a central point from a regular hexagon, not a circle.
This post was created during the Dovetail Research Fellowship. Thanks to Alex, Alfred, José Pedro Faustino, everyone who read and commented on the draft, and everyone else in the fellowship for their ideas and discussions.
Introduction
Part of what makes a system intelligent is its ability to analyze data generated by some process (for example, the past observations it might have made about the world) and use it to predict what data the process is likely to generate in the future (such as what it might observe in the future). So to understand how systems do this, it would be interesting to study the underlying mechanics of making accurate predictions about and discovering patterns in sequences of data.
What is computational mechanics and why is it useful?
Here is the informal definition: Computational mechanics is a formalism (originating with ε-machines/causal states) for building minimal, maximally predictive models of stochastic processes, quantifying their structure, memory, and computation.
It is essentially a field of study that tries to answer the question: Given some observed data (a sequence of events), what does it mean to predict the future of that process as well as possible, and how can we build a minimal model that does so?[1]
The usefulness of answering this question should be apparent, since there are many applications (including in the field of AI) in which the goal is to predict a future sequence of events based on the events that happened so far, and it is convenient to be able to do that as accurately and efficiently as possible. And it could also be useful to model systems that are themselves predicting future events as having a minimal and maximally predictive internal model of the process that is generating the events. We will discuss what minimality and maximum predictiveness mean later.
One interesting and recent use case of computational mechanics (which I will call CM from now on) was shown in a paper called “Transformers Represent Belief State Geometry in their Residual Stream”, which had authors from AI safety research groups such as PIBBSS and Simplex. In the paper, they showed evidence that
This seems very significant, since it appears to indicate that current transformer-based AI systems do have internal models of the processes that generate their training data (for example human cognition, which generates text in natural language). (LessWrong post about the paper).
Example: Weather forecasting
The weather is an example of a process that can emit stochastic observations. For instance, imagine that, at some particular location, you have sensors that measure air pressure and humidity and are trying to predict what might be the resulting measurements before making them.
There are different types and amounts of information you could have about the weather. If you only had access to the current time and historical records, your best guess would be the average pressure and humidity for that particular location at that time of the year/day (assuming constant climate). But, you could get a more accurate prediction by making some measurements of those quantities across a timespan before predicting what you might measure next. For example, even though you can’t know the exact values you will get, they are probably not going to be very different from the last ones you got (assuming a short interval between measurements), and their rate of change will also be similar to the rate of change between your last 2 measurements.
But you could also infer other things, such as if you measure a sudden drop in air pressure, it is likely because of an upcoming storm, so you could use that information to predict an increase in humidity. By doing that, you are using information acquired by making observations (the measurements) of a stochastic process (the weather) to infer a hidden state of the process (a possible upcoming storm) which allows you to make better predictions about future observations (the increase in humidity).
So, in situations like this, as we'll see later, computational mechanics could be used to create models that use the least amount of information from the past measurements to make the most accurate predictions about what might be the results of future measurements.
Example: Text prediction
Imagine you are trying to predict the value of a bit in the binary version of an english text document (i.e. just the text translated to 0s and 1s). Without any information, your best guess would probably be that there is a 50% chance of being either 0 or 1. Even if you had access to a few of the previous bits, say …010 , you still wouldn’t have enough information to meaningfully improve your guess.
But if you had access to all previous bits, you could count them to figure out in which position of a byte you are, which would be useful information. Let’s say you were in position 7 and the previous bits were …00100000011100010111010 . Separating into bytes gives 00100000 01110001 0111010_ , the first two of which translate to “[space]” and “q”.
If the next bit is a 0, the full byte is 01110100 which translates to “t” and if it is a 1, it is 01110101, which translates to “u”. So, given that in English the string “ qu” is much more common than “ qt”, you can predict that the next bit is almost certainly a 1.
So you are using information acquired by looking at a history of data generated by a stochastic process (the text generation done by whoever wrote it, followed by the translation to binary) to infer a hidden state of the process (the current position within a byte, the position of the byte within a word and sentence, etc…) which allows you to make better predictions about future observations (the next bit being a 1).
This particular example of predicting the next piece of information by looking at all information contained in a specific context seems quite similar to the next token prediction performed by current LLMs, which is evidence for why computational mechanics could be useful for modeling their behavior.[2]
Causal states and ε-machines
Now let's start to see more formally what CM is. What we are interested in are stochastic processes that generate sequences of data in discrete timesteps.
Each element in the sequence is a random variable , where is the time index, drawn from a countable set (also called an alphabet)[3]. We will assume that the sequence is infinite both in its future and in its past, so is any integer and the sequence looks like this: . For example, if the alphabet is , a sequence might look like: …2313322311231… , in which each variable has a uniform distribution which is independent from all other variables (i.e. an IID process). We can also define a process (let’s call it ) in which every 1 is followed by a 2, and the distribution is uniform otherwise, in which case a sequence might look like: …312233212123… .
Let’s say we are given a process (that is, the alphabet and the rules that provide the distributions of the variables based on the values of other variables and/or the timestep) and a particular history of observations, by which I mean an infinite string of symbols going from time to (for example …3323112321) and we want to predict what the next observation will be. If the value of the variable at is independent of the values of all the variables in the history, the best we can do at predicting it is to guess based on its average distribution. But if the distribution changes depending on the values of the variables that came before it, then by knowing the history we can make a more accurate prediction, since we now have access to relevant information.
So then we can ask, just how much information from the past do we need to predict the future as accurately as possible (the amount of information required for maximum accuracy is called the statistical complexity of the process)? We might need the whole history or, depending on the process that generated the data, there might be a way to compress the information contained in the history.
One way to do that is to separate all possible histories into disjoint sets (sometimes referred to as equivalence classes or effective states of the partition) and then only look at which set a history is in, instead of at the history itself, which compresses information since there can only be as many sets as there are histories. Since, to predict the future, we only want to know the set and not the history, we are interested in the partitions that preserve all of the information about the future contained in the histories (i.e. that are maximally predictive).
This means that the partition must be such that all histories within a set imply the same distribution over futures (otherwise by knowing in which set a history is, you wouldn’t know which future distribution it implies, meaning that information wasn’t preserved). Formally, we say that if two histories and are equivalent (are in the same set), then
where is a future sequence of length . For example, in the process, the histories …333231 and …123221 imply the same distribution over futures, since for this process each symbol might only depend on the previous and both histories end in the same symbol.
But there are many ways to partition the histories in this way (the different ways are called prescient rivals, since they have the same predictive power) so which one do we choose? Well, the cut is made by Occam’s razor, which says that the simplest explanation is the best.
In this case, the simplest (or minimal) partition means the one with the fewest number of sets, which is the one in which not only all histories within a class imply the same future distribution but also all histories that imply the same future distribution are in the same class, i.e. there is a bijection between future distribution and equivalence classes.
We call the classes in this minimal and maximally predictive partition of histories the causal states of the stochastic process. They can be thought of as hidden states of the process that entirely determine the probability distribution for the next observation. And each observation uniquely determines the transition between causal states. The set of causal states, along with the transition probabilities between them, are called the ε-machine of the process.
As a more formal definition: The ε-machine of a process is the ordered pair {ε, T}, where ε is the causal state function and T is the set of the transition matrices for the states defined by ε.
The causal state function, where is the set of causal states, is the rule by which we partition the histories into the causal states, and the transition matrices are matrices associated with each possible observation that tell you the probabilities of going from each causal state to each other causal state and emitting . The matrix for a particular symbol is defined as
where means the past history concatenated with the symbol , and their sum is called the stochastic connection matrix of the process and corresponds to the total probabilities of transition between states.
One important property of ε-machines are that they are unifilar, which means that the current causal state along with the next emitted symbol uniquely determines the next causal state (there is a proof of this fact in the appendix). We will talk more about unifiliarity later.
All of this will probably become clearer with the next example, but it was essentially what we did in the examples seen so far. When predicting the weather, we created an equivalence class of all histories of observations which ended in a sudden drop in air pressure, and called it the “upcoming storm state”. By knowing we are in this state we can improve our future predictions without having to know any of the past measurements . And if we in the future detect a large increase in humidity we’ll know that we transitioned to the “raining state”.[4]
Example: The Golden Mean process
This process is defined as follows: The set of possible observations is . After emitting a 0, it deterministically emits a 1. After emitting a 1, it emits either another 1 with probability or a 0 with probability .
So it is easy to see that this process has 2 states: State A (free), corresponding to the last observation being a 1, so the next symbol is 1 with prob and 0 with prob ; State B (bound), corresponding to the last observation being a 0, so the next symbol is 1 with prob 1.
Notice how just by knowing in which state the process is currently in, we can determine exactly the probability distribution for the next observation and corresponding next state, therefore the current state encodes the entire future distribution of the system, and you couldn’t get more information about it by looking at more of the past history of observations.
The transition matrices for this process are as follows (rows = current state, columns = next state):
And those are the matrices that define the ε-machine for the Golden Mean process. We could also look at some of the prescient rivals of the states of the ε-machine. For example, we could define a machine that works like this: State X, which emits a 1 with prob and transitions to state Y or emits a 0 with prob and transitions to state Z; State Y, which emits a 1 with prob and transitions to state X or emits a 0 with prob and transitions to state Z; State Z, which emits a 1 with prob 1 and transitions to state X.
This machine also predicts the future observations of the Golden Mean process, and it does it as accurately as the ε-machine, but it has one extra state (in fact this machine, by using the states X and Y, is keeping track of the parity of the number of 1s after the last 0 in the sequence, not just which was the last observation as the ε-machine was), so it isn’t a minimal model of the process despite being maximally predictive. Therefore, Occam’s razor tells us the ε-machine is, as it’s always the case, a better model for this process, because it is simpler.
Important quantities in computational mechanics
The stationary distribution over causal states, denoted . It is a vector containing the long term average probabilities of being in each causal state. If the process is in a causal state then the vector containing the probability distribution of being in each state will be 1 at the coordinate corresponding to and 0 everywhere else. Then, the probabilities of being in each causal state at the next time step is given by . And the same goes for any timestep, , so . This means that the stationary distribution (if the limit exists and the ε-machine is irreducible). That implies that , so π is an eigenvector of with eigenvalue 1, which gives us an easy way to find given using linear algebra. For the Golden Mean, if , we have the system (because the components of always sum to 1, since they are probabilities). Solving it, gives us , and if , . There are also processes that oscillate between causal states in a cycle, which could make oscillate with as well and make the limit not exist. But since all matrices whose rows (or columns) sum to 1 have an eigenvalue of 1, there still exists such that and in this case represents the limit of the distribution averaged over time (called the Cesàro limit distribution) and can be defined as .
The statistical complexity of a partition, denoted . It is just the entropy of the stationary probability distribution of the partition states, so in the case of an ε-machine, . This quantity can be interpreted as the long term average information necessary to predict the future history, or the level of structure of the process.
For the ε-machine of the Golden Mean with , , slightly less than 1 bit. That makes sense, since there are only 2 possible future distributions, corresponding to the 2 possible values of the last observation, plus we already have some additional information since 1 is a more likely observation than 0.
For the 3 state partition of the Golden Mean with , the stationary distribution over states is and . So this partition is statistically more complex than the causal states and requires more information about the history to give predictions with the same accuracy.
It has been proven that the causal states, i.e. the partition with the fewest number of sets (or, in cases with infinite sets, the partition in which there is a bijection between sets and future distributions), are always unique and have the minimum amongst all of the prescient rivals. Two partitions can be different and have both the minimum if they differ only on a measure 0 set of histories.
The entropy rate of a process, denoted hμ. It is the average entropy per symbol given all of the past history (i.e. how much randomness remains on average even when you know the causal state). It is defined as where is a future sequence of observations of length L. For irreducible ε-machines, the entropy rate also follows the relation which is the weighted average of the entropies of the next observation given the possible current states. For the Golden Mean,
.
E:
The excess entropy of a process, denoted E. It is the total information (mutual information) the past observations provide about the future observations, defined as
or where s is the current causal state. can be interpreted as a measure of the "predictive gain" of the process, i.e. how much better on average is our prediction about the future when knowing the past compared to not knowing it (completely random processes have and deterministic processes have ). For the Golden Mean, , which is also the same as , but this isn’t the case for all processes, and only holds here because the golden mean is a process with a Markov Order of 1, meaning that the causal state depends only on the last symbol.
𝜒:
The crypticity of a process, denoted . It is defined as , and represents the “information loss”, that is, the difference between how much information we need to make the best prediction and how much information that prediction contains about the future (or how much uncertainty remains on average about the current causal state given the whole future). If is 0, all causal states are completely determined by the future observations (and not just by the past ones). For the Golden Mean, , which makes sense, since if the first future observation is a 1, we don’t know whether the current causal state is A or B, and any more future observations wouldn’t help distinguishing them.
Example: Non-unifilar generator
Before we saw how the Golden Mean process can be generated by two different Markov models, with one being its ε-machine. But in both cases, the Markov models were unifilar, meaning that if you are in a state and emit a symbol, there is never ambiguity regarding in which state you transition to. But that isn't always the case, as can be seen in this Markov model:
Here, if you are in state A and emit a 1 you don't know whether you transitioned to B or back to A, so any past that ends in a 1 implies uncertainty about the current state (which is why it is called a hidden Markov model, or HMM), which means that this Markov model isn't the ε-machine of the process it generates. This is the ε-machine:
It requires 3 states instead of 2. The causal state X correspond to the HMM state A, Y corresponds to B and Z corresponds to the state of uncertainty between A and B. As can be seen, the ε-machine is unifilar, all arrows coming out of any causal state have different emissions. To make the past histories partition explicit, any history that ends in a 0 followed by an even number of 2s corresponds to causal state X, any that ends in a 0 then an odd number of 2s is in Y and any that ends in a 1 followed by any number of 2s is in Z, as shown in the image:
Something interesting about the different Markov models is that knowing whether you are in state A or B can give you more information about the future than knowing whether you are in X, Y or Z, but you can't access that information just by looking at the past symbols.
For this process, the aforementioned quantities are: ; ; ; ; .
Example: The Mess3 process
This is the main process analyzed in the previously mentioned paper by the researchers from PIBBSS and Simplex. It is generated by the following HMM:
This HMM also isn't unifilar, and in this case you can never know exactly in which of its states of the you are just by looking at the emissions, which implies that the ε-machine of Mess3 isn’t the same as the HMM. So then what are the causal states for this process? Is there a more intuitive way for us to understand them and refer to them other than just as a partition of histories? The answer is yes, each causal state is a probability distribution of being in each of the states of the HMM. Such probability distributions are called "mixed states" or "belief states". In the appendix there is a proof that the belief states are the causal states of Mess3.
Now, how many causal states are there? As it turns out, there is an uncountably infinite amount of causal states. Since each is a probability distribution over the states , the causal states can be arranged as points in an equilateral triangle such that the distance to a side represents the probability of being in the state associated with the opposite vertex (because the sum of the distances to the sides always adds up to 1). Doing that yields the following figure:
As can be seen in the figure, the causal states are distributed in the triangle (2-simplex) forming a fractal pattern. Since there is an uncountable amount of (non measure zero) causal states, the statistical complexity is infinite[5], so in this case if you want maximal prescience, you need the whole history and there is no way to compress the information contained in it. But CM can still be useful for compressing if we are willing to sacrifice some predictive power. Instead of partitioning histories by their exact future distribution, we could have a partition in which each set has histories with "similar" future distributions. We can quantify their similarity by the maximum error between their predictions, which serves as a distance metric, , between two histories/causal states, called the Total Variation distance (TV). [6]More precisely, is the maximum difference between the conditional probabilities of any particular set of future symbols occurring when conditioning on the two histories.
So, as formalized in the appendix, if we stipulate a maximum allowed error (i.e. a resolution of the partition), we can find a partition with a finite number of sets such that if any two causal states and are in the same set then , and the smaller the larger needs to be. This corresponds to dividing the simplex into regions such that they cover all causal states.
The optimal approximations of the ε-machine are those that, given some minimize or, given some , minimize , depending on what you are optimizing for.
For the Mess3 process, the aforementioned quantities are: probability density represented by the fractal; ; ; ; .
Conclusion
Computational mechanics is a useful framework because it allows us to infer hidden underlying dynamics of stochastic processes which can be used to improve our predictions about future data generated by them, and it provides a way to quantify how much information about past data is required to make the best possible predictions. And ε-machines can be thought of as the objective "targets of learning", as they are the simplest best possible models of stochastic processes.
It also seems potentially useful for understanding the inner workings and behavior of agents, since to be effective optimizers most agents would need to model their environment based on the inputs they receive, so it would be interesting to investigate if such internal models resemble ε-machines.
So far we haven’t talked about, given a particular process, how to reconstruct the ε-machine (or an optimal machine given a limit on or ) associated with it, or about things like bidirectional ε-machines and ε-transducers. Those and other related topics will be explored if a part 2 of this explainer is ever written.
Appendix
Proof of the unifiliarity of ε-machines:
Let's consider the bidirectional history to be composed of the past, concatenated with the present concatenated with the future:
We want to prove that if two pasts are in the same causal state and emit the same symbol, then they must end up in the same causal state, that is, .
Assuming , we have:
where is the Borel σ-algebra generated by the cylinder sets of and is the alphabet. So, in particular, considering the cylinder set defined by , we have that
But , so
Now, taking the future into account, we have
and using we get:
Proof that the belief states are the causal states of Mess3:
The states of the HMM are called , and let’s call the random variable associated with the causal states . We know that the distribution over future sequences of symbols is completely determined by the current state , so no matter what the past history is, we have
. What we are interested in is the future distribution given the past if . We can open this distribution in terms of the states :
To answer that it is easier to only look at the distributions over the next symbol. If we are in state , the probabilities are:
Arranging them in a vector we have that the distribution over the next symbol for is . By symmetry, , . And the distribution over the next symbol given some mixed state is . Looking at one component of the vector,
. Since the probability of the next symbol is entirely determined by the ,
. So and thus .
Let’s assume there are two different mixed states and with the same distribution . That means
. Since the are L.I. , all of their coefficients must be 0, so , , , which contradicts the assumption that the mixed states were different. So two different mixed states must have different future distributions and therefore there is a bijection between mixed states and future distributions, which means the mixed states are the causal states of Mess3.
Approximability of ε-machines:
One possible metric for the distance between causal states is the Total Variation distance, defined as:
Using this metric, we can say that a process has an arbitrarily well approximable ε-machine iff a partition of the causal states of the process into disjoint sets such that , that is, iff the closure of the set of causal states is a compact metric space.
This is the case for Mess3 since the causal states live in a finite region of a 2-dimensional space.
In essence, finding this model is a form of statistical inference with a very general hypothesis class: The set of unifilar hidden Markov models (including those with uncountably many states). By doing inference, you are trying to find the model that best fits the data. And by choosing a good prior, that prioritizes simpler models, you should converge to the minimal model in terms of its number of states and statistical complexity (which corresponds to the amount of information stored by the process for future data generation).
LLMs can be viewed as approximations of ε-machines for the process of natural language generation, in which the approximate causal states are the context vectors i.e. the activations of the final layer of the transformer, just before the "Unembedding" step.
We are considering only countable sets to not have to talk about probability densities and measure theory.
The “upcoming storm" and "raining" states aren't actually causal states of the “weather process”, because they aren't maximally predictive as they don't contain all of the relevant information we could obtain from the past observations.
Actually, by the previously stated definition, the statistical complexity of Mess3 is undefined. So instead we can define it as the limit of the statistical complexities of the sets of states of the optimal approximations of the ε-machine, as goes to 0, which is infinite. There is however a better way to measure the complexity of processes with uncountable causal states, called the Information Dimension of the process, which as the name suggests measures the fractal dimension of the distribution of causal states. This dimension is a way of quantifying how well can we approximate the machine using finitely many states, as the lower the dimension the less states we need for the same "quality" of approximation.
This metric is not the same as the Euclidean distance in the simplex. In this particular case, using this metric, the points at some constant distance from a central point from a regular hexagon, not a circle.