Context:I (Daniel C) have been working with Aram Ebtekar on various directions in his work on algorithmic thermodynamics and the causal arrow of time. This post explores some implications of algorithmic thermodynamics on the concept of optimization. All mistakes are my (Daniel's) own.
A typical picture of optimization is when an agent causes the environment to have a convergent attractor: When you have an agent that is trying to steer a system towards particular target configurations, you can predict that the system will likely end up in those target configurations even if you have significant uncertainty over the initial configurations of the subsystem.
Some examples:
This view of optimization can be framed as a form of entropy reduction. Initially, we have high uncertainty about which configuration the system occupies: There are many possible initial states the raw materials or chess pieces could be in. The optimizing process reduces this uncertainty, concentrating probability mass from a broad initial distribution into a tight final distribution.
However, note that this entropy reduction cannot occur globally for two related reasons: the second law of thermodynamics and the reversibility of the laws of physics. The second law directly states that the total entropy of an isolated system tends to increase over time, which forbids global entropy reduction. Similarly, reversibility of the microscopic dynamics requires that each initial microstate maps to a distinct final microstate, which is incompatible with our "convergent attractor" picture where many initial states funnel to the same final state. In fact, within the framework of stochastic thermodynamics, the derivation of the second law is closely linked to the reversibility of the underlying physics. Roughly speaking, the reversibility of the underlying laws of physics allows us to define coarse-grained macrostates of the system where the dynamics are approximately markovian. Once we have this Markovian structure, we can derive the second law of thermodynamics as a mathematical consequence:
Second law from reversibility of physics[1]
Stochastic thermodynamics operates on coarse-grained macrostates rather than exact microstates. Let denote the space of macrostates. We translate reversibility of underlying physical laws into Markovian dynamics on these macrostates, allowing us to study thermodynamics through Markov processes.
For systems with discrete state spaces and time evolution, the translation is straightforward. Macrostates form a partition over the microstates of an underlying system that evolves deterministically and reversibly. The transition probability is simply the fraction of microstates in that evolve into macrostate after one time step. We define as the stationary measure of —the unique measure satisfying , where .
For classical hamiltonian mechanics where we have a continuous phase space, we discretize the phase space into cells and assume the cell dynamics are Markovian (known as a Markovian coarse-graining). The transition probability represents the fraction of phase space volume in cell that flows into cell under time evolution for duration . Liouville's theorem guarantees that phase space volume is preserved under Hamiltonian evolution, ensuring the stationary measure corresponds to the Liouville measure(the phase space volume of cell ).
We define the dual probabilities as , which represent the fraction of the microstates in that were mapped from microstates in . By Bayes rule, these are the reverse transition probabilites (only) when is sampled from the stationary measure .
We define the stochastic entropy as , which measures the entropy of an individual state. The generalized Gibbs-Shannon entropy is the expectation of stochastic entropy. Note that this differs from the standard Shannon entropy because we're operating on coarse-grained macrostates rather than uniform microstates. If the stationary measure is the counting measure where every macrostate is equally likely (e.g., if the discrete cells in our Markovian coarse-graining have equal volume), then we can omit and recover the usual Shannon entropy.
Now, if we have as the initial distribution of , which evolves to under our transition probabilities, then the stochastic entropy production that occurs when we transition from the state to is
We will attempt to prove that its expectation, the Gibbs-Shannon entropy production, is nonnegative:
First, note that since we have , we can rewrite the stochastic entropy production as .
In addition, we will make use of the identity
Given these, we have:
Where denotes conditional expectation given the random variable .
Then, by the law of total expectation, we have:
Applying Jensen's inequality to the above equality, we conclude that the Gibbs- shannon entropy production is non-negative:
In summary, the reversible, deterministic laws of physics at the microscopic level permit a Markovian coarse-graining into macrostates, where transition probabilities preserve the underlying reversibility through the dual probability relationship. From this Markovian structure, we can derive the second law of thermodynamics in the form of nonnegative entropy production.
Given the reversibility of physics and the second law of thermodynamics, we cannot achieve global entropy reduction. Instead, optimization can only reduce the entropy of a subsystem, and we must ensure that total entropy does not decrease while the underlying microscopic dynamics remain reversible. This constraint changes our "convergent attractor" picture of optimization: while an optimizing agent can funnel a subsystem from many initial configurations toward a narrow set of target states, it can only do so by carefully managing the entropy balance with its environment, ensuring that the information squeezed out of the optimized subsystem is properly accounted for in the larger system.
Given the constraints imposed by reversibility and the second law of thermodynamics, can we classify all the ways an optimizing agent can maintain these requirements while reducing the entropy of a subsystem? That is, how can an agent achieve the "convergent attractor" behavior of optimization while ensuring that global entropy production remains non-negative?
Suppose that we decompose our universe into the agent , the subsystem that the agent is trying to optimize, and the rest of the environment . Optimization results in entropy reduction in the subsystem , and we must make sure that the global entropy doesn't decrease. There are three ways this can happen:
We "erase" the copy of mutual information inside without making any changes to the agent or the environment. By erasing the mutual information in the subsystem, we have:
and . So there is entropy reduction in the subsystem.
The joint entropy becomes:
which is exactly the same as before. The mutual information between the agent and the subsystem allows us to reduce entropy of the subsystem without increasing entropy in either the agent or the environment.
To recap, there are three ways that an embedded agent can optimize a subsystem under information conservation:
Entropy is supposed to capture "the capacity to do work", which is really the same as the capacity to perform optimization. Consider again the setup of Maxwell's demon: The demon stores information about the gas particle configurations inside its own memory, thereby increasing the entropy of its memory. If the demon's memory has a finite state space, then the difference between the size of the demon's memory and the memory's existing entropy represents the demon's remaining capacity to do work, as that's the amount of "free memory" that can be used to store further information about the room of ideal gas. A greater amount of entropy leads to a smaller amount of "free memory" available for optimization. In particular, the amount of free memory should be an objective feature of the memory state itself, since it's supposed to represent the demon's objective capacity to perform optimization on the room of ideal gas.
Details
To formalize this more rigorously, suppose that the state space of the demon's memory is , the demon may store information as binary string inside its memory, and we always mark the end of binary string by 1 to separate it from the "free memory".
In other words, the demon's memory contains a string of the form , where is a binary string and . We can think of as representing the size of "free memory" that can be used to store information and therefore perform optimization, where as represents the amount of used memory. In particular, for each , we group all strings of the form where and into the same macrostate (labeled by ).
Now, if we have a counting measures where all strings in are equally likely, then the entropy of the macrostate will be exactly , since there are exactly random bits given this macrostate (the other bits taken up by are deterministic).
As a result, when a demon stores new information inside its memory, it will have to use up more memory and therefore increase . Entropy production occurs because we're "forgetting" the exact microstate the memory is in when we condition on the macrovariable.
However, this intuitive objectivity comes into conflict with existing definitions of entropy, as they rely on a subjective distribution. For instance, in stochastic thermodynamics, the stochastic entropy (or Shannon codelength) of a state given a distribution is defined as when the stationary measure is the counting measure, with our familiar Gibbs-Shannon entropy defined as the expectation of the Shannon codelength. In particular, the entropy of an individual state is undefined unless we also specify the distribution . To translate this to the Maxwell's demon case, we would have to say that the demon's "capacity to do work" somehow depends on our subjective distribution over the memory states of the demon, which raises puzzling questions such as: "If our subjective distribution over the memory states has changed, does that affect the demon's actual capacity to do work?" It seems obvious that demon's physical ability to perform useful work on the gas particles should not depend on an external observer's beliefs about its memory configuration. Yet standard formulations of entropy seem to make this objective physical capacity fundamentally observer-dependent.
In contrast with the subjectivity in traditional definitions of entropy, algorithmic thermodynamics defines the information content of a physical state as an objective feature of that state. Similar to stochastic thermodynamics, it relies on a Markovian coarse-graining of the system with coarse-grained state space . However, instead of defining a subjective distribution over , it simply assigns a binary string encoding to each coarse-grained state . Under the counting measure, the algorithmic entropy of a state reduces to the Kolmogorov complexity of its binary encoding , which measures the length of the shortest program that can generate that encoding. This quantity represents a system's actual capacity to do work from an individual state; while algorithmic entropy depends on a choice of universal computer, it has been argued that the effect of this dependence is small for all realistic pairs of computers we might want to compare[3], allowing us to treat algorithmic entropy as a state function.
In algorithmic thermodynamics, we also have a version of the second law of thermodynamics that originates from Levin's principle of randomness conservation, giving probabilistic guarantees on the nondecrease of algorithmic entropy. Additionally, it offers a new interpretation of Gibbs-shannon entropy: while algorithmic entropy captures the system's actual capacity to do work given an individual state, the Gibbs-shannon entropy represents the expected capacity to do work when our state is sampled according to a prior distrbution , and when that distribution is known a priori.
When we analyze Maxwell's demon under the lens of algorithmic thermodynamics, we find that if the content of the demon's memory is somehow compressible, then the demon can in principle leverage universal computation to compress that content, thereby freeing up more available capacity to store information and perform optimization. On the other hand, if the content of its memory were incompressible, then the unused memory represents the demon's actual remaining capacity to do work, which is objective because there is no way to increase available memory without physically destroying information. This capacity does not depend on our subjective distribution over the demon's memory states; the compressibility of the memory state is an intrinsic property of the physical configuration itself, making the demon's optimization capacity observer-independent.
What are the implications for our understanding of optimization now that we've refined our understanding of entropy through algorithmic thermodynamics? How does this foundation change our analysis of the three types of optimization we identified earlier?
Theorem 1 in Aram Ebtekar and Marcus Hutter. Foundations of algorithmic thermodynamics. arXiv preprint arXiv:2308.06927, 2024.
Engines of cognition explains the same idea
Page 5 of Aram Ebtekar and Marcus Hutter. Foundations of algorithmic thermodynamics. arXiv preprint arXiv:2308.06927, 2024.