I’d like to be able to apply more of the tools of statistical mechanics and thermodynamics outside the context of physics. For some pieces, that’s pretty straightforward - a large chunk of statistical mechanics is just information theory, and that’s already a flourishing standalone field which formulates things in general ways. But for other pieces, it’s less obvious. What’s the analogue of a refrigerator or a carnot cycle in more general problems? How do “work” and “heat” generalize to problems outside physics? The __principle of maximum entropy__ tells us how to generalize temperature, and offers one generalization of work and heat, but it’s not immediately obvious why we can’t extract “work” from “heat” without subsystems at different temperatures, or how to turn that into a useful idea in non-physics applications.

This post documents my own exploration of these questions in the context of a relatively simple problem, with minimal reference to physics (other than by analogy). Specifically: we’ll talk about how to construct the analogue of a heat engine using biased coins.

## Intuition

The main idea I want to generalize here is that we can “move uncertainty around” without reducing uncertainty. This is exactly what e.g. a refrigerator or heat engine does.

Consider the viewpoint of a refrigerator-designer. All the microscopic dynamics of the (fridge + environment) system must be reversible, so the number of possible microscopic states will never decrease on its own as time passes. The only way to reduce uncertainty about the microscopic state is to observe it. But the fridge designer is *designing* the system, deciding in advance how it will behave. The designer has no direct access to the environment in which the fridge will run, no way to measure the exact positions the atoms will be in when the fridge first turns on. The designer, in short, cannot directly observe the system. So, from the designer’s perspective, there’s uncertainty which cannot be reduced.

(In statistical mechanics, there are several entirely different justifications for why observations can’t reduce microscopic uncertainty/entropy - for instance, in one approach, macroscopic variables are chosen in such a way that we can deterministically predict future macroscopic observations. Another comes from Maxwell’s demon-style arguments, where the demon’s memory has to be included as part of the system. I’ll use the designer viewpoint, since it’s conceptually simple and easy to apply in other areas - in particular, we can easily apply it to the design of AIs embedded in their environment.)

While we can’t reduce our total uncertainty, we *can* move it around. We design the machine to apply transformations to the system which leave us more certain about some subsystems (e.g. the inside of the refrigerator), but less certain about other subsystems (e.g. heat baths used to power the system).

## Setup

We’ll imagine two large sets of IID biased coins. One is the “cold pool”, in which each coin comes up 1 (i.e. heads) with probability 0.1 and 0 with probability 0.9. The other is the “hot pool”, in which each coin comes up 1 with probability 0.2. We’ll call the coins in the cold pool , and the coins in the hot pool .

We’re going to apply transformations to these coins. Each transformation replaces some set of coins with new values which are a function of their old values. For instance, one transformation might be

(Here the bar denotes logical not - i.e. means "not X".) This transformation swaps with if is 1, and leaves everything unchanged if is 0.

We’ll mostly be able to use any transformations we want, but with two big constraints. First: **all transformations must be reversible**. If we know the final state of the coins and which transformations were applied, then we must be able to reconstruct the initial state of the coins. (This is the analogue of microscopic reversibility.) Our example transformation above is reversible - since it doesn’t change , we can always tell whether and were swapped, and we can swap them back if they were (indeed, we can do so by simply reapplying the same transformation).

Second constraint: **all transformations must conserve the number of heads**; heads can be neither created nor destroyed on net. Here the number of heads is our analogue of energy, and heads-conservation is our analogue of microscopic energy conservation. (In physics, we’d probably describe this as some kind of spin system in an external magnetic field.) Our example transformation above conserves the number of heads: it either swaps to coins or leaves everything alone, so the total number of heads stays the same.

One more key rule: while we will be able to choose what transformation to apply, **we do not get to look at the coins before choosing our transformation**. Physical analogy: if we’re building a heat engine or refrigerator or the like, we can’t just freely observe the microscopic state of the system. More generally, if we’re designing some machine (like a heat engine), we have to decide up-front how the machine will behave, before we have perfect information about the environment in which it will run. The machine itself can “observe” variables while running, but the machine is part of the system, so those “observations” need to be reversible and energy-conserving just like any other transformations.

Writing it all out mathematically: we choose some transformation for which

- is invertible

We’ll want to choose this to do something interesting, like reduce the uncertainty of particular coins.

## Extracting “Work”

General problem: choose a transformation to produce some coins which are 1 with near-zero uncertainty (i.e. asymptotically zero uncertainty). We’ll call these deterministic coins “work”, and use to denote the number of work-coins produced.

We’ll look at two subproblems to this problem. First, we’ll try to do it using just *one* of the two pools of coins (the hot one, though it doesn’t matter). This is the equivalent of “turning heat directly into work”, i.e. a __type-2 perpetual motion machine__; we’d expect it to be impossible. Second, we’ll tackle the problem using both pools, and figure out how much work we can extract. This is the equivalent of a heat engine.

## Extracting Work From One Heat Bath

The first key thing to notice is that this is inherently an information compression problem. I have random coins with heads-probability 0.2. I want to make w of those coins near-certainly 1, while still making the transformation reversible - therefore the remaining transformed coins must contain *all* of the information from the original coins. In other words, I need to compress the info from the original n coins into bits with near-certainty.

If we whip out our information theory, that compression is fairly straightforward. Our biased coins have entropy of bits per coin. So, with a reversible transformation we can compress all of the info into of the coins, and the remaining coins can all be nearly-deterministic.

(We’re fudging a bit here - we may need to add one or two extra coins from outside to make the compression algorithm handle unlikely cases without loss - but for current purposes that’s not a big deal. I’ll be fudging this sort of thing throughout the post.)

However, we *also* need to conserve the number of heads. That’s a problem: fully compressed bits are 50/50 in general, so our compressed bits include roughly heads. We started with only heads, so we have no way to balance the books - even if all of our deterministic bits are tails, we still end up with too many heads and too few tails.

This generalizes: we won’t be able to compress our information without producing more tails. Hand-wavy proof: the initial distribution of coins is maxentropic subject to a constraint on the total number of heads. So, we can’t compress it without violating that constraint.

Let’s spell this out a bit more carefully.

A maxentropic variable contains as much information as possible - there is no other distribution over the same outcomes with higher entropy. In general, mutual information is at most the entropy of one variable - i.e. the information in about is at most all of the information in , so the higher the entropy the more information can potentially contain about any other variable .

In our case, we have an initial state and a final state We want to compress all the info in into , so we must have . Initial state is maxentropic: its possible outcomes are all values of coin flips with a fixed number of heads, and has the highest possible over those outcomes. Final state we *choose* to be maxentropic - we need , so we make as large as possible. However, note that the possible outcomes of are a strict subset of the possible outcomes of : possible outcomes of are all values of coin flips with a fixed number of heads AND the first coins are all heads. So, we choose to be maxentropic on this set of outcomes, but it’s a strictly smaller set of outcomes than for , so the maximum achievable entropy will be less than . Thus: our condition cannot be achieved.

We cannot extract deterministic bits (i.e. work) from a single pool of maxentropic-subject-to-constraint random bits (i.e. heat), while still respecting the constraint.

Even more generally: if we have a pool of random variables which are maxentropic subject to some constraint, we won’t be able to compress them without violating that constraint. If the constraint fixes a value of , and we want to deterministically fix , then that reduces the number of possible values of , and therefore reduces the amount of information which the remaining variables can contain. Since they didn’t have any “spare” entropy before (i.e. initial state is maxentropic subject to the constraint), we won’t be able to “fit” all the information into the remaining entropy.

That’s a very general analogue of the idea that we can’t extract work from a single-temperature heat bath. How about two heat baths?

## Extracting Work From Two Heat Baths

Now we have coins to play with: with probability 0.1, and with probability 0.2. The entropy is roughly 0.73 bits per “hot” coin, and 0.47 bits per “cold” coin. So, we’d need coins with a roughly 50/50 mix of heads and tails to contain all the info. That’s still too many heads: full compression would require roughly heads, and we only have about . But our initial distribution is no longer maxentropic given the overall constraint, so maybe it could work if we only partially compress the information?

Let’s set up the problem more explicitly, to maximize the work we can extract.

Our final distribution will contain deterministic bits and information-containing bits. The information-containing bits must contain a total of heads. In order to contain as much information as possible, the final distribution of those bits should be maxentropic subject to the constraint on the number of heads. So, they should be *roughly* (remember, large ) IID with probability of heads, with total entropy . We set that equal to the amount of entropy we need (i.e. bits), and solve for . In this case, I find . Since we started with about heads, we’re able to extract about 3.7% of them as “work” (or 5.5% of the “hot” heads).

So we can indeed extract work from two heat baths at different temperatures.

Notably, the “efficiency” we calculated is *not* the usual theoretical optimal efficiency from thermodynamics. That “optimal efficiency” comes from a slightly different problem - rather than converting *all* our bits into as much work as possible, that problem considers the optimal conversion of random bits into work *at the margin*, assuming our heat baths don’t run out. In particular, that means we usually wouldn’t be using equal numbers of bits from the hot and cold pools.

This post is already plenty long, so I’ll save further discussion of thermodynamic efficiency and temperatures for another day.

## Takeaway

The point of this exercise is to cast core ideas of statistical mechanics - especially the more thermo-esque ideas - in terms which are easier to generalize beyond physics. To that end, the key ideas are:

- Thermo-like laws apply when we can't gain information about a system (e.g. because we're designing a machine to operate in an environment which we can't observe directly at design time), can't lose information about a system at a low level (either due to physical reversibility constraints or because we don't want to throw out info), and the system has some other constraints (like energy conservation).
- We can operate on the system in ways which move uncertainty around, without decreasing it.
- If we want to move uncertainty around in a way which makes certain variables nearly deterministic (i.e. "extract work"), that's a compression problem.
- We can't compress a maxentropic distribution, so we can't extract work from a single maxentropic-subject-to-constraint pool of variables without violating the constraint.
- We can extract work from two pools of variables which are initially maxentropic under different constraints, while still respecting the full-system constraint.

*The follow-up post on thermodynamic efficiency and temperatures is **here**.*

This is absolutely beautiful. Bravo.

This is super interesting!

Quick typo note (unless I'm really misreading something): in your setups, you refer to coins that are biased towards tails, but in your analyses, you talk about the coins as though they are biased towards heads.

Oh shit, that's actually a pretty serious oversight. I'm effectively missing negative signs all over the place. Thanks for catching it!

Fixed now, and it did change the numbers.

There's something basic about thermo that continues to elude me: what exactly does the reversibility criterion buy us?

I am trying to get my head around maximum caliber, which isn't about heat engines but because it is another one of Jaynes' ideas there is a lot of discussion about the statistical mechanics intuitions and why they do or do not apply. One entry in the macroscopic prediction paper rejects reversibility on the grounds that knowledge of the microstates may not be available, only that of the macrostate.

The reversibility of transformations here mostly seems in service of the engine being an ideal and general example for reasoning purposes; is that correct, or does it provide some other benefit I am missing?

To the best of my current understanding, (microscopic) reversibility is crucial to get something which

looks likeclassical thermodynamics - i.e. second law, thermal efficiency limit, etc. Without reversibility, we could still apply similar reasoning and get analogous results, but there would be extra steps and the end result would look qualitatively different. Roughly speaking, we'd need to separate out the steps which reduce the number of microstates from the steps which move around our uncertainty about the microstate.So, the assumption here is in service of reproducing classical thermodynamics, which is in turn a way to test that I'm setting things up right before moving on to more general applications.