I’d like to be able to apply more of the tools of statistical mechanics and thermodynamics outside the context of physics. For some pieces, that’s pretty straightforward - a large chunk of statistical mechanics is just information theory, and that’s already a flourishing standalone field which formulates things in general ways. But for other pieces, it’s less obvious. What’s the analogue of a refrigerator or a carnot cycle in more general problems? How do “work” and “heat” generalize to problems outside physics? The principle of maximum entropy tells us how to generalize temperature, and offers one generalization of work and heat, but it’s not immediately obvious why we can’t extract “work” from “heat” without subsystems at different temperatures, or how to turn that into a useful idea in non-physics applications.

This post documents my own exploration of these questions in the context of a relatively simple problem, with minimal reference to physics (other than by analogy). Specifically: we’ll talk about how to construct the analogue of a heat engine using biased coins.


The main idea I want to generalize here is that we can “move uncertainty around” without reducing uncertainty. This is exactly what e.g. a refrigerator or heat engine does.

Consider the viewpoint of a refrigerator-designer. All the microscopic dynamics of the (fridge + environment) system must be reversible, so the number of possible microscopic states will never decrease on its own as time passes. The only way to reduce uncertainty about the microscopic state is to observe it. But the fridge designer is designing the system, deciding in advance how it will behave. The designer has no direct access to the environment in which the fridge will run, no way to measure the exact positions the atoms will be in when the fridge first turns on. The designer, in short, cannot directly observe the system. So, from the designer’s perspective, there’s uncertainty which cannot be reduced.

(In statistical mechanics, there are several entirely different justifications for why observations can’t reduce microscopic uncertainty/entropy - for instance, in one approach, macroscopic variables are chosen in such a way that we can deterministically predict future macroscopic observations. Another comes from Maxwell’s demon-style arguments, where the demon’s memory has to be included as part of the system. I’ll use the designer viewpoint, since it’s conceptually simple and easy to apply in other areas - in particular, we can easily apply it to the design of AIs embedded in their environment.)

While we can’t reduce our total uncertainty, we can move it around. We design the machine to apply transformations to the system which leave us more certain about some subsystems (e.g. the inside of the refrigerator), but less certain about other subsystems (e.g. heat baths used to power the system).


We’ll imagine two large sets of IID biased coins. One is the “cold pool”, in which each coin comes up 1 (i.e. heads) with probability 0.1 and 0 with probability 0.9. The other is the “hot pool”, in which each coin comes up 1 with probability 0.2. We’ll call the coins in the cold pool , and the coins in the hot pool .

We’re going to apply transformations to these coins. Each transformation replaces some set of coins with new values which are a function of their old values. For instance, one transformation might be

(Here the bar denotes logical not - i.e.  means "not X".) This transformation swaps  with  if  is 0, and leaves everything unchanged if  is 1.

We’ll mostly be able to use any transformations we want, but with two big constraints. First: all transformations must be reversible. If we know the final state of the coins and which transformations were applied, then we must be able to reconstruct the initial state of the coins. (This is the analogue of microscopic reversibility.) Our example transformation above is reversible - since it doesn’t change , we can always tell whether  and  were swapped, and we can swap them back if they were (indeed, we can do so by simply reapplying the same transformation).

Second constraint: all transformations must conserve the number of heads; heads can be neither created nor destroyed on net. Here the number of heads is our analogue of energy, and heads-conservation is our analogue of microscopic energy conservation. (In physics, we’d probably describe this as some kind of spin system in an external magnetic field.) Our example transformation above conserves the number of heads: it either swaps two coins or leaves everything alone, so the total number of heads stays the same.

One more key rule: while we will be able to choose what transformation to apply, we do not get to look at the coins before choosing our transformation. Physical analogy: if we’re building a heat engine or refrigerator or the like, we can’t just freely observe the microscopic state of the system. More generally, if we’re designing some machine (like a heat engine), we have to decide up-front how the machine will behave, before we have perfect information about the environment in which it will run. The machine itself can “observe” variables while running, but the machine is part of the system, so those “observations” need to be reversible and energy-conserving just like any other transformations.

Writing it all out mathematically: we choose some transformation  for which

  •  is invertible

We’ll want to choose this  to do something interesting, like reduce the uncertainty of particular coins.

Extracting “Work”

General problem: choose a transformation to produce some coins which are 1 with near-zero uncertainty (i.e. asymptotically zero uncertainty). We’ll call these deterministic coins “work”, and use  to denote the number of work-coins produced.

We’ll look at two subproblems to this problem. First, we’ll try to do it using just one of the two pools of coins (the hot one, though it doesn’t matter). This is the equivalent of “turning heat directly into work”, i.e. a type-2 perpetual motion machine; we’d expect it to be impossible. Second, we’ll tackle the problem using both pools, and figure out how much work we can extract. This is the equivalent of a heat engine.

Extracting Work From One Heat Bath

The first key thing to notice is that this is inherently an information compression problem. I have  random coins with heads-probability 0.2. I want to make w of those coins near-certainly 1, while still making the transformation reversible - therefore the remaining  transformed coins must contain all of the information from the original  coins. In other words, I need to compress the info from the original n coins into  bits with near-certainty.

If we whip out our information theory, that compression is fairly straightforward. Our biased coins have entropy of  bits per coin. So, with a reversible transformation we can compress all of the info into  of the coins, and the remaining  coins can all be nearly-deterministic.

(We’re fudging a bit here - we may need to add one or two extra coins from outside to make the compression algorithm handle unlikely cases without loss - but for current purposes that’s not a big deal. I’ll be fudging this sort of thing throughout the post.)

However, we also need to conserve the number of heads. That’s a problem: fully compressed bits are 50/50 in general, so our  compressed bits include roughly  heads. We started with only  heads, so we have no way to balance the books - even if all of our  deterministic bits are tails, we still end up with too many heads and too few tails.

This generalizes: we won’t be able to compress our information without producing more tails. Hand-wavy proof: the initial distribution of coins is maxentropic subject to a constraint on the total number of heads. So, we can’t compress it without violating that constraint.

Let’s spell this out a bit more carefully.

A maxentropic variable contains as much information as possible - there is no other distribution over the same outcomes with higher entropy. In general, mutual information  is at most the entropy of one variable  - i.e. the information in  about  is at most all of the information in , so the higher the entropy  the more information  can potentially contain about any other variable .

In our case, we have an initial state  and a final state  We want to compress all the info in  into  , so we must have . Initial state  is maxentropic: its possible outcomes are all values of  coin flips with a fixed number of heads, and  has the highest possible  over those outcomes. Final state  we choose to be maxentropic - we need , so we make  as large as possible. However, note that the possible outcomes of  are a strict subset of the possible outcomes of : possible outcomes of  are all values of  coin flips with a fixed number of heads AND the first  coins are all heads. So, we choose  to be maxentropic on this set of outcomes, but it’s a strictly smaller set of outcomes than for , so the maximum achievable entropy  will be less than . Thus: our condition  cannot be achieved. 

We cannot extract deterministic bits (i.e. work) from a single pool of maxentropic-subject-to-constraint random bits (i.e. heat), while still respecting the constraint.

Even more generally: if we have a pool of random variables which are maxentropic subject to some constraint, we won’t be able to compress them without violating that constraint. If the constraint fixes a value of , and we want to deterministically fix , then that reduces the number of possible values of , and therefore reduces the amount of information which the remaining variables can contain. Since they didn’t have any “spare” entropy before (i.e. initial state is maxentropic subject to the constraint), we won’t be able to “fit” all the information into the remaining entropy.

That’s a very general analogue of the idea that we can’t extract work from a single-temperature heat bath. How about two heat baths?

Extracting Work From Two Heat Baths

Now we have  coins to play with:  with probability 0.1, and  with probability 0.2. The entropy is roughly 0.73 bits per “hot” coin, and 0.47 bits per “cold” coin. So, we’d need  coins with a roughly 50/50 mix of heads and tails to contain all the info. That’s still too many heads: full compression would require roughly  heads, and we only have about . But our initial distribution is no longer maxentropic given the overall constraint, so maybe it could work if we only partially compress the information?

Let’s set up the problem more explicitly, to maximize the work we can extract.

Our final distribution will contain  deterministic bits and  information-containing bits. The information-containing bits must contain a total of  heads. In order to contain as much information as possible, the final distribution of those  bits should be maxentropic subject to the constraint on the number of heads. So, they should be roughly (remember, large ) IID with probability  of heads, with total entropy . We set that equal to the amount of entropy we need (i.e.  bits), and solve for . In this case, I find . Since we started with about  heads, we’re able to extract about 3.7% of them as “work” (or 5.5% of the “hot” heads).

So we can indeed extract work from two heat baths at different temperatures.

Notably, the “efficiency” we calculated is not the usual theoretical optimal efficiency from thermodynamics. That “optimal efficiency” comes from a slightly different problem - rather than converting all our bits into as much work as possible, that problem considers the optimal conversion of random bits into work at the margin, assuming our heat baths don’t run out. In particular, that means we usually wouldn’t be using equal numbers of bits from the hot and cold pools.

This post is already plenty long, so I’ll save further discussion of thermodynamic efficiency and temperatures for another day.


The point of this exercise is to cast core ideas of statistical mechanics - especially the more thermo-esque ideas - in terms which are easier to generalize beyond physics. To that end, the key ideas are:

  • Thermo-like laws apply when we can't gain information about a system (e.g. because we're designing a machine to operate in an environment which we can't observe directly at design time), can't lose information about a system at a low level (either due to physical reversibility constraints or because we don't want to throw out info), and the system has some other constraints (like energy conservation).
  • We can operate on the system in ways which move uncertainty around, without decreasing it.
  • If we want to move uncertainty around in a way which makes certain variables nearly deterministic (i.e. "extract work"), that's a compression problem.
  • We can't compress a maxentropic distribution, so we can't extract work from a single maxentropic-subject-to-constraint pool of variables without violating the constraint.
  • We can extract work from two pools of variables which are initially maxentropic under different constraints, while still respecting the full-system constraint.

The follow-up post on thermodynamic efficiency and temperatures is here.

New Comment
12 comments, sorted by Click to highlight new comments since:

This is absolutely beautiful. Bravo.

This is super interesting!

Quick typo note (unless I'm really misreading something): in your setups, you refer to coins that are biased towards tails, but in your analyses, you talk about the coins as though they are biased towards heads.

One is the “cold pool”, in which each coin comes up 1 (i.e. heads) with probability 0.1 and 0 with probability 0.9. The other is the “hot pool”, in which each coin comes up 1 with probability 0.2

 random coins with heads-probability 0.2

We started with only  tails

full compression would require roughly  tails, and we only have about 

Oh shit, that's actually a pretty serious oversight. I'm effectively missing negative signs all over the place. Thanks for catching it!

Fixed now, and it did change the numbers.

There's something basic about thermo that continues to elude me: what exactly does the reversibility criterion buy us?

I am trying to get my head around maximum caliber, which isn't about heat engines but because it is another one of Jaynes' ideas there is a lot of discussion about the statistical mechanics intuitions and why they do or do not apply. One entry in the macroscopic prediction paper rejects reversibility on the grounds that knowledge of the microstates may not be available, only that of the macrostate.

The reversibility of transformations here mostly seems in service of the engine being an ideal and general example for reasoning purposes; is that correct, or does it provide some other benefit I am missing?

To the best of my current understanding, (microscopic) reversibility is crucial to get something which looks like classical thermodynamics - i.e. second law, thermal efficiency limit, etc. Without reversibility, we could still apply similar reasoning and get analogous results, but there would be extra steps and the end result would look qualitatively different. Roughly speaking, we'd need to separate out the steps which reduce the number of microstates from the steps which move around our uncertainty about the microstate.

So, the assumption here is in service of reproducing classical thermodynamics, which is in turn a way to test that I'm setting things up right before moving on to more general applications.

Reversible/Carnot cycles in heat engines are a theoretical model that describe a system with perfect efficiency within each of the cycles. The Carnot heat engine is a model used in Thermodynamics 1 to introduce heat engines to students. The point of this is to allow students to focus on the four constituent cycles of the heat engine without worrying about tracking inefficiencies. It is, of course, impossible to design a heat engine that is operating at perfect efficiency with perfect reversibility. 


Your are correct in your assumption that the Carnot cycle is just a distillation of the core principles of heat engines. Because of this, the Carnot model also helps at a higher level by helping students understand that: 

  1. The efficiency of a reversible heat engine is always greater than that of an irreversible one
  2. Any reversible heat engines operating on the same two reservoirs have the same efficiency

The violation of either of these statements violates the second law, since order cannot be restored to a system, the only possible movement is an increase in disorder and subsequent lower of efficiency.

Is this the best thing you wrote?

It's certainly the most technically beautiful thing.

nit: "This transformation swaps  with  if  is 1, and leaves everything unchanged if  is 0."

I think it actually swaps if  is  and leaves unchanged if it's .


nit: "swaps to coins"

missing 'w'

Fixed, thanks.

[+][comment deleted]20Review for 2020 Review