Why Are Maximum Entropy Distributions So Ubiquitous?

12Jeffrey Heninger

3Alexander Gietelink Oldenziel

11davidad

4Alexander Gietelink Oldenziel

8David Johnston

5dr_s

New Comment

In practice, smoothness interacts with measurement: we can usually measure the higher-order bits without measuring lower-order bits, but we can’t easily measure the lower-order bits without the higher-order bits. Imagine, for instance, trying to design a thermometer which measures the fifth bit of temperature but not the four highest-order bits. Probably we’d build a thermometer which measured them all, and then threw away the first four bits! Fundamentally, it’s because of the informational asymmetry: higher-order bits affect everything, but lower-order bits mostly don’t affect higher-order bits much, so long as our functions are smooth. So, measurement in general will favor higher-order bits.

There are examples of measuring lower-order bits without measuring higher-order bits. If something is valuable to measure, there's a good chance that someone has figured out a way to measure it. Here is the most common example of this that I am familiar with:

When dealing with lasers, it is often useful to pass the laser through a beam splitter, so part of the beam travels along one path and part of the beam travels along a different path. These two beams are often brought back together later. The combination might have either constructive or destructive interference. It has constructive interference if the difference in path lengths is an integer multiple of the wavelength, and destructive interference if the difference in path length is a half integer multiple of the wavelength. This allows you to measure changes in differences in path lengths, without knowing how many wavelengths either path length is.

One place this is used is in LIGO. LIGO is an interferometer with two multiple kilometer long arms. It measures extremely small ( $ 10^{-19} $ m) changes in the difference between the two arm lengths caused by passing gravitational waves.

Every distribution (that agrees with the base measure about null sets) is a Boltzmann distribution. Simply define , and presto, .

This is a very useful/important/underrated fact, but it does somewhat trivialize “Boltzmann” and “maximum entropy” as *classes of distributions*, rather than as certain *ways of looking at* distributions.

A related important fact is that temperature is not really a physical quantity, but is: it’s known as inverse temperature or . (The nonexistence of zero-temperature systems, the existence of negative-temperature systems, and the fact that negative-temperature systems intuitively seem extremely high energy bear this out.)

I am a little confused about this. It was my understanding that exponential families are distinguished class of families of distributions. For instance, they are regular (rather than singular).

The family of mixed Gaussians is not an exponential family I believe.

So my conclusion would be that the while "being Boltzmann" for a distribution is trivial as you point out, "being Boltzmann" (= exponential) for a family is nontrivial.

the Boltzmann distribution is the maximum entropy distribution subject to a constraint

for both expectation and energy doesn't lend itself to fast reading. is sometimes standard for expectation

The energy at any later time should be exactly equal to the initial energy, not merely equal on average. So why is the Boltzmann distribution subject to only a constraint on expected energy?

I mean, that's just the distinction between microcanonical and canonical ensemble, no? In the canonical ensemble, there is some outside thermal bath with which the system can exchange energy. In the microcanonical one, there is indeed no exchange of energy, and you get exactly a distribution constrained on total energy.

If we measure the distribution of particle velocities in a thin gas, we’ll find that they’re roughly normally distributed. Specifically, the probability density of velocity v will be proportional to e−12mv2/(kBT) - or, written differently, e−E(v)/(kBT), where E(v) is the kinetic energy of a particle of the gas with velocity v, T is temperature, and kB is Boltzmann’s constant. The latter form, e−E/(kBT), generalizes even beyond thin gasses - indeed, it generalizes even to solids, fluids, and plasmas. It applies to the concentrations of chemical species in equilibrium solutions, or the concentrations of ions around an electrode. It applies to light emitted from hot objects. Roughly speaking, it applies to microscopic states in basically any physical system in thermal equilibrium where quantum effects aren't significant.

It’s called the Boltzmann distribution; it’s a common sub-case of a more general class of relatively-elegant distributions called maximum entropy distributions.

Even more generally, maximum entropy distributions show up remarkably often. The normal distribution is another good example: you might think of normal distributions mostly showing up when we add up lots of independent things (thanks to the Central Limit Theorem), but then what about particle velocities in a gas? Sure, there’s

conceptuallylots of little things combining together to produce gas particle velocities, but it’s notliterallya bunch of numbers adding together; Central Limit Theorem doesn’t directly apply. Point is: normal distributions show up surprisingly often, even when we’re not adding together lots of numbers.Same story with lots of other maximum entropy distributions - poisson, geometric/exponential, uniform, dirichlet… most of the usual named distributions in a statistical library are either maximum entropy distributions or near relatives. Like the normal distribution, they show up surprisingly often.

What’s up with that? Why this particular class of distributions?

If you have a Bayesian background, there’s kind of a puzzle here. Usually we think of probability distributions as epistemic states, descriptions of our own uncertainty. Probabilities live “in the mind”. But here we have a class of distributions which are out there “in the territory”: we look at the energies of individual particles in a gas or plasma or whatever, and find that they have not just any distribution, but a relatively “nice” distribution, something simple. Why? What makes a distribution like that appear, not just in our own models, but out in the territory?

## What Exactly Is A Maximum Entropy Distribution?

Before we dive into

whymaximum entropy distributions are so ubiquitous, let’s be explicit aboutwhatmaximum entropy distributions are.Any (finite) probability distribution has some information-theoretic entropy, the “amount of information” conveyed by a sample from the distribution, given by Shannon’s formula

^{[1]}:−∑ipilog(pi)

As the name suggests, a maximum entropy distribution is the distribution with the highest entropy, subject to some constraints. Different constraints yield different maximum entropy distributions.

Conceptually: if a distribution has maximum entropy, then we gain the largest possible amount of information by observing a sample from the distribution. On the flip side, that means we know as little as possible about the sample before observing it. Maximum entropy = maximum uncertainty.

With that in mind, you can probably guess one maximum entropy distribution: what’s the maximum entropy distribution over a finite number of outcomes (e.g. heads/tails, or 1/2/3/4/5/6), without any additional constraints?

(Think about that for a moment if you want.)

Intuitively, the “most uncertain” distribution is uniform, and indeed that’s the unconstrained maximum entropy distribution over a finite number of outcomes.

Other maximum entropy distributions are generated by adding constraints. For instance, we might constrain the mean and variance of a distribution over real numbers; then the maximum entropy distribution is the normal distribution with the appropriate mean and variance.

The maximum entropy distributions we use in practice are simple and relatively elegant mainly because the constraints are simple and relatively elegant. In particular, the constraints which generate most maximum entropy distributions used in practice have the form

E[f(X)]=∑Xf(X)P[X]=μ

… for some (possibly vector) function f of the outcome X. Under constraints with this form, the maximization problem is mathematically very clean

^{[2]}, and the maximum entropy distribution has the formP[X]=1Zeλf(X)

… for some (possibly vector) λ

^{[3]}. Conceptually, λ quantifies the trade-off between the constraint and the entropy - i.e. how much would the entropy increase/decrease if we change the constraint value μ by a little bit? Different μ values yield different λ values - e.g. different average energies (μ) correspond to different temperatures (λ).With that in mind, let’s look at the Boltzmann distribution again. It has the form

P[X]=1Ze−1kBTE(X)

… where the “outcome” X is the system state and E(X) is its energy. That pattern-matches to our maximum entropy form; the Boltzmann distribution is the maximum entropy distribution subject to a constraint

E[E(X)]=μ

… i.e. a constraint on the expected energy (sorry for the notation). The factor −1kBT then corresponds to λ; conceptually, the temperature T tells us how much the entropy would increase/decrease if we change the expected energy by a little bit. And indeed, that’s how temperature is often defined in physics.

So we have some rough idea of what a maximum entropy distribution is: it’s “maximally uncertain” subject to some constraints. The maximum-entropy distributions which are ubiquitous in practice are simple mainly because the constraints are simple. But now we return to our main question: why are those distributions so ubiquitous? Or, to frame it differently: why are those relatively-simple constraints so common? What physical processes generate those simple constraints?

## Chaos And Conserved Quantities

In the context of dynamic systems, chaos provides (most of) an answer.

Remember that, in a chaotic system, information about the approximate system state (i.e. leading-order bits of the system state) is lost over time, assuming we had even a little uncertainty initially. The approximate state becomes dependent on less and less significant bits of the initial conditions, eventually reaching bits whose initial values we do not know, making our predictions of the approximate state increasingly uncertain over time.

“Increasingly uncertain” sure does sound like entropy of the distribution over the approximate state increases over time, doesn’t it?

… but also remember that not all information about the approximate system state is lost over time. There are typically also some conserved quantities, most notably energy. That information is not lost over time.

Put those two together: entropy of approximate system state increases over time, but subject to a constraint on conserved quantities (e.g. energy). Assuming there are no other constraints unaccounted-for, we’d intuitively expect the entropy of the approximate system state to approach its maximum value subject to the constraints on conserved quantities. And that sure does sound like the Boltzmann distribution: maximum entropy, subject to a constraint on the conserved quantity energy.

There are still some important details to sort out, though.

## Value vs Expected Value

If energy is our only conserved quantity over the approximate state, and entropy of the approximate state increases over time, then intuitively we’d expect the distribution to maximize entropy subject to a constraint on energy E(X), not on

expectedenergy E[E(X)]. The energy at any later time should be exactly equal to the initial energy, not merely equal on average. So why is the Boltzmann distribution subject to only a constraint onexpectedenergy?For the kinds of systems where the Boltmann distribution shows up, it turns out that the difference doesn’t matter very much, at least for the kinds of questions which we’re interested in. The key reason is that the Boltzmann distribution is “very sharply peaked” near the expected energy μ - i.e. samples from the Boltzmann distribution have energy extremely close to μ with very high probability. Nearly all the probability mass is therefore very close to the surface with energy μ, and is roughly maxentropic on that surface. A constraint directly on the energy would put all of the probability mass exactly on that same surface, and also spread out maxentropically on the surface. So the two distributions are “roughly the same” in some sense.

Let’s ask one more “why?”: why is the Boltzmann distribution very sharply peaked near the expected energy, e.g. for a gas?

Conceptually: given the expected energy (or, equivalently, the temperature) and any other conserved quantities, the approximate motions in one little region of a gas are roughly-independent of the approximate motions in another little region, assuming the regions aren’t too close together. Chaos wipes out all the information between the two - not just over time, but between spatially-separated regions as well. The total energy is a

sumover the energies of all those mostly-roughly-independent regions, so the central limit theorem kicks in, and our uncertainty over the sum (i.e. the total energy) ends up relatively small as long as there are a lot of roughly-independent regions. Thus, the distribution is very tightly peaked around the average energy.... Note that that story relied heavily on chaos wiping out information across space, not just over time, a phenomenon which I listed under "open problems" in the chaos post. I'm surprised that I haven't seen much work on that topic before, especially since the conceptual story above is quite standard in statistical mechanics.

## From Distributions "In The Mind" To Distributions "In The Territory"

We still haven’t answered the original question: why does the Boltzmann distribution (and maximum entropy distributions more generally) show up so often in the territory, as opposed to just in our minds?

What we

haveargued so far is that the Boltzmann distribution should beconvergent: it doesn’t matter what distribution we start with, as long as there’s a little uncertainty in the initial conditions, our uncertainty will quickly evolve into a Boltzmann distribution as we forecast the approximate state forward over time. That’s quite powerful: it says that e.g. even a smart alien or a superintelligence will end up predicting from roughly the same distribution, even if they have very different priors or a lot more information than we do.Let’s extend that story to talk about distributions in the territory, not just minds.

Suppose we have an experiment where we initialize some system somehow, let it run for a while, then measure its approximate final state. Empirically, we find that the approximate final states are roughly Boltzmann distributed. Then the question is: what distributions of initial states would have resulted in a roughly-Boltzmann distribution of approximate final states? And how/why did our experimental setup produce one of the initial distributions which results in a roughly-Boltzmann distribution of approximate final state?

… well, almost any initial distribution which is smooth-ish over the lower-order bits will produce a Boltzmann distribution. So, our experimental setup should produce a roughly-Boltzmann distribution as long as it’s not producing any particular special pattern in the lower-order bits of the initial conditions. Kind of anticlimactic, but that's what our empirical observations of Boltzmann distributions imply.

Note that we could also ask why

partsof a system - e.g. individual particles in a gas - are Boltzmann distributed, even for a single run of our hypothetical experiment. We could probably tell a similar story, but we’d probably want to use a version of chaos which talks about loss of information between parts of the system (e.g. particles) rather than over time… which is the main theme of the Open Problems below.## Aside: Subjectivity Of Coarse-Graining

At first glance, there’s a degree of freedom in the chaos story, where we decide to focus on the most significant bits of the system state rather than some other part of the system state. That’s a choice of “coarse graining” - a choice of ways to take an approximate system state. If we chose a different coarse graining, presumably we could get different results. That messes up the “universality” of our story - e.g. maybe an alien or a superintelligence would use a different coarse graining and thereby end up with a very different distribution.

In practice, I don’t think the choice of coarse graining is as much of a degree of freedom as it might seem at first glance.

The basic feature-of-the-world which makes certain coarse grainings strongly preferred is smoothness: lots of stuff is a reasonably-smooth function of other stuff. In particular, a core part of what makes chaos interesting in the first place is that future states are a nice smooth function of past states

over short times. And when we have nice smooth functions, the lower-order bits usually depend on much higher-order bits,but mostly not vice-versa.Consider, for instance, computing f(x) = x^2 for a real number x = 1.731417…. We can compute the first few digits of the result without needing to look at digits very far back in the decimal expansion of x. But if we want to compute digits further back in the result, then we need to start from the front and work our way backwards; digits further back in the result depend on the first few digits of x.

In practice, smoothness interacts with measurement: we can usually measure the higher-order bits without measuring lower-order bits, but we can’t easily measure the lower-order bits without the higher-order bits. Imagine, for instance, trying to design a thermometer which measures the fifth bit of temperature but not the four highest-order bits. Probably we’d build a thermometer which measured them all, and then threw away the first four bits! Fundamentally, it’s because of the informational asymmetry: higher-order bits affect everything, but lower-order bits mostly don’t affect higher-order bits much, so long as our functions are smooth. So, measurement in general will favor higher-order bits.

That, in turn, means that agents’ observations will favor higher-order bits, which makes higher-order bits the relevant coarse-graining for predicting observations.

Now, there are still

somedegrees of freedom in the details of the coarse-graining - e.g. "how many higher-order bits?" or "bits vs digits?" or "what if we coarse-grain in some other way than bits/digits, but still tracking the general notion of 'big/small' implied by our smooth functions?". But so long as the different coarse grainings correspond to the same smooth functions (specifically the smooth system dynamics over short times), they should still converge to the same distributions via chaos.## Open Problems

The open problems from the chaos post generally carry over to here:

This is the sort of thing I’ve worked on over the past couple years, but my answers are still pretty incomplete.

^{^}For distributions over infinite outcome-spaces, especially continuous spaces, the right way to generalize Shannon's formula is usually to use "relative entropy", a.k.a. KL-divergence. Then "maxent distributions" are generalized to "maxent relative to some prior", usually called minimum KL-divergence distributions. Conceptually, basically all of the intuitions in this post generalize to minimum KL-divergence distributions.

In fact, for the continuous examples at the start of this post like particle velocities in a gas, the "maxent distribution" is implicitly a minimum KL-divergence distribution. It's relative to a (improper) uniform prior on particle velocities/energies. The non-relative entropy would be infinite.

^{^}It’s a convex optimization problem.

^{^}λ is the lagrange multiplier on the maximization problem.