When we think of “optimization” as compressing some part of the universe into a relatively small number of possible states, it’s very natural to quantify that compression in terms of “bits of optimization”. Example: we have a marble which could be in any of 16 different slots in a box (assume/approximate uniform probability on each). We turn the box on its side, shake it, and set it down so the marble will end up in one of the 4 slots on the downward side (again, assume/approximate uniform probability on each). Then we’ve compressed the marble-state from 16 possibilities to 4, cut the state-space in half twice, and therefore performed two bits of optimization.
In the language of information theory, this quantity is the KL-divergence between the initial and final distributions.
In this post, we’ll prove our first simple theorem about optimization at a distance: the number of bits of optimization applied can only decrease over a distance. In particular, in our optimization at a distance picture:
… the number of bits of optimization applied to the far-away optimization target cannot be any larger than the number of bits of optimization applied to the optimizer’s direct outputs.
The setup: first, we’ll need two distributions to compare over the optimizer’s direct outputs. You might compare the actual output-distribution to uniform randomness, or independent outputs. If the optimizer is e.g. a trained neural net, you might compare its output-distribution to the output-distribution of a randomly initialized net. If the optimizer has some sort of explicit optimization loop (like e.g. gradient descent), then you might compare its outputs to the initial outputs tested in that loop. These all have different interpretations and applications; the math here will apply to all of them.
Let’s name some variables in this setup:
- Optimizer’s direct outputs: (for “actions”)
- ’th intermediating layer: (with )
- Reference distribution over everything:
- Actual distribution over everything:
By assumption, only the optimizer differs between the reference and actual distributions; the rest of the Bayes net is the same. Mathematically, that means (and of course both distributions factor over the same underlying graph).
Once we have two distributions over the optimizer’s direct outputs, they induce two distributions over each subsequent layer of intermediating variables, simply by propagating through each layer:
At each layer, we can compute the number of bits of optimization applied to that layer, i.e. how much that layer’s state-space is compressed by the actual distribution relative to the reference distribution. That’s the KL-divergence between the distributions: .
To prove our theorem, we just need to show that . To do that, we’ll use the chain rule of KL divergence to expand in two different ways. First:
Recall that and are the same, so , and our first expression simplifies to . Second:
KL-divergence is always nonnegative, so we can drop the second term above and get an inequality:
Now we just combine these two expressions for and find
… which is what we wanted to prove.
So: if we measure the number of bits of optimization applied to the optimizer’s direct output, or to any particular layer, that provides an upper bound on the number of bits of optimization applied further away.