Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

When we think of “optimization” as compressing some part of the universe into a relatively small number of possible statesit’s very natural to quantify that compression in terms of “bits of optimization”. Example: we have a marble which could be in any of 16 different slots in a box (assume/approximate uniform probability on each). We turn the box on its side, shake it, and set it down so the marble will end up in one of the 4 slots on the downward side (again, assume/approximate uniform probability on each). Then we’ve compressed the marble-state from 16 possibilities to 4, cut the state-space in half twice, and therefore performed two bits of optimization.

In the language of information theory, this quantity is the KL-divergence between the initial and final distributions.

In this post, we’ll prove our first simple theorem about optimization at a distance: the number of bits of optimization applied can only decrease over a distance. In particular, in our optimization at a distance picture:

… the number of bits of optimization applied to the far-away optimization target cannot be any larger than the number of bits of optimization applied to the optimizer’s direct outputs.

The setup: first, we’ll need two distributions to compare over the optimizer’s direct outputs. You might compare the actual output-distribution to uniform randomness, or independent outputs. If the optimizer is e.g. a trained neural net, you might compare its output-distribution to the output-distribution of a randomly initialized net. If the optimizer has some sort of explicit optimization loop (like e.g. gradient descent), then you might compare its outputs to the initial outputs tested in that loop. These all have different interpretations and applications; the math here will apply to all of them.

Let’s name some variables in this setup:

  • Optimizer’s direct outputs:  (for “actions”) 
  • ’th intermediating layer:  (with )
  • Reference distribution over everything: 
  • Actual distribution over everything: 

By assumption, only the optimizer differs between the reference and actual distributions; the rest of the Bayes net is the same. Mathematically, that means  (and of course both distributions factor over the same underlying graph).

Once we have two distributions over the optimizer’s direct outputs, they induce two distributions over each subsequent layer of intermediating variables, simply by propagating through each layer:

At each layer, we can compute the number of bits of optimization applied to that layer, i.e. how much that layer’s state-space is compressed by the actual distribution relative to the reference distribution. That’s the KL-divergence between the distributions: .

To prove our theorem, we just need to show that . To do that, we’ll use the chain rule of KL divergence to expand  in two different ways. First:

Recall that  and  are the same, so , and our first expression simplifies to . Second:

KL-divergence is always nonnegative, so we can drop the second term above and get an inequality: 

Now we just combine these two expressions for  and find

… which is what we wanted to prove.

So: if we measure the number of bits of optimization applied to the optimizer’s direct output, or to any particular layer, that provides an upper bound on the number of bits of optimization applied further away.

26

Ω 18

New Comment
15 comments, sorted by Click to highlight new comments since: Today at 6:19 PM

To check my noob understanding:

Suppose you are playing a MMO video game which accepts about 10,000 binary inputs per second, in the form of various keyboard buttons being pressed or not. You want to cause something crazy and meme-worthy to happen in the game, like everyone in the game falling over dead simultaneously. You have an hour or so of playtime total. What's the craziest thing you can make happen?

Answer: Well, whatever it is, it has to be at least 1/2^(10,000x3600) likely to happen already. If it was less likely than that, you wouldn't be able to make it likely even if you played perfectly.

That's a correct intuitive understanding.

I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?

I agree on the "reference" distribution in Daniel's example. I think it generally means "the distribution over the random variables that would obtain without the optimiser". What exactly that distribution is / where it comes from I think is out-of-scope for John's (current) work, and I think is kind of the same question as where the probabilities come from in statistical mechanics.

What exactly that distribution is / where it comes from I think is out-of-scope for John's (current) work, and I think is kind of the same question as where the probabilities come from in statistical mechanics.

Yup.

My problem is that A is defined as the output of the optimizer, M0 is defined as A, so P(A|ref) is central to the entire inequality. However, what is the output of an optimizer if we are without the optimizer? The given examples (Daniel's and John's) both gloss over the question of P(A|ref) and implicitly treat it as uniform over the possible choices the optimizer could have made. In the box-with-slots examples, what happens if there is no optimizer? I don't know.

In the MMO example, what is the output without a player-optimizer? I don't think it's a randomly chosen string of 10,000 bit inputs. No MMO I've ever played chooses random actions if you walk away from it. Yet Daniel's interpretation assumes that that's the distribution. Anything else, the player choosing the least likely reference outcome can beat the bounds in Daniel's answer. I.e. his example makes it clear that bits-of-optimization applied by the player does not correspond to bits-of-input, unless the reference is a randomly chosen string of inputs. And in that case, the bound feels trivial and uninsightful. If every possible action I can choose has a p chance of happening without me, then the output that I choose will have had a chance of p by definition. And the distribution of outcomes I selected will then always have had at least a p chance of having been selected without me (plus some chance that it happened through other possible output choices). No math needed to make me believe that!

None of this applies to the equation itself. It works for any choice of P(A|ref). But I think that changes the interpretations given (such as Daniels) and without it I'm not sure I that this builds intuition for anything in the way that I think it's trying to do. Is "uniformly choose an output" really a useful reference? I don't think it is useful for intuition. And in the useful references I choose (constant output), the bound becomes trivial (infinite KL divergence). So what is a useful choice?

Good point!

It seems like it would be nice in Daniel's example for P(A|ref) to be the action distribution of an "instinctual" or "non-optimising" player. I don't know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.

Suppose I send a few lines of code to a remote server. Those lines are enough to bootstrap a superintelligence which goes on to strongly optimize every aspect of the world. This counts as a fairly small amount of optimization power, as the chance of me landing on those lines of code by pure chance isn't that small. 

Correct, though note that this doesn't let you pick a specific optimization target for that AGI. You'd need more bits in order to specify an optimization target. In other words, there's still a meaningful sense in which you can't "steer" the world into a specific target other than one it was likely to hit anyway, at least not without more bits.

Is this assuming all optimization happens at "the optimizer" (and thus implicitly assuming that no optimization can take place in any intermediate step?)

That wording is somewhat ambiguous, but the answer to the question which I think you're trying to ask is "yes". We're assuming that any differences between "optimized" and "not optimized" states/distributions are driven by differences in the optimizer, in the sense that the causal structure of the world outside the optimizer remains the same.

Intuitively, this corresponds to the idea that the optimizer is "optimizing" some far-away chunk of the world, and we're quantifying the optimization "performed by the optimizer", not whatever optimization might be done by whatever else is in the environment.

Stepping into a real-world example, consider a text file, and three cases, illustrating different things:

First case: Entirely ineffective ZIP compression, (some processes), effective ZIP compression.  If we treat the ineffective ZIP compression as "the optimizer", then it is clear that some compression happens later in the sequence of processes; the number of bits of optimization increased.  However, the existence or non-existence of the first ineffective ZIP compression has no effect on the number of bits of optimization, so maybe this isn't quite appropriate as a way of thinking about this.

Second case: Anti-effective ZIP compression, (some processes), effective ZIP compression.  The anti-effective ZIP compression, instead of compressing the file, maybe just copies the file twenty times.  Then the effective ZIP compression takes that, and possibly compresses it nearly as effectively as the first case.  The existence or non-existence of the anti-effective ZIP compression does matter in this case - in particular, if we compare the state of its output compared to the final state, the anti-effective ZIP compression creates a wider optimization discrepancy; it would appear to "optimize at a distance."

Third case: Garbage data cleanup, (some processes), effective ZIP compression.  Here we enter an interesting case, because, if we treat the garbage data cleanup as our optimizer, it both optimizes the immediate output, and, in the event that the garbage is significantly higher entropy than the rest of the data, might make the effective ZIP compression more effective.

I think the third case really matters here, but in general, once you introduce a second optimizer, you can create emergent optimizations which don't strictly belong to either optimizer.  Additionally, once we submit the existence of emergent optimizations, it may entirely be possible for an optimizer to optimize "at a distance", even without a specific second optimization taking place.  Consider any given optimization as a sequence of operations each of which is necessary; if we treat the first operation as "the optimizer", then the optimization as a whole happens "at a distance", given that without the first operation, the optimization fails, and it is only optimized at the final step in the process (given that each step is necessary).

I'm intrigued by these examples but I'm not sure it translates. It sounds like you are interpreting "difference of size of file in bits between reference and optimized versions" as the thing the KL divergence is measuring, but I don't think that's true. I'm assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1).  Let's explicitly assume that the input file is a randomly chosen English word.

Suppose a fourth case where our "optimizer" outputs the file "0" regardless of input. The end result is a tiny zip file. Under the "reference" condition, the original file is zipped and is still a few bytes, so we have reduced the file size by a few bytes at most. However, the KL divergence is infinite! After all "0" is not an English word and so it's zip never appears in the output distribution of the reference but occurs with probability 1 under our optimizer. So the KL divergence is not at all equal to the number of bits of filesize reduced. Obviously this example is rather contrived, but it suffices to show that we can't directly translate intuition about filesizes to intuition about bits-of-optimization as measured by KL divergence.

Were you going for a different intuition with these examples?

It isn't the thing that the KL divergence is measuring, it is an analogy for it.  The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.

However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.

So, having an intuition for what the ZIP file does, the equivalent "examples":

Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0.  Not appropriate as a way of thinking about this, but helpful to frame the next two examples.

Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0.  Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.

Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient.  Critically for this, the future optimization can vary in effectiveness depending on the input parameters.  Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:

For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]

For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]

Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.

Or, for a very contrived example, the first "optimizer" does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.

There's something here about process-entropy versus data-entropy which I think merits examination, however.

[Also, an observation: Arithmetic is an optimizer.  Consider the number of input states and the number of output states of addition.]

One thing that I had to remind myself while reading this post is that "far away" is across space-time, emphasis on time. So "far away" can be about optimizing the future.