Toward Interoperability of Minimal Programs

johnswentworth

Assumed background: Kolmogorov complexity and Solomonoff induction.

Suppose I have some data , and I go looking for the models (i.e. programs) which best compress that data. I find two different programs, and , which both reproduce the data using approximately the same number of bits, and that seems to be roughly the best compression possible. On examination, I find that the two models do totally different things internally.

It would be really nice if I could provably construct a third program, , which in some sense "combines the internal structure" of the two programs and , while still achieving approximately the same compression. This would be a result in the general cluster of natural abstraction and interoperable semantics. Very roughly speaking, it would say that if a human and an alien both have approximately-best-compressing models in some domain, but their models have totally alien internal structure, then we can construct a new model which finds both of the original models intelligible, while still achieving basically-optimal compression.

I don't have a perfect theorem like that with all the kinks worked out. But I can give some math which seems like it would allow a result along those lines, with the right setup.

Some K-Complexity Math

Suppose I have two approximate best compressions of data . Let's give them just a little internal structure:

The first compression consists of a self-contained program for generating an intermediate string , followed by a second self-contained program for generating the data from .
Second compression consists of a self-contained program for generating an intermediate string , followed by a second self-contained program for generating the data from .

Quantitatively, using the notation for Kolmogorov complexity (K-complexity), this means:

The existence of our two approximately-best compressions means that both of these approximations must hold.

From there, we do a little math, just using standard properties of K-complexity. (The main properties we use here derive straightforwardly from the Chain Rule. Note that the approximations therefore suppress terms of order .)

implying

Intuitively, this says: since is part of an approximate minimal compression of the data, it has approximately zero K-complexity given the data.

... and together, and given the data could be specified by just appending the two program which generates from the data and the program which generates from the data, both of which are very short (approximately zero, to within terms). So:

Then we chain rule back the other way:

implying

In other words, there exists an approximately-shortest program for the data which consists of a self-contained program to generate and , and then a second self-contained program to generate the data from and . That's the interesting result.

If the intermediates and are independent of each other, i.e. neither is more compressible given the other, then we can go one more step: . In that case, there exists a shortest program which consists of a self-contained program to generate , another self-contained program to generate , and a third self-contained program to generate the data from and . In that case, our new program can straight-up reuse the and programs from the two original models; the new program directly shares structure with the originals and could even interoperate with them.

Summary

We start with two different approximate best compressions of some data, and .
Each compression consists of a self-contained program to generate an intermediate string , followed by another self-contained program to generate the data given .
Then: there exists another approximate best compression consisting of a self-contained program to generate both and , followed by another self-contained program to generate the data given and .
If and are independent (i.e. no nontrivial joint compression), then the new best compression can directly reuse the and generating programs from the two original models.

By itself, this result is kind of toy. Approximate best compressions of lots of real-world data would probably start by defining whole new languages or libraries, which would then be used by component programs later on, so the component programs would not be standalone. On the other hand, there might be other tricks to handle that sort of structure, like e.g. Solomonoff natural latents. The hope would be that we could figure out a few such foundational tricks, and then compose them to handle more complicated programs.

Acknowledgement: David wasn't involved in this particular post, but it did bubble out of stuff I've done with him.

Could you come up with some examples of this, where a system is well-compressed by two completely different models?

Imagine a modeling a human. Given infinite compute (which is already assumed, because we're talking about Kolmogorov complexity), we could model the entire human at the level of quantum fields. Or, we could model the human at the level of preferences and beliefs (including inconsistencies and irrationalities), planning (including imperfectly), etc. So, a low abstraction level and a high abstraction level.

Suppose each of those yield an approximately-shortest program for our observations of the human. We might reasonably expect those programs to look like:

"low-level": a program which generates a low-level physics engine library (the string ), followed by a program which generates the data using that library.
"high-level": a program which generates an agent-modeling library (the string ), followed by a program which generates the data using that library.

Then the theorem says that there exists a third approximately-shortest program which "uses" both levels: it first generates both the low-level physics library and the agent-modeling library, then uses both of them to generate the data. The part which generates the data from the two libraries gains the compression benefits of both: unless one of the strings or has approximately-zero complexity given the other, the program to generate the data from both will be significantly shorter than the programs to generate the data from either or alone.

Not exactly what you're asking for, but maybe in the right direction?

This seems like a somewhat trivial construction that relies on the fact that S_1 and S_2 are both small compared to the data. This, to me, seems like saying "If you can drive from San Francisco to Seattle using a Tesla car key (by driving a Tesla), and also do it in a Honda car key (by getting in a Hyundai), then you can drive there almost as fast (modulo the mass of the second key slowing you down) by using both keys and driving there in the Tesla." Am I missing something?

There is no requirement, either explicit or implicit, that S_1 or S_2 be small relative to the data. Either or both could be e.g. a third the size of the data, or almost as large as the data itself.

I was thinking of "small" in terms of K-complexity, but I had still misunderstood it. Is the rough intuition behind this result the following:

Suppose has two components, which generates , and which generates the data given . We can "combine" and to form a program which first uses to generate and then uses to generate given . Then either and have some shared structure, in which case so is shorter than or and have no shared structure, in which so is shorter either than or . Or some combination of the two.

Ok after going through the proof of the chain rule, here is my sense for what is going on. Please correct me if I am mistaken.

An approximately-shortest program outputting is very similar to an approximately-shortest program that outputs just using as an intermediate output. This means that, given , there are few programs that output for some that have a length approximately , so has a small index in the enumeration of . This means that .

We get by searching for programs with an appropriate K-complexity bound that output triples , and we describe the triple we want using an efficient enumeration scheme. The length of this program is close to for the reason outlined above. This is split into producing and by splitting up the enumeration, effectively currying the program.

However I don't think this necessarily means that there is an approximately-shortest program producing from both and in a meaningful way. One approximately-shortest program is given by generating from , keeping , and then searching for via program enumeration. For some given , it's possible that all approximately-shortest programs are of this form.

How much compute is your decompression program allowed to use? Is it allowed to generate S1 then data using algorithm 1, then iterate every S2 until it finds the S2 of appropriate length that expands to data using the second step of algorithm 2? (potentially with a hash check, or a condition that the second step is injective)

ah, this gives K(data) < K(S1) + K(data | s2) + K(data| s1) which isnt interesting

In this example the program generating the data from both S1 and S2 is the more general abstraction, but can we really think of it as natural?

Maybe we have M3, M4 and so forth and if we try use the corresponding strings S1, ... Sn to produce the data, we get something longer (each Si is small but there are many of them). In that case, there will be no single abstraction which is the most "natural". I'm handwaving a lot, but does this make sense?

It seems to me that the "approximately-shortest program" here could just be generating S_1 and S_2, then throwing away S_2 and generating the data from S_1 (or vice versa)?

One class of cases where that definitely won't work: S_1 and S_2 independent, so K(data|S_1, S_2) is roughly K(data) - K(S_1) - K(S_2) (as shown in the post at the end of the main section). In that case, the program for data given S_1, S_2 has to be significantly shorter than either the program for data given S_1 or the program for data given S_2 (assuming S_1 and S_2 themselves have significantly more than zero K-complexity).

It would be really nice if I could provably construct a third program, , which in some sense "combines the internal structure" of the two programs and , while still achieving approximately the same compression.

I don't think that would fit the definition, since it would be over 2x as complex as either of the original programs. It doesn't seem like it would solve the spirit of the problem either.

I'm not talking about the spirit of the problem, I'm talking about the actual program corresponding to the derivation in this article. I'm not super familiar with the field though, so I could be wrong.

The math you're doing here implies that if 0 ≈ K(A) then 0 ≈ K(A) + K(A) = 2K(A), so constant multipliers seem to be allowed in your approximations.