Could you come up with some examples of this, where a system is well-compressed by two completely different models?
Imagine a modeling a human. Given infinite compute (which is already assumed, because we're talking about Kolmogorov complexity), we could model the entire human at the level of quantum fields. Or, we could model the human at the level of preferences and beliefs (including inconsistencies and irrationalities), planning (including imperfectly), etc. So, a low abstraction level and a high abstraction level.
Suppose each of those yield an approximately-shortest program for our observations of the human. We might reasonably expect those programs to look like:
Then the theorem says that there exists a third approximately-shortest program which "uses" both levels: it first generates both the low-level physics library and the agent-modeling library, then uses both of them to generate the data. The part which generates the data from the two libraries gains the compression benefits of both: unless one of the strings
This seems like a somewhat trivial construction that relies on the fact that S_1 and S_2 are both small compared to the data. This, to me, seems like saying "If you can drive from San Francisco to Seattle using a Tesla car key (by driving a Tesla), and also do it in a Honda car key (by getting in a Hyundai), then you can drive there almost as fast (modulo the mass of the second key slowing you down) by using both keys and driving there in the Tesla." Am I missing something?
There is no requirement, either explicit or implicit, that S_1 or S_2 be small relative to the data. Either or both could be e.g. a third the size of the data, or almost as large as the data itself.
I was thinking of "small" in terms of K-complexity, but I had still misunderstood it. Is the rough intuition behind this result the following:
Suppose
Ok after going through the proof of the chain rule, here is my sense for what is going on. Please correct me if I am mistaken.
An approximately-shortest program outputting
We get
However I don't think this necessarily means that there is an approximately-shortest program producing
How much compute is your decompression program allowed to use? Is it allowed to generate S1 then data using algorithm 1, then iterate every S2 until it finds the S2 of appropriate length that expands to data using the second step of algorithm 2? (potentially with a hash check, or a condition that the second step is injective)
ah, this gives K(data) < K(S1) + K(data | s2) + K(data| s1) which isnt interesting
In this example the program generating the data from both S1 and S2 is the more general abstraction, but can we really think of it as natural?
Maybe we have M3, M4 and so forth and if we try use the corresponding strings S1, ... Sn to produce the data, we get something longer (each Si is small but there are many of them). In that case, there will be no single abstraction which is the most "natural". I'm handwaving a lot, but does this make sense?
It seems to me that the "approximately-shortest program" here could just be generating S_1 and S_2, then throwing away S_2 and generating the data from S_1 (or vice versa)?
One class of cases where that definitely won't work: S_1 and S_2 independent, so K(data|S_1, S_2) is roughly K(data) - K(S_1) - K(S_2) (as shown in the post at the end of the main section). In that case, the program for data given S_1, S_2 has to be significantly shorter than either the program for data given S_1 or the program for data given S_2 (assuming S_1 and S_2 themselves have significantly more than zero K-complexity).
It would be really nice if I could provably construct a third program,
, which in some sense "combines the internal structure" of the two programs and , while still achieving approximately the same compression.
I don't think that would fit the definition, since it would be over 2x as complex as either of the original programs. It doesn't seem like it would solve the spirit of the problem either.
I'm not talking about the spirit of the problem, I'm talking about the actual program corresponding to the derivation in this article. I'm not super familiar with the field though, so I could be wrong.
The math you're doing here implies that if 0 ≈ K(A) then 0 ≈ K(A) + K(A) = 2K(A), so constant multipliers seem to be allowed in your approximations.
Assumed background: Kolmogorov complexity and Solomonoff induction.
Suppose I have some data , and I go looking for the models (i.e. programs) which best compress that data. I find two different programs, and , which both reproduce the data using approximately the same number of bits, and that seems to be roughly the best compression possible. On examination, I find that the two models do totally different things internally.
It would be really nice if I could provably construct a third program, , which in some sense "combines the internal structure" of the two programs and , while still achieving approximately the same compression. This would be a result in the general cluster of natural abstraction and interoperable semantics. Very roughly speaking, it would say that if a human and an alien both have approximately-best-compressing models in some domain, but their models have totally alien internal structure, then we can construct a new model which finds both of the original models intelligible, while still achieving basically-optimal compression.
I don't have a perfect theorem like that with all the kinks worked out. But I can give some math which seems like it would allow a result along those lines, with the right setup.
Some K-Complexity Math
Suppose I have two approximate best compressions of data . Let's give them just a little internal structure:
Quantitatively, using the notation for Kolmogorov complexity (K-complexity), this means:
The existence of our two approximately-best compressions means that both of these approximations must hold.
From there, we do a little math, just using standard properties of K-complexity. (The main properties we use here derive straightforwardly from the Chain Rule. Note that the approximations therefore suppress terms of order .)
implying
Intuitively, this says: since is part of an approximate minimal compression of the data, it has approximately zero K-complexity given the data.
... and together, and given the data could be specified by just appending the two program which generates from the data and the program which generates from the data, both of which are very short (approximately zero, to within terms). So:
Then we chain rule back the other way:
implying
In other words, there exists an approximately-shortest program for the data which consists of a self-contained program to generate and , and then a second self-contained program to generate the data from and . That's the interesting result.
If the intermediates and are independent of each other, i.e. neither is more compressible given the other, then we can go one more step: . In that case, there exists a shortest program which consists of a self-contained program to generate , another self-contained program to generate , and a third self-contained program to generate the data from and . In that case, our new program can straight-up reuse the and programs from the two original models; the new program directly shares structure with the originals and could even interoperate with them.
Summary
By itself, this result is kind of toy. Approximate best compressions of lots of real-world data would probably start by defining whole new languages or libraries, which would then be used by component programs later on, so the component programs would not be standalone. On the other hand, there might be other tricks to handle that sort of structure, like e.g. Solomonoff natural latents. The hope would be that we could figure out a few such foundational tricks, and then compose them to handle more complicated programs.
Acknowledgement: David wasn't involved in this particular post, but it did bubble out of stuff I've done with him.