K-complexity of everyday things

by cousin_it1 min read4th Dec 201117 comments

19

Personal Blog

What can we say about the K-complexity of a non-random string from our universe, e.g. the text of Finnegans Wake? It contains lots of patterns making it easy to compress using a regular archiver, but can we do much better than that?

On one hand, the laws of physics in our universe seem to be simple, and generating the text is just a matter of generating our universe then pointing to the text. On the other hand, our evolution involved a lot of quantum randomness, so pointing to humans within the universe could require a whole lot of additional bits above and beyond the laws of physics. So does anyone have good arguments whether the K-complexity of Finnegans Wake is closer to 10% or 0.1% of its length?

17 comments, sorted by Highlighting new comments since Today at 11:40 AM
New Comment

If anyone is curious about regular archivers, Joyce became less compressible throughout his life. The compression ratios (bytes per 100 characters) for Dubliners, Portrait, Ulysses, and Wake are: by gzip -9: 38, 38, 42, 47; for paq8l -7: 24, 24, 26, 33. LZMA and PPMd interpolate these numbers in unsurprising ways. Dubliners and Portrait seem about as compressible as other fiction in English.

Of course, I performed these calculations using a server in Australia, where Finnegans Wake is in the public domain.

This comment was prompted by Finnegans Wake seeming like an odd choice of a novel. War and Peace is a more prototypical novel, so you probably didn't mean anything by the choice.

Can anyone suggest other hard to compress novels?

Thanks for the pointer to paq8l. And it won the Hutter Prize too! That's funny because my post can be viewed as a comment on the relevance of the Hutter Prize.

Finnegans Wake was just my first idea for "novel that's hard to compress".

So does anyone have good arguments whether the K-complexity of Finnegans Wake is closer to 10% or 0.1% of its length?

Pointing at our world in the multiverse looks as though it would probably take a lot of bits - which would go against the "0.1%" interpretation - assuming that the reference machine lacks access to the world, of course.

[-][anonymous]10y 4

One interesting compression of Finnegans Wake that goes beyond a .zip file is a plot summary. Obviously it is lossy compression, but how lossy?

One way to think about it is to imagine you are given a plot summary of Finnegans Wake, and asked to reconstruct it. What additional information would you want? A reasonably extensive knowledge of the English language and its grammar, certainly. Most likely a description of Joyce's writing style. Knowledge of human psychology and of the setting.

Obviously, we only need to include as much information as is actually used. If Finnegans Wake never contains the word "indubitably" then we don't need its definition. Also, ideally, all of this is written in some sort of natural representation with no redundancy, rather than in English, but we can think about writing the above in English as an approximation.

Knowing the algorithm, we can then add corrections. Suppose, when we take the above information, and try to reproduce the text, we end up putting a semicolon instead of a period 600 characters in (perhaps semicolons are usually more consistent with Joyce's style, but here he was feeling capricious). We could add a note to the effect of "600 characters: period, not semicolon". A bunch of these notes (which don't really take up much space) together with the information above make up our perfectly compressed string.

I cannot see how you could reconstruct a novel from a plot summary, regardless of additional information provided. Do you mean a text such that if a person read it, then read the actual Finnigan's Wake say a year later, 95% of the time he would not notice the difference? In any case, this scheme clearly has greater K-complexity than a .zip file.

Even if a computer simulated the universe, and even if it located earth, and even if it found a text titled "Finnegans Wake," there a very large number of things that, throughout the universe, might be titled Finnegans Wake! So the computer would still need some algorithm to recognize the right Finnegans Wake.

So, universe-independent compression is where it's at.

What do you think about the conditional K-complexity of everyday things? For example, does knowing Finnegans Wake help a lot with compressing War and Peace?

If I got to call substrings of Finnegans Wake when writing a computer program that generated War and Peace, I can see how that would make it shorter. Every time they shared a phrase longer than the call to Finnegans Wake, I could just call Finnegan's wake, for example. But it wouldn't save very much, I'd think.

On the other hand, a program that spat out both books could be much shorter than two programs for one book each, by some amount related to how much repetition there was.

Yes, definitely. A simple example: one tool that will be useful to compress FW is just a dictionary (or index) of English words. Instead of encoding the letters of a word, you encode the index of the word in the list, and save bits by doing so. You have to pay an up-front cost to encode the dictionary itself, but it should still be worthwhile overall, even for a single novel. Now when you compress two novels together, you get the benefit of the dictionary for the second novel without having to repay the upfront cost.

Yes, of course. But I was thinking of a more substantial savings. The question is more like, does Finnegans Wake represent a sort of pointer to our branch of the multiverse, which you could use to compress War and Peace down to a couple kilobytes? How much "entanglement" is there?

[-][anonymous]10y 0

Deleted

[This comment is no longer endorsed by its author]Reply

It contains lots of patterns making it easy to compress using a regular archiver, but can we do much better than that?

We can do a little bit better but not much better. We can use some tricks like dictionaries, PCFG grammars, and so on, but there's just a hard limit to how much we will be able to achieve, because we need to pay a model-specification cost to encode the dictionary or grammar, and a single novel just isn't all that long.

What I think is interesting is what happens when we consider larger datasets. What is the K-complexity of all the books in the library of Congress? Here we should be able to do much better by using a specialized compressor than by using a regular archiver. Because now we can afford to use all kinds of high-concept tricks, because the cost of encoding those tricks will be amortized over a far larger dataset.

Simulating the universe seems to be the ultimate "high-concept trick". I wonder how much data you need before it starts to pay off. How do I even go about estimating the answer to such a question?

Simulating the universe seems to be the ultimate "high-concept trick". I wonder how much data you need before it starts to pay off.

It would seem that such a trick would usually only provide minimal assistance. After all, more can be said of a single library of congress than of all libraries of congress in a Tegmark level 1 simulation!

I confess I'm a little confused by the question here. Doesn't K-complexity have to be relative to some base language? Normally that's a trivial detail but when we get to the level of deciding whether or not to use a specification of the universe as an optimization technique it becomes rather salient.

Doesn't K-complexity have to be relative to some base language? Normally that's a trivial detail but when we get to the level of deciding whether or not to use a specification of the universe as an optimization technique it becomes rather salient.

Why? As long as the translator between two different base languages is much smaller than Finnegans Wake, I don't see the problem.

It would seem that such a trick would usually only provide minimal assistance. After all, more can be said of a single library of congress than of all libraries of congress in a Tegmark level 1 simulation!

That's true. I'm trying to understand if simulating our universe is ever the easiest way to recreate complex things, and from the comments it's seeming less and less likely. In particular, UDASSA presupposes that the easiest way to generate the state of a human mind is to simulate the universe and point to the mind within it, which might easily turn out to be false.

Why? As long as the translator between two different base languages is much smaller than Finnegans Wake, I don't see the problem.

Because if we are deciding whether there are gains to be made by including a simulation of the universe in the compression algorithm. In that case the comparisons to be made are between the representation of a universe simulation, what efficiency this can gain compared to the base language and the translation cost between languages. Since we can expect any benefit to using a universe sim to be rather minimal this matters a lot!

I was actually only allowing the possibility that simulating the universe could be an "ultimate high concept trick" if you were going to be compressing things a whole heap more arbitrary than Finnegans wake (or any human produced data). If you are just talking about compressing human works and lets say using any old language like "ruby" then including a simulation of the universe as part of the message is an ultimately terrible concept. Even narrowing things down from "something represented in the universe" to "something conveniently expressed in ruby" provides an enormous amount of information already.

If our hypothesis is correct, the gains from simulating the universe are not just minimal, they're negative...