The relevant argument is equivalence of SI on different universal Turing machines, up to a constant. Briefly: if we have a short program on machine M1 (e.g. python), then in the worst case we can write an equivalent program on M2 (e.g. LISP) by writing an M1-simulator and then using the M1-program (e.g. writing a python interpreter in LISP and then using the python program). The key thing to notice here is that the M1-simulator may be long, but its length is completely independent what we're predicting - thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).
Applied to English: we could simulate an English-speaking human. This would be a lot more complicated than a python interpreter, but the program length would still be constant with respect to the prediction task. Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence. Thus, if something has a short English description, then there exists a short (up to a constant) code description which contains all the same information (i.e. can be used to predict all the same things).
Two gotchas to emphasize here:
the M1-simulator may be long, but its length is completely independent what we're predicting - thus, the M2-Kolmogorov-complexity of a string is at most the M1-Kolmogorov-complexity plus a constant (where the constant is the length of the M1-simulator program).
I agree with this, but I don't think it answers the question. (i.e. it's not a relevant argument^([1]))
Given the English sentence, the simulated human should then be able to predict anything a physical human could predict given the same English sentence.
There's a large edge case where the over...
One somewhat silly reason: for any simple English hypothesis, we can convert it to code by running a simulation of a human and giving them the hypothesis as input, then asking them to predict what will happen next. Therefore the English and code-complexity can differ by at most a constant.
This gives a very loose bound, since it will probably take a lot of bits to specify a human mind. In practice I think the two complexities will usually not differ by too much, because coding languages were designed to be understandable by humans and have syntax similar to human languages. There are some difficult cases like descriptions of visual objects, but even here neural networks should be able to bridge the gap to an extent(in the limit this just becomes the 'encode a human brain' strategy)
Regarding 'levels of abstraction', I'm not sure if this is a big obstacle, as most programming languages have built-in mechanisms for changing levels of abstraction. e.g. functional programming languages allow you to treat functions as objects.
One somewhat silly reason: for any simple English hypothesis, we can convert it to code by running a simulation of a human and giving them the hypothesis as input, then asking them to predict what will happen next.
Some things are quick for people to do and some things are hard. Some ideas have had multiple people continuously arguing for centuries. I think this either means you can't apply a simulation of a person like this, or some inputs have unbounded overhead.
...because coding languages were designed to be understandable by humans and have syntax sim
Solomonoff Induction has a free choice of variable, namely the use of the Universal Turing Machine (UTM) on which to run it, which is equivalent to the choice of language (e.g. whether to choose Python or C or some formalization of the english language). How to choose that variable is indeed kind of an open question, and not super obvious, but so are many other things about how to actually use SI in practice (obviously the same is true for less formal versions of Occam's Razor).
However, you can still prove a substantial number of things about SI without knowing the UTM with which it is run, because as johnswentworth says, there is at most a fixed constant of difference in the length of the resulting programs and in many inference problems it makes sense to think about the convergent behavior over long periods of time/many pieces of evidence, for which all SI machines will converge to the same beliefs.
I do concretely think that this choice of open variable is a really big one, and not one that we should forget when reasoning about SI.
Brevity of code and english can correspond via abstraction.
I don't know why brevity in low and high abstraction programs/explanations/ideas would correspond (I suspect they wouldn't). If brevity in low/high abstraction stuff corresponded; isn't that like contradictory? If a simple explanation in high abstraction is also simple in low abstraction then abstraction feels broken; typically ideas only become simple after abstraction. Put another way: the reason to use abstraction is to make ideas/thing that are highly complex into things that are less complex.
I think Occam's Razor makes sense only if you take into account abstractions (note: O.R. itself is still a rule of thumb regardless). Occam's Razor doesn't make sense if you think about all the extra stuff an explanation invokes - partially because that body of knowledge grows as we learn more, and good ideas become more consistent with the population of other ideas over time.
When people think of short code they think of doing complex stuff with a few lines of code. e.g. cat asdf.log | cut -d ',' -f 3 | sort | uniq
. When people think of (good) short ideas they think of ideas which are made of a few well-established concepts that are widely accessible and easy to talk about. e.g. we have seasons because energy from sunlight fluctuates ~sinusoidally through our annual orbit.
One of the ways SI can use abstraction is via the abstraction being encoded in both the program, program inputs, and the observation data.
(I think) SI uses an arbitrary alphabet of instructions (for both programs and data), so you can design particular abstractions into your SI instruction/data language. Of course the program would be a bit useless for any other problem than the one you designed it for, in this case.
Is there literature arguing that code and English brevity usually or always correspond to each other?
I don't know of any.
If not, then most of our reasons for accepting Occam’s Razor wouldn’t apply to SI.
I think some of the reasoning makes sense in a pointless sort of way. e.g. the hypothesis 1100
corresponds to the program "output 1 and stop". The input data is from an experiment, and the experiment was "does the observation match our theory?", and the result was 1
. The program 1100
gets fed into SI pretty early, and it matches the predicted output. The reason this works is that SI found a program which has info about 'the observation matching the theory' already encoded, and we fed in observation data with that encoding. Similarly, the question "does the observation match our theory?" is short and elegant like the program. The whole thing works out because all the real work is done elsewhere (in the abstraction layer).
On the literature that addresses your question: here is a classic LW post on this sort of question.
You point out that length of a description in English and length in code don't necessarily correlate. I think for English sentences that are actually constraining expectations, there is a fairly good correlation between length in English and length in code.
There's the issue that the high-level concepts we use in English can be short, but if we were writing a program from scratch using those concepts, the expansion of the concepts would be large. When I appeal to the concept of a buffer overflow when explaining how someone knows secrets from my email, the invocatory phrase "buffer overflow" is short, but the expansion out in terms of computers and transistors and semiconductors and solid state physics is rather long.
But I'm in the game of trying to explain all of my observations. I get to have a dictionary of concepts that I pay the cost for, and then reuse the words and phrases in the dictionary in all my explanations nice and cheaply. Similarly, the computer program that I use to explain the world can have definitions or a library of code, as long as I pay the cost for those definitions once.
So, I'm already paying the cost of the expansion of "buffer overflow" in my attempt to come up with simple explanations for the world. When new data has to be explained, I can happily consider explanations using concepts I've already paid for as rather simple.
On the literature that addresses your question: here is a classic LW post on this sort of question.
The linked post doesn't seem to answer it, e.g. in the 4th paragraph EY says:
Why, exactly, is the length of an English sentence a poor measure of complexity? Because when you speak a sentence aloud, you are using labels for concepts that the listener shares—the receiver has already stored the complexity in them.
I also don't think it fully addresses the question - or even partially in a useful way, e.g. EY says:
...It’s enormously easier (as it turns out)
Solomonoff Induction (SI) focuses on short code. What’s short in English is not necessarily short in code, and vice versa. Most of our intuition in favor of short, simple explanations is from our experience with English, not code. Is there literature arguing that code and English brevity usually or always correspond to each other? If not, then most of our reasons for accepting Occam’s Razor wouldn’t apply to SI.
Another way to think of the issue may be that coding is a low level or reductionist way to deal with an issue, while English is a high level approach that uses high level tools like explanations. Ideas can be represented in many ways, including at different levels of abstraction. It’s unclear that length or simplicity is consistent for the same idea across different levels of abstraction. That is, if you have two ideas X and Y, and X is simpler and shorter when you compare both ideas at a one level of abstraction, it may be unsafe to assume that X will also be simpler and shorter than Y when you compare them at a different level of abstraction. Is there any literature which addresses this?