In case anyone else didn't know what it meant for a set of binary strings to be "prefix-free", here's Claude's explanation, which I found helpful:
A set of binary strings is prefix-free if no string in the set is a prefix of any other string in the set.
Example:
- ✅ Prefix-free: {0, 10, 110, 111}
- ❌ Not prefix-free: {0, 01, 10} (because "0" is a prefix of "01")
Why does this matter for Turing machines?
The key is in how universal Turing machines work. A universal machine U simulates any other machine T by receiving input of the form i′q, where:
- i′ = prefix-free encoding of machine T's description
- q = the actual input to feed to machine T
U must parse this concatenated input to figure out: "Where does the machine description end and the actual input begin?"
Without prefix-free encoding: Given input "00101", U can't tell if the machine description is "0", "00", "001", etc. - there's ambiguity.
With prefix-free encoding: Once U reads a complete machine description, it knows exactly where that description ends and the input begins. No ambiguity, no delimiters needed.
This unique parseability is essential for universal machines to correctly simulate other machines, and it's crucial for Kolmogorov complexity theory where we need to measure program lengths precisely without parsing ambiguity.
Thanks for adding!
Not sure if you were aware, but in the glossary at the top-right of the post, there is also an explanation (albeit shorter) of "prefix-free code". I'm just mentioning this in case you weren't aware of the glossary functionality.
In this post, I give a self-contained treatment, including a proof, of the coding theorem.[1] Let be a finite binary string and a universal prefix-free Turing machine. The universal probability of is defined as the probability that a program sampled uniformly at random computes when is run via . The coding theorem says that up to a multiplicative constant, where is the prefix-free Kolmogorov complexity of .[2] Thus, more complex binary strings become exponentially less likely and vice versa. This can be interpreted as a form of Occam's razor.
This statement has found some prior interest on Lesswrong. Alexander Gietelink Oldenziel called the argument behind one inequality the padding argument, after it was independently rediscovered by Matthias Dellago and Lucius Bushnaq.[3] Nate Soares wrote a whole post about his rediscovery of some of these ideas, for the case that strings are themselves computable functions. A similar result was discussed by Joar Skalse in the context of the complexity of functions learned by neural networks. Note that my usage of the term padding is unrelated to a different type of padding argument.
The motivation of this post is to provide a didactical and elementary proof for readers, avoiding the typical machinery of "lower semicomputable discrete semimeasures", and to have a reference for future posts. In particular, Alexander and I are currently writing a post on the complexity of functions learned by neural networks, in which we would like to refer to this result.
Audience: This post assumes basic knowledge of Turing machines, the Church-Turing thesis, and the Kolmogorov complexity for prefix-free Turing machines (often called prefix Turing machines; I find it clearer to append "free"). We will briefly recall some basic definitions as a reminder.
Contributions: It is possible that all parts of the proofs can already be found identically in the literature, but I did not try to find them. OpenAI's o3 found the idea for the dovetailing procedure. The proof of the efficient algorithmic Kraft coding in the appendix is mine. The entire post is written by myself, except the last paragraph of the following section, which was first drafted by GPT-5.
Acknowledgments: I thank Alexander Gietelink Oldenziel and Matthias Dellago for engaging discussions about the coding theorem and the padding argument.
Let be our alphabet. Denote by the set of binary strings of arbitrary finite length. A prefix-free Turing machine is a Turing machine that halts on a prefix-free set of programs . Let be a universal prefix-free Turing machine, meaning that for any prefix-free Turing machine , there exists such that for all we have ,[4] where is a chosen prefix-free encoding of .
We construct , the universal a priori probability of , as follows: First, by abuse of notation, consider the uniform distribution on infinite binary sequences , which are sampled by sampling each bit in sequence with probability . Now for a particular sampled sequence , we observe whether it starts with a finite sequence on which halts, and if it does, then we write , where is an adapted version of that operates on infinite sequences. There can only be one such since is prefix-free. Then, define the universal a priori probability of to be
i.e., the probability that maps a randomly sampled sequence to , i.e., that starts with a program on which halts with output . As an aside, note that this is only a semiprobability mass function since some infinite sequences do not contain any prefix on which halts, meaning that is undefined. Thus, we have
This total probability is also called the halting probability; it is the probability that a randomly selected program produces an output and halts.
For a program , denote by its length. Then the distribution also has an alternative and more common formulation:
Lemma 1 (Deadcode/padding argument). We have
Proof. We first claim
The inclusion from left to right is clear: Every infinite with by definition starts with a finite program such that . The other direction is the padding or deadcode argument: If a program computes (i.e., ), then every extension of to an infinite sequence will compute via : The extensions/paddings are a dead code that is ignored by .
Thus, we obtain
In the last step, we used that the likelihood of sampling an infinite sequence that starts with is exactly since each specific bit is sampled with probability under the uniform distribution.
Now, let the Kolmogorov complexity of a binary string computed via the universal prefix-free Turing machine , i.e., the length of the shortest program that computes :
We can now formulate the coding theorem:
Theorem 2 (Coding Theorem). There is a constant such that for all , we have
In other words, up to at most a multiplicative constant, coincides with .
Proof (first inequality). Choose a shortest program with . Then we have and we obtain by Lemma 1.
The rest of the post will be about proving the second inequality, which is much harder than the first. It roughly states that if is likely under the universal probability, then its Kolmogorov complexity is small, i.e., can be compressed. To prove this claim, we first reformulate it slightly. Up to a relabeling of , and by taking the binary logarithm, the inequality is equivalent to
The rough outline is as follows: Our proof of this inequality constructs, for each string , a short prefix-free program whose length is close to , using an online version of Kraft’s inequality: We simulate all programs in parallel (dovetailing) and maintain running lower bounds of . Whenever first crosses a threshold , we assign a fresh codeword of length via an algorithmic Kraft coding procedure that guarantees prefix-freeness without knowing future codelengths. That this is accomplished by an online algorithm ensures that the decoding procedure is computable by a fixed prefix-free Turing machine, adding only a constant to the code lengths. Since is eventually within a distance of of , the resulting shortest program length found by the procedure is at most , establishing the desired bound with constant .
We need an algorithmic version of the Kraft inequality as a tool. If you are not interested in the details, feel free to only read the statement of Theorem 4 and then move on to the next section.
For a binary string , we denote by the corresponding real number in base , with . Furthermore, denotes the corresponding half-open dyadic interval. We say that is a prefix of if starts with the bits of .
Lemma 3. Let be binary strings. Then if and only if is a prefix of .
Proof. Assume is a prefix of . Then
All inequalities except the last are trivial. The last inequality follows since adding to increments only the last bit of , which cannot get larger than incrementing an earlier bit — which is what happens by forming , given that is a prefix of . Overall, this shows .
For the other direction, assume we have , i.e., the inequalities in the first equation in the proof all hold. If , then the third inequality shows , which shows , and implies . Thus, is a prefix of and we are done. Thus, we can assume . For any natural numbers , define and as the 'th bit in the binary expansions of and , respectively. Let be the earliest index where (which exists since ). We are done if we can show . And indeed, if this were not the case, then , a contradiction to the assumed inequalities. Thus, , i.e., is a prefix of .
Theorem 4 (Algorithmic Kraft Coding). Let be a sequence of natural numbers. Assume they satisfy Kraft's inequality:
Then there exist codewords with the two properties that:
Furthermore, there is an algorithm which successively maps the lengths to a codeword without looking at future codelengths and without revising earlier codewords along the way.
Proof. Here, we present a simple proof that only achieves slightly less efficient codelengths . In the appendix we present a more complicated algorithm that achieves precisely the stated result.
The algorithm proceeds as follows: Upon receiving the sequence , form the numbers , and analogously . Consider the interval .
To construct the codeword , go through all binary codewords of length , lexicographically, until you find the very first one for which . This is .
Now we prove the first property (the second one holds by construction, though with lengths ). Due to the minimality of , we have . Consequently, we have
Consequently, the dyadic interval satisfies . Since all are pairwise disjoint, all are pairwise disjoint, too. By Lemma 3, no is a prefix of any other. That proves property 1. Finally, notice that was chosen algorithmically without knowledge of future codelengths and without revising any earlier codewords, finishing the proof.
Remark: The statement of the theorem is stronger than its typical formulation since I require there to be an algorithm that selects one codeword at a time. This is what makes my proof of the efficient codewords in the appendix so complicated. Other proofs I found in the literature assume that the codelengths can be ordered in a non-decreasing way, which isn't possible when receiving one codelength at a time.
The reverse of this theorem is also true, i.e., the codelengths of any prefix-free code satisfy Kraft's inequality, see the Wikipedia page.
We now prove the second inequality of Theorem 2. Recall that we want to prove
We now explain how to code binary strings algorithmically to a code of length roughly to achieve this. Crucially, we can't compute precisely and can only approximate it, leading to a moving target for the codelengths for , and so we have to code in several ways since we can never be sure if we have found the correct codelength already.
We proceed as follows: For all , initialize as an approximation to . Initialize an empty sequence of codelengths. We approximate from below. We do this by dovetailing: Run all inputs in parallel through the universal Turing machine , one algorithmic step at a time. Assume our list of codelengths already contains at some algorithmic step. Whenever any program halts and computes for any binary string , update the estimate :
When this happens, check whether just crossed a threshold for any natural number for the first time. If it did, then for the smallest such (if it crosses several at once), append to a sequence of codelengths to obtain , and assign to a codeword of length via the algorithm from Theorem 4.
For this to be well-defined, we need to check that all the codelengths together satisfy Kraft's inequality. Notice that each is of the form for an with for some binary string , and that no appears twice for a given since we only update the list if a threshold is crossed for the first time. For a given , let be the minimal that eventually appears in the algorithm and satisfies . Thus, we have
Thus, algorithmic Kraft coding can indeed be applied.
Now, if you have a codeword that arose from this procedure, you can decode to algorithmically as follows: Just run the same procedure again and observe which binary string will be coded as . Then, halt and output . Since this procedure halts only on outputs of the above encoding, and these outputs form a prefix-free set by Theorem 4, the decoding procedure described here halts on a prefix-free set, and so it is given by a prefix-free Turing machine (where we make use of the Church-Turing thesis). Let be this Turing machine. Thus, there exists a Turing machine index such that for all . Consequently, we have , and thus
where is a constant independent of and .
Now, note that for a given , in the encoding procedure, we eventually reach the minimal , which then satisfies since otherwise would eventually cross the threshold , contradicting the minimality of . Thus, there exists some minimal code such that
Combining with the previous equation, we obtain
finishing the proof of Theorem 2.
Here, we present a more complicated algorithm that achieves the efficient codelengths for Theorem 4. Start with the set of codewords.[5] Intuitively, we think of them as a "resource" from which we choose the codewords of length . Whenever this resource does not have a codeword of the correct size available, we fractally "break up" the longest available shorter string into longer strings in a prefix-free way, assign one of them, and add the new remaining "pieces" back into the resource in a way that does not create any "waste".
We now explain the details of how to do this. Initialize the set of "found codewords" as the empty set . Note that and have the following properties for :
In the course of this proof, we will progressively replace elements of with extensions and build up the desired codewords algorithmically while preserving all four properties. The fact that is prefix-free by property and that by property then shows that also is prefix-free, which then implies that the code is prefix-free, establishing the result.
Let . As , choose the unique with . This exists since contains one codeword for each codelength. Set and .
Note that all four properties still hold for .
Now, assume by induction that and are already constructed and that all four properties hold. If there is an with , then simply choose , and . The choice of is unique by property . Then trivially, all four properties are preserved.
Finally, consider the case that there is no with . Define as the longest element with . It is unique by property . It exists since otherwise does not represent any of the codelengths , which by induction and properties , , and (in that order) would imply
This would result in
in contradiction to the assumed Kraft's inequality. Thus, with (and maximally long with these properties) exists.
Now, set . Define
Define and .
We now show that all four properties remain true with these constructions. The newly added elements in are clearly not prefixes of each other. Since they are all extensions of and was prefix-free, the prefix-free property then follows for all of , proving property . For property , note that , and since by definition, we have , too. by construction, and so . We have , proving property .
For property , note that
Thus, codelength is removed from and codelengths are added to . By construction of (i.e., its maximality), none of the newly added codelengths were present in , and so no codelength is represented more than once in , proving property .
Property follows from Equation and the induction assumption for :
As we explained before, this concludes the proof.
See Theorem 4.3.3 in Li and Vitányi's "An Introduction to Kolmogorov Complexity and Its Applications". This was first proved by L.A. Levin in 1974.
Importantly, this result is only correct up to a logarithmic error for plain, or descriptional, Kolmogorov complexity. Thus, the assumption that we work with prefix-free Turing machines is crucial. Thanks to Daniel Filan for pointing that out.
The term "padding" is also used in this context in the book "An Introduction to Universal Artificial Intelligence" by Hutter, Quarel, and Catt.
In particular, halts on whenever halts on .
My guess is one could also start with the resource consisting of only the empty binary string, which would possibly be somewhat cleaner than the choice I made.