LESSWRONG
LW

Information TheoryKolmogorov ComplexityOccam's RazorAI
Frontpage

32

The Coding Theorem — A Link between Complexity and Probability

by Leon Lang
10th Aug 2025
10 min read
4

32

Information TheoryKolmogorov ComplexityOccam's RazorAI
Frontpage

32

The Coding Theorem — A Link between Complexity and Probability
4Matt Dellago
3jacob_drori
2Leon Lang
2jacob_drori
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:23 PM
[-]Matt Dellago22d42

Excellent! Great to have a cleanly formulated article to point people to!

Reply
[-]jacob_drori21d30

In case anyone else didn't know what it meant for a set of binary strings to be "prefix-free", here's Claude's explanation, which I found helpful:
 

A set of binary strings is prefix-free if no string in the set is a prefix of any other string in the set.

Example:

  • ✅ Prefix-free: {0, 10, 110, 111}
  • ❌ Not prefix-free: {0, 01, 10} (because "0" is a prefix of "01")

Why does this matter for Turing machines?

The key is in how universal Turing machines work. A universal machine U simulates any other machine T by receiving input of the form i′q, where:

  • i′ = prefix-free encoding of machine T's description
  • q = the actual input to feed to machine T

U must parse this concatenated input to figure out: "Where does the machine description end and the actual input begin?"

Without prefix-free encoding: Given input "00101", U can't tell if the machine description is "0", "00", "001", etc. - there's ambiguity.

With prefix-free encoding: Once U reads a complete machine description, it knows exactly where that description ends and the input begins. No ambiguity, no delimiters needed.

This unique parseability is essential for universal machines to correctly simulate other machines, and it's crucial for Kolmogorov complexity theory where we need to measure program lengths precisely without parsing ambiguity.
 

Reply
[-]Leon Lang21d20

Thanks for adding!

Not sure if you were aware, but in the glossary at the top-right of the post, there is also an explanation (albeit shorter) of "prefix-free code". I'm just mentioning this in case you weren't aware of the glossary functionality.

Reply
[-]jacob_drori21d20

Ah I hadn't noticed that, very nice. Great post!

Reply
Moderation Log
More from Leon Lang
View more
Curated and popular this week
4Comments

In this post, I give a self-contained treatment, including a proof, of the coding theorem.[1] Let x be a finite binary string and U a universal prefix-free Turing machine. The universal probability P(x) of x is defined as the probability that a program p sampled uniformly at random computes x when p is run via U. The coding theorem says that P(x)≈2−K(x) up to a multiplicative constant, where K(x) is the prefix-free Kolmogorov complexity of x.[2] Thus, more complex binary strings x become exponentially less likely and vice versa. This can be interpreted as a form of Occam's razor.

This statement has found some prior interest on Lesswrong. Alexander Gietelink Oldenziel called the argument behind one inequality the padding argument, after it was independently rediscovered by Matthias Dellago and Lucius Bushnaq.[3] Nate Soares wrote a whole post about his rediscovery of some of these ideas, for the case that strings x are themselves computable functions. A similar result was discussed by Joar Skalse in the context of the complexity of functions learned by neural networks. Note that my usage of the term padding is unrelated to a different type of padding argument.

The motivation of this post is to provide a didactical and elementary proof for readers, avoiding the typical machinery of "lower semicomputable discrete semimeasures", and to have a reference for future posts. In particular, Alexander and I are currently writing a post on the complexity of functions learned by neural networks, in which we would like to refer to this result. 

Audience: This post assumes basic knowledge of Turing machines, the Church-Turing thesis, and the Kolmogorov complexity for prefix-free Turing machines (often called prefix Turing machines; I find it clearer to append "free"). We will briefly recall some basic definitions as a reminder. 

Contributions: It is possible that all parts of the proofs can already be found identically in the literature, but I did not try to find them. OpenAI's o3 found the idea for the dovetailing procedure. The proof of the efficient algorithmic Kraft coding in the appendix is mine. The entire post is written by myself, except the last paragraph of the following section, which was first drafted by GPT-5. 

Acknowledgments: I thank Alexander Gietelink Oldenziel and Matthias Dellago for engaging discussions about the coding theorem and the padding argument. 

Formulating the Coding Theorem

Let {0,1} be our alphabet. Denote by {0,1}∗ the set of binary strings of arbitrary finite length. A prefix-free Turing machine is a Turing machine that halts on a prefix-free set of programs p∈{0,1}∗. Let U be a universal prefix-free Turing machine, meaning that for any prefix-free Turing machine T, there exists i∈{0,1}∗ such that for all q∈{0,1}∗ we have T(q)=U(i′q),[4] where i′ is a chosen prefix-free encoding of i.  

We construct P(x)=PU(x), the universal a priori probability of x, as follows: First, by abuse of notation, consider the uniform distribution P on infinite binary sequences s∈{0,1}∞, which are sampled by sampling each bit in sequence with probability 1/2. Now for a particular sampled sequence s, we observe whether it starts with a finite sequence p on which U halts, and if it does, then we write U′(s)=U(p), where U′ is an adapted version of U that operates on infinite sequences. There can only be one such p since U is prefix-free. Then, define the universal a priori probability of x∈{0,1}∗ to be

P(x):=PU(x):=P(U′−1(x))=P({s∈{0,1}∞∣U′(s)=x}),

i.e., the probability that U′ maps a randomly sampled sequence s∈{0,1}∞ to x, i.e., that s starts with a program on which U halts with output x. As an aside, note that this is only a semiprobability mass function since some infinite sequences s do not contain any prefix on which U halts, meaning that U′(s) is undefined. Thus, we have 

∑x∈{0,1}∗P(x)<1.

This total probability is also called the halting probability; it is the probability that a randomly selected program p produces an output and halts. 

For a program p∈{0,1}∗, denote by l(p) its length. Then the distribution P also has an alternative and more common formulation:

Lemma 1 (Deadcode/padding argument). We have

P(x)=∑p∈{0,1}∗ : U(p)=x2−l(p).

Proof. We first claim

U′−1(x)={s∈{0,1}∞∣U′(s)=x}=⋃p∈{0,1}∗ : U(p)=x{p}×{0,1}∞.

The inclusion from left to right is clear: Every infinite s with U′(s)=x by definition starts with a finite program p such that U(p)=x. The other direction is the padding or deadcode argument: If a program p computes x (i.e., U(p)=x), then every extension of p to an infinite sequence s will compute x via U′: The extensions/paddings are a dead code that is ignored by U′.

Thus, we obtain

P(x)=P(U′−1(x))=∑p∈{0,1}∗ : U(p)=xP({p}×{0,1}∞)=∑p∈{0,1}∗ : U(p)=x2−l(p).

In the last step, we used that the likelihood of sampling an infinite sequence that starts with p is exactly 2−l(p) since each specific bit is sampled with probability 1/2 under the uniform distribution. □

Now, let K(x) the Kolmogorov complexity of a binary string x∈{0,1}∗ computed via the universal prefix-free Turing machine U, i.e., the length of the shortest program that computes x:

K(x):=min{l(p)∣p∈{0,1}∗ : U(p)=x}.

We can now formulate the coding theorem:

Theorem 2 (Coding Theorem). There is a constant C>0 such that for all x∈{0,1}∗, we have

2−K(x)≤P(x)≤C⋅2−K(x).

In other words, up to at most a multiplicative constant, P(x) coincides with 2−K(x). 

Proof (first inequality). Choose a shortest program p∈{0,1}∗ with U(p)=x. Then we have l(p)=K(x) and we obtain 2−K(x)=2−l(p)≤P(x) by Lemma 1. □

The rest of the post will be about proving the second inequality, which is much harder than the first. It roughly states that if x is likely under the universal probability, then its Kolmogorov complexity is small, i.e., x can be compressed. To prove this claim, we first reformulate it slightly. Up to a relabeling of C, and by taking the binary logarithm, the inequality is equivalent to

K(x)≤−logP(x)+C.

The rough outline is as follows: Our proof of this inequality constructs, for each string x, a short prefix-free program whose length is close to −logP(x), using an online version of Kraft’s inequality: We simulate all programs in parallel (dovetailing) and maintain running lower bounds Q(x) of P(x). Whenever Q(x) first crosses a threshold 2−r, we assign x a fresh codeword of length r+1 via an algorithmic Kraft coding procedure that guarantees prefix-freeness without knowing future codelengths. That this is accomplished by an online algorithm ensures that the decoding procedure is computable by a fixed prefix-free Turing machine, adding only a constant C to the code lengths. Since r is eventually within a distance of 1 of −logP(x), the resulting shortest program length found by the procedure is at most −logP(x)+C+2, establishing the desired bound with constant C+2. 

Algorithmic Kraft Coding

We need an algorithmic version of the Kraft inequality as a tool. If you are not interested in the details, feel free to only read the statement of Theorem 4 and then move on to the next section. 

For a binary string x∈{0,1}∗, we denote by 0.x the corresponding real number in base 2, with 0≤0.x<1. Furthermore, D(x):=[0.x,0.x+2−l(x)) denotes the corresponding half-open dyadic interval. We say that y is a prefix of x if x starts with the bits of y.

Lemma 3. Let x,y∈{0,1}∗ be binary strings. Then D(x)⊆D(y) if and only if y is a prefix of x.

Proof. Assume y is a prefix of x. Then 

0.y≤0.x<0.x+2−l(x)≤0.y+2−l(y).

All inequalities except the last are trivial. The last inequality follows since adding 2−l(x) to 0.x increments only the last bit of 0.x, which cannot get larger than incrementing an earlier bit — which is what happens by forming 0.y+2−l(y), given that y is a prefix of x. Overall, this shows D(x)⊆D(y).

For the other direction, assume we have D(x)⊆D(y), i.e., the inequalities in the first equation in the proof all hold. If 0.y=0.x, then the third inequality shows 2−l(x)≤2−l(y), which shows l(y)≤l(x), and 0.y=0.x implies x=y0l(x)−l(y). Thus, y is a prefix of x and we are done. Thus, we can assume 0.y<0.x. For any natural numbers k, define x(k) and y(k) as the k'th bit in the binary expansions of 0.x and 0.y, respectively. Let k be the earliest index where x(k)>y(k) (which exists since 0.x>0.y). We are done if we can show k≥l(y)+1. And indeed, if this were not the case, then 0.x≥0.y+2−l(y), a contradiction to the assumed inequalities. Thus, k≥l(y)+1, i.e., y is a prefix of x. □

Theorem 4 (Algorithmic Kraft Coding). Let l1,l2,⋯∈{1,2,…} be a sequence of natural numbers. Assume they satisfy Kraft's inequality:

∞∑i=12−li≤1.

Then there exist codewords C(k)∈{0,1}∗ with the two properties that:

  1. C forms a prefix-free code, i.e., no C(k) is the prefix of any C(l).
  2. The codelengths are given by l(C(k))=lk.

Furthermore, there is an algorithm which successively maps the lengths lk to a codeword C(k) without looking at future codelengths and without revising earlier codewords along the way. 

Proof.  Here, we present a simple proof that only achieves slightly less efficient codelengths l(C(k))=l(k)+1. In the appendix we present a more complicated algorithm that achieves precisely the stated result.

The algorithm proceeds as follows: Upon receiving the sequence (l1,…,lk), form the numbers sk=∑k−1i=12−li, and analogously sk+1. Consider the interval Ik=[sk,sk+1).
To construct the codeword C(k), go through all binary codewords x of length l(k)+1, lexicographically, until you find the very first one for which 0.x≥sk. This is C(k).

Now we prove the first property (the second one holds by construction, though with lengths l(C(k))=lk+1). Due to the minimality of C(k), we have 0.C(k)<sk+2−(lk+1). Consequently, we have 

0.C(k)+2−l(C(k))=0.C(k)+2−(lk+1)<sk+2−lk=sk+1.

Consequently, the dyadic interval satisfies D(C(k))⊆Ik. Since all Ik are pairwise disjoint, all D(C(k)) are pairwise disjoint, too. By Lemma 3, no C(k) is a prefix of any other. That proves property 1. Finally, notice that C(k) was chosen algorithmically without knowledge of future codelengths and without revising any earlier codewords, finishing the proof. □

Remark: The statement of the theorem is stronger than its typical formulation since I require there to be an algorithm that selects one codeword at a time. This is what makes my proof of the efficient codewords in the appendix so complicated. Other proofs I found in the literature assume that the codelengths can be ordered in a non-decreasing way, which isn't possible when receiving one codelength at a time. 

The reverse of this theorem is also true, i.e., the codelengths of any prefix-free code satisfy Kraft's inequality, see the Wikipedia page. 

A Proof of the Coding Theorem

We now prove the second inequality of Theorem 2. Recall that we want to prove

K(x)≤−logP(x)+C.

We now explain how to code binary strings x algorithmically to a code of length roughly −logP(x) to achieve this. Crucially, we can't compute −logP(x) precisely and can only approximate it, leading to a moving target for the codelengths for x, and so we have to code x∈{0,1}∗ in several ways since we can never be sure if we have found the correct codelength already.

We proceed as follows: For all x∈{0,1}∗, initialize Q(x):=0 as an approximation to P(x). Initialize an empty sequence () of codelengths. We approximate P(x) from below. We do this by dovetailing: Run all inputs p in parallel through the universal Turing machine U, one algorithmic step at a time. Assume our list of codelengths already contains (l1,…,lk−1) at some algorithmic step. Whenever any program halts and computes U(p)=x for any binary string x, update the estimate Q(x):

Q(x)←Q(x)+2−l(p).

When this happens, check whether Q(x) just crossed a threshold 2−r for any natural number r for the first time. If it did, then for the smallest such r (if it crosses several at once), append lk:=r+1 to a sequence (l1,…,lk−1) of codelengths to obtain (l1,…,lk), and assign to x a codeword of length lk via the algorithm from Theorem 4.

For this to be well-defined, we need to check that all the codelengths l1,l2,… together satisfy Kraft's inequality. Notice that each li is of the form r+1 for an r with2−r≤Q(x)≤P(x) for some binary string x, and that no r appears twice for a given x since we only update the list if a threshold is crossed for the first time. For a given x, let rx be the minimal r that eventually appears in the algorithm and satisfies 2−r≤Q(x). Thus, we have

∞∑i=12−li≤∑x∈{0,1}∗∞∑k=12−(rx+k)=∑x∈{0,1}∗2−rx∞∑k=12−k=∑x∈{0,1}∗2−rx≤∑x∈{0,1}∗P(x)≤1.

Thus, algorithmic Kraft coding can indeed be applied.

Now, if you have a codeword p that arose from this procedure, you can decode p to x algorithmically as follows: Just run the same procedure again and observe which binary string x will be coded as p. Then, halt and output x. Since this procedure halts only on outputs of the above encoding, and these outputs form a prefix-free set by Theorem 4, the decoding procedure described here halts on a prefix-free set, and so it is given by a prefix-free Turing machine (where we make use of the Church-Turing thesis). Let T be this Turing machine. Thus, there exists a Turing machine index i∈{0,1}∗ such that U(i′q)=T(q) for all q∈{0,1}∗. Consequently, we have U(i′p)=T(p)=x, and thus

K(x)≤l(i′p)=l(p)+l(i′)=l(p)+C,

where C=l(i′) is a constant independent of x and p.

Now, note that for a given x, in the encoding procedure, we eventually reach the minimal rx, which then satisfies 2−rx≤P(x)≤2−(rx−1) since otherwise Q(x) would eventually cross the threshold 2−(rx−1), contradicting the minimality of rx. Thus, there exists some minimal code p such that 

l(p)=rx+1=rx−1+2≤−logP(x)+2.

Combining with the previous equation, we obtain

K(x)≤−logP(x)+C+2,

finishing the proof of Theorem 2. □

Appendix: Proof of Efficient Algorithmic Kraft Coding

Here, we present a more complicated algorithm that achieves the efficient codelengths l(C(k))=lk for Theorem 4. Start with the set R0={0,10,110,1110,…}⊆{0,1}∗ of codewords.[5] Intuitively, we think of them as a "resource" from which we choose the codewords of length lk. Whenever this resource does not have a codeword of the correct size available, we fractally "break up" the longest available shorter string into longer strings in a prefix-free way, assign one of them, and add the new remaining "pieces" back into the resource in a way that does not create any "waste".

We now explain the details of how to do this. Initialize the set of "found codewords" F0 as the empty set F0=∅. Note that Rk and Fk have the following properties for k=0:

  1. Rk is prefix-free.
  2. Fk={C(1),…,C(k)}⊆Rk, and the codelengths are given by l(C(i))=li for i=1,…,k.
  3. No codelength is represented more than once in Rk∖Fk.
  4. ∑r∈Rk2−l(r)=1.

In the course of this proof, we will progressively replace elements of Rk with extensions and build up the desired codewords Fk algorithmically while preserving all four properties. The fact that Rk is prefix-free by property 1 and that Fk⊆Rk by property 2 then shows that also Fk is prefix-free, which then implies that the code C is prefix-free, establishing the result. 

Let k=1. As C(1), choose the unique r∈R0 with l(r)=l1. This exists since R0 contains one codeword for each codelength. Set R1:=R0 and F0:={C(1)}.
Note that all four properties still hold for k=1.

Now, assume by induction that Rk−1 and Fk−1={C(1),…,C(k−1)} are already constructed and that all four properties hold. If there is an r∈Rk−1∖Fk−1 with l(r)=lk, then simply choose C(k)=r, Fk=Fk−1∪{C(k)} and Rk=Rk−1. The choice of r is unique by property 3. Then trivially, all four properties are preserved.

Finally, consider the case that there is no r∈Rk−1∖Fk−1 with l(r)=lk. Define r∈Rk−1∖Fk−1 as the longest element with l(r)<lk. It is unique by property 3. It exists since otherwise Rk−1∖Fk−1 does not represent any of the codelengths 1,2,…,lk, which by induction and properties 2, 4, and 3 (in that order) would imply

k−1∑i=12−li=∑r∈Fk−12−l(r)=1−∑r∈Rk−1∖Fk−12−l(r)≥1−∞∑i=lk+12−i=1−2−lk.

This would result in

∞∑i=12−li>k∑i=12−li=k−1∑i=12−li+2−lk≥1−2−lk+2−lk=1,

in contradiction to the assumed Kraft's inequality. Thus, r∈Rk−1∖Fk−1 with l(r)<lk (and maximally long with these properties) exists.

Now, set n:=lk−l(r)>0. Define

Rk:=Rk−1∖{r}∪{r0, r10, …, r1n−20, r1n−10, r1n}.   (∗)

Define C(k):=r1n and Fk:=Fk−1∪{C(k)}.

We now show that all four properties remain true with these constructions. The newly added elements in Rk are clearly not prefixes of each other. Since they are all extensions of r∈Rk−1 and Rk−1 was prefix-free, the prefix-free property then follows for all of Rk, proving property 1. For property 2, note that Fk−1⊆Rk−1, and since r∉Fk−1 by definition, we have Fk−1⊆Rk, too. C(k)∈Rk by construction, and so Fk⊆Rk. We have l(C(k))=l(r)+l(1n)=l(r)+n=lk, proving property 2.

For property 3, note that

Rk∖Fk=(Rk−1∖Fk−1)∖{r}∪{r0, r10, …, r1n−20, r1n−10}.

Thus, codelength l(r) is removed from and codelengths l(r)+1,…,l(r)+n=lk are added to Rk−1∖Fk−1. By construction of r (i.e., its maximality), none of the newly added codelengths were present in Rk−1∖Fk−1, and so no codelength is represented more than once in Rk∖Fk, proving property 3.

Property 4 follows from Equation (∗) and the induction assumption for k−1:

∑r′∈Rk2−l(r′)=∑r′∈Rk−12−l(r′)−2−l(r)+2−l(r)−n+n∑i=12−l(r)−i=1+2−l(r)⋅(−1+2−n+n∑i=12−i)=1.

As we explained before, this concludes the proof. □

  1. ^

    See Theorem 4.3.3 in Li and Vitányi's "An Introduction to Kolmogorov Complexity and Its Applications". This was first proved by L.A. Levin in 1974.

  2. ^

    Importantly, this result is only correct up to a logarithmic error for plain, or descriptional, Kolmogorov complexity. Thus, the assumption that we work with prefix-free Turing machines is crucial. Thanks to Daniel Filan for pointing that out. 

  3. ^

    The term "padding" is also used in this context in the book "An Introduction to Universal Artificial Intelligence" by Hutter, Quarel, and Catt.

  4. ^

    In particular, T halts on q whenever U halts on i′q.

  5. ^

    My guess is one could also start with the resource R0={ϵ} consisting of only the empty binary string, which would possibly be somewhat cleaner than the choice I made. 

Mentioned in
63A philosophical kernel: biting analytic bullets