Shannon's Surprising Discovery

[-]Lao Mein4mo70

I would have expected early information theory, at least the concept of the parity bit, to have been invented alongside the telegraph, or even the heliograph.

If feels like information theory was an idea behind its time. What was blocking its discovery?

[-]Brendan Long3y40

Thanks, this was really well written and I learned some things. Now I want to try writing my own adaptive Huffman coding algorithm..

[-]Antoine de Scorraille2y30

There is one catch: in principle, there could be multiple codes/descriptions which decode to the same message. The obvious thing to do is then to add up the implied probabilities of each description which produces the same message. That indeed works great. However, it turns out that just taking the minimum description length - i.e. the length of the shortest code/description which produces the message - is a good-enough approximation of that sum in one particularly powerful class of codes: universal Turing machines.

Is this about K-complexity is silly; use cross-entropy instead?

[-]johnswentworth2y20

Yes.

[-]darius3y30

Nit: 0.36 bits/letter seems way off. I suspect you only counted the contribution of the letter E from the above table (-p log2 p for E's frequency value is 0.355).

[-]johnswentworth3y30

Wow, I really failed to sanity-check that calculation. Fixed now, and thankyou!

[-]evand3y20

I think you missed a follow-on edit:

"Let’s unpack what that 0.36 bits means,"

[-]johnswentworth3y20

Ah, yup, thankyou!

^{^}

I’m brushing under the rug the fact that not all programs halt, which means that the implied “probabilities” don’t sum to 1. The obvious way to handle this is to sum up the implied probabilities of all the programs which do halt, then normalize by that sum. In information-theoretic terms, just directly using a Turing machine as a decoder is inefficient (because codes corresponding to non-halting programs are unused), but we can make it more efficient by just removing the unused codes. Of course the normalizer is uncomputable, but that’s par for the course when dealing with Kolmogorov complexity.

Letter	Frequency
E	0.1127819549
T	0.08458646617
A	0.07518796992
I	0.07518796992
N	0.07518796992
O	0.07518796992
S	0.07518796992
H	0.06015037594
R	0.05827067669
D	0.04135338346
L	0.03759398496
U	0.03195488722
C	0.02819548872
M	0.02819548872
F	0.0234962406
W	0.01879699248
Y	0.01879699248
G	0.01597744361
P	0.01597744361
B	0.01503759398
V	0.01127819549
K	0.007518796992
Q	0.00469924812
J	0.003759398496
X	0.003759398496
Z	0.001879699248

LESSWRONG
LW

LESSWRONG
LW

59

Shannon's Surprising Discovery

59

59

Aside: Fungibility of Information Channel Capacity

A Frequentist -> Bayesian Bridge

Translating Between Description Length And Probability

Minimum Description Length

Summary