Entropy, and Short Codes

Eliezer Yudkowsky

Entropy, and Short Codes — LessWrong

A Human's Guide to Words

93 Entropy, and Short Codes

by Eliezer Yudkowsky

23rd Feb 2008

5 min read

93

Suppose you have a system X that's equally likely to be in any of 8 possible states:

{X₁, X₂, X₃, X₄, X₅, X₆, X₇, X₈.}

There's an extraordinarily ubiquitous quantity—in physics, mathematics, and even biology—called entropy; and the entropy of X is 3 bits. This means that, on average, we'll have to ask 3 yes-or-no questions to find out X's value. For example, someone could tell us X's value using this code:

X₁: 001 X₂: 010 X₃: 011 X₄: 100

X₅: 101 X₆: 110 X₇: 111 X₈: 000

So if I asked "Is the first symbol 1?" and heard "yes", then asked "Is the second symbol 1?" and heard "no", then asked "Is the third symbol 1?" and heard "no", I would know that X was in state 4.

Now suppose that the system Y has four possible states with the following probabilities:

Y₁: 1/2 (50%) Y₂: 1/4 (25%) Y₃: 1/8 (12.5%) Y₄: 1/8 (12.5%)

Then the entropy of Y would be 1.75 bits, meaning that we can find out its value by asking 1.75 yes-or-no questions.

What does it mean to talk about asking one and three-fourths of a question? Imagine that we designate the states of Y using the following code:

Y₁: 1 Y₂: 01 Y₃: 001 Y₄: 000

First you ask, "Is the first symbol 1?" If the answer is "yes", you're done: Y is in state 1. This happens half the time, so 50% of the time, it takes 1 yes-or-no question to find out Y's state.

Suppose that instead the answer is "No". Then you ask, "Is the second symbol 1?" If the answer is "yes", you're done: Y is in state 2. Y is in state 2 with probability 1/4, and each time Y is in state 2 we discover this fact using two yes-or-no questions, so 25% of the time it takes 2 questions to discover Y's state.

If the answer is "No" twice in a row, you ask "Is the third symbol 1?" If "yes", you're done and Y is in state 3; if "no", you're done and Y is in state 4. The 1/8 of the time that Y is in state 3, it takes three questions; and the 1/8 of the time that Y is in state 4, it takes three questions.

(1/2 * 1) + (1/4 * 2) + (1/8 * 3) + (1/8 * 3)
= 0.5 + 0.5 + 0.375 + 0.375
= 1.75.

The general formula for the entropy of a system S is the sum, over all S_i, of -p(S_i)*log₂(p(S_i)).

For example, the log (base 2) of 1/8 is -3. So -(1/8 * -3) = 0.375 is the contribution of state S₄ to the total entropy: 1/8 of the time, we have to ask 3 questions.

You can't always devise a perfect code for a system, but if you have to tell someone the state of arbitrarily many copies of S in a single message, you can get arbitrarily close to a perfect code. (Google "arithmetic coding" for a simple method.)

Now, you might ask: "Why not use the code 10 for Y₄, instead of 000? Wouldn't that let us transmit messages more quickly?"

But if you use the code 10 for Y₄, then when someone answers "Yes" to the question "Is the first symbol 1?", you won't know yet whether the system state is Y₁ (1) or Y₄ (10). In fact, if you change the code this way, the whole system falls apart—because if you hear "1001", you don't know if it means "Y₄, followed by Y₂" or "Y₁, followed by Y₃."

The moral is that short words are a conserved resource.

The key to creating a good code—a code that transmits messages as compactly as possible—is to reserve short words for things that you'll need to say frequently, and use longer words for things that you won't need to say as often.

When you take this art to its limit, the length of the message you need to describe something, corresponds exactly or almost exactly to its probability. This is the Minimum Description Length or Minimum Message Length formalization of Occam's Razor.

And so even the labels that we use for words are not quite arbitrary. The sounds that we attach to our concepts can be better or worse, wiser or more foolish. Even apart from considerations of common usage!

I say all this, because the idea that "You can X any way you like" is a huge obstacle to learning how to X wisely. "It's a free country; I have a right to my own opinion" obstructs the art of finding truth. "I can define a word any way I like" obstructs the art of carving reality at its joints. And even the sensible-sounding "The labels we attach to words are arbitrary" obstructs awareness of compactness. Prosody too, for that matter—Tolkien once observed what a beautiful sound the phrase "cellar door" makes; that is the kind of awareness it takes to use language like Tolkien.

The length of words also plays a nontrivial role in the cognitive science of language:

Consider the phrases "recliner", "chair", and "furniture". Recliner is a more specific category than chair; furniture is a more general category than chair. But the vast majority of chairs have a common use—you use the same sort of motor actions to sit down in them, and you sit down in them for the same sort of purpose (to take your weight off your feet while you eat, or read, or type, or rest). Recliners do not depart from this theme. "Furniture", on the other hand, includes things like beds and tables which have different uses, and call up different motor functions, from chairs.

In the terminology of cognitive psychology, "chair" is a basic-level category.

People have a tendency to talk, and presumably think, at the basic level of categorization—to draw the boundary around "chairs", rather than around the more specific category "recliner", or the more general category "furniture". People are more likely to say "You can sit in that chair" than "You can sit in that recliner" or "You can sit in that furniture".

And it is no coincidence that the word for "chair" contains fewer syllables than either "recliner" or "furniture". Basic-level categories, in general, tend to have short names; and nouns with short names tend to refer to basic-level categories. Not a perfect rule, of course, but a definite tendency. Frequent use goes along with short words; short words go along with frequent use.

Or as Douglas Hofstadter put it, there's a reason why the English language uses "the" to mean "the" and "antidisestablishmentarianism" to mean "antidisestablishmentarianism" instead of antidisestablishmentarianism other way around.

Information theoryEntropyPhilosophy of LanguageLogic & Mathematics

Frontpage

93

Where to Draw the Boundary?

53 comments91 karma

Mutual Information, and Density in Thingspace

28 comments73 karma

New Comment

27 comments, sorted by

oldest

Click to highlight new comments since: Today at 11:55 AM

[-]Roland218y00

erratum: So -(1/8 * 3) = 0.375 is the contribution

[-]Eliezer Yudkowsky18y20

Roland, I remember an erratum where I had that quantity equal to 0.475 instead of 0.375, but I also remember fixing it, and the web text seems to agree - are you perchance seeing 0.475 in an RSS feed?

[-]tcpkac18y-30

Firstly, saying "you can define a word any way you want" is not the same thing as "any way which is meaningful to you". Secondly, I don't believe the development on entropy has anything to do with the convenience of using short words for often used concepts. "chair" is a meaningful piece of jointed reality not because of its intrinsic physical properties but because of its handiness for humans. A dolphin would find "chair" as a significant piece of jointed reality absurd. Thirdly, there is an obvious distinction bewteen using language descriptively and using it predictively. I would agree with you that mistakes often arise when moving from the descriptive to the predictive incautiously. That doesn't, however, make the descriptive use of language invalid, or even unwise, 98% of the use of language is descriptive. (I have proof of that statistic, but it won't fit in this margin).

[-]Anonymous2518y70

"antidisestablishmentarianism way around" should be "antidisestablishmentarianism other way around".

[-]Eliezer Yudkowsky18y10

Fixed.

[-]Ron_Hardin18y10

It's called entropy because somebody told Shannon that the same mathematical quantity already existed in thermodynamics, when the question what to call it came up.

I don't know that there's any other operational connection.

Like there's no entropy gradient giving the arrow of time, or system-wide increase.

[-]Frank_Hirsch18y570

Okay, now let's code those factory objects! 1 bit for blue not red 1 bit for egg not cube 1 bit for furred not smooth 1 bit for flexible not hard 1 bit for opaque not translucent 1 bit for glows not dark 1 bit for vanadium not palladium

Nearly all objects we encounter code either 1111111 or 0000000. So we compress all objects into two categories and define: 1 bit for blegg (1111111) not rube (0000000). But, alas, the compression is not lossless, because there are objects which are neither perfect bleggs nor rubes: A 1111110 object will be innocently accused of containing vanadium, because it is guilty by association with the bleggs, subjected to unfair kin liability! Still, in an enviroment where our survival depends on how faithfully we can predict unobserved features of those objects we stand good chances:

Nature: "I have here an x1x1x1x object, what is at it's core?" We suspect a blegg and guess Vanadium - and with 98% probability we are right, and nature awards us a pizza and beer.

Now the evil supervillain, I-can-define-any-way-I-like-man (Icdawil-man, for short), comes by and says: "I will define my categories thus: 1 bit for regg (0101010) not blube (1010101)" While he will achieve the same compression ratio, he looses about 1/2 of the information in the process. He has failed to carve at the joint. So much the worse for Icdawil-man.

Nature: "I have here an x1x1x1x object, what is at it's core?" Icdawil-man suspects a regg, guesses Palladium, and with 98% probability starts coughing blood...

Next along comes the virtuous and humble I-refuse-to-compress-man:

Nature: "I have here an x1x1x1x object, what is at it's core?" Irtc-man refuses to speculate and is awarded a speck in his eye.

Next along comes the brainy I-have-all-probabilities-stored-here-because-I-can-man:

Nature: "I have here an x1x1x1x object, what is at it's core?" Ihapshbic-man also gets a pizza and beer, but will sooner be hungry again than we will. That's because of all the energy he needs for his humongous brain which comes in an extra handcart.

Any more contenders? =)

[-]atorm14y50

I love the rewards.

[-][anonymous]18y40

There is different-base-concepts-alien who views the world with radar, x rays and sonar

Dbc-alien says "I don't understand half of your coding (colour, translucency, glowingness) and the other things are high level concepts I don't tend to bother with. It is like trying to ask me to identify a cat by density, I can always tell between what has vanadium and palladium via the way they refract x rays".

There is also not-important-for-me ant who gives not a fig whether something is a blegg or a rube.

Nifm ant goes to eats remnants of the pizza and beer given to ihapshbic and us, and out can survive because it hasn't even bothered to try and build a big brain to remember anything.

[-]A._Denis18y00

Make attention to use the entropy . The entopy is based on the idea that the amount of information in binary string S is Log(S) , the number of bit to directly code the string . This is wrong , the correct information is the Kolmogorov complexity of S . Nowaday the scientific literature don't focus on this difference and very often use Log(S) instead of K(S) ( K is the Kolmogorov complexity function ) and justify this becouse K(X) is uncomputable and becouse for the major part of K(X) value it is approximable by Log(X) in a mathematical context . This wrong assumption is done becouse people think that every object , every binary string can happen . This is wrong , only few bit string with lenght 1000000 can happen becouse a system that produce 2^1000000 object , string can not exist. What this mean is that the object are always small ! The object stay in that value smaller than Log(X) in the function K(X) .

[-][anonymous]18y10

"This wrong assumption is done becouse people think that every object , every binary string can happen . This is wrong , only few bit string with lenght 1000000 can happen becouse a system that produce 2^1000000 object , string can not exist."

So these people are frauds?

[-]Roland218y20

erratum: So -(1/8 * 3) = 0.375 is the contribution

Eliezer, the problem is with the sign. You wrote: -(a) = a

[-]Eliezer Yudkowsky18y10

Durr... fixed.

[-]Tom_McCabe218y00

"Prosody too, for that matter - Tolkien once observed what a beautiful sound the phrase "cellar door" makes; that is the kind of awareness it takes to use language like Tolkien."

Out of curiosity, did you ever study Tolkien's languages.

[-]AshwinV12y00

I personally found that after reading that one line, my estimation of Tolkien just shot through the roof. It's also kind of inspiring, in the sense, you want to now start writing because it just somehow seems.......... cooler.

[-]A._Denis18y00

"So these people are frauds?"

To answer your question , I think the card can be one of the better random generator existing, so it is absolutely not a frauds ( and pheraps I will buy one ... thank you for the link ) . But there are many definition of random . The only one theoretical random definition I accept as true is that a bit string S is random only if this string has K(S) >=Len(S) . My strong deterministic opinion is that in the real world these random objects exist only if we take short object , also for quantum field ( like Wolfram think ). I read articles where people reasoning on the quantum field and the relation on exponential or polynomial world , I don't know what is the answer but I don't think that quantum filed open the door for an exponential world. This opinion come to me from many discrepance I find in the mathematical description and what happen in practical . For example for the K function ( Kolmogorov complexity ) there are proof say that for major part of value we have K(X)>=Log(X) and if you watch on the function you can say it is absolutely correct , not only but for very very few case we have K(X)<Log(X) and also from a statistical point of view all is coherent. The probability to compress X using the optimal K function with a dimension of N bits into a Y with a dimension of N/2 bits is (2^(N/2))/(2^N) . This probability is exponential low! So for increasing value X the probability to have a small K(X) is very low. What this mean for practical point of view? This mean that if I take a file from my pc and then I try to compress this file using a very very bad compressor ( becouse the K function is an idealized compressor with an infinite power ) it is crazy to hope to compress it . But I absolutely sure that I am able to compress it and with high probability of 50% ! . Why? What happen ? There are many explanation we can give to this phenomena but the follow is my opinion. When we get an object , when we receive an input this object is only a representation of the information of the object computed by a program , every program has limited power , we can define this power as a limit on the size of the bitstring that this program can do . I call this limit M and using this limit something change , for examples the number of available bit string of lenght N in a mathematical view are 2^N , now become min( 2^N , M/N ) . The compression probability change ... The universal distribution change ... But the interesting property of this vision is that it make pactical aspecative coherent with theretical function . In the standard mathematical/statistical view is more easy to compress small string and become more difficult to compress large string in absolutely opposition in what happen in the real world! When I have a small file I think will be difficult to compress it and when I have a big file I think that I will get a big compression ratio.

If you watch this function min( 2^N , M/N ) what happen is this! for small string we have exponential behaviour and this is coherent with mathematical classical view but after a limit M what happen change and the probability to compress for example increase! .

Another important characteristic is that big variation on the parameter M make small variation on the behaviour of the function , so is more important to assume the existence of this behaviour also if we don't know M also if it is impossible to compute M !.

This cause a discrepance in the entropy , in the assumption of Log as function of information measurement , etc ... becouse this theories suppose an exponential world ! a very big world! ( exponential functions are very big and we can not underrate them ! )

There are many consequence of this simple observation and many to be investigate .

I don't know if quantum field will open the exponential door but the world behaviour seem to me polynomial.

Denis.

[-]Ben_Jones18y-10

Frank, how about the way it actually happened; guy #5 bashes you over the head with a rube, and steals your beer, your pizza and your woman?

[-]anonymous2618y00

Suppose you have a system X that's equally likely to be in any of 8 possible states [...] on average, we'll have to ask 3 yes-or-no questions to find out X's value.

Eliezer: is there a formal connection between entropy and binary search? A binary search of that system would also terminate in three operations, right?

[-]michael0110017y20

sorry to butt in, but he's written so much good stuff, i have to imagine he's happy whenever anyone else can expound on a point.

now, the binary search algorithm has to have a way to order the things it is searching through in order to work, and in the article Eliezer was just using the numbers as symbols for what state a system could be in (in other words, we are mentally imputing the ordering of the symbols because of our familiar concept of numbers), but i know what you were getting at, and yes, there is an intimate connection between entropy and binary search.

one of the ways to think about how much "information" or "entropy" (the two words do not really mean the same thing, but they both get used from so many different angles that no one can keep them straight - see John von Neumann's infamous quotation about why Claude Shannon ought to name the ubiquitous formula "entropy") - anyway one of the ways to think about how much information is in something, is to ask how many yes/no questions you have to ask to fully specify it. you might even notice Eliezer going through that process in the text of the article, e.g. "and we learned this with a total of 4 questions." the number of yes/no questions is the number of bits of information/entropy. in other words a bit of information is what allows you to distinguish in a choice between 2 alternatives, or, if you have a whole lot of alternatives, a bit of information is what allows you to eliminate half of them from consideration.

when you perform a binary search, you are basically doing the same thing, asking a series of yes/no questions, but with a special case of question - remember to use the binary search algorithm your input has to be ordered. so you are asking a series of questions: "is it in this half or in that half?", the "halves" being decided by the middle value in the ordering, and of course each time you eliminate half of the choices from consideration.

so your intuition was correct, it is no coincidence that a binary search would also terminate in 3 steps.

michael redman champaign, il, usa michael01100@gmail.com www.locative.me/~michael

[-]Zareon16y20

Very interesting article! It seemed very surprising to me that the information entropy can be interpreted as the (minimum average) number of yes/no-questions needed to fully determine the state of a system. Especially since the value of the entropy depends on the units (i.e. what logarithm base you use). Is the use of base two related to the fact that the answers can be either 'yes' or 'no' (binary)?

Another question. Suppose you have a system X in 4 possible states {X1,X2,X3,X4} with probabilities {3/8,2/8,2/8,1/8} respectively. The information entropy is about 1.91 bits. But I couldnt find any series of yes/no questions to ask that would give me 1.91 questions on average. An average of 2 questions is the best I could do. That is the same as the situation with probabilities {1/4,1/4,1/4,1/4} even though we have more information than that in our present case. Am I not looking hard enough?

I`m really trying to get a grasp on the concept of information entropy so I hope someone can answer my questions.

Great blog in any case. I`m glad I found it!

[-]wnoise16y130

EDIT: Welcome to LessWrong. You may want introduce yourself in the welcome thread

Especially since the value of the entropy depends on the units (i.e. what logarithm base you use). Is the use of base two related to the fact that the answers can be either 'yes' or 'no' (binary)?

Precisely.

{3/8,2/8,2/8,1/8} ... 1.91 bits ... 2 questions. Am I not looking hard enough?

No, there really aren't. A simpler example is just taking two options, with unequal probabilities (take {1/8, 7/8} for concreteness). Again, you have to ask one question even though there is less than one bit (0.54 bits if I did the math right). However, if you have many copies of this system, you can ask questions about the entire subset that can (on average over all possible states of the entire system) describe the entire system with less than one question per system, and in the limit of an infinite number of subsystems, this approaches the entropy.

E.g, for two systems, labeling the first state as 0, and the second as 1, you have the following tree:

Are they both 1?
a. Yes: stop, state is (1,1) and 1 question asked with probability 49/64 = 49/64.
b. No: continue
Is the first one 1?
a: Yes: stop (1, 0), 2 questions asked with probability 7/64
b: no: continue
Is the second one in 1?
a: Yes: stop (0,1), 3 questions asked with probability 7/64
b: No: stop (0,0), 3 questions asked with probability 1/64

Total average questions = 49 + 2*7 + 3*8 = 87/64 < 2, but greater than 2*0.54.

[-]jooyous13y10

Is there a short, pronounceable word for the underscore symbol?

[-]Mass_Driver12y10

I'm confused. What makes "chair" the basic category? I mean, obviously more basic categories will have shorter words -- but who decided that "solid object taking up roughly a cubic meter designed to support the weight of a single sitting human" was a basic category?

[-]Nornagest12y30

It probably comes down to frequency of use, as Eliezer alludes in the next couple of sentences. Shorter words are easier to use and likely to be preferred, but independently of that, the need to refer to an object sized and designed for one person to comfortably sit on it will likely come up more often than the need for a word for "personal recuperation armature, padded interface surfaces, overstuffed, with position-activated leg supports". There's no Platonic ideal of chairness of which reclinerness is a subclass, but there are facts of the social and physical environment in which language evolves.

The category breakdown is arbitrary at some level, but the tendency to prefer more general to more specific categories is real, and so is the association with length. Japanese aoi covers more ground than English blue, but both languages have analogs of "sky blue" -- and they're both longer than the base word.

[-]Mass_Driver12y10

OK, but why is "chair" shorter than "furniture"? Why is "blue" shorter than "color"? Furniture and color don't strike me as words that are so abstract as to rarely see use in everyday conversation.

[-]Nornagest12y10

We're venturing into wild speculation territory here, but I suspect that there's a sort of sweet spot of specificity, between adding extraneous details and talking in terms so general that they're only useful for accounting headers or philosophy papers, and that the shortest nouns will fall into the center of it. "We need seventy pieces of furniture for the banquet" is a sentence I'd expect to come up less often than "we need sixty chairs and ten tables".

"Furniture" and "color" do show up in everyday conversation, but often in contexts like "what furniture needs repairs?" or "what color did you paint the kitchen?"

[-]lolbifrons8y30

The encoding scheme you're talking about is Huffman Coding, and the ambiguity you're trying to avoid explicitly occurs when one symbol is the prefix of another. The mechanism to build an optimal prefix(-free) code is called the Huffman Tree, and it uses a greedy algorithm that builds a full binary tree from the bottom up based on the frequencies of the symbols. Leaves are symbols, and the code for a symbol is the sequence of left or right branches you must traverse the reach that symbol.

To get more specific, you add all the symbols to a heap based on their frequency of appearance, lowest-at-the-front. You extract two symbols, foo and bar. These become children of a common parent node, the pseudosymbol baz. baz.freq = foo.freq + bar.freq. Add baz to the heap. Loop until the heap contains one node at the end of this process. That node is the root of the Huffman Tree.

Moderation Log

X₁: 001	X₂: 010	X₃: 011	X₄: 100
X₅: 101	X₆: 110	X₇: 111	X₈: 000