Mutual Information, and Density in Thingspace

We have a thousand words for sorrow http://rhhardin.home.mindspring.com/sorrow.txt

I don't know if that affects the theory.

(computer clustering a short distance down paths of a thesaurus)

Including: "twitter", "altruism", "trust", "start" and "curiosity" apparently?

You've forgotten one important caveat in the phrase "And the way to carve reality at its joints, is to draw your boundaries around concentrations of unusually high probability density in Thingspace." The important caveat is : 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief' . All the imperfections in categorisation in existing languages come from that limitation. Other problems in categorisation, like those of Antonio, in 'Merchant of Venise', or those of the founding fathers who wrote that it is 'self evident that all men were created equal' but at the same time were slave owners, do not come from language problems in categorisation, they would have acknowledged that Shylock or the slaves were human, but from different types of cognitive compromise. Apart from that, it's an intellectually satisfying approach, and you might, if you persevere, end up with a poor relation to an existing language. Why a poor relation ? because it would lack nuance, ambiguity, and redundance, which are the roots of poetry. It would also lack words for the surprising but significant improbable phenomenon. Like genius, or albino. Then again, once you get around to saying you will have words for significant low hills of probability, the whole argument blows away. Bon courage.

[-]ksvanhorn15y40

"The important caveat is : 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief' ."

I would call the above an instance of the Mind Projection Fallacy, as you seem to be assuming a probability density that is a property of the physical world, and which we are trying to ascertain. But probabilities are properties of our minds (or ideal, perfectly rational minds), not of the exterior world, and a probability distribution is simply an entity to describe our state of information; it is "the best of our knowledge and belief".

[-]Frank_Hirsch18y00

tcpkac: The important caveat is : 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief' . All the imperfections in categorisation in existing languages come from that limitation.

This strikes me as a rather bold statement, but "to the best of our knowledge and belief" might be fuzzy enough to make it true. Some specific factors that distort our language (and consequently our thinking) might be:

Probability shifts in thingspace invalidating previously useful clusterings. Natural languages need time adapt, and dictionary writers tend to be conservative.
Cognitive biases that distort our perception of thingspace. Very on topic here, I suppose. ^_^
Manipulation (intended and unintended). Humans treat articulations from other humans as evidence. That can go so far that authentic contrary evidence is explained away using confirmation bias.

Other problems in categorisation, [...] do not come from language problems in categorisation, [...] but from different types of cognitive compromise.

Well, lack of consistency in important matters seems to me to be a rather bad sign.

It would also lack words for the surprising but significant improbable phenomenon. Like genius, or albino. Then again, once you get around to saying you will have words for significant low hills of probability, the whole argument blows away.

I don't think so. Once the most significant hills have been named, we go on and name the next significant hills. We just choose longer names.

[-][anonymous]18y20

Since we are resting all our language construction and reasoning on thingspace there are a few things that need to be defined.

What is the distance metric for thingspace? How is thingspace extended?

[-]Peter_de_Blanc18y120

Even an optimal language would not be one designed to minimize average message length, because some messages are more urgent than others, even if relatively uncommon; e.g., messages about tigers.

[-]Richard_Kennaway18y00

tcpkac wrote: 'boundaries around where concentrations of unusually high probability density lie, to the best of our knowledge and belief'

The "probability density" is already the best of our knowledge and belief, unless Eliezer has converted to frequentism.

[-]Eliezer Yudkowsky18y00

Will, thingspace may not need a distance metric depending on how you draw your boundaries, which are not necessarily surfaces containing volumes of constant density. For example, a class in Naive Bayes / neural network of type 2 also slices up thingspace. More about this shortly. But if you're interested in the general topic, I believe that in the field of statistical learning, for algorithms that actually do depend on distance metrics, the standard cheap trick is to "sphere" the space by making the standard deviation equal 1 in all directions. An ad-hoc technique but apparently a useful one, though it has all the flaws you would expect.

tcpkac, see Kenneway's response.

[-][anonymous]18y00

While it is true that you don't need a metric to draw a boundary, I personally need a metric to be able to envision high concentrations of probability density.

A concentration implies a region, which implies a metric space. While your sphering of the space normalises it somewhat and deals with part of the trouble, it still skips over the question of metric space. For example is 2, 2, 2 closer to 1, 1, 1 than 4, 1, 1? If that was a co-ordinate of a position in three dimensional space you would want to use the euclidean metric i.e. d = ((x2 - x1)^2 + (y2 - y1)^2+ (z2 - z1)^2)^1/2 or you that might not be appropriate and you would have to use city block distances and put them equally far away (if they were average energy usage, weight and how many copies of the gene for green eyes it had).

See this page for more possible metrics http://www.cut-the-knot.org/do_you_know/far_near.shtml.

[-]Gordon_Worley18y00

I believe you made a slight typo, Eli.

You said: "Since there's an "unusually high" probability for P(Z1Y2) - defined as a probability higher than the marginal probabilities would indicate by default - it follows that observing Z1 is evidence which increases the probability of Y2. And by a symmetrical argument, observing Y2 must favor Z1."

But I think what you meant was "Since there's an "unusually high" probability for P(Z1Y2) - defined as a probability higher than the marginal probabilities would indicate by default - it follows that observing Y2 is evidence which increases the probability of Z1. And by a symmetrical argument, observing Z1 must favor Y2."

Nothing you said was untrue, but the implication of what you wrote doesn't match up with the example you actually gave just above that text.

[-]Gordon_Worley18y00

Hopefully not taking away anyone's fun here, but to reconcile Raven(x)->Black(x) but not vice versa, what this statement wants to say, letting P(R) and P(B) be the probabilities of raven and black, respectively, is P(R|B)=0 and P(B|R)=1, which gives us that

P(R|B) = 0 P(RB)/P(B) = 0 P(RB) = 0

and

P(B|R) = 1 P(BR)/P(R) = 1 P(BR) = P(R)

But of course this leads to a contradiction, so it can't really be true that Black(x)-/->Raven(x), can it? Sure, because what is really meant by implies (-/->) is not P(B|R) = 0 but P(B|R)<1. But in logic we often forget this because anything with a probability less than 1 is assigned a truth value of false.

Logic has its value, since sometimes you want to prove something is true 100% of the time, but this is generally only possible in pure mathematics. If you try to do it elsewhere you'll get exceptions (e.g. albino ravens). So leave logic to mathematicians; you should use Bayesian inference.

[-]Ben_Jones18y00

I've no doubt got the wrong end of the stick here, but why P(R|B)=0? Surely the probability that a black thing is a raven is nonzero?

[-]Nick_Tarleton18y00

"Vice versa" would be the contrapositive, which is NonBlack(x)->NonRaven(x), which is true iff R(x)->B(x) is true, no?

[-]Eliezer Yudkowsky18y00

Gordon, I fixed the Z1/Y2 swap.

"Vice versa" seems to have been interpreted ambiguously so I substituted "doesn't mean you're allowed to reason Black(x)->Raven(x)" which was what I meant.

Gordon, the whole business about P(R|B) = 0 makes no sense to me, and I suspect that it makes no sense even in principle. "If we learn that something is black, we know it cannot possibly be a raven"?

[-]Gordon_Worley18y00

I agree that it makes no sense, but as I was writing the comment I figured I would take you down the wrong path of what someone might naively think and then correct it. I think that someone who was overly trained in logic and not in probability might assume that if Raven(x)-->Black(x) being true leads to P(B|R) = 1, they might reason that since the reverse implication Black(x)-->Raven(x) is false, it leads to P(R|B) = 0. But based on the comments above, maybe only an ancient Greek philosopher would be inclined to make such a mistake.

[-]Ben_Jones18y00

Gordon,

I'd hope they weren't so hopelessly 'overtrained' that they wouldn't be able to step back from their P's and parentheses and ask themselves whether they really think that a black object cannot be a raven.

If it's a raven, it's black. If it ain't black, it ain't a raven.

[-]Caledonian218y-10

We'll ignore the existence of albino ravens for the sake of argument.

[-]Ben_Jones18y10

Have a look at the caption here

That's what happens to you when you insist on being the exception to the rule!

[-]Rolf_Nelson218y10

Green-eyed people are more likely than average to be black-haired (and vice versa), meaning that we can probabilistically infer green eyes from black hair or vice versa

There is nothing in the mind that is not first in the census.

[-]Ender14y10

Just so you know, there are two columns of Y subscript 3s in the first joint distribution.

[-][anonymous]12y10

This typo is still there.

Then if and only if the joint distribution of Y and Z is as follows, there is zero mutual information between Y and Z:

Z1Y1: 3/16 Z1Y2: 3/32 Z1Y3: 3/64 Z1Y3: 3/64

Z2Y1: 5/16 Z2Y2: 5/32 Z2Y3: 5/64 Z2Y3: 5/64

Fourth column has misnumbered subscripts.

[-]royf13y00

Having a word [...] is a more compact code precisely in those cases where we can infer some of those properties from the other properties. (With the exception perhaps of very primitive words, like "red" [...]).

Remember that mutual information is symmetric. If some things have the property of being red, then "red" has the property of being a property of those things. Saying "blood is red" is really saying "remember that visual experience that you get when you look at certain roses, apples, peppers, lipsticks and English buses and phone booths? The same happens with blood." If I give you the list above, can you find ("infer") more red things? Then "red" is a good word.

But do note that this is a dual sense to the one in which "human" is a good word. Most of the properties of humans are statistically necessary for being human: remove any one of them, and the thing is much less likely to be human. "Human" is a good word because these properties are positively correlated. On the other hand, most of the red things are statistically sufficient for being red: take any one of them, and the thing is much more likely to be red. "Red" is a good word because these things are negatively correlated - they are a bunch of distinct things with a shared aspect.

[-][anonymous]13y00

Erratum: In the first example of YZ joint distribution, last column should list Z1Y4 and Z2Y4 instead of Z1Y3 and Z2Y3.

[This comment is no longer endorsed by its author]Reply

[-][anonymous]10y20

So, hold on, if you wrote this in 2008, why the hell did you keep writing this blog instead of publishing at least one of what were eventually numerous papers on information-theoretic clustering with mutual-information measurements? Some of those didn't even come out until 2012 or 2014 or so, so it's not like you wouldn't have had time to publish a solid revision to MI-clustering if you came up with a good algorithm.

[-]Capla8y00

This is a brilliant essay. One of the best in the sequences, I think.

[-]Khal7y10

I’m wondering if a combination is so rare as to be odd, is it worth naming? E.g. wigger, or wangster. Wouldn’t it be useful precisely because we don’t expect it?

[-]Ian Televan5y*30

Fascinating subject indeed!

I wonder how one would need to modify this principle to take into account risk-benefit analysis. What if quickly identifying wiggins meant incurring great benefit or avoiding great harm, then you would still need a nice short word for them. This seems obvious, the question is only how much shorter would the word need to be.
Labels that are both short and phonetically consistent with a given language are in short supply, therefore we would predict that sometimes even unrelated things shared labels - if they occupied sufficiently different contexts s.t. there was no risk of confusing them. This what we see in case of professional jargon, for example. I also wonder whether one could actually quantify such prediction.
If labels that are both short and phonetically consistent with a given language are really in such short supply, why aren't they all already occupied? Why were you able to come up with a word like 'wiggin', that seems to be consistent with English phonetics, that doesn't already mean something? -- This introduces the concept of phonetic redundancy in languages. It would actually be impractical to occupy all shortest syllable combinations, because it would make it impossible or require too much effort to correct errors. People in radiocommunications recognized this phenomenon and devised a number of spelling alphabets, the most commonly known being the NATO phonetic alphabet.

Z₁Y₁: 3/16	Z₁Y₂: 3/32	Z₁Y₃: 3/64	Z₁Y₃: 3/64
Z₂Y₁: 5/16	Z₂Y₂: 5/32	Z₂Y₃: 5/64	Z₂Y₃: 5/64

Z₁Y₁: 12/64	Z₁Y₂: 8/64	Z₁Y₃: 1/64	Z₁Y₄: 3/64
Z₂Y₁: 20/64	Z₂Y₂: 8/64	Z₂Y₃: 7/64	Z₂Y₄: 5/64

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

71

Mutual Information, and Density in Thingspace

71

71