Superexponential Conceptspace, and Simple Words

Eliezer Yudkowsky

Thingspace, you might think, is a rather huge space. Much larger than reality, for where reality only contains things that actually exist, Thingspace contains everything that could exist.

Actually, the way I "defined" Thingspace to have dimensions for every possible attribute—including correlated attributes like density and volume and mass—Thingspace may be too poorly defined to have anything you could call a size. But it's important to be able to visualize Thingspace anyway. Surely, no one can really understand a flock of sparrows if all they see is a cloud of flapping cawing things, rather than a cluster of points in Thingspace.

But as vast as Thingspace may be, it doesn't hold a candle to the size of Conceptspace.

"Concept", in machine learning, means a rule that includes or excludes examples. If you see the data 2:+, 3:-, 14:+, 23:-, 8:+, 9:- then you might guess that the concept was "even numbers". There is a rather large literature (as one might expect) on how to learn concepts from data... given random examples, given chosen examples... given possible errors in classification... and most importantly, given different spaces of possible rules.

Suppose, for example, that we want to learn the concept "good days on which to play tennis". The possible attributes of Days are:

Sky: {Sunny, Cloudy, Rainy} AirTemp: {Warm, Cold} Humidity: {Normal, High} Wind: {Strong, Weak}

We're then presented with the following data, where + indicates a positive example of the concept, and - indicates a negative classification:

+ Sky: Sunny; AirTemp: Warm; Humidity: High; Wind: Strong. - Sky: Rainy; AirTemp: Cold; Humidity: High; Wind: Strong. + Sky: Sunny; AirTemp: Warm; Humidity: High; Wind: Weak.

What should an algorithm infer from this?

A machine learner might represent one concept that fits this data as follows:

Sky: ?; AirTemp: Warm; Humidity: High; Wind: ?

In this format, to determine whether this concept accepts or rejects an example, we compare element-by-element: ? accepts anything, but a specific value accepts only that specific value.

So the concept above will accept only Days with AirTemp=Warm and Humidity=High, but the Sky and the Wind can take on any value. This fits both the negative and the positive classifications in the data so far—though it isn't the only concept that does so.

We can also simplify the above concept representation to {?, Warm, High, ?}.

Without going into details, the classic algorithm would be:

Maintain the set of the most general hypotheses that fit the data—those that positively classify as many examples as possible, while still fitting the facts.
Maintain another set of the most specific hypotheses that fit the data—those that negatively classify as many examples as possible, while still fitting the facts.
Each time we see a new negative example, we strengthen all the most general hypotheses as little as possible, so that the new set is again as general as possible while fitting the facts.
Each time we see a new positive example, we relax all the most specific hypotheses as little as possible, so that the new set is again as specific as possible while fitting the facts.
We continue until we have only a single hypothesis left. This will be the answer if the target concept was in our hypothesis space at all.

In the case above, the set of most general hypotheses would be {?, Warm, ?, ?} and {Sunny, ?, ?, ?}, while the set of most specific hypotheses is the single member {Sunny, Warm, High, ?}.

Any other concept you can find that fits the data will be strictly more specific than one of the most general hypotheses, and strictly more general than the most specific hypothesis.

(For more on this, I recommend Tom Mitchell's Machine Learning, from which this example was adapted.)

Now you may notice that the format above cannot represent all possible concepts. E.g. "Play tennis when the sky is sunny or the air is warm". That fits the data, but in the concept representation defined above, there's no quadruplet of values that describes the rule.

Clearly our machine learner is not very general. Why not allow it to represent all possible concepts, so that it can learn with the greatest possible flexibility?

Days are composed of these four variables, one variable with 3 values and three variables with 2 values. So there are 3*2*2*2 = 24 possible Days that we could encounter.

The format given for representing Concepts allows us to require any of these values for a variable, or leave the variable open. So there are 4*3*3*3 = 108 concepts in that representation. For the most-general/most-specific algorithm to work, we need to start with the most specific hypothesis "no example is ever positively classified". If we add that, it makes a total of 109 concepts.

Is it suspicious that there are more possible concepts than possible Days? Surely not: After all, a concept can be viewed as a collection of Days. A concept can be viewed as the set of days that it classifies positively, or isomorphically, the set of days that it classifies negatively.

So the space of all possible concepts that classify Days is the set of all possible sets of Days, whose size is 2²⁴ = 16,777,216.

This complete space includes all the concepts we have discussed so far. But it also includes concepts like "Positively classify only the examples {Sunny, Warm, High, Strong} and {Sunny, Warm, High, Weak} and reject everything else" or "Negatively classify only the example {Rainy, Cold, High, Strong} and accept everything else." It includes concepts with no compact representation, just a flat list of what is and isn't allowed.

That's the problem with trying to build a "fully general" inductive learner: They can't learn concepts until they've seen every possible example in the instance space.

If we add on more attributes to Days—like the Water temperature, or the Forecast for tomorrow—then the number of possible days will grow exponentially in the number of attributes. But this isn't a problem with our restricted concept space, because you can narrow down a large space using a logarithmic number of examples.

Let's say we add the Water: {Warm, Cold} attribute to days, which will make for 48 possible Days and 325 possible concepts. Let's say that each Day we see is, usually, classified positive by around half of the currently-plausible concepts, and classified negative by the other half. Then when we learn the actual classification of the example, it will cut the space of compatible concepts in half. So it might only take 9 examples (2⁹ = 512) to narrow 325 possible concepts down to one.

Even if Days had forty binary attributes, it should still only take a manageable amount of data to narrow down the possible concepts to one. 64 examples, if each example is classified positive by half the remaining concepts. Assuming, of course, that the actual rule is one we can represent at all!

If you want to think of all the possibilities, well, good luck with that. The space of all possible concepts grows superexponentially in the number of attributes.

By the time you're talking about data with forty binary attributes, the number of possible examples is past a trillion—but the number of possible concepts is past two-to-the-trillionth-power. To narrow down that superexponential concept space, you'd have to see over a trillion examples before you could say what was In, and what was Out. You'd have to see every possible example, in fact.

That's with forty binary attributes, mind you. 40 bits, or 5 bytes, to be classified simply "Yes" or "No". 40 bits implies 2^40 possible examples, and 2^(2^40) possible concepts that classify those examples as positive or negative.

So, here in the real world, where objects take more than 5 bytes to describe and a trillion examples are not available and there is noise in the training data, we only even think about highly regular concepts. A human mind—or the whole observable universe—is not nearly large enough to consider all the other hypotheses.

From this perspective, learning doesn't just rely on inductive bias, it is nearly all inductive bias—when you compare the number of concepts ruled out a priori, to those ruled out by mere evidence.

But what has this (you inquire) to do with the proper use of words?

It's the whole reason that words have intensions as well as extensions.

In yesterday's post, I concluded:

The way to carve reality at its joints, is to draw boundaries around concentrations of unusually high probability density.

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn't explain it until today. A better statement would be:

The way to carve reality at its joints, is to draw simple boundaries around concentrations of unusually high probability density in Thingspace.

Otherwise you would just gerrymander Thingspace. You would create really odd noncontiguous boundaries that collected the observed examples, examples that couldn't be described in any shorter message than your observations themselves, and say: "This is what I've seen before, and what I expect to see more of in the future."

In the real world, nothing above the level of molecules repeats itself exactly. Socrates is shaped a lot like all those other humans who were vulnerable to hemlock, but he isn't shaped exactly like them. So your guess that Socrates is a "human" relies on drawing simple boundaries around the human cluster in Thingspace. Rather than, "Things shaped exactly like [5-megabyte shape specification 1] and with [lots of other characteristics], or exactly like [5-megabyte shape specification 2] and [lots of other characteristics]", ..., are human."

If you don't draw simple boundaries around your experiences, you can't do inference with them. So you try to describe "art" with intensional definitions like "that which is intended to inspire any complex emotion for the sake of inspiring it", rather than just pointing at a long list of things that are, or aren't art.

In fact, the above statement about "how to carve reality at its joints" is a bit chicken-and-eggish: You can't assess the density of actual observations, until you've already done at least a little carving. And the probability distribution comes from drawing the boundaries, not the other way around—if you already had the probability distribution, you'd have everything necessary for inference, so why would you bother drawing boundaries?

And this suggests another—yes, yet another—reason to be suspicious of the claim that "you can define a word any way you like". When you consider the superexponential size of Conceptspace, it becomes clear that singling out one particular concept for consideration is an act of no small audacity—not just for us, but for any mind of bounded computing power.

Presenting us with the word "wiggin", defined as "a black-haired green-eyed person", without some reason for raising this particular concept to the level of our deliberate attention, is rather like a detective saying: "Well, I haven't the slightest shred of support one way or the other for who could've murdered those orphans... not even an intuition, mind you... but have we considered John Q. Wiffleheim of 1234 Norkle Rd as a suspect?"

I'll second the recommendation for Tom Mitchell's book (although it has been a long time since I have read it and I have moved away from the machine learning philosophy since).

Are you going to go on to mention that the search in a finite concept space can be seen as a search the space of regular languages and therefore a search in the space of FSM? And then move onto Turing Machines and the different concepts they can represent e.g. the set of strings that have exactly the same number of 0s and 1s in.

Hmm, lets fast forward to where I think the disagreement might lie in our philosophies.

Let us say I have painstakingly come up with a Turing machine that represents a concept, e.g. the even 0s and 1s I mentioned above. We shall call this the evenstring concept, I want to give this concept to a machine as it I have found it useful for something.

Now I could try and teach this to a machine by giving it a series of of positive and negative examples

0011 +, 0101 +, 000000111111 +, 1 -, 0001 - etc...

It would take infinite bits to fully determine this concept, agreed? You might get there early if you have a nice short way of describing evenstring in the AIs space of TM.

Instead if we had an agreed ordering of Turing machines I could communicate the bits of the Turing machine corresponding to evenstring first and ignore the evidence about evenstring entirely, instead we are looking at evidence for what is the TM of! That is I am no longer doing traditional induction. It would only take n bits of evidence to nail down the Turing Machine, where n is the length of the turing machine I am trying to communicate to the AI. I could communicate partial bits and it could try and figure out the rest, if I specified the length or a bound on it.

If you want to add evidence back into the picture, I could communicate the evenstring concept initially to the AI and make it increase the prior of the evenstring concept in some fashion. Then it collect evidence in the normal way, in case we were wrong when we communicated evenstring in some fashion, or it had a different coding for TMs.

However this is still not enough for me. The concepts that this could deal with would only be to do with the outside world, that is the communicated TM would be a mapping from external senses to thing space. I'm interested in concepts that map the space of turing machines to another space of turing machines ("'et' is the french word for 'and'"), and other weird and wonderful concepts.

I'll leave it at that for now.

Actually I'll skip to the end. In order to be able to represent all the possible concepts (e.g. concepts about concept formation, concepts about languages) I would like to be able to represent I need a general purpose, stored program computer. A lot like a PC, but slightly different.

Of course, once someone has defined the word "wiggin" in that way, this gives us a reason to attend to that concept, and additionally, presumably the definer would never have thought of it in the first place without SOME reason for it, even if an extremely weak one.

As I was reading about that learning algorithm, I was sure it looked familiar (I've never even heard of it before). Then suddenly it hit me - this is why I always win at Cluedo!

Say there's no correlation between the size of a rock and how much Vanadium it contains. I collect Vanadium you see. However, I can only carry rocks under a certain size back to my Vanadium-Extraction Facility. I bring back rocks I can carry, and test them for Vanadium. However, for ease of reference, I call all carrying-size, Vanadium-containing rocks 'Spargs'. This is useful since I can say 'I found 5 Spargs today', instead of having to say 'I found 5 smallish rocks containing Vanadium today'. I have observed no correlation between rock size and Vanadium content. Is 'Sparg' a lie?

edit sorry looks like this actually belongs a bit farther down the comment stream.

Ben, to put that point more generally, Eliezer seems to be neglecting to consider the fact that utility is sometimes a reason to associate several concepts, even apart from their probability of being associated with one another or with other things. An example from another commenter would be "I want a word for red flowers because I like red flowers"; this is entirely reasonable.

A nice point. But it there is something a bit weird.

We all agree that utility is subjective - you like red flowers, I like blue. As Bayesians, we also know that empirical information is somewhat subjective as well. We don't all have access to the same empirical observations.

So, it appears that we cannot really "carve nature at the (objective) joints. The best we can do is to carve where we (subjectively) estimate the joints to be. Now, that is fine if we are using words only to carry out private inferences. But we also frequently need to use words to communicate.

It appears that even Bayesian rationalists who understand cluster analysis must sometimes argue about definitions. There may be arguments regarding how to delimit the population. There may be arguments about how best to quantize the variables. It is not completely clear to me that it is always possible to distinguish communication problems from single-person inference problems.

I think that the usefulness of "Sparg" is that it has a stronger correlation with "usefulness to Ben" than is indicated by the seperitate correlations of "usefulness to Ben" with "carrying size", "contains Vanduim", and "rock".

this reminds me of The Dumpster.

http://www.flong.com/projects/dumpster/

also, i can't help but think of the concept of the Collective Unconscious, and the title of an old Philip K Dick story, Do Androids Dream of Electric Sheep?

Unknown - pretty much, yeah. I just wanted to use the words 'Vanadium' and 'Sparg'.

Obvious counter: utility is subjective, the joints in reality (by definition!) aren't. So categorisation based on 'stuff we want' or 'stuff we like' can go any way you want and doesn't have to fall along reality's joints. There is a marked distinction between these and the type of (objective?) categories that fit in with the world.

If I am searching for a black-haired, green-eyed person to be in my movie, I have a motive for using the word Wiggin. However, the existence of the word Wiggin doesn't reflect a natural group of Things in Thingspace, and hence doesn't have any bearing on my expectations. Just as the coining of a word meaning 'red flower' wouldn't be a reflection of any natural grouping in Thingspace - flowers can be lots of colours, and lots of things are red. Sound good?

Not every regularity of categorization is derived from physical properties local to the object itself. I watch Ben Jones returning to the facility, and notice that all the rocks he carries are vanadium-containing and weighing less than ten pounds, at a surprisingly high rate relative to the default propensity of vanadium-containing and less-than-ten-pound rocks. So I call these rocks "Spargs", and there's no reason Ben himself can't do the same.

Remember, it's legitimate to have a word for something with properties A, B where A, B do not occur in conjunction at greater than default probability, but which, when they do occur in conjunction, have properties C, D, and E at greater than default probability.

It's legitimate for us to have a word for any reason. Words have more than one purpose - they're not just for making predictions about likely properties.

In the second block indention, shouldn't either the first or the last entry for 'Sky' be 'Rainy'?

edit: or 'Cloudy'?

Erratum? I'm pretty sure your machine learning example set should have Cloudy as the sky attribute in the third line; otherwise your machine learner won't match any value for the sky attribute.

This complete space includes all the concepts we have discussed so far. But it also includes concepts like "Positively classify only the examples {Sunny, Warm, High, Strong} and {Sunny, Warm, High, Weak} and reject everything else"

Isn't that the same as {Sunny, Warm, High, ?} in the notation used above?

I deliberately left out a key qualification in that (slightly edited) statement, because I couldn't explain it until today.

I might be missing something crucial because I don't understand why this addition is necessary. Why do we have to specify "simple" boundaries on top of saying that we have to draw them around concentrations of unusually high probability density? Like, aren't probability densities in Thingspace already naturally shaped in such a way that if you draw a boundary around them, it's automatically simple? I don't see how you run the risk of drawing weird, noncontiguous boundaries if you just follow the probability densities.

The best way to draw a boundary around the high-probability things, without worrying about simplicity, is to just write down all your observations; they have probability 1 of having been observed, and everything else has probability 0.
This boundary is way too complicated; you've seen many things.