Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Eliezer has talked about thingspace - an abstract space in which objects lie, defined by their properties on some scales. In thingspace, there are some more-or-less natural categories, such as the "bird" category, that correspond to clusters in this thingspace. Drawing the boundary/carving reality at the joints, can be thought of as using definitions that efficiently separate these clusters[1].

He said about the thingspace of birds:

The central clusters of robins and sparrows glowing brightly with highly typical birdness; satellite clusters of ostriches and penguins [further from the center] glowing more dimly with atypical birdness, and Abraham Lincoln a few megaparsecs away and glowing not at all.

That, however, is how things stand in the world as it is. What if, for some reason, we had a lot of power in the world, and we wanted to break these neat-ish clusters, how would we go about it?

Remaking the clusters

Suppose first we wanted to break away some of those satellites away. What if we wanted penguins to be clearly distinct from birds? Penguins are land-waddlers and good swimmers, living in cold climates. So let's fill out this category, call it, maybe, Panguens.

Panguens all live in cold climates, waddle on land, and are good swimmers. Most are mammals, and give birth to live young. Some are more like the duck-billed platypus, and lay eggs (though they suckle their young). And all of them, including penguins, can reproduce with each other (how? genetic engineering magic). Thus there are a lot of intermediate species, including some that are almost-penguins except differing on a few details of their anatomy. All have fins of some sort, but only have penguin-like wing-fins.

Panguens are clearly a cluster, quite a tight one as animals go (given they can reproduce with each other). And they are clearly not birds. And penguins clearly belong inside them, rather than with birds; at most, some pedantic classifier would label penguins as "bird-like panguens".

So, by filling in extra possibilities in thingspace, we have moved penguins to a new category. Graphically, it looks like starting with:

and moving to:

What if we wanted to make Abraham Lincoln into a bird[2]? We can connect the categories of birds and humans by creating all sort of bird-human hybrids. But this is not quite enough. Bird-humans is certainly a reasonable category in this world. But humans are very close to other mammals, especially great apes; Lincoln is clearly a bird-human, but also clearly a mammal. How can we excise him from this category?

Well, we might just go around killing all the apes and monkeys, making the bird-human category more natural. But we don't need to do that. We could introduce instead mammal-dog hybrids: all mammal species (except humans) have a smooth transition to lizards, filled with intermediate species. Now the new natural definition of "mammal" includes "must have a hybrid version with lizards". All mammals fit this definition - except for humans, who have their own thing with birds. Much more natural to divide the "bird-humans" cleanly from the "mammal-lizards", and note that Lincoln is a bird-human that shares some features with some mammal-lizards.

The more we want to separate humans from mammals, the more we increase the category "mammals" in directions away from humans, and increase the category "bird-humans" in directions away from other mammals. Graphically we start with:

and move to:

That image, however, underestimates the distance between humans and other mammals, since we have added an extra feature - "has a hybrid version with lizards" - that other mammals have and humans don't. So humans and mammals have moved further apart in thingspace, along this feature.

Why we should care: AIs, examples, and definitions

This is an amusing mental exercise, but what's the point? The point is that this is another variant of the standard failure mode of powerful AIs.

Let's go with that old chestnut, and assume that we have a superintelligence dedicated to "making humans happy". For that to work, we have to define "human" and "happy". We could give an intrinsic definition of these, by saying that humans are "featherless bipeds" (with broad flat nails) and happy humans are those that are "smiling, with higher serotonin, dopamine, oxytocin, and endorphins".

Think of these in terms of thingspace. These carve it up, separating the featherless bipeds from the other animals and things, and the happy ones from those less happy. In our world as it stands, these are pretty natural cuts: almost humans are featherless bipeds, and almost all featherless bipeds (with broad flat nails) are humans. The happiness definition is similarly decent.

But all that falls apart as soon as the AI starts to optimise things. Consider tiny toy robots with a painted smile and some hormones around its CPU. These fit the definitions we've made, but are clearly not "happy humans". That's because the definitions are decent cuts in the space of actual things, but terrible cuts in the space of potential things. So the more powerful the AI becomes, the more it can enter potential space, and the more useless the definitions become.

Relative definitions also suffer: if you define what a human is in terms of collections of cells, for instance, you incentivise the AI to change what cells correspond to.

The approach of reward extrapolation/model splintering is to rely more on extrinsic definitions: the category of humans is the set of humans, the category of happy humans is the set of happy humans. Then the AI is supposed to extend both of these categories in ways similar to how we would do so, dealing conservatively with the inevitable problems as the defining features break down.


  1. This is quite similar to unsupervised learning in ML. ↩︎

  2. This is among my favourite sentences I have ever written. ↩︎

New Comment
6 comments, sorted by Click to highlight new comments since:

Yes - nice post - feels to me like another handle on the pointers problem.

Thanks for the link.

Talk of clusters and distributions brings to mind an inherently correlational/statistical definition of classes. However, one can also think of it in a more causal sense; penguins are birds because they descend from birds (and so automatically end up having the sorts of traits that birds have).

These causal definition would be invariant to the sorts of cluster-remaking that you suggest, because you can't change the past.

But then how do we make sense of the relationships of belonging that you describe?

Basically, by reversing the hierarchy: penguins are not panguens, instead panguens are penguins. Or more specifically, they are penguin-mammal hybrids, descending from both classes through genetic engineering. So the genetic engineering lets you violate some things that have been invariant (well, approximately, ignoring horizontal gene transfer, which AFAIK has played a relatively smallish role?) from evolution, such as species always descending from a single species, rather than merging multiple distinct species together.

In the case of lizard-mammals, I guess the causal perspective is somewhat trickier. In a way the selection procedure introduces a class of "non-human mammal" that didn't exist before? And then the mammal-lizard hybrids descend from both of these.

I don't know whether preserving the categories against declustering/reclustering like this is at all useful, but I though it should be mentioned that it can be done.

It seems like this would have issues once you want the AI to e.g. influence the world in a way that increases the number of happy humans - then you can't just say "and also 'happy humans' have to have an origin causally separated from AI intervention."

You wouldn't say "and also 'happy humans' have to have an origin causally separated from AI intervention", you'd say "and also 'happy humans' have to have an origin causally downstream from humans".

Hm. But that doesn't seem to stop the AI from creating molecular smiley faces whose precise form is causally dependent on what humans are like. The reason the causal origin argument helps for birds is because we can specify a model of the causal origin, and it's just evolution with no room for AI intervention.