You are viewing revision 1.4.0, last edited by Vladimir_Nesov

A "magical category" is an English word which, although it sounds simple - hey, it's just one word, right? - is actually not simple, and furthermore, may be applied in a complicated way that drags in other considerations.

In Yudkowsky's work on Friendly AI, the notion of "magical categories" is particularly important with respect to English words that are supposed to describe morality and imperatives. Suppose you say, for example, that "pleasure" is good. You should instruct an AI to make people happy. "All right," says the wise Friendly AI researcher, "so the AI inserts an electrode into your pleasure center to make you happy forever; or better yet, it disassembles you and uses your atoms to construct many tiny agents that are just complex enough to experience pleasure, and makes all of them happy."

"What?" you cry indignantly. "That's not what I meant! That's not true happiness!"

And if you try to unpack the concept of "true" happiness - that is, happiness which qualifies for being valuable - then you end up with a highly complicated Fun Theory. So the word "happiness" actually turned out to have a complicated border; you would have to draw a squiggly surface in the space of possibilities, to capture everything you meant by "happiness" and exclude everything you didn't mean.

Or suppose that your mother's house was on fire, and so you wished to a literal-minded genie to "get my mother out of the house". The genie said: "What do you mean by that? Can you specify it mathematically?" And you replied: "Increase the distance between the following contiguous physical entity, designated 'Mom', and the center of the following house." So the genie takes your mother and throws her out of the Solar System.

If you were talking to a human firefighter, it wouldn't occur to him, any more than to you, to cause your mother to be outside the house by, say, exploding the gas main and sending her flaming form hurtling high into the air. The firefighter understands explicitly that, in addition to wanting to increase the distance between your mother and the center of the burning house, you want your mother to survive... in good physical condition, and so on. The firefighter, to some extent, shares your values; and so the invisible additional considerations that you're dragging into the problem, are also shared by the firefighter, and so can be omitted from your verbal instructions.

Physical brains are not powerful enough to search all possibilities; we have to cut down the search space to possibilities that are likely to be good. Most of the "obviously bad" ways to get one's mother out of a burning building - those that would end up violating our other values, and so ranking very low in our preference ordering - do not even occur to us as possibilities. When we tell the firefighter "My mother's in there! Get her out!" it doesn't occur to us, or to the firefighter, to destroy the building with high explosives. And so we don't realize that, in addition to the distance between our mother and the house, we're also dragging in highly complicated considerations having to do with the value of survival and good health, and what exactly constitutes being alive and healthy...

Main posts

See also