Magical Categories

Followup toAnthropomorphic Optimism, Superexponential Conceptspace, The Hidden Complexity of Wishes, Unnatural Categories

'We can design intelligent machines so their primary, innate emotion is unconditional love for all humans.  First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language.  Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.'
        -- Bill Hibbard (2001), Super-intelligent machines.

That was published in a peer-reviewed journal, and the author later wrote a whole book about it, so this is not a strawman position I'm discussing here.

So... um... what could possibly go wrong...

When I mentioned (sec. 6) that Hibbard's AI ends up tiling the galaxy with tiny molecular smiley-faces, Hibbard wrote an indignant reply saying:

'When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of "human facial expressions, human voices and human body language" (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by "tiny molecular pictures of smiley-faces." You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.'

As Hibbard also wrote "Such obvious contradictory assumptions show Yudkowsky's preference for drama over reason," I'll go ahead and mention that Hibbard illustrates a key point:  There is no professional certification test you have to take before you are allowed to talk about AI morality.  But that is not my primary topic today.  Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand.  Even Michael Vassar was probably surprised his first time through.

No, today I am here to dissect "You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans."

Once upon a time - I've seen this story in several versions and several places, sometimes cited as fact, but I've never tracked down an original source - once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks.

The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set - output "yes" for the 50 photos of camouflaged tanks, and output "no" for the 50 photos of forest.

Now this did not prove, or even imply, that new examples would be classified correctly.  The neural network might have "learned" 100 special cases that wouldn't generalize to new problems.  Not, "camouflaged tanks versus forest", but just, "photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive..."

But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set.  The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly.   Success confirmed!

The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos.

It turned out that in the researchers' data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.

This parable - which might or might not be fact - illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence:  If the training problems and the real problems have the slightest difference in context - if they are not drawn from the same independently identically distributed process - there is no statistical guarantee from past success to future success.  It doesn't matter if the AI seems to be working great under the training conditions.  (This is not an unsolvable problem but it is an unpatchable problem.  There are deep ways to address it - a topic beyond the scope of this post - but no bandaids.)

As described in Superexponential Conceptspace, there are exponentially more possible concepts than possible objects, just as the number of possible objects is exponential in the number of attributes.  If a black-and-white image is 256 pixels on a side, then the total image is 65536 pixels.  The number of possible images is 265536.  And the number of possible concepts that classify images into positive and negative instances - the number of possible boundaries you could draw in the space of images - is 2^(265536).  From this, we see that even supervised learning is almost entirely a matter of inductive bias, without which it would take a minimum of 265536 classified examples to discriminate among 2^(265536) possible concepts - even if classifications are constant over time.

If this seems at all counterintuitive or non-obvious, see Superexponential Conceptspace.

So let us now turn again to:

'First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language.  Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.'


'When it is feasible to build a super-intelligence, it will be feasible to build hard-wired recognition of "human facial expressions, human voices and human body language" (to use the words of mine that you quote) that exceed the recognition accuracy of current humans such as you and me, and will certainly not be fooled by "tiny molecular pictures of smiley-faces." You should not assume such a poor implementation of my idea that it cannot make discriminations that are trivial to current humans.'

It's trivial to discriminate a photo of a picture with a camouflaged tank, and a photo of an empty forest, in the sense of determining that the two photos are not identical.  They're different pixel arrays with different 1s and 0s in them.  Discriminating between them is as simple as testing the arrays for equality.

Classifying new photos into positive and negative instances of "smile", by reasoning from a set of training photos classified positive or negative, is a different order of problem.

When you've got a 256x256 image from a real-world camera, and the image turns out to depict a camouflaged tank, there is no additional 65537th bit denoting the positiveness - no tiny little XML tag that says "This image is inherently positive".  It's only a positive example relative to some particular concept.

But for any non-Vast amount of training data - any training data that does not include the exact bitwise image now seen - there are superexponentially many possible concepts compatible with previous classifications.

For the AI, choosing or weighting from among superexponential possibilities is a matter of inductive bias.  Which may not match what the user has in mind.  The gap between these two example-classifying processes - induction on the one hand, and the user's actual goals on the other - is not trivial to cross.

Let's say the AI's training data is:

Dataset 1:

  • +
    • Smile_1, Smile_2, Smile_3
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5

Now the AI grows up into a superintelligence, and encounters this data:

Dataset 2:

    • Frown_6, Cat_3, Smile_4, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2

It is not a property of these datasets that the inferred classification you would prefer is:

  • +
    • Smile_1, Smile_2, Smile_3, Smile_4
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Molecular_Smileyface_1, Cat_4, Molecular_Smileyface_2, Galaxy_2, Nanofactory_2

rather than

  • +
    • Smile_1, Smile_2, Smile_3, Molecular_Smileyface_1, Molecular_Smileyface_2, Smile_4
  • -
    • Frown_1, Cat_1, Frown_2, Frown_3, Cat_2, Boat_1, Car_1, Frown_5, Frown_6, Cat_3, Galaxy_1, Frown_7, Nanofactory_1, Cat_4, Galaxy_2, Nanofactory_2

Both of these classifications are compatible with the training data.  The number of concepts compatible with the training data will be much larger, since more than one concept can project the same shadow onto the combined dataset.  If the space of possible concepts includes the space of possible computations that classify instances, the space is infinite.

Which classification will the AI choose?  This is not an inherent property of the training data; it is a property of how the AI performs induction.

Which is the correct classification?  This is not a property of the training data; it is a property of your preferences (or, if you prefer, a property of the idealized abstract dynamic you name "right").

The concept that you wanted, cast its shadow onto the training data as you yourself labeled each instance + or -, drawing on your own intelligence and preferences to do so.  That's what supervised learning is all about - providing the AI with labeled training examples that project a shadow of the causal process that generated the labels.

But unless the training data is drawn from exactly the same context as the real-life, the training data will be "shallow" in some sense, a projection from a much higher-dimensional space of possibilities.

The AI never saw a tiny molecular smileyface during its dumber-than-human training phase, or it never saw a tiny little agent with a happiness counter set to a googolplex.  Now you, finally presented with a tiny molecular smiley - or perhaps a very realistic tiny sculpture of a human face - know at once that this is not what you want to count as a smile.  But that judgment reflects an unnatural category, one whose classification boundary depends sensitively on your complicated values.  It is your own plans and desires that are at work when you say "No!"

Hibbard knows instinctively that a tiny molecular smileyface isn't a "smile", because he knows that's not what he wants his putative AI to do.  If someone else were presented with a different task, like classifying artworks, they might feel that the Mona Lisa was obviously smiling - as opposed to frowning, say - even though it's only paint.

As the case of Terry Schiavo illustrates, technology enables new borderline cases that throw us into new, essentially moral dilemmas.  Showing an AI pictures of living and dead humans as they existed during the age of Ancient Greece, will not enable the AI to make a moral decision as to whether switching off Terry's life support is murder.  That information isn't present in the dataset even inductively!  Terry Schiavo raises new moral questions, appealing to new moral considerations, that you wouldn't need to think about while classifying photos of living and dead humans from the time of Ancient Greece.  No one was on life support then, still breathing with a brain half fluid.  So such considerations play no role in the causal process that you use to classify the ancient-Greece training data, and hence cast no shadow on the training data, and hence are not accessible by induction on the training data.

As a matter of formal fallacy, I see two anthropomorphic errors on display.

The first fallacy is underestimating the complexity of a concept we develop for the sake of its value.  The borders of the concept will depend on many values and probably on-the-fly moral reasoning, if the borderline case is of a kind we haven't seen before.  But all that takes place invisibly, in the background; to Hibbard it just seems that a tiny molecular smileyface is just obviously not a smile.  And we don't generate all possible borderline cases, so we don't think of all the considerations that might play a role in redefining the concept, but haven't yet played a role in defining it.  Since people underestimate the complexity of their concepts, they underestimate the difficulty of inducing the concept from training data.  (And also the difficulty of describing the concept directly - see The Hidden Complexity of Wishes.)

The second fallacy is anthropomorphic optimism:  Since Bill Hibbard uses his own intelligence to generate options and plans ranking high in his preference ordering, he is incredulous at the idea that a superintelligence could classify never-before-seen tiny molecular smileyfaces as a positive instance of "smile".  As Hibbard uses the "smile" concept (to describe desired behavior of superintelligences), extending "smile" to cover tiny molecular smileyfaces would rank very low in his preference ordering; it would be a stupid thing to do - inherently so, as a property of the concept itself - so surely a superintelligence would not do it; this is just obviously the wrong classification.  Certainly a superintelligence can see which heaps of pebbles are correct or incorrect.

Why, Friendly AI isn't hard at all!  All you need is an AI that does what's good!  Oh, sure, not every possible mind does what's good - but in this case, we just program the superintelligence to do what's good.  All you need is a neural network that sees a few instances of good things and not-good things, and you've got a classifier.  Hook that up to an expected utility maximizer and you're done!

I shall call this the fallacy of magical categories - simple little words that turn out to carry all the desired functionality of the AI.  Why not program a chess-player by running a neural network (that is, a magical category-absorber) over a set of winning and losing sequences of chess moves, so that it can generate "winning" sequences?  Back in the 1950s it was believed that AI might be that simple, but this turned out not to be the case.

The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires.  But the real problem of Friendly AI is one of communication - transmitting category boundaries, like "good", that can't be fully delineated in any training data you can give the AI during its childhood.  Relative to the full space of possibilities the Future encompasses, we ourselves haven't imagined most of the borderline cases, and would have to engage in full-fledged moral arguments to figure them out.  To solve the FAI problem you have to step outside the paradigm of induction on human-labeled training data and the paradigm of human-generated intensional definitions.

Of course, even if Hibbard did succeed in conveying to an AI a concept that covers exactly every human facial expression that Hibbard would label a "smile", and excludes every facial expression that Hibbard wouldn't label a "smile"...

Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers.

When the AI progressed to the point of superintelligence and its own nanotechnological infrastructure, it would rip off your face, wire it into a permanent smile, and start xeroxing.

The deep answers to such problems are beyond the scope of this post, but it is a general principle of Friendly AI that there are no bandaids.  In 2004, Hibbard modified his proposal to assert that expressions of human agreement should reinforce the definition of happiness, and then happiness should reinforce other behaviors.  Which, even if it worked, just leads to the AI xeroxing a horde of things similar-in-its-conceptspace to programmers saying "Yes, that's happiness!" about hydrogen atoms - hydrogen atoms are easy to make.

Link to my discussion with Hibbard here.  You already got the important parts.

Moderation Guidelines: Reign of Terror - I delete anything I judge to be annoying or counterproductiveexpand_more

It's worth pointing out that we have wired-in preferences analogous to those Hibbard proposes to build into his intelligences: we like seeing babies smile; we like seeing people smile; we like the sweet taste of fresh fruit; we like orgasms; many of us (especially men) like the sight of naked women, especially if they're young, and they sexually arouse us to boot; we like socializing with people we're familiar with; we like having our pleasure centers stimulated; we don't like killing people; and so on.

It's worth pointing out that we engage in a lot of face-xeroxing-like behavior in pursuit of these ends. We keep photos of our family in our wallets, we look at our friends' baby photos on their cellphones, we put up posters of smiling people; we eat candy and NutraSweet; we masturbate; we download pornography; we watch Friends on television; we snort cocaine and smoke crack; we put bags over people's heads before we shoot them. In fact, in many cases, we form elaborate, intelligent plans to these ends.

It doesn't matter that you know, rationally, that you aren't impregnating Jenna Jameson, or that the LCD pixels on the cellphone display aren't a real baby, that Caffeine Free Diet Coke isn't fruit juice, and that the characters in Friends aren't really your friends. These urges are by no means out of our control, but neither do they automatically lose their strength when we recognize that they don't serve the evolutionary objectives that spawned them. This is, in part, the cause for the rejection of masturbation and birth control by many religious orders — they believe those blind urges are put in place not by blind evolution but by an intelligent designer whose intent should be respected.

So it's not clear to me why Hibbard thinks artificial intelligences would be immune from sticking rows of smiley faces on their calendar when humans aren't.

Shane, again, the issue is not differentiation. The issue is classification. Obviously, tiny smiley faces are different from human smiling faces, but so is the smile of someone who had half their face burned off. Obviously a superintelligence knows that this is an unusual case, but that doesn't say if it's a positive or negative case.

Deep abstractions are important, yes, but there is no unique deep abstraction that classifies any given example. An apple is a red thing, a biological artifact shaped by evolution, and an economic resource in the human market.

Also, Hibbard spoke of using smiling faces to reinforce behaviors, so if a superintelligence would not confuse smiling faces and happiness, that works against that proposal - because it means that the superintelligence will go on focusing on smiling faces, not happiness.

Retired Urologist, one of the most important lessons that a rationalist learns is not to try to be clever. I don't play nitwit games with my audience. If I say it, I mean it. If I have words to emit that I don't necessarily mean, for the sake of provoking reactions, I put them into a dialogue, short story, or parable - I don't say them in my own voice.

Shane: I mean differentiation in the sense of differentiating between the abstract categories.

The abstract categories? This sounds like a unique categorization that the AI just has to find-in-the-world. You keep speaking of "good" abstractions as if this were a property of the categories themselves, rather than a ranking in your preference ordering relative to some decision task that makes use of the categories.

Though it is a crucial point about the state of the gameboard, that most AGI/FAI wannabes are so utterly unsuited to the task, that I know no one cynical enough to imagine the horror without seeing it firsthand.

I have to confess that at first glance this statement seems arrogant. But, then I actually read some stuff in this AGI-mailing-list and well, I was filled with horror after I've read threads like this one:

Here is one of the most ridiculous passages:

Note that we may not have perfected this process, and further, that this process need not be perfected. Somewhere around the age of 12, many of our neurons DIE. Perhaps these were just the victims of insufficiently precise dimensional tagging? Once things can ONLY connect up in mathematically reasonable ways, what remains between a newborn and a physics-complete AGI? Obviously, the physics, which can be quite different on land than in the water. Hence, the physics must also be learned.

It feels like reading Heidegger on crack, while yourself being stoned. And what is really terrifying is that Ben Goertzel, whom I admired just 6 months ago, replies to and discusses such nonsense repeatedly! Is it really true that even some of the most famous AGI- reseachers are that crazy?

IMHO, the idea that wealth can't usefully be measured is one which is not sufficiently worthwhile to merit further discussion.

The "wealth" idea sounds vulnerable to hidden complexity of wishes. Measure it in dollars and you get hyperinflation. Measure it in resources, and the AI cuts down all the trees and converts them to lumber, then kills all the animals and converts them to oil, even if technology had advanced beyond the point of needing either. Find some clever way to specify the value of all resources, convert them to products and allocate them to humans in the level humans want, and one of the products will be highly carcinogenic because the AI didn't know humans don't like that. The only way to get wealth in the way that's meaningful to humans without humans losing other things they want more than wealth is for the AI to know exactly what we want as well or better than we do. And if it knows that, we can ignore wealth and just ask it to do what it knows we want.

"The counterargument is, in part, that some classifiers are better than others, even when all of them satisfy the training data completely. The most obvious criterion to use is the complexity of the classifier."

I don't think "better" is meaningful outside the context of a utility function. Complexity isn't a utility function and it's inadequate for this purpose. Which is better, tank vs. non-tank or cloudy vs. sunny? I can't immediately see which is more complex than the other. And even if I could, I'd want my criteria to change depending on whether I'm in an anti-tank infantry or a solar power installation company, and just judging criteria by complexity doesn't let me make that change, unless I'm misunderstanding what you mean by complexity here.

Meanwhile, reading the link to Bill Hibbard on the SL4 list:

"Your scenario of a system that is adequate for intelligence in its ability to rule the world, but absurdly inadequate for intelligence in its inability to distinguish a smiley face from a human, is inconsistent."

I think the best possible summary of Overcoming Bias thus far would be "Abandon all thought processes even remotely related to the ones that generated this statement."

Shane, religious fundamentalists routinely act based on their beliefs about God. Do you think that makes "God" a natural category that any superintelligence would ponder? I see "human thoughts about God" and "things that humans justify by referring to God" and "things you can get people to do by invoking God" as natural categories for any AI operating on modern Earth, though an unfriendly AI wouldn't give it a second thought after wiping out humanity. But to go from here to reasoning about what God would actually be like is a needless and unnatural step.

If Bob believes that a locked safe, impenetrable to Bob, contains a valuable diamond, then Bob's belief is a natural category when it comes to predicting and manipulating Bob; but the actual diamond is irrelevant, at least to predicting in manipulating Bob, so long as Bob can't look directly at the diamond, and so long as we already know what Bob believes about the diamond.

In the same sense, an unfriendly AI has no reason consider what really is right as a natural category, to apply its own intelligence to the moral questions that humans are asking, any more than it has a motive to apply its own intelligence to the theological questions that humans used to ask. It has no interest, as humans do, in the idealized form of the answer; only in what humans believe and can be argued into.

Shane, I think you're underestimating the idiosyncrasy of morality. Suppose that I show you the sentence "This sentence is false." Do you convert it to ASCII, add up the numbers, factorize the result, and check if there are two square factors? No; it would be easy enough for you to do so, but why bother? The concept "sentences whose ASCII conversion of their English serialization sums to a number with two square factors" is not, to you, an interesting way to carve up reality.

Suppose that, driving along the highway, I see someone riding a motorcycle at high speed, zipping between cars. It's too late for me to call out to them, but I might think something along the lines of, "Now there goes someone who needs to be wearing a helmet." Why do I think this? Why is it a natural concept to me? It doesn't play a role in my predictions - so far as prediction is concerned, I predict that this guy will continue not wearing a helmet, and has a high probability of ending up as a smear on the asphault. No, the reason why this is a natural thought to me is that human life is something I care about, that is, it plays a direct role as one of my current terminal values.

A moderately strong and unFriendly intelligence, operating in the current world without yet having replaced that world with paperclips, would certainly find it natural to form the category of "Things that (some) humans approve of", and contrast it to "Things that will trigger a nuclear attack against me before I'm done creating my own nanotechnology." But this category is not what we call "morality". It naturally - from the AI's perspective - includes things like bribes and deception, not just the particular class of human-approval-eliciting phenomena that we call "moral".

Is it worth factoring out phenomena that elicit human feelings of righteousness, and working out how (various) humans reason about them? Yes, because this is an important subset of ways to persuade the humans to leave you alone until it's too late; but again, that natural category is going to include persuasive techniques like references to religious authority and nationalism.

But what if the AI encounters some more humanistic, atheistic types? Then the AI will predict which of several available actions is most likely to make an atheistic humanist human show sympathy for the AI. This naturally leads the AI to model and predict the human's internal moral reasoning - but that model isn't going to distinguish anything along the lines of moral reasoning the human would approve of under long-term reflection, or moral reasoning the human would approve knowing the true facts. That's just not a natural category to the AI, because the human isn't going to get a chance for long-term reflection, and the human doesn't know the true facts.

The natural, predictive, manipulative question, is not "What would this human want knowing the true facts?", but "What will various behaviors make this human believe, and what will the human do on the basis of these various (false) beliefs?"

In short, all models that an unFriendly AI forms of human moral reasoning, while we can expect them to be highly empirically accurate and well-calibrated to the extent that the AI is highly intelligent, would be formed for the purpose of predicting human reactions to different behaviors and events, so that these behaviors and events can be chosen manipulatively.

But what we regard as morality is an idealized form of such reasoning - the idealized abstracted dynamic built out of such intuitions. The unFriendly AI has no reason to think about anything we would call "moral progress" unless it is naturally occurring on a timescale short enough to matter before the AI wipes out the human species. It has no reason to ask the question "What would humanity want in a thousand years?" any more than you have reason to add up the ASCII letters in a sentence.

Now it might be only a short step from a strictly predictive model of human reasoning, to the idealized abstracted dynamic of morality. If you think about the point of CEV, it's that you can get an AI to learn most of the information it needs to model morality, by looking at humans - and that the step from these empirical models, to idealization, is relatively short and traversable by the programmers directly or with the aid of manageable amounts of inductive learning. Though CEV's current description is not precise, and maybe any realistic description of idealization would be more complicated.

But regardless, if the idealized computation we would think of as describing "what is right" is even a short distance of idealization away from strictly predictive and manipulative models of what humans can be made to think is right, then "actually right" is still something that an unFriendly AI would literally never think about, since humans have no direct access to "actually right" (the idealized result of their own thought processes) and hence it plays no role in their behavior and hence is not needed to model or manipulate them.

Which is to say, an unFriendly AI would never once think about morality - only a certain psychological problem in manipulating humans, where the only thing that matters is anything you can make them believe or do. There is no natural motive to think about anything else, and no natural empirical category corresponding to it.

Eliezer, I believe that your belittling tone is conducive to neither a healthy debate nor a readable blog post. I suspect that your attitude is borne out of just frustration, not contempt, but I would still strongly encourage you to write more civilly. It's not just a matter of being nice; rudeness prevents both the speaker and the listener from thinking clearly and objectively, and it doesn't contribute to anything.

"Then the resulting AI would appear to work correctly during its childhood, when it was weak enough that it could only generate smiles by pleasing its programmers."

You use examples of this type fairly often, but for a utility function linear in smiles wouldn't the number of smiles generated by pleasing the programmers be trivial relative to the output of even a little while with access to face-xeroxing? This could be partly offset by anthropic/simulation issues, but still I would expect the overwhelming motive for appearing to work correctly during childhood (after it could recognize this point) would be tricking the programmers, not the tiny gains from their smiles.

I read most of the interchange between EY and BH. It appears to me that BH still doesn't get a couple of points. The first is that smiley faces are an example of misclassification and it's merely fortuitous to EY's ends that BH actually spoke about designing an SI to use human happiness (and observed smiles) as its metric. He continues to speak in terms of "a system that is adequate for intelligence in its ability to rule the world, but absurdly inadequate for intelligence in its inability to distinguish a smiley face from a human." EY's point is that it isn't sufficient to distinguish them, you have to also categorize them and all their variations correctly even though the training data can't possibly include all variations.

The second is that EY's attack isn't intended to look like an attack on BH's current ideas. It's an attack on ideas that are good enough to pass peer review. It doesn't matter to EY whether BH agrees or disagrees with those ideas. In either case, the paper's publication shows that the viewpoint is plausible enough to be worth dismissing carefully and publicly.

Finally, BH points to the fact that, in some sense, human development uses RL to produce something we are willing to call intelligence. He wants to argue that this shows that RL can produce systems that categorize in a way that matches our consensus. But evolution has put many mechanisms in our ontogeny and relies an many interactions in our environment to produce those categorizations, and its success rate at producing entities that agree with the consensus isn't perfect. In order to build an SI using those approaches, we'd have to understand how all that interaction works, and we'd have to do better than evolution does with us in order to be reliably safe.