Today's post, Mirrors and Paintings was originally published on 23 August 2008. A summary (taken from the LW wiki):


There is a proposal for programming a friendly AI, called CEV. Essentially, this strategy consists of teaching a computer to look at human brains and deduce, from that, morality. This should work better than trying to program morality "by hand", since we really aren't smart enough to solve that problem with an acceptable degree of accuracy.

Discuss the post here (rather than in the comments to the original post).

This post is part of the Rerunning the Sequences series, where we'll be going through Eliezer Yudkowsky's old posts in order so that people who are interested can (re-)read and discuss them. The previous post was Invisible Frameworks, and you can use the sequence_reruns tag or rss feed to follow the rest of the series.

Sequence reruns are a community-driven effort. You can participate by re-reading the sequence post, discussing it here, posting the next day's sequence reruns post, or summarizing forthcoming articles on the wiki. Go here for more details, or to have meta discussions about the Rerunning the Sequences series.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 2:31 PM

So you build the AI with a kind of forward reference: "You see those humans over there? ..."

Building on a prior comment of mine, right in the moral sense is not evenly distributed. The there in "over there" matters a great deal. Not stating that leads to telling an AI 'pretend like you know what i mean when I talk about moral right.' Since the Coherent Extrapolated Volition model is built on human behavior, it matters that human behavior is... diverse.

For this human, who I have decided to not emulate (including myself in the past) has mattered perhaps more than who I decided to emulate (including imagined future selves). Refraining from doing harm works right away, trying to do good includes mis-steps and takes time. I am starting to understand that what works for me, or for humans, might not be optimal or even functional in an AI. But starting with what I know, I'd tell a potentially friendly AI 'you see those humans over there? Tend toward not doing what they do.' I'd aim it where physical violence is more common, for starters. I tend toward Popper's piecemeal social change rather than utopian social change. Again, what works well for people may not work well for AIs. To that end, aim AIs away from utopia and they'll bumble their way (in nanoseconds) towards something more humane.

For further study: a 37 second documentary clip on the pursuit of utopia, described here as "what is best in life."

"Tend toward not doing what they do differently from them."

Every person who committed deliberate genocide was breathing at the time they made the decision. That does not make breathing evil.


We are in agreement. An AI that tended toward not doing what that group did would, therefore, tend to not commit genocide and tend to not breathe. So far, my AI program is tending to refrain from harming humans (perhaps also other AIs) and that is the goal.

Ok, now lets consider that fusion bombs are not used in warfare by evil people. Therefore, in order to be as different as possible from evil people, one should use fusion bombs liberally.



I think you're doing some special pleading, but I could be wrong. In the face of your proposed possible not-evil liberal use of fusion bombs, I prefer (a) turning my AI off and being wrong about turning it off to (b) your strange example being right. My AI made it through one round, which is pretty good for a first try.

But... you've just proved my point. As I said back in July, questions of correct and incorrect are not of equal utility in minimizing error. There are a few ways for an airplane to fly correctly and many, many ways to fly incorrectly (crash!). So plan for reducing those wrong ways with greater weight than enhancing the right ways. My AI made it all the way through one whole round before I turned it off, and when I turned it off it was because of a way it could go wrong.

There are other ways for my AI to be sub-optimal. By not imitating those people over there (my preferred term, although I think we're talking about the same group when you call them evil) it is likely that my AI will not have enough behaviors left to do anything, much less anything awesome. But once again my point is proven. My AI would rather turn itself into a lump than do harm. Useless, but here's a case where useless is the better option.

I (probably unfairly) gave myself some wiggle room by saying my AI tends to do or not do something, rather than does or doesn't do something. Unfortunately for me, that means my AI might tend in a harmful way sometimes. Fortunately for me, it means an opportunity for some feedback along the way. "Hey, AI, I know you never saw those people over there exhibiting non-evil liberal use of fusion bombs, so you didn't know any better, but... add that one to the 'no' list. Thanks big guy."

The Hippocratic Oath is another version of what I'm talking about. "I will prescribe regimens for the good of my patients according to my ability and my judgment and never do harm to anyone." I've seen a misquote I like better: 'First, do no harm." You may not know how to heal, but make sure you don't harm and make sure of that first.

Thank you for helping me test my AI! Your refutations to my conjectures are most welcome and encouraged. Glad we turned it off before it started in with that not-evil liberal use of fusion bombs.

Oh, what I'm saying is the right way to do it is "See those people over there? Figure out what makes them different from people in general, and don't do that." Or better yet "... and tell me what you think makes them different, so we can learn about it together. Later on, I expect that I will tell you to refrain from that behavior."


This is the first time you've mentioned people in general. I think that increases the chances of confusion, not decreases.

My initial point was simply saying "Those people are better than those people, pointing out two diffferent groups. What is confusing about making one of those groups "All people", other than defining "people"?

Because if I can't get an AI to discriminate between people and not-people for non-edge cases, I'm not going to try to get it to figure out moral and immoral.