Friendly to who?

by TimFreeman3 min read16th Apr 201138 comments

2

Personal Blog

At
   http://lesswrong.com/lw/ru/the_bedrock_of_fairness/ldy
Eliezer mentions two challenges he often gets, "Friendly to who?" and "Oh, so you get to say what 'Friendly' means."  At the moment I see only one true answer to these questions, which I give below.  If you can propose alternatives in the comments, please do.

I suspect morality is in practice a multiplayer game, so talking about it needs multiple people to be involved.  Therefore, let's imagine a dialogue between A and B.

A: Okay, so you're interested in Friendly AI.  Who will it be Friendly toward?

B: Obviously the people who participate in making the system will decide how to program it, so they will decide who it is Friendly toward.

A: So the people who make the system decide what "Friendly" means?

B: Yes.

A: Then they could decide that it will be Friendly only toward them, or toward White people.  Aren't that sort of selfishness or racism immoral?

B: I can try to answer questions about the world, so if you can define morality so I can do experiments to discover what is moral and what is immoral, I can try to guess the results of those experiments and report them.  What do you mean by morality?

A: I don't know.  If it doesn't mean anything, why do people talk about morality so much?

B: People often profess beliefs to label themselves as members of a group.  So far as I can tell, the belief that some things are moral and other things are not is one of those beliefs.  I don't have any other explanation for why people talk so much about something that isn't subject to experimentation.

A: So if that's what morality is, then it's fundamentally meaningless unless I'm planning out what lies to tell in order to get positive regard from a potential ingroup, or better yet I manage to somehow deceive myself so I can truthfully conform to the consensus morality of my desired ingroup.  If that's all it is, there's no constraint on how a Friendly AI works, right?  Maybe you'll build it and it will be only be Friendly toward B.

B: No, because I can't do it by myself.  Suppose I approach you and say "I'm going to make a Friendly AI that lets me control it and doesn't care about anyone else's preference."  Would you help me?

A: Obviously not.

B: Nobody else would either, so the only way I can unilaterally run the world with an FAI is to create it by myself, and I'm not up to that.  There are a few other proposed notions of Friendlyness that are nonviable for similar reasons. For example, if I approached you and said "I'm going to make a Friendly AI that treats everyone fairly, but I don't want to let anybody inspect how it works." Would you help me?

A: No, because I wouldn't trust you.  I'd assume that you plan to really make it Friendly only toward yourself, lie about it, and then drop the lie once the FAI had enough power that you didn't need the lie any more.

B: Right.  Here's an ethical system that fails another way: "I'll make an FAI that cares about every human equally, no matter what they do."  To keep it simple, let's assume that engineering humans to have strange desires for the purpose of manipulating the FAI is not possible.  Would you help me build that?

A: Well, it fits with my intuitive notion of morality, but it's not clear what incentive I have to help.  If you succeed, I seem to win equally at the end whether I help you or not.  Why bother?

B: Right.  There are several possible fixes for that.  Perhaps if I don't get your help, I won't succeed, and the alternative is that someone else builds it poorly and your quality of life decreases dramatically.  That gives you an incentive to help.

A: Not much of one.  You'll surely need a lot of help, and maybe if all those other people help I won't have to.  Everyone would make the same decision and nobody would help.

B: Right.  I could solve that problem by paying helpers like you money, if I had enough money.  Another option would be to tilt the Friendlyness in the direction of helpers in proportion to how much they help me.

A: But isn't tilting the Friendlyness unfair?

B: Depends.  Do you want things to be fair?

A: Yes, for some intuitive notion of "fairness" I can't easily describe. 

B: So if the AI cares what you want, that will cause it to figure out what you mean by "fair" and tend to make it happen, with that tendency increasing as it tilts more in your favor, right?

A: I suppose so.  No matter what I want, if the AI cares enough about me, it will give me more of what I want, including fairness. 

B: Yes, that's the best idea I have right now.  Here's another alternative: What would happen if we only took action when there's a consensus about how to weight the fairness?

A: Well, 4% of the population are sociopaths.  They, and perhaps others, would make ridiculous demands and prevent any consensus.  Then we'd be waiting forever to build this thing and someone else who doesn't care about consensus would move while we're dithering and make us irrelevant.  Thus we'll have to take action and do something reasonable without having a consensus about what that is.  Since we can't wait for a consensus, maybe it makes sense to proceed now.  So how about it?  Do you need help yet?

B: Nope, I don't know how to make it.

A: Damn.  Hmm, do you think you'll figure it out before everybody else?

B: Probably not.  There are a lot of everybody else.  In particular, business organizations that optimize for profit have a lot of power and have fundamentally inhuman value systems.  I don't see how I can take action before all of them.

A: Me either.  We are so screwed.

Personal Blog

2

38 comments, sorted by Highlighting new comments since Today at 3:07 AM
New Comment

There is surprisingly little incentive for selfish AI writers to tilt the friendliness towards themselves. Consider these four outcomes: an AI is created that isn't friendly to anyone; no AI is created; an AI is created that's friendly to all humans; or an AI is created that's friendly only to its creator. A selfish programmer would prefer these in order of increasing preference.

The difference between each of the first three outcomes is huge. Death and extinction versus status quo versus superoptimization. But the difference between an AI friendly to all humans and an AI friendly to just the creator is small; normal human preferences are mostly compatible, and don't require astronomical resources, so making everyone else happy too would cost very little. But making an AI that's friendly only to its creator is riskier and less likely to succeed than making one that's friendly to everyone; the part where it distinguishes them from other humans may have bugs (especially if they try to self-modify later), they can't recruit help, and other humans may try to stop it. It also creates a time window during which, if the creator dies or suffers brain damage, the AI ends up unfriendly to everyone (including the creator).

So making a selectively-friendly AI just seems like a stupid idea, even before you get to the moral arguments. And the moral arguments point the same way. I'm much less worried about someone making an AI selfishly than I am about someone making an AI stupidly or carelessly, which is a real danger and one that can't be defused by any philosophical argument.

There is surprisingly little incentive for selfish AI writers to tilt the friendliness towards themselves.

For normal humans and at the end of the game, I agree with you. However, there are two situations where people may want tilt:

  • Narcissists seem to have an unlimited appetite for adoration from others. That might translate to a desire to get the AI tilted as much as possible in their favor. They are 1% of the population according to the abnormal psych literature, but in my experience I see a much larger fraction of the population being subclinically narcissistic enough to be a problem.

  • If there's a slow takeoff, the AI will be weak for some period of time. During this time the argument that it controls enough resources to satisfy everyone doesn't hold. If the organization building it has no other available currency to pay people for help, it might pay in tilt. If the tilt decays toward zero at some rate we could end up with something that is fair. I don't know how to reconcile that with the scheme described in another comment for dealing with utility monsters by tilting away from them.

It also creates a time window during which, if the creator dies or suffers brain damage, the AI ends up unfriendly to everyone (including the creator).

I agree that there will be windows like that. To avoid that, we would need a committee taking the lead with some well-defined procedures that allow a member of the committee to be replaced if the others judge him to be insane or deceptive. Given how poorly committee decision making works, I don't know if that presents more or less risk than simply having one leader and taking the risk of him going insane. The size of the window depends on whether there's a hard or soft takeoff, and I don't know which of those to expect.

B: ... "I'll make an FAI that cares about every human equally, no matter what they do." ... Would you help me build that?

A: Well, it fits with my intuitive notion of morality, but it's not clear what incentive I have to help.

At this stage, I think the dialog goes astray by missing the real practical and political reason for CEV. The correct question is "Would you actively oppose me?" The correct answer is, "Well, I don't see how I could reasonably expect anything much better than that, so ..., no, I suppose I won't actively oppose you." And the difficult problem is convincing a rather large fraction of mankind to give the correct answer.

The correct question is "Would you actively oppose me?" The correct answer is, "Well, I don't see how I could reasonably expect anything much better than that, so ..., no, I suppose I won't actively oppose you."

The rich and powerful won't care for CEV. It pays no attention to their weath. They might as well have wasted their time accruing it.

Since the rich and powerful are high on the list for funding the R&D behind intelligent machines, they are likely to find a way to fund something that pays more attention to their preferences.

The "I don't see how I could reasonably expect anything much better" seems likely to be a failure of the imagination.

The rich and powerful won't care for CEV. It pays no attention to their weath.

Not necessarily so. Quoting Eliezer: "A minor, muddled preference of 60% of humanity might be countered by a strong, unmuddled preference of 10% of humanity." So any good Marxist will be able to imagine the rich and powerful getting their way in the computation of CEV just as they get their way today: by inducing muddle in the masses.

The "I don't see how I could reasonably expect anything much better" seems likely to be a failure of the imagination.

And here I was considering it a victory of reason. :)

Quoting Eliezer: "A minor, muddled preference of 60% of humanity might be countered by a strong, unmuddled preference of 10% of humanity." So any good Marxist will be able to imagine the rich and powerful getting their way in the computation of CEV just as they get their way today: by inducing muddle in the masses.

There's little reason for them to bother with such nonsense - if they are building and paying for the thing in the first place.

CEV may be a utilitarian's wet dream - but it will most-likely look like a crapshoot to the millionaires who are actually likely to be building machine intelligence.

The "I don't see how I could reasonably expect anything much better" seems likely to be a failure of the imagination.

And here I was considering it a victory of reason. :)

It seemed as though you were failing to forsee opposition to CEV-like schemes. There are implementation problems too - but even without those, such scenarios do not seem very likely to happen.

Thanks, I agree. It's good to see that this multiplayer game notion of morality leads to a new insight that I didn't build into it.

As I write this, this topic has a score of -1. Did it get downvoted because it's perceived as an uninteresting topic, or because it appears to be poorly done? I'm not perceiving a lot of "poorly done" signals in the comments; the conversations seem to be interesting and constructive. If it's uninteresting, where should people go to talk about Friendly AI? The SL4 list is moribund.

People often profess beliefs to label themselves as members of a group. So far as I can tell, the belief that some things are moral and other things are not is one of those beliefs. I don't have any other explanation for why people talk so much about something that isn't subject to experimentation.

Well, of course it's subject to experimentation, or at least real-world testing: do other agents consider you sufficiently trustworthy to deal with? In the indefinitely-iterated prisoner's dilemma we call society, are you worth the effort of even trying to deal with?

(This is not directly relevant to your topic, but it jumped out at me.)

If you accept morality being a statement about strategies in the multiplayer game, then I agree. However, if you take the usual stand that "action X is moral" is a statement that is either true or false, no matter what anyone else thinks, then whether your moral system leads other people to trust you is irrelevant.

Here's an environment where it makes a practical difference. At one point my dad tried (and failed) to get me to have "racial consciousness". I didn't pay much attention, but I gather he meant that I should not be color-blind in my social interactions so racist white people would trust me. He's not stupid, so I assume there really were enough whites with that flavor of racism, somewhere, to form an in-group of meaningful size. Thus, if you accept that morality is about getting a specific in-group to trust you, racism is moral for the purposes of signalling membership in that specific in-group.

That conclusion just seems too repugnant to me. I'd rather stop using the word "moral" than use it with that definition. I won't argue definitions with people, but I do want to point out that that definition leads to some odd-looking patterns of words in true statements.

Er, your second paragraph appears to say "morality is part of signaling therefore signaling is part of morality therefore the repugnance of a given use of signaling disproves your thesis." (Please correct me if I've misparsed it.) I'm not sure any of those "therefores" work, particularly the first one (which is a simple "A in B therefore B in A" fallacy).

I've probably just failed to explain what I'm saying particularly well. I've been trying to sharpen the idea in discussions elsewhere and I'm discovering how LessWronged I've become, because I've had to go down two levels and I'm now explaining the very concept of cognitive biases and how provably gibberingly delusional humans are about themselves. I just had to supply a reference to show that people tend to pick up their beliefs from the people they associate with, and that priming exists ... I can see why EY wrote the sequences.

Er, your second paragraph appears to say "morality is part of signaling therefore signaling is part of morality therefore the repugnance of a given use of signaling disproves your thesis."

If the phrase "your thesis" refers to your claim:

Well, of course [statements about morality are] subject to experimentation, or at least real-world testing: do other agents consider you sufficiently trustworthy to deal with? In the indefinitely-iterated prisoner's dilemma we call society, are you worth the effort of even trying to deal with?

then we're failing to communicate. For me to agree or disagree with your statement requires me to guess what you mean by "morality", and for my best guess, I agree with your statement. I said what my best guess was, and pointed out that it leads to some odd-looking true statements, such as "racism is moral" with all those qualifications I said.

Of course, if "your thesis" meant something else, then we're failing to communicate in a less interesting way because I have no clue WTF you mean.

(Please correct me if I've misparsed it.)

I think you have. I was guessing that you're saying morality is a useful concept and you're defining it to be is a type of signaling. If you meant something else please clarify. That's a fine definition, and we can use it if you want, but it leads to the odd conclusion that racism is moral if you're trying to signal to a group of racists. If you accept that conclusion, that's great, we have a definition of morality that has odd consequences but it has the advantage of being empirically testable.

If you don't like to admit the assertion "racism is moral" with the abovementioned qualifications, we need a different definition of morality. Ideally that different definition would still let us empirically test whether "X is moral". I don't know what that definition would be.

No, you've just explicitly clarified that you are in fact making an "A is a subset of B, therefore B is a subset of A" fallacy, with A=morality and B=signaling. Moralities being a subset of signaling (and I'm not saying it's a strict subset anyway, but a combination of practical game theory and signaling; I'd be unsurprised, of course, to find there was more) does not, in logic, imply that all signaling (e.g. racism, to use your example) is therefore a subset of morality. That's a simple logical fallacy, though the Latin name for it doesn't spring to mind. It's only not a fallacy if the two are identical or being asserted to be identical (or, for practical discussion, substantially identical), and I'm certainly not asserting that - there is plenty of signaling that is nothing to do with moralities.

Remember: if you find yourself making an assertion that someone else's statement that A is a subset of B therefore implies that B is a subset of A, you're doing it wrong, unless A is pretty much all of B (such that if you know something is in B, it's very likely to be in A). If you still think that in the case you're considering A⊂B => B⊂A, you should do the numbers.

I proposed abandoning the word "morality" because it's too muddled. You want to use it. I have repeatedly tried to guess what you mean by it, and you've claimed I'm wrong every time. Please define what you mean by "morality" if you wish to continue.

I'm not sure I understand the hypothesis. Surely you are not suggesting that people signal their adherence to consequentialism rather than deontological versions of ethics as a way of convincing rational agents to trust them.

I think they signal deontological ethics (cached rules), whatever their internal moral engine actually uses. "I am predictable, you can trust me not to defect!" I suspect it ties into ingroup identification as well.

I need to write up my presently half-baked notions as a discussion post, probably after rereading the metaethics and ethical injunctions sequences in case it's already covered.

If, as you say:

the people who make the system decide what "Friendly" means

...I figure that makes the term of little practical use to anyone else.

Ah. I interpreted the challenge "Oh, so you get to say what 'Friendly' means" to mean that the speaker is objecting that the listener seems to feel that he's entitled to say what the AI is supposed to do. You seem to be having a different question of what's the intended general meaning of the term.

I had taken as given that a "Friendly" AI is, by definition, one that you trust to do what you want, even if it's much smarter and more powerful than you. If your desires are contradictory, it should still do something reasonable.

This leaves "reasonable" and "what you want" undefined. Filling in those definitions with something technically well-defined and getting to the point where you can reasonably expect it to be stable when it's much smarter than it's designers is the crux of the problem.

Wikipedia's definition at http://en.wikipedia.org/wiki/Friendly_artificial_intelligence is: "A Friendly Artificial Intelligence or FAI is an artificial intelligence (AI) that has a positive rather than negative effect on humanity." Of course, this leaves "positive" undefined, which is not any better than leaving "what you want" or "reasonable" undefined.

At the end of the day, you don't need the concept. The question the creators of the AI will ask themselves will be "Do we want to run this AI?"

At the end of the day, you don't need the concept [of Friendly AI]. The question will be "Do we want to run this AI?"

I worry that the question will be "What do you mean we, Tonto?"

I agree that that's a valid worry, but I intended to look at things from the point of view of the people making the AI so I was expressing a different worry. I edited my post to clarify.

I agree, that's the "whom" in CEV.

I have an issue with CEV, which is that I don't think we should extrapolate. We should give the current crop of humans what they want, not what we imagine they might want if they were extrapolated into some finished version of themselves. In the example where Fred wants to kill Steve, CEV says the FAI shouldn't help because Fred aspires to give up hatred someday and the extrapolated Fred wouldn't want to kill Steve. On the contrary, I say the FAI shouldn't help because Steve wants to live more than Fred wants to kill him.

For example, in the original post, if our AI cares about an extrapolated version of speaker A, then it's possible that that extrapolated A will want different things from the actual present A, so the actual A would be wise to withhold any assistance until he clearly understood the extrapolation process.

On the contrary, I say the FAI shouldn't help because Steve wants to live more than Fred wants to kill him.

Doesn't this fall victim to utility monsters, though? If there's some actor who wants to kill you more than you want not to die, then the FAI would be obliged to kill you. That's a classic utility monster: an entity that wants harder than anyone else.

One solution is to renorm everyone's wants, such that the desires of any sentient being don't count any more than any other. But this leads directly to Parfit's Repugnant Conclusion¹, or at least to some approximation thereof: maximizing the number of sentient beings, even at the expense of their individual wants or pleasure.


¹ Parfit's repugnant conclusion, Level 7 Necromancer spell. Material components include the severed head of a utility monster, or a bag of holding filled with orgasmium, which is consumed during casting. Fills a volume of space with maximally dense, minimally happy (but barely nonsuicidal) sentient minds, encoded onto any available material substrate.

I agree that you have to renorm everyone's wants for this to work. I also agree that if you can construct broken minds for the purpose of manipulating the FAI, we need provisions to guard against that. My preferred alternative at the moment follows:

  • Before people become able to construct broken minds, the FAI cares about everything that's genetically human.

  • After we find the first genetically human mind deliberately broken for the purpose of manipulating the FAI, we guess when the FAI started to be influenced by that, and retroactive to just before that time we introduce a new policy: new individuals start out with a weight of 0, and can receive weight transferred from their parents, so the total weight is conserved. I don't want an economy of weight-transfer to arise, so it would be a one-way irreversible transfer.

This might lead to a few people running around with a weight of 0 because their parents never made the transfer. This would be suboptimal, but it would not have horrible conclusions because the AI would care for the parents who probably care for the new child, so the AI would in effect care some for the new child.

Death of the parents doesn't break this. Caring about the preference of dead people is not a special case.

I encourage people to reply to this post with bugs in this alternative or with other plausible alternatives.

I agree completely that the extrapolation process as envisioned in CEV leads to the system doing all manner of things that the original people would reject.

It is also true that maturation often leads to adult humans doing all manner of things that their immature selves would reject. And, sure, it's possible to adopt the "Peter Pan" stance of "I don't wanna grow up!" in response to that, though it's hard to maintain in the face of social expectations and biological imperatives.

It is not, however, clear to me that a seven-year-old would be wise to reject puberty, or that we would, if offered a pill that ensured that we would never come to prefer anything different from what we prefer right now, be wise to collectively take it.

That extrapolation leads to something different isn't clearly a reason to reject it.

It is not, however, clear to me that a seven-year-old would be wise to reject puberty

The difference between a seven year old and an adult is a transition into a well understood state that many have been in. The most powerful people in society are already in that state. In contrast, the transition of the AI's extrapolation is going into a completely new state that nobody knows anything about, except possibly the AI. The analogy isn't valid.

That extrapolation leads to something different isn't clearly a reason to reject it.

It's a wildcard with unknown and unknowable consequences. That's not a good thing to have in a Friendly AI. The burden of proof should be on the people who want to include it. As I mentioned earlier, it's not the best solution to the Fred-wants-to-murder-Steve problem, since it's more reliable to look at Steve's present desire to live than to hope that extrapolated-Fred doesn't want to murder. So it isn't needed to solve that problem. What problem does it solve?

  • I agree with you that puberty is a familiar, well-understood transition, whereas extrapolation is not. It's not clear to me that reasoning from familiar cases to novel ones by analogy is invalid, but I certainly agree that reasoning by analogy doesn't prove much of anything, and you're entirely justified in being unconvinced by it.

  • I agree with you that anyone who wants to flip the switch on what I consider a FOOMing CEV-implementing AI has an enormous burden of proof to shoulder before they get my permission to do so. (Not that I expect they will care very much.)

  • I agree with you that if we simply want to prevent the AI from killing people, we can cause it to implement people's desire to live; we don't need to extrapolate Fred's presumed eventual lack of desire-to-kill to achieve that.

  • My one-sentence summary of the problem CEV is intended to solve (I do not assert that it does so) is "how do we define the target condition for a superhuman environment-optimizing system in such a way that we can be confident that it won't do the wrong thing?"

  • That is expanded on at great length in the Metaethics and Fun Theory sequences, if you're interested. Those aren't the clearest conceivable presentation, but I doubt I will do better in a comment and am not highly motivated to try.

My one-sentence summary of the problem CEV is intended to solve (I do not assert that it does so) is "how do we define the target condition for a superhuman environment-optimizing system in such a way that we can be confident that it won't do the wrong thing?"

My question was meant to be "What problem does extrapolation solve?", not "What problem is CEV intended to solve?" To answer the former question, you'd need some example that can be solved with extrapolation that can't easily be solved without it. I can't presently see a reason the example should be much more complicated than the Fred-wants-to-kill-Steve example we were talking about earlier.

That is expanded on at great length in the Metaethics and Fun Theory sequences, if you're interested.

I might read that eventually, but not for the purpose of getting an answer to this question. I have no reason to believe the problem solved by extrapolation is so complex that one needs to read a long exposition to understand the problem. Understanding why extrapolation solves the problem might take some work, but understanding what the problem is should not. If there's no short description of a problem that requires extrapolation to solve it, it seems likely to me that extrapolation does not solve a problem.

For example, integral calculus is required to solve the problem "What is the area under this parabola?", given enough parameters to uniquely determine the parabola. Are you seriously saying that extrapolation is necessary but its role is more obscure than that of integral calculus?

Are you seriously saying that extrapolation is necessary but its role is more obscure than that of integral calculus?

What I said was that the putative role of extrapolation is avoiding optimizing for the wrong thing.

That's not noticeably more complicated a sentence than "the purpose of calculus is to calculate the area under a parabola", so I mostly think your question is rhetorically misleading.

Anyway, as I explicitly said, I'm not asserting that extrapolation solves any problem at all. I was answering (EDIT: what I understood to be) your question about what problem it's meant to solve, and providing some links to further reading if you're interested, which it sounds like you aren't, which is fine.

Ah, I see. I was hoping to find an example, about as concrete as the Fred-wants-to-kill-Steve example, that someone believes actually motivates extrapolation. A use-case, as it were.

You gave the general idea behind it. In retrospect, that was a reasonable interpretation of my question.

I'm not asserting that extrapolation solves any problem at all.

Okay, so you don't have a use case. No problem, I don't either. Does anybody else?

I realize you haven't been online for a few months, but yes, I do.

Humanity's desires are not currently consistent. An FAI couldn't satisfy them all because some of them contradict each other, like Fred's and Steve's in your example. There may not even be a way of averaging them out fairly or meaningfully. Either Steve lives or dies: there's no average or middle ground and Fred is just out of luck.

However, it might be the case that human beings are similar enough that if you extrapolate everything all humans want, you get something consistent. Extrapolation is a tool to resolve inconsistencies and please both Fred and Steve.

I have an issue with CEV, which is that I don't think we should extrapolate.

Amen brother!

It would be good if Eliezer (or someone who understands his thinking) could explain just why it is so important to extrapolate - rather than, for example, using current volition of mankind (construed as a potentially moving target). I worry that extrapolation is proposed simply because Eliezer doesn't much care for the current volition of mankind and hopes that the extrapolated volition will be more to his taste. Of course, another explanation is that it has something to do with the distaste for discounting.

This is the Knew more ... Thought faster ... Were more the people we wished we were ... section of CEV.

Yes, that 'poetry' explains what extrapolation is, but not why we need to risk it. To my mind, this is the most dangerous aspect of the whole FAI enterprise. Yet we don't have anything approaching an analysis of a requirements document - instead we get a poetic description of what Eliezer wants, a clarification of what the poetry means, but no explanation of why we should want that. It is presumed to be obvious that extrapolating can only improve things. Well, lets look more closely.

... if we knew more, ...

An AI is going to tell us what we would want, if only we knew more. Apparently, there is an assumption here that the AI knows things we don't. Personally, I worry a bit that an AI will come to believe things that are not true. In fact, I worry about it most when the AI claims to know something that mankind does not know - something dealing with human values. Why do I worry about that? Something someone wrote somewhere presumably. But maybe that is not the kind of superior AI 'knowledge' that Eliezer is talking about here.

Knew more: Fred may believe that box A contains the diamond, and say, "I want box A." Actually, box B contains the diamond, and if Fred knew this fact, he would predictably say, "I want box B."

And instead of extrapolating, why not just inform Fred where the diamond is? At this point, the explanation becomes bizarre.

If Fred would adamantly refuse to even consider the possibility that box B contains a diamond, while also adamantly refusing to discuss what should happen in the event that he is wrong in this sort of case, and yet Fred would still be indignant and bewildered on finding that box A is empty, Fred's volition on this problem is muddled.

Am I alone in preferring, in this situation, that the AI not diagnose a 'muddle', and instead give Fred box A after offering him the relevant knowledge?

Thought faster: Suppose that your current self wants to use an elaborate system of ropes and sticks to obtain a tasty banana, but if you spent an extra week thinking about the problem, you would predictably see, and prefer, a simple and elegant way to get the banana using only three ropes and a teddy bear.

Again, if the faster thinking allows the AI to serve as an oracle, making suggestions that even our limited minds can appreciate once we hear them, then why should we take the risk of promoting the AI from oracle to king? The AI should tell us things rather than speaking for us.

Were more the people we wished we were: Any given human is inconsistent under reflection. We all have parts of ourselves that we would change if we had the choice, whether minor or major.

When we have a contradiction between a moral intuition and a maxim codifying our system of moral standards there are two ways we can go - we can revise the intuition or we can revise the maxim. It makes me nervous having an AI make the decisions leading to 'reflective equilibrium' rather than making those decisions myself. Instead of an extrapolation, I would prefer a dialog leading me to my own choice of equilibrium rather than having a machine pick one for me. Again, my slogan is "Speak to us, don't speak for us."

Where our wishes cohere rather than interfere

I'm not sure what to make of this one. Is there a claim here that extrapolation automatically leads to coherence? If so, could we have an argument justifying that claim? Or, is the point that the extrapolation specification has enough 'free play' to allow the AI to guide the extrapolation to coherence? Coherence is certainly an important issue. A desideratum? Certainly. A requirement? Maybe. But there are other ways of achieving accommodation without trying to create an unnatural coherence in our diverse species.

These are topics that really need to be discussed in a format other than poetry.

An AI is going to tell us what we would want, if only we knew more. Apparently, there is an assumption here that the AI knows things we don't. Personally, I worry a bit that an AI will come to believe things that are not true. In fact, I worry about it most when the AI claims to know something that mankind does not know - something dealing with human values. Why do I worry about that? Something someone wrote somewhere presumably. But maybe that is not the kind of superior AI 'knowledge' that Eliezer is talking about here.

Rebuttal: Most people in the world believe in a religion that is wrong. (This conclusion holds regardless of which, if any, world religion happens to be true.) Would we want an A.I. that enforces the laws of a false religion because people want the laws of their religion enforced? (Assume that people would agree that the AI shouldn't enforce the laws of false religions.)

If Fred would adamantly refuse to even consider the possibility that box B contains a diamond, while also adamantly refusing to discuss what should happen in the event that he is wrong in this sort of case, and yet Fred would still be indignant and bewildered on finding that box A is empty, Fred's volition on this problem is muddled.

Am I alone in preferring, in this situation, that the AI not diagnose a 'muddle', and instead give Fred box A after offering him the relevant knowledge?

What if box A actually contains a bomb that explodes when Fred opens it? Should the AI still give Fred the box?

Is there a claim here that extrapolation automatically leads to coherence? If so, could we have an argument justifying that claim?

As I understand CEV, the hope is that it will, and if it doesn't, CEV is said to fail. Humanity may not have a CEV.

Here's one attempt at a definition:

"Friendly AI"…: the challenge of creating an AI that, e.g., cures cancer, rather than wiping out humanity.

Presumably "curing cancer" would usually be relatively simple if you had a superintelligence and some humans - so the issue seems mostly down to the "wiping out humanity" business.

A practical superintelligence with so little instrumental concern about its origins and history that it fails to preserve some humans seems pretty unlikely to me.

The idea is that the superintelligence will preserve humans as part of a general instrumental strategy of preserving the past - provided its goal is a long-term, open-ended one.

Why would such a superintelligence care about preserving its origins? Simply because that is a critically-important clue to the form of any aliens it might meet in the alien race - and it needs to be as prepared for that as possible. Not only is it likely to preserve humans, it will probably simulate and experiment with many close variants of us.

One possible screw-up is excessive discounting - but that seems fairly simple to avoid - we already know not to build a myopic machine.

So, going by this definition of the term, friendliness seems practically automatic - and we should probably be setting our sights on something higher.