by [anonymous]
1 min read13 comments

10

An agent is composed of two components: a predictive model of the world, and a utility function for evaluating world states

I would say that the 'intelligence' of an agent corresponds to the sophistication and accuracy of their world model.  The 'friendliness' of an agent depends on how closely their utility function aligns with human values.

What is baffling to me is the vague idea that developing the theory of friendliness has any significant synergy with developing the theory of intelligence.  (Such an idea has surfaced in discussions of on the plausibility of SIAI developing friendly AI before unfriendly AI is developed by others.)

One argument for the latter (and the only one I can think of) is that:

a) A perfect world model is not possible: one must make the best tradeoff given limited resources

b) The more relevant a certain feature of the world is to the utility function, the more accurately we want to model it in the world model

However, for the large part, the world model would differ, in totality, very little between a paperclip maximizer and a friendly AI.  While the Friendly AI certainly has to keep track of more things which are irrelevant to the paperclip maximizer, both AIs would have to have world models which have to be able to model human behavior in order for the AIs to be effective, which one would expect would account for the bulk of the complexity of the world model in the first place.

New to LessWrong?

New Comment
13 comments, sorted by Click to highlight new comments since:

An agent is composed of two components: a predictive model of the world, and a utility function for evaluating world states.

...and a tree-pruner - at the very least!

However, for the large part, the world model would differ, in totality, very little between a paperclip maximizer and a friendly AI. While the Friendly AI certainly has to keep track of more things which are irrelevant to the paperclip maximizer, both AIs would have to have world models which have to be able to model human behavior in order for the AIs to be effective, which one would expect would account for the bulk of the complexity of the world model in the first place.

This is, as I understand it, the crux of the argument. Perhaps it takes an AI of complexity 10 to model the world well enough to interact with it and pursue simple values, but an AI of complexity 11 to model the world well enough to understand and preserve human values. If fooming is possible, that means any AIs of complexity 10 will take over the world and not preserve human values, and the only way to get a friendly AI is for no one to make an AI of complexity 10 and the first AI to be complexity 11 (and human-friendly).

What is baffling to me is the vague idea that developing the theory of friendliness has any significant synergy with developing the theory of intelligence.

It is not clear what discussions you are referring to - but there is a kind of economic synergy the other way around - machine intelligence builders will need to give humans what they want initially, or their products won't sell.

So, for example, there are economic incentives for automobile makers to figure out if humans prefer to be puked through windscreens or have airbags exploded in their faces.

To give a more machine-intelligence-oriented example, Android face recognition faces some privacy-related issues before it can be marketed - because it goes close to the "creepy line". Without a keen appreciation of the values-related issues, it can't be deployed or marketed.

It is not clear what discussions you are referring to

Yeah, though it seems I've seen statements by some people here that "Friendliness IS the AI"; I didn't understand them at face value due to the same obvious question as the OP.

[-][anonymous]10

"What is baffling to me is the vague idea that developing the theory of friendliness has any significant synergy with developing the theory of intelligence. "

Makes perfect sense to me. 'Friendliness' requires being able to specify a very precise utility function to an optimiser. 99% of the job of figuring that out is the job of learning how to specify any utility function precisely. That's a job that will have to be done by any AI developer.

Most of the arguments I've seen that SIAI will be effective at developing AI focus more on identifying common failings of other AI research (and a bit on "look at me I'm really smart" :P ). Maybe an argument like "if you haven't figured out that friendliness is important you probably haven't put in enough thought to make a self-improving AI" inspired this post, rather than the idea that figuring out friendliness would have some causal benefit?

Tried to edit the article for clarity: last paragraph should be "However, for the large part, the world model would differ, in totality, very little between a paperclip maximizer and a friendly AI. It is true the Friendly AI certainly has to keep track of more things which are irrelevant to the paperclip maximizer. But this extra complexity is nothing compared to the fact that for either AI to be effective, their world models would have to be able to model human behavior on a global scale."

Tried to edit the article for clarity: last paragraph should be [...]

Given that the current state of the last paragraph of the post doesn't match what you write in this comment, do you mean that you couldn't actually edit it for some reason? What's the purpose of the comment?

Yes: this site is quite buggy for me for some reason.

The obvious answer I can think of is that having a utility function that closely corresponds to a human's values is going to help an AI predict humans. This is perhaps analogous to mirror neurons in humans.

Probably not so much. You have to figure out what agents in your environment want if you are going to try to understand them and deal with them - but you don't really have to want the same things as them to be able to do that.

It's of course not necessary. But humans model other humans by putting ourselves in someone else's shoes and asking what we would do in that situation. I don't necessarily agree with the argument that it is necessary for an AI to have the same utility function as a human in order to predict humans. But if you did write an AI with an identical utility function, that would give it an easy way to make some predictions about humans (although you'd have problems with things like biases that prevent us from achieving our goals, etc).

Some truth - but when you put yourself in someone else's shoes, "goal substitution" often takes place, to take account of the fact that they want different things from you.

Machines may use the same trick, but again, they seem likely to be able to imagine quite a range of different intentional agents with different goals.

The good news is that they will probably at least try and understand and represent human goals.