Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The mathematical result is clear: you cannot deduce human preferences merely by observing human behaviour (even with simplicity priors).

Yet many people instinctively reject this result; even I found it initially counter-intuitive. And you can make a very strong argument that it's wrong. It would go something like this:

"I, a human , can estimate what human wants, just by observing their behaviour. And these estimations have evidence behind them: will often agree that I've got their values right, and I can use this estimation to predict 's behaviour. Therefore, it seems I've done the impossible: go from behaviour to preferences."

Evolution and empathy modules

This is how I interpret what's going on here. Humans (roughly) have empathy modules which allow them to estimate the preferences of other humans, and prediction modules which use the outcome of to predict their behaviour. Since evolution is colossally lazy, these modules don't vary much from person to person.

So, for a history of human 's behaviour in typical circumstances, the modules for two humans and will give similar answers:

  • .

Moreover, when humans turn their modules to their own behaviour, they get similar result. The human will have a privileged access to their own deliberations; so define as the internal history of . Thus:

  • .

This idea connects with partial preferences/partial models in the following way: gives access to their own internal models and preferences; so the approximately equal symbols above means that, by observing the behaviour of other humans, we have approximate access to their own internal models.

Then just takes the results of to predict future behaviour; since and have co-evolved, it's no surprise that would have a good predictive record.

So, given , it is true that a human can estimate the preferences of another human, and, given , it is true they can use this knowledge to predict behaviour.

The problems

So, what are the problems here? There are three:

  1. and only function well in typical situations. If we allow humans to self-modify arbitrarily or create strange other beings (such as AIs themselves, or merged human-AIs), then our empathy and predictions will start to fail[1].
  2. It needs and to be given; but defining these for AIs is very tricky. Time and time again, we've found that tasks that are easy for humans to do are not easy for humans to program into AIs.
  3. The empathy and prediction modules are similar, but not identical, from person to person and culture to culture[2].

So both are correct: my result (without assumption, you cannot go from human behaviour to preferences) and the critique (given these assumptions that humans share, you can go from human behaviour to preferences).

And when it comes to humans predicting humans, the critique is more valid: listening to your heart/gut is a good way to go. But when it comes to programming potentially powerful AIs that could completely transform the human world in strange and unpredictable ways, my negative result is more relevant than the critique is.

A note on assumptions

I've had some disagreements with people that boil down to me saying "without assuming A, you cannot deduce B", and them responding "since A is obviously true, B is true". I then go on to say that I am going to assume A (or define A to be true, or whatever).

At that point, we don't actually have a disagreement. We're saying the same thing (accept A, and thus accept B), with a slight difference of emphasis - I'm more "moral anti-realist" (we choose to accept A, because it agrees with our intuition) they are more "moral realist" (A is true, because it agrees with our intuition). It's not particularly productive to dig more.

In practice: debugging and injecting moral preferences

There are some interesting practical consequences to this analysis. Suppose, for example, that someone is programming a clickbait detector. They then gather a whole collection of clickbait examples, train a neural net on them, and fiddle with the hyperparameters till the classification looks decent.

But both "gathering a whole collection of clickbait examples", "the classification looks decent" are not facts about the universe: they are judgements of the programmers. The programmers are using their own and modules to establish that certain articles are a) likely to be clicked on, but b) not what the clicker would really want to read. So the whole process is entirely dependent on programmer judgement - it might feel like "debugging", or "making reasonable modelling choices", but its actually injecting the programmers' judgements into the system.

And that's fine! We've seen that different people have similar judgements. But there are two caveats: first, not everyone will agree, because there is not perfect agreement between the empathy modules. The programmers should be careful as to whether this is an area of very divergent judgements or not.

And second, these results will likely not generalise well to new distributions. That's because having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well. At all.

Hence we should expect poor generalisation from such methods, to other situations and (sometimes) to other humans. In my opinion, if programmers are more aware of these issues, they will have better generalisation performance.

  1. I'd consider the Star Trek universe to be much more typical that, say, 7th century China. The Star Trek universe is filled with beings that are slight variants or exaggerations of modern humans, while people in 7th century China will have very alien ways of thinking about society, hierarchy, good behaviour, and so on. But that is still very typical compared with the truly alien beings that can exist in the space of all possible minds. ↩︎

  2. For instance, Americans will typically explain a certain behaviour by intrinsic features of the actor, while Indians will give more credit to the circumstance (Miller, Joan G. "Culture and the development of everyday social explanation." Journal of personality and social psychology 46.5 (1984): 961). ↩︎


Ω 9

11 comments, sorted by Click to highlight new comments since: Today at 4:58 AM
New Comment

I would add that people overestimate their ability to guess others preferences. "He just wants money" or "She just wants to marry him". Such oversimplified models could be not just useful simplifications, buts could be blatantly wrong.

I agree we're not as good as we think we are. But there are a lot of things we do agree on, that seem trivial: eg "this person is red in the face, shouting at me, and punching me; I deduce that they are angry and wish to do me harm". We have far, far, more agreement than random agents would.

I agree, and I think much of the difficulty people have in accepting the result comes from not seeing implicitly assumed norms we are always applying to understand things. I think this runs much deeper than saying humans have something like an empathy module, though, and is a general problem of humans not seeing reality clearly, and instead thinking they see it when in fact what they are seeing (and especially what they are interpreting and inferring) is tainted by prior evidence in ways that makes everything humans do conditional on the priors such that no seeing is truly free and independent of the conditions in which it arises.

That's fairly abstract, so another way to put it is that we're constantly seeing the world on the assumption that we already know what the world looks like. We can learn to assume less, but the natural, adaptive state is to assume a lot because it has lead to greater reproductive fitness, probably specifically because it allowed hard-to-make inferences possible only by making strong assumptions about the world that were often true (or true enough for our ancestors' purposes).

(I think this ties into the story I've been telling for a long time about developmental psychology, and the newer story I've been telling about human brains doing minimization of prediction error with additional homeostatic set points, but I also think it stands on its own without that, so I write here without reference to them other than this comment.)

The problem with the maths is that it does not correlate 'values' with any real world observable. You give all objects a property, you say that that property is distributed by simplicity priors. You have not yet specified how these 'values' things relate to any real world phenomenon in any way. Under this model, you could never see any evidence that humans don't 'value' maximizing paperclips.

To solve this, we need to understand what values are. The values of a human are much like the filenames on a hard disk. If you run a quantum field theory simulation, you don't have to think about either, you can make your predictions directly. If you want to make approximate predictions about how a human will behave, you can think in terms of values and get somewhat useful predictions. If you want to predict approximately how a computer system will behave, instead of simulating every transistor, you can think in terms folders and files.

I can substitute words in the 'proof' that humans don't have values, and get a proof that computers don't have files. It works the same way, you turn your uncertainty in the relation between the exact and the approximate into a confidence that the two are uncorrelated. Making a somewhat naive and not formally specified assumption along the lines of, "the real action taken optimizes human values better than most possible actions" will get you a meaningful but not perfect definition of 'values'. You still need to say exactly what a "possible action" is.

Making a somewhat naive and not formally specified assumption along the lines of, "the files are what you see when you click on the file viewer" will get you a meaningful but not perfect definition of 'files'. You still need to say exactly what a "click" is. And how you translate a pattern of photons into a 'file'.

We see that if you were running a quantum simulation of the universe, then getting values out of a virtual human is the same type of problem as getting files off a virtual computer.

I like this analogy. Probably not best to put too much weight on it, but it has some insights.

I wonder if part of messiness might stem from confusing various domains and ranges. For example, for human, we have a complex of wants -- some are driven very much by physiological factors, some by cultural factor and some by individual factors (including things like what I did yesterday or 5 hours ago). We might call these our preference domain.

Then we need some function mapping the preferences into the range of behaviors that are observable. Assuming that there is something approximating a function here (caveat - not a math guy here so maybe that is misused/loaded here). From that we have some hope for deducing the behavior back to the preference.

However, we should not consider the above three sources as coming from the same domain, or mapping to the same range. Confusion may come in from both the fuzziness (I'm implicitly agreeing with the general cannot infer preferences from behavior that well as a general proposition) of the "correct" function as well as a confusion of associating a behavior to one of the three ranges, and then attempting to deduce the preference.

If I see A doing x and ascribe x to the physiological range and then attempt to deduce the preference (in the physiological domain) when x is actually in the individual range for A I will probably see a lot of errors. But maybe not 100% error.

I do think there is something to the we're all human so can recognize a lot of meaning in action from others -- but things like culture (as mentioned) does influence performance here. So, what is an acceptable accuracy rate? Is the goal mathematical certainty or something else?

Your title seems clickbaity, since its question is answered no in the post, and this article would have been more surprising had you answered yes. (And my expectation was that if you ask that question in the title, you don't know the answer anymore.)

having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well

How do you know this? Should we turn this into one of those concrete ML experiments?

PS: the other title I considered was "Why do people feel my result is wrong", which felt too condescending.

Your title seems clickbaity

Hehe - I don't normally do this, but I feel I can indulge once ^_^

having implicit access to categorisation modules that themselves are valid only in typical situations... is not a way to generalise well

How do you know this?

Moravec's paradox again. Chessmasters didn't easily program chess programs; and those chess programs didn't generalise to games in general.

Should we turn this into one of those concrete ML experiments?

That would be good. I'm aiming to have a lot more practical experiments from my research project, and this could be one of them.

Chessmasters didn't easily program chess programs; and those chess programs didn't generalise to games in general.

I'd say a more relevant analogy is whether some ML algorithm could learn to play Go teaching games against a master, by example of a master playing teaching games against a student, without knowing what Go is.

And whether those programs could then perform well if their opponent forces them into a very unusual situation, such as would not have ever appeared in a chessmaster game.

If I sacrifice a knight for no advantage whatsoever, will the opponent be able to deal with that? What if I set up a trap to capture a piece, relying on my opponent not seeing the trap? A chessmaster playing another chessmaster would never play a simple trap, as it would never succeed; so would the ML be able to deal with it?

New to LessWrong?