Humans interpreting humans

6mo13th Feb 20191 comment

Ω 5

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In a previous post, I showed how, given certain normative assumptions, one could distinguish agents for whom anchoring was a bias, from those for which it was a preference.

But agent looks clearly ridiculous - how could anchoring be a bias, it makes no sense. And I agree with that assessment! 's preferences make no sense - if we think of it as a human.

Humans model each other in very similar ways

This is another way in which I think we can extract human preferences: using the fact that human models of each other, and self-models, are all incredibly similar. Consider the following astounding statements:

• If somebody turns red, shouts at you, then punches you in the face, they are probably angry at you.
• If somebody is drunk, they are less rational at implementing long-term plans.
• If somebody close to you tells you an intimate secret, then they probably trust you.

Most people will agree with all those statements, to a large extent - including the "somebody" being talked about. But what is going on here? Have I not shown that you can't deduce preferences or rationality from behaviour? It's not like we've put the "somebody" in an FMRI scan to construct their internal model, so how do we know?

The thing is, that natural selection is lazy, and a) different humans use the same type of cognitive machinery to assess each other, and b) individual humans tend to use their own self-assessment machinery to assess other humans. Consequently, there tends to be large agreement between our own internal self-assessment models, our models of other people, other people's models of other people, and other people's self-assessment models of themselves:

This agreement is not perfect, by any means - I've mentioned that it varies from culture to culture, individual to individual, and even within the same individual. But even so, we can add the normative assumption:

• : If is a human and another human, then 's models of 's preferences and rationality are informative of 's preferences and rationality.

That explains why I said that was a human, while was not: my model of what a human would prefer in those circumstances was correct for but not for .

Implicit models

Note that this modelling is often carried out implicitly, through selecting the scenarios, and tweaking the formal model, so as to make the agent being assessed more human-like. With many variables to play with, it's easy to restrict to a set that seems to demonstrate human-like behaviour (for example, using almost-rationality assumptions for agents with small action spaces but not for agents with large ones).

There's nothing wrong with this approach, but it needs to be made clear that, when we are doing that, we are projecting our own assessments of human rationality on the agent; we not making "correct" choices as if we were dispassionately improving the hyperparameters of an image recognition program.