In a previous post, I showed how, given certain normative assumptions, one could distinguish agents for whom anchoring was a bias, from those for which it was a preference.
But agent looks clearly ridiculous - how could anchoring be a bias, it makes no sense. And I agree with that assessment! 's preferences make no sense - if we think of it as a human.
Humans model each other in very similar ways
This is another way in which I think we can extract human preferences: using the fact that human models of each other, and self-models, are all incredibly similar. Consider the following astounding statements:
- If somebody turns red, shouts at you, then punches you in the face, they are probably angry at you.
- If somebody is drunk, they are less rational at implementing long-term plans.
- If somebody close to you tells you an intimate secret, then they probably trust you.
Most people will agree with all those statements, to a large extent - including the "somebody" being talked about. But what is going on here? Have I not shown that you can't deduce preferences or rationality from behaviour? It's not like we've put the "somebody" in an FMRI scan to construct their internal model, so how do we know?
The thing is, that natural selection is lazy, and a) different humans use the same type of cognitive machinery to assess each other, and b) individual humans tend to use their own self-assessment machinery to assess other humans. Consequently, there tends to be large agreement between our own internal self-assessment models, our models of other people, other people's models of other people, and other people's self-assessment models of themselves:
This agreement is not perfect, by any means - I've mentioned that it varies from culture to culture, individual to individual, and even within the same individual. But even so, we can add the normative assumption:
- : If is a human and another human, then 's models of 's preferences and rationality are informative of 's preferences and rationality.
That explains why I said that was a human, while was not: my model of what a human would prefer in those circumstances was correct for but not for .
Note that this modelling is often carried out implicitly, through selecting the scenarios, and tweaking the formal model, so as to make the agent being assessed more human-like. With many variables to play with, it's easy to restrict to a set that seems to demonstrate human-like behaviour (for example, using almost-rationality assumptions for agents with small action spaces but not for agents with large ones).
There's nothing wrong with this approach, but it needs to be made clear that, when we are doing that, we are projecting our own assessments of human rationality on the agent; we not making "correct" choices as if we were dispassionately improving the hyperparameters of an image recognition program.
Human ability to model other human preferences may be an evidence that alignment is possible: we evolved to present and predict each other (and our own) goals. So our goals are expressed in the ways which could be reconstructed by other agent.
However, "X is not about X" could be true here. What humans think to be their "goals" or "rationality", could be not it, but just some signals. For example, being angry on someone and being the one on whom someone is angry is very clear situation for both humans, but what it actually mean for outside non-human observer? Is it a temporary tantrum of a friend, or a precommitment to kill? Is it a joke, a theatre, an expression of love or an act of war?