Using lying to detect human values


48


Stuart_Armstrong

In my current research, I've often re-discovering things that are trivial and obvious, but that suddenly become mysterious. For instance, it's blindingly obvious that the anchoring bias is a bias, and almost everyone agrees on this. But this becomes puzzling when we realise that there is no principled ways of deducing the rationality and reward of irrational agents.

Here's another puzzle. Have you ever seen someone try and claim that they have certain values that they manifestly don't have? Seen their facial expression, their grimaces, their hesitation, and so on.

There's an immediate and trivial explanation: they're lying, and they're doing it badly (which is why we can actually detect the lying). But remember that there is no way of detecting the preferences of an irrational agent. How can someone lie about something that is essentially non-existent, their values? Even if someone knew their own values, why would the tell-tale signs of lying surface, since there's no way that anyone else could ever check their values, even in principle?

But here evolution is helping us. Humans have a self-model of their own values; indeed, this is what we use to define what those values are. And evolution, being lazy, re-uses the self-model to interpret others. Since these self-models are broadly similar from person to person, people tend to agree about the rationality and values of other humans.

So, because of these self-models, our own values "feel" like facts. And because evolution is lazy, lying and telling the truth about our own values triggers the same responses as lying or telling the truth about facts.

This suggests another way of accessing the self-model of human values: train an AI to detect human lying and misdirection on factual matters, then feed that AI a whole corpus of human moral/value/preference statements. Given the normative assumption that lying on facts resembles lying on values, this is another avenue by which AIs can learn human values.

So far, I've been assuming that human values are a single, definite object. In my next post, I'll look at the messy reality of under-defined and contradictory human values.