Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a toy model, inspired by the previous post's argument that the biases and preferences of a human carry more information than their behaviour.

This is about a human purchasing chocolate, so I am fully justified in illustrating it with this innocuous image:

Bidding on chocolate

A human is bidding to purchase some chocolate. It's a second price auction, so the human is motivated to bid their true valuation.

The human either prefers milk chocolate (described as 'sweet') or dark chocolate (described as 'sour'). Whichever one they prefer, they will price at , and the other one at . Call these two possible preferences (prefers milk chocolate) and (prefers dark chocolate).

The human is also susceptible to the anchoring bias. Before announcing the type of chocolate, the announcer will also randomly mention or . This will push the human's declared bid up or down. Because of their susceptibility to anchoring bias, we will call them .

Then the following table illustrates the human's bid, dependent on the announcement and their own preference:

Now let's introduce (the 'rational' human). This human derives satisfaction from naming prices that are closer to numbers it has heard recently. This satisfaction is worth - you guessed it - the equivalent of . Call this reward .

Then with reward will have the same behaviour as with reward : they will also upbid when is announced and downbid when is announced, but this is because they derive reward from doing so, not because they are biased. Similarly, with reward will have the same behaviour as with reward .

So is a human with less bias; now let's imagine a human with more bias. I'll introduce a so-called 'connotation bias'. This human may value milk and dark chocolate as given by their reward, but the word 'sweet' has positive connotations, independent of its descriptive properties; this increases their bids by . The word 'sour' has negative connotations; this decreases their bids by .

A human with connotation bias and reward will behave like a human without connotation bias and reward . That's because the human prices milk chocolate at , but will add because it's described as 'sweet'; and the converse, for the 'sour' dark chocolate.

Let describe a human that has the connotation bias, and describe a human that has both the connotation bias and the anchoring bias. Then the following four pairs of humans and rewards will behave the same in the bidding process:

We might also have a reward version of the connotation bias, where the human enjoys the chocolate more or less depending on the description (this is similar to the way that the placebo effect can still work, to some extent, if the subject is aware of it). Call this reward . Then we can add two more agents that have the same behaviour as those above:

This is how preferences and biases carry more information than the policy does: we have six possible pairs, all generating the same policy. Figuring out which one is 'genuine' requires more bits of information.

More data won't save you

At this point, you're probably starting to think of various disambiguation experiments you could run to distinguish the various possibilities. Maybe allow the human to control the random number that is announced (maybe the with would prefer that be named, as that would give it satisfaction for naming a lower price), or the description of the chocolate ('savoury' or 'full-bodied' having better connotations that 'sour').

But recall that, per the Occam's razor paper's results, modelling the human as fully rational will be simpler than modelling them as having a varying mix of biases and preferences. So full rationality will always be a better explanation for the human behavioural data.

Since we have some space between the simplicity of full rationality and the complexity of more genuine human preferences and (ir)rationality, there will also be completely erroneous models of human preferences and biases, that are nonetheless simpler than the genuine ones.

For example, an almost fully rational human with an anti-anchoring bias (one that names quantities far from the suggestions it's been primed with) will most likely be simpler than a genuine explanation, which also has to take into account all the other types of human biases.

Why disambiguation experiments fail

Ok, so the previous section gave a high-level explanation why running more disambiguation experiments won't help. But it's worth being more narrow, and zooming in a bit. Specifically, if we allow the human to control the random number that is announced, the fully rational human, , would select , while the human-with-anchoring bias, , would either select the same number they would have bid otherwise ( or ), to remove anchoring bias, or would give a number at random (if they're unaware of anchoring bias).

To make behave in that way, we would have to add some kind of 'reward for naming numbers close to actual internal valuation, when prompted to do so, or maybe answering at random'. Call this reward ; then seems clearly more complicated than , so what is going on here?

What's going on is that we are making use of a lot of implicit assumptions about how humans (or quasi-humans) work. We're assuming that humans treat money as fungible, that we desire more of it, and are roughly efficient at doing so. We're assuming that humans are either capable of identifying the anchoring bias and removing it, or believe the initial number announced is irrelevant. There are probably a lot of other implicit assumptions that I've missed myself, because I too am human, and it's hard to avoid using these assumptions.

But, in any case, it is only when we've added these implicit assumptions that ' seems clearly more complicated than '. If we had the true perspective that Thor is much more complicated than Maxwell's equations, then might well be the simpler of the two (and the more situations we considered, the simpler the 'humans are fully rational' model becomes, relative to other models).

Similarly, if we wanted to disambiguate , then we would be heavily using the implicit knowledge that 'sour' is a word with negative connotations, while 'savoury' is not.

Now, the implicit assumptions we've used are 'true', in that they do describe the preference/rationality/bias features of humans that we'd want an AI to use to model us. But we don't get them for free, from mere AI observation; we need to put them into the AI's assumptions, somehow.

New Comment