Anchoring vs Taste: a model

11

Ω 4


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here I'll develop my observation that anchoring bias is formally similar to taste based preferences, and develop some more formalism for learning the values/preferences/reward functions of a human.

Anchoring or taste

An agent (think of them as a simplified human) confronts one of two scenarios:

  • In scenario I, the agent sees a movie scene where someone wonders how much to pay for a bar of chocolate, spins a wheel, and gets either £0.01 or £100. Then is asked how much they would spend for the same bar of chocolate.

  • In scenario II, the agent sees a movie scene in which someone eats a bar of chocolate, which reveals that the bar has nuts, or doesn't. Then is asked how much they would spend for the same bar of chocolate.

In both cases, will spend £1 for the bar (£0.01/no nuts) or £3 (£100/nuts).

We want to say that scenario I is due to anchoring bias, while scenario II is due to taste differences. Can we?

Looking into the agent

We can't directly say anything about just by their actions, of course - even with simplicity priors. But we can make some assumptions if we look inside their algorithm, and see how they model the situation.

Assume that 's internal structure consists of two pieces: a modeller and an assessor . Any input is streamed to both and . Then can interrogate by sending an internal variable , receives another variable in return, and then outputs .

In pictures, this looks like this, where each variable has been indexed by the timestep at which it is transmitted:

Here the input decomposes in (the movie) and (the question). Assume that these variables are sufficiently well grounded that when I describe them ("the modeller", "the movie", "the key variables", and so on), these descriptions mean what they seem to.

So the modeller will construct a list of all the key variables, and pass these on to the assessor to get an idea of the price. The price will return in , and then will simply output that value as .

A human-like agent

First we'll design to look human-like. In scenario I the modeller will pass to the assessor - only the question "how much is a bar of chocolate worth?" will be passed on (in a real world scenario, more details about what kind of chocolate it is would be included, but let's ignore those details here). The answer will be £1 or £3, as indicated above, dependent on (which is also an input into ).

In scenario II, the modeller will pass on where is a boolean that indicates whether the chocolate contains nuts or not. The response will be £1 if (false) or £3 if (true).

Can we now say that anchoring is a bias but the taste of nuts is a preference? Almost, we're nearly there. To complete this, we need to make the normative assumption:

  • : key variables that are not passed on by are not relevant to the agent's reward function.

Now we can say that anchoring is a bias (because the variable that changes the assessment, the movie, affects but is not passed on via ), while taste is likely a preference (because the key taste variable is passed on by ).

A non-human agent

We can also design an with the same behaviour as , but clearly non-human. For , in scenario II, while is scenario I, where is a boolean encoding whether the movie-chocolate was bought for £0.01 or for £100.

In that case, will assess anchoring as a demonstration of preference, while the presence of nuts is clearly an irrational bias. And I'd agree with this assessment - but I wouldn't call a human, for reasons explained here.