I've shown that it is, theoretically, impossible to deduce the preferences and rationality of an agent by looking at their actions or policy.
That argument is valid, but feels somewhat abstract, talking about "fully anti-rational" agents, and other "obviously ridiculous" preferences.
In this post, I'll present a simple realistic example of human behaviour where their preferences cannot be deduced. The example was developed by Xavier O'rourke.
The motivations and beliefs of a poker player
In this example, Alice is playing Bob at poker, and they are on their last round. Alice might believe that Bob has a better hand, or a worse one. She may be maximising her expected income, or minimising it (why? read on to see). Even under questioning, it is impossible to distinguish an Alice belief in Bob having a worse hand and Alice following a maximising behaviour, from Bob-better-hand-and-Alice-minimising-income. And, similarly, Bob-worse-hand-and-Alice-minimising-income is indistinguishable from Bob-better-hand-and-Alice-maximising-income.
If we want to be specific, imagine the we are observing Alice playing a game of Texas holdem'. Before the river (the final round of betting), everyone has folded besides Alice and Bob. Alice is holding , and the board (the five cards both players have in common) is .
Alice is looking at four-of-a-kind in 10's, and can only lose if Bob holds , giving him a straight flush. For simplicity, assume Bob has raised, and Alice can only call or fold -- assume she's out of money to re-raise -- and Bob cannot respond to either, so his actions are irrelevant. He has been playing this hand, so far, with great confidence.
Alice can have two heuristic models of Bob's hand. In one model, , she assumes that having specifically is very low, so she almost certainly has the better hand. In a second model, she notes Bob's great confidence, and concludes he is quite likely to have that pair.
What does Alice want? Well, one obvious goal is to maximise money, with reward , linear in money. However, it's possible that Alice doesn't care about how much money she's taking home -- she'd prefer to take Bob home instead, her reward is -- and she thinks that putting Bob in a good mood by letting him win at poker will make him more receptive to her advances later in the evening. In this case Alice wants to lose as much money as she can in this hand, so, in this specific situation, .
Then the following table represent's Alice's action, as a function of her model and reward function:
Thus, for example, if she wants to maximise money () and believes Bob doesn't have the winning hand (), she should call. Similarly, results in Alice calling (because she believes she will lose if both players show their cards, and wants to lose). Conversely, and result in Alice folding.
Thus observing Alice's behaviour neither constrains her beliefs, nor her preferences -- though it does constrain the combination of the two.
Alice's overall actions
Can we really not figure what Alice wants here? What about if we just waited to see her previous or subsequent behaviour? Or if we simply asked her what she wanted?
Unfortunately, neither of these may suffice. Even if Alice is mainly a money maximiser, it's possible she might take Bob as a consolation prize; even if she was mainly interested in Bob, it's possible that she previously played aggressively to win money, reasoning that Bob is more likely to savour a final victory against a worthy-seeming opponent.
As for asking Alice -- well, sexual preferences and poker strategies are areas where humans are incredibly motivated to lie and mislead. Why confess to a desire that might result in it being impossible to achieve? Or reveal how you analyse poker hands in an unduly honest way? Conversely, honesty or double-bluffs are also options.
Thus, it is plausible that Alice's total behaviour could be identical in the and cases (and in the and cases), not allowing us to distinguish these. Or at least, not allowing us to distinguish them with much confidence.
Adding more details
It might be objected that the problem above is overly narrow, and that if we expanded the space of actions, Alice's preferences would become clear.
That is likely to be the case; but the space of beliefs and rewards was also narrow. We could allow Alice to raise as well (maybe with the goal of tricking Bob into folding); with three actions, we may be able to distinguish better between the four possible pairs. But we can then give Alice more models as to how Bob would react, increasing the space of possibilities. We could also consider more possible motives for Alice -- she might have a risk averse money-loving utility, and/or some mix between and .
It's therefore not clear that "expanding" the problem, or making it more realistic, would make it any easier to deduce what Alice wants.
Since beliefs/values combinations can be ruled out, would it then be possible to learn values by asking the human about their own beliefs?
In the higher dimensional belief/reward space, do you think that it would be possible to significantly narrow down the space of possibilities (so this argument is saying "be bayesian with respect to reward/beliefs, picking policies that work over a distribution) or are you more pessimistic than that, thinking that the uncertainty would be so great in higher dimensional spaces that it would not be possible to pick a good policy?
I think we need to add other assumptions to narrow down the search space.
Here is assumed that Alice knows her preferences. But sometimes humans are unsure about what they actually want, especially in the case of two almost equal desires like Bob and money. They update their preferences later via rationalisation. So if Alice will get Bob, she will decide that she wanted Bob.
Indeed, that adds an extra layer of complication.