In this penultimate post on "learning human values" series, I just want to address some human values/preferences/rewards that don't fit neatly into the (p, R) model where p in the planning algorithm and R the actual reward.
Most people have preferences over their own preferences - and that of others. For example, consider someone who has an incorrect religious faith. They might believe something like:
"I want to always continue believing. I flinch away from certain sceptical arguments, but I'm sure my deity would protect me from doubt if I ever decided to look into them".
Hope this doesn't sound completely implausible for someone. Here they have beliefs, preferences over their future beliefs, and beliefs over their future beliefs. This doesn't seem to be able to be easily captured in the (p, R) framework. We can also see that asking them equivalent questions "Do you want to doubt your deity?" and "Do you want to learn the truth?" will get very different answers.
But it's not just theism, an example which is too easy to pick on. I have preferences over knowledge, for instance, as do most people. I would prefer that people had accurate information, for instance. I would also prefer that, when choosing between possible formalisations of preferences, people went with the less destructive and less self-destructive options. These are not overwhelmingly strong preferences, but they certainly exist.
Consider the following scenario: someone believes that roller-coasters are perfectly safe, but enjoys riding them for the feeling of danger they give them. It's clear that the challenge here is not reconciling the belief of safety with the alief of danger (which is simple: roller-coasters are safe), but to somehow transform the feeling of danger into another form that keeps the initial enjoyment.
The theism argument might suggest that tribalism will be a major problem, as various groups pressure adherents to conform to certain beliefs and preferences.
But actually that need not be such a problem. It's clear that there is a strong desire to remain part of that group (or, sometimes, just of a group). Once that desire is identified, all the rest become instrumental - the human will either do the actions that are needed to remain part of the group, without needing to change their beliefs or preference (just because evolution doesn't allow us to separate those two easily, doesn't mean an AI can't help us do it), or will rationally sacrifice beliefs and preferences to the cause of remaining part of the group.
Most signalling cases can be dealt with in the same way. So, though tribalism is a major reason people can end up with contingent preferences, it doesn't in itself pose problems to the (p, R) model.
The problem of personal identity is a tricky one. I would like to remain alive, happy, curious, having interesting experience, doing worthwhile and varied activities, etc...
Now, this is partially preferences about future preferences, but there's the implicit identity: I want this to happen to me. Even when I'm being altruistic, I want these experiences to happen to someone, not just to happen in some abstract sense.
But the concept of personal identity is a complicated one, and it's not clear if it can be collapsed easily into the (p, R) format.
Finally, even if personal identity is defined, it remains the case that people can judge different situations depending on how that situation is achieved. Being forced or manipulated into a situation will make them resent it much more than if they reach it through "natural" means. Of course, what counts as acceptable and unacceptable manipulations change, is filled with biases, inconsistencies, and incorrect beliefs (in my experience, far too many people think themselves immune to advertising, for instance).
People react strongly to situations getting worse of better, not so much to the absolute quality of the situation.
AIs would radically reshape the world and society. And yet humans have deeply held values that only make sense in narrow contexts - sometimes, they already no longer make sense. For instance, in my opinion, of the five categories in moral foundations theory, one no longer makes sense and three only make partial sense (and it seems to me that having these values in a world where it's literally impossible to satisfy them, is part of the problem people have with the modern world):
This can be seen as a subset of the whole "underdefined human values", but it could also be seen as an argument for preserving or recreating certain contexts, in which these values make sense.
These are just some of the challenges to the (p, R) format, and there are certainly others. It's not clear how much that format needs to complicated in order to usefully model all these extra types of preferences.