I've shown that, even with simplicity priors, we can't figure out the preferences or rationality of a potentially irrational agent (such as a human ).
But we can get around that issue with 'normative assumptions'. These can allow us to zero in on a 'reasonable' reward function .
We should however note that:
- Even if is highly complex, a normative assumption need not be complex to single it out.
This post gives an example of that for general agents, and discusses how a similar idea might apply to the human situation.
An agent takes actions () and gets observations (), and together these form histories, with the set of histories (I won't present all the details of the formalism here). The policies are maps from histories to actions. The reward functions are maps from histories to real numbers), and the planners are maps from reward functions to policies.
By observing an agent, we can deduce (part of) their policy . Then a reward-planner pair is compatible with if . Further observations cannot distinguish between different compatible pairs.
Then a normative assumption is something that distinguishes between compatible pairs. It could be a prior on , or an assumption of full rationality (which removes all-but-the-rational planner from ), or something that takes in more details about the agent or the situation.
Assumptions that use a lot of information
Assume that the agent's algorithm is written in some code, as , and that will have access to this. Then suppose that scans , looking for the following: an object that takes a history as an input and has a real number as an output, an object that takes and a history as inputs, and outputs an action, and a guarantee that chooses actions by running on and the input history.
The need not be very complex to do that job. Because of rice's theorem and obfuscated code, it will be impossible for to check those facts in general. But, for many examples of , it will be able to check that those things hold. In that case, let return ; otherwise, let it return the trivial reward.
So, for a large set of possible algorithms, can return a reasonable reward function estimate. Even if the complexity of and is much, much higher than the complexity of itself, there are still examples of these where can successfully identity the reward function.
Of course, if we run on a human brain, it would return . But what I am looking for is not , but a more complicate , that, when run on the set of human agents, will extract some 'reasonable' . It doesn't matter what does when run on non-human agents, so we can load it with assumptions about how humans work. When I talk about extracting preferences through looking at internal models, this is the kind of thing I had in mind (along with some method for synthesising those preferences into a coherent whole).
So, though my desired might be complex, there is no a priori reason to think that it need be as complex as the output.