Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Methods like cooperative inverse reinforcement learning assume that the human knows their "true" reward function , and then that the human and the robot cooperate to figure out and maximise this reward.

This is fine as far as the model goes, and can allow us to design many useful systems. But it has a problem: the assumption is not true, and, moreover, its falsity can have major detrimental effects.

Contrast two situations:

  1. The human knows the true .
  2. The human has a collection of partial models in which they have clearly defined preferences. As a bounded, limited agent whose internal symbols are only well-grounded in standard situations, their stated preferences will be a simplification of their mental model at the time. The true is constructed from some process of synthesis.

Now imagine the following conversation:

  • AI: What do you really want?
  • Human: Money.
  • AI: Are you sure?
  • Human: Yes.

Under most versions of hypothesis 1., this will be in a disaster. The human has expressed their preferences, and, when offered the opportunity for clarification, didn't give any. The AI will become a money-maximiser, and things go pear shaped.

Under hypothesis 2., however, the AI will attempt to get more details out of the human, suggesting hypothetical scenarios, checking what happens when money and other things in money's web of connotations come apart - eg "What if you had a lot of money, but couldn't buy anything, and everyone despised you?" The synthesis may fail, but, at the very least, the AI will investigate more.

Thus assuming the AI will be learning a truth that humans already know, is harmless assumption in many circumstances, but will result in disasters if pushed to the extreme.

New Comment