VesaVanDelzig — LessWrong

His hostility to the program as I understand it is that is CIRL doesn't much answer the question of how to specify specify a learning procedure that would go from an observations of a human being to a correct model of a human being's utility function. This is the hard part of the problem. This is why he says "specifying an update rule which converges to a desirable goal is just a reframing of the problem of specifying a desirable goal, with the "uncertainty" part a red herring".

One of the big things that CIRL was claimed to have going for it is that this uncertainty about what the true reward function was would lead to deferential properties which would lead to a more corrigible system (it would let you shut it down for example). This doesn't seem like it holds up because a CIRL agent would probably eventually stop treating you as a source of new information once it had learned a lot from you, at which point it would stop being deferential.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments