Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Inverse reinforcement learning is the challenge of constructing a value system that "explains" the behaviour of another agent. Part of the idea is to have algorithms deduce human observations from human behaviours.

It struck me that this could be used by the agent on themselves. Imagine we had a diamond-maximising agent, who believed something like classical Greek "science", and behaved to accumulate the maximal amount of these shiny crystals. However, they have an ontology change, and learn quantum physics. This completely messes up their view of what a "diamond" is.

However, what if they replayed their previous behaviour, and tried to deduce what possible utility function, in a quantum world, could explain what they had done? They would be trying to fit a quantum-world-aware utility to the decisions of a non-quantum-world-aware being.

This could possibly result in a useful extension of the original motivation to the new setup (at least, it would guarantee similar behaviour in similar circumstances). There are many challenges - most especially that a quantum-aware being has far more knowledge about how to affect the world, and thus far more options - but they seem the usual sort of inverse reinforcement learning (partial knowledge, noise, etc...)

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 8:25 PM

It seems like this is moving the complexity of the ontology mapping problem into the behavior model. In order to explain the agent's strangely correlated errors, the behavior model will probably need to say something about the agent's original ontology and how that relates to their goals in the new ontology.

I'm hoping there would be at least some gain that is an extension of the old preferences and not just a direct translation into the old ontology.