A putative new idea for AI control; index here.
A simple way of thinking that I feel clarifies a lot of issues (related to Blue Minimising Robot):
Suppose you have an entity that follows algorithm . Then define:
The interpretation part of wants is crucial, but it is often obscured in practice in value learning. That's because we often start with things like ` is a boundedly rational agent that maximises ...', or we lay out the agent in such a way that that's clearly the case.
What we're doing there is writing the entity as --- an algorithm with a special variable that tracks what the entity wants. In the case of cooperative inverse reinforcement learning, this is explicit, as the human's values are given by a , known to the human. Thus the human's true algorithm is , the human observes , meaning that is an objective fact about the universe. And then the human follows .
Note here that knowing what the human is in the one-variable sense (i.e. knowing ) helps with the correct deduction about what they want - while simply knowing the joint does not.
In contrast an interpretation starts with a zero-variable algorithm, and attempts to construct a one-variable one. There for, given it constructs (one or more) and such that
This illustrates the crucial role of interpretation, especially if is highly complex.