Ambitious value learning aims to achieve superhuman performance by figuring out the underlying latent "values" that humans have, and evaluating new situations according to these values. In other words, it is trying to infer the criteria by which we judge situations to be good. This is particularly hard because in novel situations that humans haven't seen yet, we haven't even developed the criteria by which we would evaluate. (This is one of the reasons why we need to model humans as suboptimal, which causes problems.)
Instead of this, we can use narrow value learning, which produces behavior that we want in some narrow domain, without expecting generalization to novel circumstances. The simplest form of this is imitation learning, where the AI system simply tries to imitate the supervisor's behavior. This limits the AI’s performance to that of its supervisor. We could also learn from preferences over behavior, which can scale to superhuman performance, since the supervisor can often evaluate whether a particular behavior meets our preferences even if she can’t perform it herself. We could also teach our AI systems to perform tasks that we would not want to do ourselves, such as handling hot objects.
Nearly all of the work on preference learning, including most work on inverse reinforcement learning (IRL), is aimed at narrow value learning. IRL is often explicitly stated to be a technique for imitation learning, and early algorithms phrase the problem as matching the features in the demonstration, not exceeding them. The few algorithms that try to generalize to different test distributions, such as AIRL, are only aiming for relatively small amounts of generalization.
(Why use IRL instead of behavioral cloning, where you mimic the actions that the demonstrator took? The hope is that IRL gives you a good inductive bias for imitation, allowing you to be more sample efficient and to generalize a little bit.)
You might have noticed that I talk about narrow value learning in terms of actual observed behavior from the AI system, as opposed to any sort of “preferences” or “values” that are inferred. This is because I want to include approaches like imitation learning, or meta learning for quick task identification and performance. These approaches can produce behavior that we want without having an explicit representation of “preferences”. In practice any method that scales to human intelligence is going to have to infer preferences, though perhaps implicitly.
Since any instance of narrow value learning is defined with respect to some domain or input distribution on which it gives sensible results, we can rank them according to how general this input distribution is. An algorithm that figures out what food I like to eat is very domain-specific, whereas one that determines my life goals and successfully helps me achieve them in both the long and short term is very general. When the input distribution is “all possible inputs”, we have a system that has good behavior everywhere, reminiscent of ambitious value learning.
(Annoyingly, I defined ambitious value learning to be about the definition of optimal behavior, such as an inferred utility function, while narrow value learning is about the observed behavior. So really the most general version of narrow value learning is equivalent to “ambitious value learning plus some method of actually obtaining the defined behavior in practice, such as by using deep RL”.)
IRL also can produce different actions at equilibrium (given finite capacity), it's not merely an inductive bias.
E.g. suppose the human does X half the time and Y half the time, and the agent can predict the details of X but not Y. Behavioral cloning then does X half the time, and half the time does some crazy thing where it's trying to predict Y but can't. IRL will just learn that it can get OK reward by outputting X (since otherwise the human wouldn't do it) and will avoid trying to do things it can't predict.
To head off a possible confusion come tomorrow, it seems like your definition of "narrow value learning" is a bit different from Paul's. You define it as learning to produce desired behavior in some domain, while Paul defined it as learning instrumental goals and values. I think this means that under your definition, behavioral cloning and approval-directed agents are subsets of narrow value learning, whereas under Paul's definition they are disjoint from narrow value learning. Does this seem right to you, and if so was this overloading of the term intentional?
Hmm, I agree that Paul's definition is different from mine, but it feels to me like they are both pointing at the same thing.
I'm not sure. I would have included them, because sufficiently good behavioral cloning/approval-directed agents would need to learn instrumental goals and values in order to work effectively in a domain.
It was intentional, in that I thought that these were different ways of pointing at the same thing.