Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Narrow value learning is a huge field that people are already working on (though not by that name) and I can’t possibly do it justice. This post is primarily a list of things that I think are important and interesting, rather than an exhaustive list of directions to pursue. (In contrast, the corresponding post for ambitious value learning did aim to be exhaustive, and I don’t think I missed much work there.)

You might think that since so many people are already working on narrow value learning, we should focus on more neglected areas of AI safety. However, I still think it’s worth working on because long-term safety suggests a particular subset of problems to focus on; that subset seems quite neglected.

For example, a lot of work is about how to improve current algorithms in a particular domain, and the solutions encode domain knowledge to succeed. This seems not very relevant for long-term concerns. Some work assumes that a handcoded featurization is given (so that the true reward is linear in the features); this is not an assumption we could make for more powerful AI systems.

I will speculate a bit on the neglectedness and feasibility of each of these areas, since for many of them there isn’t a person or research group who would champion them whom I could defer to about the arguments for success.

The big picture

This category of research is about how you could take narrow value learning algorithms and use them to create an aligned AI system. Typically, I expect this to work by having the narrow value learning enable some form of corrigibility.

As far as I can tell, nobody outside of the AI safety community works on this problem. While it is far too early to stake a confident position one way or the other, I am slightly less optimistic about this avenue of approach than one in which we create a system that is directly trained to be corrigible.

Avoiding problems with goal-directedness. How do we put together narrow value learning techniques in a way that doesn’t lead to the AI behaving like a goal-directed agent at each point? This is the problem with keeping a reward estimate that is updated over time. While reward uncertainty can help avoid some of the problems, it does not seem sufficient by itself. Are there other ideas that can help?

Dealing with the difficulty of “human values”. Cooperative IRL makes the unrealistic assumption that the human knows her reward function exactly. How can we make narrow value learning systems that deal with this issue? In particular, what prevents them from updating on our behavior that’s not in line with our “true values”, while still letting them update on other behavior? Perhaps we could make an AI system that is always uncertain about what the true reward is, but how does this mesh with epistemics, which suggest that you can get to arbitrarily high confidence given sufficient evidence?

Human-AI interaction

This section of research aims to figure out how to create human-AI systems that successfully accomplish tasks. For sufficiently complex tasks and sufficiently powerful AI, this overlaps with the big picture concerns above, but there are also areas to work on with subhuman AI with an eye towards more powerful systems.

Assumptions about the human. In any feedback system, the update that the AI makes on the human feedback depends on the assumption that the AI makes about the human. In Inverse Reward Design (IRD), the AI system assumes that the reward function provided by a human designer leads to near-optimal behavior in the training environment, but may be arbitrarily bad in other environments. In IRL, the typical assumption is that the demonstrations are created by a human behaving Boltzmann rationally, but recent research aims to also correct for any suboptimalities they might have, and so no longer assumes away the problem of systematic biases. (See also the discussion in Future directions for ambitious value learning.) In Cooperative IRL, the AI system assumes that the human models the AI system as approximately rational. COACH notes that when you ask a human to provide a reward signal, they provide a critique of current behavior rather than a reward signal that can be maximized.

Can we weaken the assumptions that we have to make, or get rid of them altogether? Barring that, can we make our assumptions more realistic?

Managing interaction. How should the AI system manage its interaction with the human to learn best? This is the domain of active learning, which is far too large a field for me to summarize here. I’ll throw in a link to Active Inverse Reward Design, because I already talked about IRD and I helped write the active variant.

Human policy. The utility of a feedback system is going to depend strongly on the quality of the feedback given by the human. How do we train humans so that their feedback is most useful for the AI system? So far, most work is about how to adapt AI systems to understand humans better, but it seems likely there are also gains to be had by having humans adapt to AI systems.

Finding and using preference information

New sources of data. So far preferences are typically learned through demonstrations, comparisons or rankings; but there are likely other useful ways to elicit preferences. Inverse Reward Design gets preferences from a stated proxy reward function. An obvious one is to learn preferences from what people say, but natural language is notoriously hard to work with so not much work has been done on it so far, though there is some. (I’m pretty sure there’s a lot more in the NLP community that I’m not yet aware of.) We recently showed that there is even preference information in the state of the world that can be extracted.

Handling multiple sources of data. We could infer preferences from behavior, from speech, from given reward functions, from the state of the world, etc. but it seems quite likely that the inferred preferences would conflict with each other. What do you do in these cases? Is there a way to infer preferences simultaneously from all the sources of data such that the problem does not arise? (And if so, what is the algorithm implicitly doing in cases where different data sources pull in different directions?)

Acknowledging Human Preference Types to Support Value Learning talks about this problem and suggests some aggregation rules but doesn’t test them. Reward Learning from Narrated Demonstrations learns from both speech and demonstrations, but they are used as complements to each other, not as different sources for the same information that could conflict.

I’m particularly excited about this line of research -- it seems like it hasn’t been explored yet and there are things that can be done, especially if you allow yourself to simply detect conflicts, present the conflict to the user, and then trust their answer. (Though this wouldn’t scale to superintelligent AI.)

Generalization. Current deep IRL algorithms (or deep anything algorithms) do not generalize well. How can we infer reward functions that transfer well to different environments? Adversarial IRL is an example of work pushing in this direction, but my understanding is that it had limited success. I’m less optimistic about this avenue of research because it seems like in general function approximators do not extrapolate well. On the other hand, I and everyone else have the strong intuition that a reward function should take fewer bits to specify than the full policy, and so should be easier to infer. (Though not based on Kolmogorov complexity.)

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 12:07 AM

I am slightly less optimistic about this avenue of approach than one in which we create a system that is directly trained to be corrigible.

I'm confused about the difference between these two. Does "directly trained to be corrigible" correspond to hand-coded rules for corrigible/incorrigible behavior?

(Though this wouldn’t scale to superintelligent AI.)

Why's that? Some related thinking of mine.

I'm confused about the difference between these two. Does "directly trained to be corrigible" correspond to hand-coded rules for corrigible/incorrigible behavior?

"Directly trained to be corrigible" could involve hardcoding a "core of corrigible reasoning", or imitating a human overseer who is trained to show corrigible behavior (which is my story for how iterated amplification can hope to be corrigible).

In contrast, with narrow value learning, we hope to say something like "learn the narrow values of the overseer and optimize them" (perhaps by writing down a narrow value learning algorithm to be executed), and we hope that this leads to corrigible behavior. Since "corrigible" means something different from narrow value learning (in particular, it means "is trying to help the overseer"), we are hoping to create a corrigible agent by doing something that is not-exactly-corrigibility, which is why I call it "indirect" corrigibility.

Why's that?

It seems likely that there will be contradictions in human preferences that are about sufficiently difficult for humans to understand that the AI system can't simply present the contradiction to the human and expect the human to resolve it correctly, which is what I was proposing in the previous sentence.

It seems likely that there will be contradictions in human preferences that are about sufficiently difficult for humans to understand that the AI system can't simply present the contradiction to the human and expect the human to resolve it correctly, which is what I was proposing in the previous sentence.

How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK. Also, I'm not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer's preferences that the system has; couldn't you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?

How relevant do you expect this to be? It seems like the system could act pessimistically, under the assumption that either answer might be the correct way to resolve the contradiction, and only do actions that are in the intersection of the set of actions that each possible philosophy says is OK.

It seems plausible that this could be sufficient, I didn't intend to rule out that possibility. I do think that we want to eventually resolve such contradictions, or have some method for dealing with them, or otherwise we are stuck making not much progress (since I expect that creating very different conditions eg. through space colonization will take humans "off-distribution" leading to lots of contradictions that could be very difficult to resolve).

I'm not sure the overseer needs to think directly in terms of some uber-complicated model of the overseer's preferences that the system has; couldn't you make use of active learning and ask whether specific actions would be corrigible or incorrigible, without the system trying to explain the complex confusion it is trying to resolve?

I didn't mean that the complexity/confusion arises in the model of the overseer's preferences. Even specific actions can be hard to evaluate -- you need to understand the (agent's expectation of) the long-term outcomes of that action, and then to evaluate whether those long-term outcomes are good (which could be very challenging, if the future is quite different from the present). Or alternatively, you need to evaluate whether the agent believes those outcomes are good for the overseer.