Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Posted as part of the AI Alignment Forum sequence on Value Learning.

Rohin’s note: In this post (original here), Paul Christiano analyzes the ambitious value learning approach. He considers a more general view of ambitious value learning where you infer preferences more generally (i.e. not necessarily in the form of a utility function), and you can ask the user about their preferences, but it’s fine to imagine that you infer a utility function from data and then optimize it. The key takeaway is that in order to infer preferences that can lead to superhuman performance, it is necessary to understand how humans are biased, which seems very hard to do even with infinite data.

One approach to the AI control problem goes like this:

  1. Observe what the user of the system says and does.
  2. Infer the user’s preferences.
  3. Try to make the world better according to the user’s preference, perhaps while working alongside the user and asking clarifying questions.

This approach has the major advantage that we can begin empirical work today — we can actually build systems which observe user behavior, try to figure out what the user wants, and then help with that. There are many applications that people care about already, and we can set to work on making rich toy models.

It seems great to develop these capabilities in parallel with other AI progress, and to address whatever difficulties actually arise, as they arise. That is, in each domain where AI can act effectively, we’d like to ensure that AI can also act effectively in the service of goals inferred from users (and that this inference is good enough to support foreseeable applications).

This approach gives us a nice, concrete model of each difficulty we are trying to address. It also provides a relatively clear indicator of whether our ability to control AI lags behind our ability to build it. And by being technically interesting and economically meaningful now, it can help actually integrate AI control with AI practice.

Overall I think that this is a particularly promising angle on the AI safety problem.

Modeling imperfection

That said, I think that this approach rests on an optimistic assumption: that it’s possible to model a human as an imperfect rational agent, and to extract the real values which the human is imperfectly optimizing. Without this assumption, it seems like some additional ideas are necessary.

To isolate this challenge, we can consider a vast simplification of the goal inference problem:

The easy goal inference problem: Given no algorithmic limitations and access to the complete human policy — a lookup table of what a human would do after making any sequence of observations — find any reasonable representation of any reasonable approximation to what that human wants.

I think that this problem remains wide open, and that we’ve made very little headway on the general case. We can make the problem even easier, by considering a human in a simple toy universe making relatively simple decisions, but it still leaves us with a very tough problem.

It’s not clear to me whether or exactly how progress in AI will make this problem easier. I can certainly see how enough progress in cognitive science might yield an answer, but it seems much more likely that it will instead tell us “Your question wasn’t well defined.” What do we do then?

I am especially interested in this problem because I think that “business as usual” progress in AI will probably lead to the ability to predict human behavior relatively well, and to emulate the performance of experts. So I really care about the residual — what do we need to know to address AI control, beyond what we need to know to build AI?

Narrow domains

We can solve the very easy goal inference problem in sufficiently narrow domains, where humans can behave approximately rationally and a simple error model is approximately right. So far this has been good enough.

But in the long run, humans make many decisions whose consequences aren’t confined to a simple domain. This approach can can work for driving from point A to point B, but probably can’t work for designing a city, running a company, or setting good policies.

There may be an approach which uses inverse reinforcement learning in simple domains as a building block in order to solve the whole AI control problem. Maybe it’s not even a terribly complicated approach. But it’s not a trivial problem, and I don’t think it can be dismissed easily without some new ideas.

Modeling “mistakes” is fundamental

If we want to perform a task as well as an expert, inverse reinforcement learning is clearly a powerful approach.

But in in the long-term, many important applications require AIs to make decisions which are better than those of available human experts. This is part of the promise of AI, and it is the scenario in which AI control becomes most challenging.

In this context, we can’t use the usual paradigm — “more accurate models are better.” A perfectly accurate model will take us exactly to human mimicry and no farther.

The possible extra oomph of inverse reinforcement learning comes from an explicit model of the human’s mistakes or bounded rationality. It’s what specifies what the AI should do differently in order to be “smarter,” what parts of the human’s policy it should throw out. So it implicitly specifies which of the human behaviors the AI should keep. The error model isn’t an afterthought — it’s the main affair.

Modeling “mistakes” is hard

Existing error models for inverse reinforcement learning tend to be very simple, ranging from Gaussian noise in observations of the expert’s behavior or sensor readings, to the assumption that the expert’s choices are randomized with a bias towards better actions.

In fact humans are not rational agents with some noise on top. Our decisions are the product of a complicated mess of interacting process, optimized by evolution for the reproduction of our children’s children. It’s not clear there is any good answer to what a “perfect” human would do. If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for.

I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior.

We can’t use normal AI techniques to learn this kind of model, either — what is it that makes a model good or bad? The standard view — “more accurate models are better” — is fine as long as your goal is just to emulate human performance. But this view doesn’t provide guidance about how to separate the “good” part of human decisions from the “bad” part.

So what?

It’s reasonable to take the attitude “Well, we’ll deal with that problem when it comes up.” But I think that there are a few things that we can do productively in advance.

  • Inverse reinforcement learning / goal inference research motivated by applications to AI control should probably pay particular attention to the issue of modeling mistakes, and to the challenges that arise when trying to find a policy better than the one you are learning from.
  • It’s worth doing more theoretical research to understand this kind of difficulty and how to address it. This research can help identify other practical approaches to AI control, which can then be explored empirically.
New Comment
20 comments, sorted by Click to highlight new comments since:

This reminds me of what I like best about what had been Paul's approach (now it's more people's): a better acknowledgement of the limitations of humans and the difficulty of building models of them that are any less complex than the original human itself. I realize there are many reasons people would want not to worry about these things, main one being that additional, stronger constraints make the problem easier to solve for, but I think for actually solving AI alignment we're going to have to face these more general challenges with weaker assumptions. I think Paul's existing writing probably doesn't go far enough, but he does call this the "easy" problem, so we can address the epistemological issues surrounding one agent learning about what exists in another agents experience in the "hard" version.

Perhaps you should model humans as some kind of cognitively bound agent. An example algorithm would be AIXI-tl. Have the AI assume that humans are an AIXI-tl with an unknown utility function, and try to optimize that function. This means that your AI assumes that we get sufficiently trivial ethical problems correct, and have no clue about sufficiently hard problems.

A person is given a direct choice of shoot their own foot off, or don't. They choose not to. The AI reasons that our utility function values having feet.

A person is asked if the N th digit of pi is even, with their foot being shot off if they get it wrong. (With N large) They get it wrong. The AI reasons that the human didn't have enough computing power to solve that problem. As opposed to an AI that assumes humans always behave optimally, which will deduce that humans like having their foot shot off when being asked maths questions.

In practice you might want to use some other type of cognitively bound algorithm, as AIXI-tl probably makes different types of mistakes from humans. This simple model at least demonstrates the behavior of decisions on more understandable situations is a stronger indicator of goal.

If you want me to symbol this out formally, with an agent with priors over all tl limitations that "humans" might have, and all the goals they might have. (low complexity goals favored) I can do that.

I agree you should model the human as some kind of cognitively bounded agent. The question is how.

Let be the set of Worlds, be the set of all utility functions , and the set of human observations, and the set of human actions Let be the set of bounded optimization algorithms, so that an individual is a function from (Utility, Observation) pairs, to actions. Examples of include AIXI-tl with specific time and length limits, and existing deep RL models. This consists of the AI's idea about what kind of bounded agent we might be. There are various conditions of approximate correctness on

Let and be the AI's observation and action space

The AI is only interacting with one human, and has a prior where stands for the rest of the world. Note that parameters not given are summed over,

The AI performs Bayesian updates on as normal. Gathering part of an observation

If is the AI's action space, it chooses

Of course, a lot of the magic here is happening in , bit if you can find a prior that favours fast and approximately correct optimization algorithms over slow or totally defective ones and favours Simplicity of each terms.

Basically the humans utility function is

Where is the set of all things the human could have seen, is whatever policy the human implements, and focuses on that are simple, stocastic, bounded maximization algorithms.

If you don't find it very clear what I'm doing, thats ok. I'm not very cleasr what I'm doing. This is a bit of a point in the rough direction.

A lot of magic is happening in the prior over utility functions and optimization algorithms, removing that magic is the open problem.

(I'm pessimistic about making progress on that problem, and instead try to define value by using the human policy to guide a process of deliberation rather than trying to infer some underlying latent structure.)

I think this is important, but I'd take it further.

In addition to computational limits for the class of decision where you need to compute to decide, there are clearly some heuristics that are being used by humans that give implicitly incoherent values. In those cases, you might want to apply the idea of computational limits as well. This would allow you to say that the reason they picked X not Y at time 1 for time 2, but Y not X at time 2, reflects the cost of thinking about what their future self will want.

Attempted Summary: 

The post is about the project of how an AI might infer human goals in whatever representation (i.e., ambitious value learning). This is different from how to imitate human behavior because in that case "behave like the human" is the goal, whereas in the case of ambitious value learning, the goal is "figure out what the human wants and then do it better."

The fundamental problem is just the messiness of human values. The assumption of infinite data corresponds to the idea that we can place a human with an arbitrary memory in an arbitrary situation as often as we want and then observe her actions (because whatever representation of goals we have is allowed to be a function of the history). This is called the "easy goal inference problem" and is still hard. Primarily (this comes back to the difference between initiation and value learning), you need to model human mistakes, i.e., figure out whether an action was a mistake or not.

(I'm already familiar with the punchline that, for any action, there are infinitely many (rationality, goal) pairs that lead to that action, so this can't be solved without making assumptions about rationality -- but we also know it's possible to make such assumptions that give reasonable performance because humans can infer other humans' goals better than random.)

In this post (original here), Paul Christiano analyzes the ambitious value learning approach.

I find it a little bit confusing that Rohin's note refers to the "ambitious value learning approach", while the title of the post refers to the "easy goal inference problem". I think the note could benefit from clarifying the relationship of these two descriptors.

As it stands, I'm asking myself -- are they disagreeing about whether this is easy or hard? Or is "ambitious value learning" the same as "goal inference" (such that there's no disagreement, and in Rohin's terminology this would be the "easy version of ambitious value learning")? Or something else?

The easy goal inference problem is the same thing as ambitious value learning under the assumption of infinite compute and data about human behavior (which is the assumption that we're considering for most of this sequence).

The previous post was meant to outline the problem, all subsequent posts are about that problem. Ambitious value learning is probably the best name for the problem now, but not all posts use the same terminology even though they're talking about approximately the same thing.

Yeah, I said "goal inference" instead of "value learning" but I mean the same thing. The "ambitious" part is that we are trying to do much better than humans, which I was taking for granted in this post (it's six months older than ambitious vs. narrow value learning).

I don't think you can learn an agent's desires from policy, because an agent can have "loose wires" - faulty connections between the desire part and the policy part. Extreme case: imagine agent A with desires X and Y, locked inside a dumb unfeeling agent B which only allows actions maximizing X to affect behavior, while actions maximizing Y get ignored. Then desire Y can't be learned from the policy of agent B. Humans could be like that: we have filters to stop ourselves from acting on, or even talking about, certain desires. Behind these filters we have "private desires", which can be learned from brain structure but not from policy. Even if these desires aren't perfectly "private", the fastest way to learn them still shouldn't rely on policy alone.

You mean that you can ask the agent if it wants just X, and it will say "I want Y also," but it will never act to do those things? That sounds like what Robin Hanson discusses in Elephant in the Brain - and he largely dismisses the claimed preferences, in favor of the caring about the actual desire.

I'm confused about why we think this is a case that would occur in a way that Y is a real goal we should pursue, instead of a false pretense. And if it was the case, how would brain inspection (without manipulation) allow us to know it?

(There was a longer comment here but I found a way to make it shorter)

I think people can be reluctant to reveal some of their desires, by word or deed. So looking at policy isn't the most natural way to learn these desires; looking inside the black box makes more sense.

Fair point, but I don't think that addresses the final claim, which is that even if you are correct, analyzing the black box isn't enough without actually playing out counterfactuals.

Just to make sure I understand: You're arguing that even if we somehow solve the easy goal inference problem, there will still be some aspect of values we don't capture?

Yeah. I think a creature behaving just like me doesn't necessarily have the exact same internal experiences. Across all possible creatures, there are degrees of freedom in internal experiences that aren't captured by actions. Some of these might be value-relevant.

Yeah, in ML language, you're describing the unidentifiability problem in inverse reinforcement learning -- for any behavior, there are typically many reward functions for which that behavior is optimal.

Though another way this could be true is if "internal experience" depends on what algorithm you use to generate your behavior, and "optimize a learned reward" doesn't meet the bar. (For example, I don't think a giant lookup table that emulates my behavior is having the same experience that I am.)

[-]Ben SmithΩ110

I guess this falls into the category of "Well, we’ll deal with that problem when it comes up", but I'd imagine when a human preference in a particular dilemma is undefined or even just highly uncertain, one can often defer to other rules like--rather than maximize an uncertain preference, default to maximizing the human's agency, in scenarios where preference is unclear, even if this predictably leads to less-than-optimal preference satisfaction.

Typo: "But in in the long-term"

I would believe using human feedback would work for clarifying/noting mistakes as we are more precise on this matter in reflection than in the action.


I agree that modelling all human mistakes seems about as hard as modelling all of human values, so straightforward IRL is not a solution to the goal inference problem, only a reshuffling of complexity.

However, I don't think modelling human mistakes is fundamental to the goal inference problem in the way this post claims.

For example, you can imagine goal inference being solved along the lines of extrapolated volition: we give humans progressively more information about the outcomes of their actions and time to deliberate, and let the AI try to generalize to the limit of a human with infinite information and deliberation time (including time to deliberate about what information to attend to). It's unclear whether this limiting generalization would count as a sufficiently "reasonable representation" to solve the easy goal inference problem, but it's quite possible that it solves the full goal inference problem.

Another way we can avoid modelling all human mistakes is if we don't try to model all of human values, just the ones that are relevant to catastrophic / disempowering actions the AI could take. It seems plausible that there's a pretty simple description of some human cognitive limitations which if addressed, would eliminate the vast majority of risk, even if it can't help the AI decide whether the human would prefer to design a new city (to use Paul's example) more like New York or more like Jakarta. This would also count as a good-enough solution to the goal inference problem that doesn't require solving the "easy goal inference problem" in the full generality stated here.