I think of ambitious value learning as a proposed solution to the specification problem, which I define as the problem of defining the behavior that we would want to see from our AI system. I italicize “defining” to emphasize that this is not the problem of actually computing behavior that we want to see -- that’s the full AI safety problem. Here we are allowed to use hopelessly impractical schemes, as long as the resulting definition would allow us to in theory compute the behavior that an AI system would take, perhaps with assumptions like infinite computing power or arbitrarily many queries to a human. (Although we do prefer specifications that seem like they could admit an efficient implementation.) In terms of DeepMind’s classification, we are looking for a design specification that exactly matches the ideal specification. HCH and indirect normativity are examples of attempts at such specifications.
We will consider a model in which our AI system is maximizing the expected utility of some explicitly represented utility function that can depend on history. (It does not matter materially whether we consider utility functions or reward functions, as long as they can depend on history.) The utility function may be learned from data, or designed by hand, but it must be an explicit part of the AI that is then maximized.
I will not justify this model for now, but simply assume it by fiat and see where it takes us. I’ll note briefly that this model is often justified by the VNM utility theorem and AIXI, and as the natural idealization of reinforcement learning, which aims to maximize the expected sum of rewards, although typically rewards in RL depend only on states.
A lot of conceptual arguments, as well as experiences with specification gaming, suggest that we are unlikely to be able to simply think hard and write down a good specification, since even small errors in specifications can lead to bad results. However, machine learning is particularly good at narrowing down on the correct hypothesis among a vast space of possibilities using data, so perhaps we could determine a good specification from some suitably chosen source of data? This leads to the idea of ambitious value learning, where we learn an explicit utility function from human behavior for the AI to maximize.
This is very related to inverse reinforcement learning (IRL) in the machine learning literature, though not all work on IRL is relevant to ambitious value learning. For example, much work on IRL is aimed at imitation learning, which would in the best case allow you to match human performance, but not to exceed it. Ambitious value learning is, well, more ambitious -- it aims to learn a utility function that captures “what humans care about”, so that an AI system that optimizes this utility function more capably can exceed human performance, making the world better for humans than they could have done themselves.
It may sound like we would have solved the entire AI safety problem if we could do ambitious value learning -- surely if we have a good utility function we would be done. Why then do I think of it as a solution to just the specification problem? This is because ambitious value learning by itself would not be enough for safety, except under the assumption of as much compute and data as desired. These are really powerful assumptions -- for example, I'm assuming you can get data where you put a human in an arbitrarily complicated simulated environment with fake memories of their life so far and see what they do. This allows us to ignore many things that would likely be a problem in practice, such as:
- Attempting to use the utility function to choose actions before it has converged
- Distributional shift causing the learned utility function to become invalid
- Local minima preventing us from learning a good utility function, or from optimizing the learned utility function correctly
The next few posts in this sequence will consider the suitability of ambitious value learning as a solution to the specification problem. Most of them will consider whether ambitious value learning is possible in the setting above (infinite compute and data). One post will consider practical issues with the application of IRL to infer a utility function suitable for ambitious value learning, while still assuming that the resulting utility function can be perfectly maximized (which is equivalent to assuming infinite compute and a perfect model of the environment after IRL has run).
This statement feels pretty strong, especially given that I find it trivially true that I'd be a different person under many plausible alternative histories. This makes me think I'm probably misinterpreting something. :)
At first I read your paragraph as the strong claim that if it's true that individual human values are underdetermined at birth, then ambitious value learning looks doomed. And I'd take it as proof for "individual human values are underdetermined at birth" if, replaying history, I'd now have different values (or a different probability distribution over values) if I had encountered Yudkowsky's writings before Singer's, rather than vice-versa. Or if I would be less single-minded about altruism had I encountered EA a couple of years later in life, after already taking on another self-identity.
But these points (especially the second example) seem so trivially true that I'm probably talking about a different thing. In addition, they're addressed by the solution you propose in your first paragraph, namely taking current-you as the starting point.
Another concern could be that "there is almost never a stable core of an individual human's values", i.e., that "even going forward from today, the values of Lukas or Rohin or Wei are going to be heavily underdetermined". Is that the concern? This seems like it could be possible for most people, but definitely not for all people. And undetermined values are not necessarily that bad (though I find it mildly disconcerting, personally). [Edit: Wei's comment and your reply to it sounds like this might indeed be the concern. :) Good discussion there!]
The fact that I have a hard time understanding the framework behind your statement is probably because I'm thinking in terms of a different part of my brain when I talk about "my values". I identify very much with my reflective life goals to a point that seems unusual. I don't identify much with "What Lukas's behavior, if you were to put him in different environments and then watch, would indirectly consistently tell you about the things he appears to want – e.g., 'values' like being held in high esteem by others, having a comfortable life, romance, having either some kind of overarching purpose or enough distractions to not feel bother by the lack of purpose, etc.". There is definitely a sense in which the code that runs me is caring about all these implicit goals. But that's not how I most want to see it. I also know that in all the environments that offer the options to self-modify into a more efficient pursuer of explicitly held personal ideals, I would make substantial use of the option to self-modify. And that seems relevant for the same reason that we wouldn't want to count cognitive biases as people's values.
(I should probably continue reading the sequence and then come back to this later if I still feel unclear about it.)