This post summarizes the sequence on value learning. While it doesn’t introduce any new ideas, it does shed light on which parts I would emphasize most, and the takeaways I hope that readers get. I make several strong claims here; interpret these as my impressions, not my beliefs. I would guess many researchers disagree with the (strength of the) claims, though I do not know what their arguments would be.
Over the last three months we’ve covered a lot of ground. It’s easy to lose sight of the overall picture over such a long period of time, so let's do a brief recap.
The “obvious” approach
Here is an argument for the importance of AI safety:
- Any agent that is much more intelligent than us should not be exploitable by us, since if we could find some way to exploit the agent, the agent could also find the exploit and patch it.
- Anything that is not exploitable must be an expected utility maximizer; since we cannot exploit a superintelligent AI, it must look like an expected utility maximizer to us.
- Due to Goodhart’s Law, even “slightly wrong” utility functions can lead to catastrophic outcomes when maximized.
- Our utility function is complex and fragile, so getting the “right” utility function is difficult.
This argument implies that by the time we have a superintelligent AI system, there is only one part of that system that could still have been influenced by us: the utility function. Every other feature of the AI system is fixed by math. As a result, we must necessarily solve AI alignment by influencing the utility function.
So of course, the natural approach is to get the right utility function, or at least an adequate one, and have our AI system optimize that utility function. Besides fragility of value, which you might hope that machine learning could overcome, the big challenge is that even if you assume full access to the entire human policy, we cannot infer their values without making an assumption about how their preferences relate to their behavior. In addition, any misspecification can lead to bad inferences. And finally the entire project of having a single utility function that captures optimal behavior in all possible environments seems quite hard to do -- it seems necessary to have some sort of feedback from humans, or you end up extrapolating in some strange way that is not necessarily what we “would have” wanted.
So does this mean we’re doomed? Well, there are still some potential avenues for rescuing ambitious value learning, though they do look quite difficult to me. But I think we should actually question the assumptions underlying our original argument.
Problems with the standard argument
Consider the calculator. From the perspective of someone before the time of calculators, this device would look quite intelligent -- just look at the speed with which it can do arithmetic! Nonetheless, we can all agree that a standard calculator is not dangerous.
It also seems strange to ascribe goals to the calculator -- while this is not wrong per se, we certainly have better ways of predicting what a calculator will and will not do than by modelling it as an expected utility maximizer. If you model a calculator as aiming to achieve the goal of “give accurate math answers”, problems arise: what if I take a hammer to the calculator and then try to ask it 5 + 3? The utility maximizer model here would say that it answers 8, whereas with our understanding of how calculators work we know it probably won’t give any answer at all. Utility maximization with a simple utility function is only a good model for the calculator within a restricted set of environmental circumstances and a restricted action space. (For example, we don’t model the calculator as having access to the action, “build armor that can protect against hammer attacks”, because otherwise utility maximization would predict it takes that action.)
Of course, it may be that something that is generally superintelligent will work in as broad a set of circumstances as we do, and will have as wide an action space as we do, and must still look to us like an expected utility maximizer since otherwise we could Dutch book it. However, if you take such a broad view, then it turns out that all behavior looks coherent. There’s no mathematical reason that an intelligent agent must have catastrophic behavior, since any behavior that you observe is consistent with the maximization of some utility function.
To be clear, while I agree with every statement in Optimized agent appears coherent, I am making the strong claim that these statements are vacuous and by themselves tell us nothing about the systems that we will actually build. Typically, I do not flat out disagree with a common argument. I usually think that the argument is important and forms a piece of the picture, but that there are other arguments that push in other directions that might be more important. That’s not the case here: I am claiming that the argument that “superintelligent agents must be expected utility maximizers by virtue of coherence arguments” provides no useful information, with almost the force of a theorem. My uncertainty here is almost entirely caused by the fact that other smart people believe that this argument is important and relevant.
I am not claiming that we don’t need to worry about AI safety since AIs won’t be expected utility maximizers. First of all, you can model them as expected utility maximizers, it’s just not useful. Second, if we build an AI system whose internal reasoning consisted of maximizing the expectation of some simple utility function, I think all of the classic concerns apply. Third, it does seem likely that humans will build AI systems that are “trying to pursue a goal”, and that can have all of the standard convergent instrumental subgoals. I propose that we describe these systems as goal-directed rather than expected utility maximizers, since the latter is vacuous and implies a level of formalization that we have not yet reached. However, this risk is significantly different. If you believed that superintelligent AI must be goal-directed because of math, then your only recourse for safety would be to make sure that the goal is good, which is what motivated us to study ambitious value learning. But if the argument is actually that AI will be goal-directed because humans will make it that way, you could try to build AI that is not goal-directed that can do the things that goal-directed AI can do, and have humans build that instead.
Now that we aren’t forced to influence just a utility function, we can consider alternative designs for AI systems. For example, we can aim for corrigible behavior, where the agent is trying to do what we want. Or we could try to learn human norms, and create AI systems that follow these norms while trying to accomplish some task. Or we could try to create an AI ecosystem akin to Comprehensive AI Services, and set up the services such that they are keeping each other in check. We could create systems that learn how to do what we want in particular domains, by learning our instrumental goals and values, and use these as subsystems in AI systems that accelerate progress, enable better decision-making, and are generally corrigible. If we want to take such an approach, we have another source of influence: the human policy. We can train our human overseers to provide supervision in a particular way that leads to good behavior on the AI’s part. This is analogous to training operators of computer systems, and can benefit from insights from Human-Computer Interaction (HCI).
Not just value learning
This sequence is somewhat misnamed: while it is organized around value learning, there are many ideas that should be of interest to researchers working on other agendas as well. Many of the key ideas can be used to analyze any proposed solution for alignment (though the resulting analysis may not be very interesting).
The necessity of feedback. The main argument of Human-AI Interaction is that any proposed solution that aims to have an AI system (or a CAIS glob of services) produce good outcomes over the long term needs to continually use data about humans as feedback in order to “stay on target”. Here, “human” is shorthand for “something that we know shares our values”, eg. idealized humans, uploads, or sufficiently good imitation learning would all probably count.
(If this point seems obvious to you, note that ambitious value learning does not clearly satisfy this criterion, and approaches like impact measures, mild optimization, and boxing are punting on this problem and aiming for not-catastrophic outcomes rather than good outcomes.)
Mistake models. We saw that ambitious value learning has the problem that even if we assume perfect information about the human, we cannot infer their values without making an assumption about how their preferences relate to their behavior. This is an example of a much broader pattern: given that our AI systems necessarily get feedback from us, they must be making some assumption about how to interpret that feedback. For any proposed solution to alignment, we should ask what assumptions the AI system is making about the feedback it gets from us.