Conclusion to the sequence on value learning

Rohin Shah

This post summarizes the sequence on value learning. While it doesn’t introduce any new ideas, it does shed light on which parts I would emphasize most, and the takeaways I hope that readers get. I make several strong claims here; interpret these as my impressions, not my beliefs. I would guess many researchers disagree with the (strength of the) claims, though I do not know what their arguments would be.

Over the last three months we’ve covered a lot of ground. It’s easy to lose sight of the overall picture over such a long period of time, so let's do a brief recap.

The “obvious” approach

Here is an argument for the importance of AI safety:

Any agent that is much more intelligent than us should not be exploitable by us, since if we could find some way to exploit the agent, the agent could also find the exploit and patch it.
Anything that is not exploitable must be an expected utility maximizer; since we cannot exploit a superintelligent AI, it must look like an expected utility maximizer to us.
Due to Goodhart’s Law, even “slightly wrong” utility functions can lead to catastrophic outcomes when maximized.
Our utility function is complex and fragile, so getting the “right” utility function is difficult.

This argument implies that by the time we have a superintelligent AI system, there is only one part of that system that could still have been influenced by us: the utility function. Every other feature of the AI system is fixed by math. As a result, we must necessarily solve AI alignment by influencing the utility function.

So of course, the natural approach is to get the right utility function, or at least an adequate one, and have our AI system optimize that utility function. Besides fragility of value, which you might hope that machine learning could overcome, the big challenge is that even if you assume full access to the entire human policy, we cannot infer their values without making an assumption about how their preferences relate to their behavior. In addition, any misspecification can lead to bad inferences. And finally the entire project of having a single utility function that captures optimal behavior in all possible environments seems quite hard to do -- it seems necessary to have some sort of feedback from humans, or you end up extrapolating in some strange way that is not necessarily what we “would have” wanted.

So does this mean we’re doomed? Well, there are still some potential avenues for rescuing ambitious value learning, though they do look quite difficult to me. But I think we should actually question the assumptions underlying our original argument.

Problems with the standard argument

Consider the calculator. From the perspective of someone before the time of calculators, this device would look quite intelligent -- just look at the speed with which it can do arithmetic! Nonetheless, we can all agree that a standard calculator is not dangerous.

It also seems strange to ascribe goals to the calculator -- while this is not wrong per se, we certainly have better ways of predicting what a calculator will and will not do than by modelling it as an expected utility maximizer. If you model a calculator as aiming to achieve the goal of “give accurate math answers”, problems arise: what if I take a hammer to the calculator and then try to ask it 5 + 3? The utility maximizer model here would say that it answers 8, whereas with our understanding of how calculators work we know it probably won’t give any answer at all. Utility maximization with a simple utility function is only a good model for the calculator within a restricted set of environmental circumstances and a restricted action space. (For example, we don’t model the calculator as having access to the action, “build armor that can protect against hammer attacks”, because otherwise utility maximization would predict it takes that action.)

Of course, it may be that something that is generally superintelligent will work in as broad a set of circumstances as we do, and will have as wide an action space as we do, and must still look to us like an expected utility maximizer since otherwise we could Dutch book it. However, if you take such a broad view, then it turns out that all behavior looks coherent. There’s no mathematical reason that an intelligent agent must have catastrophic behavior, since any behavior that you observe is consistent with the maximization of some utility function.

To be clear, while I agree with every statement in Optimized agent appears coherent, I am making the strong claim that these statements are vacuous and by themselves tell us nothing about the systems that we will actually build. Typically, I do not flat out disagree with a common argument. I usually think that the argument is important and forms a piece of the picture, but that there are other arguments that push in other directions that might be more important. That’s not the case here: I am claiming that the argument that “superintelligent agents must be expected utility maximizers by virtue of coherence arguments” provides no useful information, with almost the force of a theorem. My uncertainty here is almost entirely caused by the fact that other smart people believe that this argument is important and relevant.

I am not claiming that we don’t need to worry about AI safety since AIs won’t be expected utility maximizers. First of all, you can model them as expected utility maximizers, it’s just not useful. Second, if we build an AI system whose internal reasoning consisted of maximizing the expectation of some simple utility function, I think all of the classic concerns apply. Third, it does seem likely that humans will build AI systems that are “trying to pursue a goal”, and that can have all of the standard convergent instrumental subgoals. I propose that we describe these systems as goal-directed rather than expected utility maximizers, since the latter is vacuous and implies a level of formalization that we have not yet reached. However, this risk is significantly different. If you believed that superintelligent AI must be goal-directed because of math, then your only recourse for safety would be to make sure that the goal is good, which is what motivated us to study ambitious value learning. But if the argument is actually that AI will be goal-directed because humans will make it that way, you could try to build AI that is not goal-directed that can do the things that goal-directed AI can do, and have humans build that instead.

Alternative solutions

Now that we aren’t forced to influence just a utility function, we can consider alternative designs for AI systems. For example, we can aim for corrigible behavior, where the agent is trying to do what we want. Or we could try to learn human norms, and create AI systems that follow these norms while trying to accomplish some task. Or we could try to create an AI ecosystem akin to Comprehensive AI Services, and set up the services such that they are keeping each other in check. We could create systems that learn how to do what we want in particular domains, by learning our instrumental goals and values, and use these as subsystems in AI systems that accelerate progress, enable better decision-making, and are generally corrigible. If we want to take such an approach, we have another source of influence: the human policy. We can train our human overseers to provide supervision in a particular way that leads to good behavior on the AI’s part. This is analogous to training operators of computer systems, and can benefit from insights from Human-Computer Interaction (HCI).

Not just value learning

This sequence is somewhat misnamed: while it is organized around value learning, there are many ideas that should be of interest to researchers working on other agendas as well. Many of the key ideas can be used to analyze any proposed solution for alignment (though the resulting analysis may not be very interesting).

The necessity of feedback. The main argument of Human-AI Interaction is that any proposed solution that aims to have an AI system (or a CAIS glob of services) produce good outcomes over the long term needs to continually use data about humans as feedback in order to “stay on target”. Here, “human” is shorthand for “something that we know shares our values”, eg. idealized humans, uploads, or sufficiently good imitation learning would all probably count.

(If this point seems obvious to you, note that ambitious value learning does not clearly satisfy this criterion, and approaches like impact measures, mild optimization, and boxing are punting on this problem and aiming for not-catastrophic outcomes rather than good outcomes.)

Mistake models. We saw that ambitious value learning has the problem that even if we assume perfect information about the human, we cannot infer their values without making an assumption about how their preferences relate to their behavior. This is an example of a much broader pattern: given that our AI systems necessarily get feedback from us, they must be making some assumption about how to interpret that feedback. For any proposed solution to alignment, we should ask what assumptions the AI system is making about the feedback it gets from us.

It seems to me that perhaps your argument about expected utility maximization being a trivial property extends back one step previous in the argument, to non-exploitability as well.

AlphaZero is better than us at chess, and so it is non-exploitable at chess (or you might say that being better at chess is the same thing as being non-exploitable at chess). If that's true, then it must also appear to us to be an expected utility maximizer. But notably the kind of EU-maximizer that it must appear to be is: one whose utility function is defined in terms of chess outcomes. AlphaZero *is* exploitable if we're secretly playing a slightly different game, like how-many-more-pawns-do-I-have-than-my-opponent-after-twenty-moves, or the game don't-get-unplugged.

Going the other direction, from EU-maximization to non-exploitability, we can point out that any agent could be thought of as an EU-maximizer (perhaps with a very convoluted utility function), and if it's very competent w.r.t. its utility function, then it will be non-exploitable by us, w.r.t. outcomes related to its utility function.

In other words, non-exploitability is only meaningful with respect to some utility function, and is not a property of "intelligence" or "competence" in general.

Would you agree with this statement?

Yes, I agree that's a corollary.

However, this risk is significantly different. If you believed that superintelligent AI must be goal-directed because of math, then your only recourse for safety would be to make sure that the goal is good, which is what motivated us to study ambitious value learning. But if the argument is actually that AI will be goal-directed because humans will make it that way, you could try to build AI that is not goal-directed that can do the things that goal-directed AI can do, and have humans build that instead.

I'm curious about the extent to which people have felt like "superintelligent AI must be goal-directed" has been the primary problem? Now that I see it expressed in this form, I realize that there have for a long time been lots of papers and comments which seem to suggest that this might be people's primary concern. But I always kind of looked at it from the perspective of "yeah this is one concern, but even assuming that we could make a non-goal-directed AI, that doesn't solve the problem of other people having an incentive to make goal-directed-AI (and that's the much more pressing problem)". So since we seemed to agree on goal-directed superintelligence being a big problem, maybe I overestimated the extent of my agreement with other people concerned about goal-directed superintelligence.

I also am unsure about how much people think that's the primary problem. I feel fairly confident that Eliezer thinks (or thought at some recent point) that this was the primary problem. I came into the field thinking of this as the primary problem.

It certainly seems that many people assume that a superintelligent AI system has a utility function. I don't know their reasons for this assumption.

The standard rebuttal here is that even if a superintelligent AI system is not goal directed, we should be concerned that the AI will spontaneously develop goal directed behavior because it is instrumentally valuable to doing whatever it is doing (and is not "doing whatever it is doing" a "goal", even if the AI does not conceive of it as a goal, the same way as the calculator has a "goal" or purpose, even if the calculator is unaware of it). This is of course contingent on it being "superintelligent".

For what it's worth this is also the origin, as I recall it, of concerns about paperclip maximizers: you won't build an AI that sets out to tile the universe with paperclips, but through a series of unfortunate misunderstandings it will, as a subagent or an instrumental action, end up optimizing for paperclips anyway because it seemed like a good idea at the time.

It sounds to me like you're requiring "superintelligent" to include "has a goal" as part of the definition. If that's part of the definition, then I would rephrase my point as "why do we have to build something superintelligent? Let's instead build something that doesn't have a goal but is still useful, like an AI system that follows norms."

See also this comment, which answers a related question.

Does anyone have an incentive to make a non-goal directed AI they can't control?

(did you mean to ask goal-directed?)

Related: Gwern wrote a post arguing that people have an incentive to build a goal-directed AI over a non-goal directed AI. See the references here.

I'm not seeing the "can't control". Sure , agent AI is more powerful than tool AI -- and more powerful things need more control to make them do what you want.

The majority of people choose to make non-goal-directed uncontrolled natural-intelligence agents. It seems likely that as general AI becomes feasible, this drive to procreate will motivate at least some to create such a thing.

It doesn't seem likely to me. People don't procreate in order to fulfil the abstract definition you gave, they procreate to fulfil biological urges and cultural mores.

You have more faith in your model of people's motivation than I do in mine. But that doesn't mean you're right. There are tons of examples in literature and in daily life of mis-/re-directed biological drives, and making an AGI "child" seems so mundane a motive that I hadn't considered until your comment that it might NOT be strong enough motive.

I have to admit I've seen this as a strong motive for creating AGI in both myself and others. Maybe it's because I just don't get along with other humans very well (or specifically I fail to model them properly), or because I feel as if I would understand AGI better than them, but it just seems much more appealing to me than having an actual child, at least right now. Specifically, my goal is (assuming I understand correctly) non-goal-directed bounded artificial intelligence agents, so... it's pretty similar, at least. It's certainly a strong enough motive for some people.

Promoted to curated: I wanted to curate this post for a long time, but didn't find a good space in the curation schedule until now. I think the whole sequence is worth reading and this post summarizes a lot of its core insights quite well.

I think this sequence clarified a lot of discussion around AI Alignment for me in a way that has changed my thinking about the whole topic quite a bit. I don't necessarily agree with everything in the sequence, but it helped me understand a perspective that previously seemed foreign to me, and I think it did the same for many others. I also hope that curating this might encourage some others to give reading the sequence a try.

Just a few comments

In the abstract, one open problem about "not-goal directed agents" is "when they turn into goal directed?"; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things
From the "alternative solutions", in my view, what is under-investigated are attempts to limit capabilities - make "bounded agents". One intuition behind it is that humans are functional just because goals and utilities are "broken" in a way compatible with our planning and computational bounds. I'm worried that efforts in this direction got bucketed with "boxing", and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like "just don't connect it to the internet")
I'm particularly happy about your points on the standard claims about expected utility maximization. My vague impression is too many people on LW kind of read the standard texts, take note that there is a persuasive text from Eliezer on a topic, and take the matter as settled.

In the abstract, one open problem about "not-goal directed agents" is "when they turn into goal directed?"; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things

I agree that inner optimizers are a way that non-goal directed agents can become goal directed. I don't see why solutions to inner optimizers would help align non goal-directed things. Can you say more about that?

From the "alternative solutions", in my view, what is under-investigated are attempts to limit capabilities - make "bounded agents". One intuition behind it is that humans are functional just because goals and utilities are "broken" in a way compatible with our planning and computational bounds. I'm worried that efforts in this direction got bucketed with "boxing", and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like "just don't connect it to the internet")

I am somewhat worried about such approaches, because it seems hard to make such agents competitive with unaligned agents. But I agree that it seems under-investigated.

I'm particularly happy about your points on the standard claims about expected utility maximization.

Thanks!

Seems as if there's some sleight-of-hand going on here Yes, we can show that any policy that is invulnerable to dutch-booking is equivalent to optimizing some utility function. But you've also shown earlier that "equivalent to optimizing some utility function" is a nearly-vacuous concept. There are plenty of un-dutch-bookable policies which still don't end up paving the universe in utilitronium, for ANY utility function.

Furthermore, I find it easy to imagine human-like value systems which are in fact dutch-bookable; e.g., "I like to play peekaboo with babies" is dutch-bookable between "eyes covered" and "eyes uncovered". So the generalization at the outset of this chapter seems over-broad.

I feel like you are trying to critique something I wrote, but I'm not sure what? Could you be a bit more specific about what you think I think that you disagree with?

(In particular, the first paragraph sounds like a statement that I myself would make, so I'm not sure how it is a critique.)

Hmm. I went back and reread you carefully, and I cannot find the part where you said the thing that I was "responding" to above. So I think I'm probably actually responding to my poor model of what you would say, not to what you actually did say. Sorry. I'll leave my above comment but strike out the parts where it refers to what "you" say.

It seems to me that perhaps your argument about expected utility maximization being a trivial property extends back one step previous in the argument, to non-exploitability as well.

In other words, non-exploitability is only meaningful with respect to some utility function, and is not a property of "intelligence" or "competence" in general.

Would you agree with this statement?

Yes, I agree that's a corollary.

However, this risk is significantly different. If you believed that superintelligent AI must be goal-directed because of math, then your only recourse for safety would be to make sure that the goal is good, which is what motivated us to study ambitious value learning. But if the argument is actually that AI will be goal-directed because humans will make it that way, you could try to build AI that is not goal-directed that can do the things that goal-directed AI can do, and have humans build that instead.

It certainly seems that many people assume that a superintelligent AI system has a utility function. I don't know their reasons for this assumption.

See also this comment, which answers a related question.

Does anyone have an incentive to make a non-goal directed AI they can't control?

(did you mean to ask goal-directed?)

Related: Gwern wrote a post arguing that people have an incentive to build a goal-directed AI over a non-goal directed AI. See the references here.

I'm not seeing the "can't control". Sure , agent AI is more powerful than tool AI -- and more powerful things need more control to make them do what you want.

It doesn't seem likely to me. People don't procreate in order to fulfil the abstract definition you gave, they procreate to fulfil biological urges and cultural mores.

Just a few comments

In the abstract, one open problem about "not-goal directed agents" is "when they turn into goal directed?"; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things
From the "alternative solutions", in my view, what is under-investigated are attempts to limit capabilities - make "bounded agents". One intuition behind it is that humans are functional just because goals and utilities are "broken" in a way compatible with our planning and computational bounds. I'm worried that efforts in this direction got bucketed with "boxing", and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like "just don't connect it to the internet")
I'm particularly happy about your points on the standard claims about expected utility maximization. My vague impression is too many people on LW kind of read the standard texts, take note that there is a persuasive text from Eliezer on a topic, and take the matter as settled.

In the abstract, one open problem about "not-goal directed agents" is "when they turn into goal directed?"; this seems to be similar to the problem of inner optimizers, at least in the direction that solutions which would prevent the emergence of inner optimizers could likely work for non-goal directed things

From the "alternative solutions", in my view, what is under-investigated are attempts to limit capabilities - make "bounded agents". One intuition behind it is that humans are functional just because goals and utilities are "broken" in a way compatible with our planning and computational bounds. I'm worried that efforts in this direction got bucketed with "boxing", and boxing got some vibe as being uncool. (By making something bounded I mean for example making bit-flips costly in a way which is tied to physics, not naive solutions like "just don't connect it to the internet")

I am somewhat worried about such approaches, because it seems hard to make such agents competitive with unaligned agents. But I agree that it seems under-investigated.

I'm particularly happy about your points on the standard claims about expected utility maximization.

Thanks!

I feel like you are trying to critique something I wrote, but I'm not sure what? Could you be a bit more specific about what you think I think that you disagree with?

(In particular, the first paragraph sounds like a statement that I myself would make, so I'm not sure how it is a critique.)

52

Conclusion to the sequence on value learning

52

Ω 21

The “obvious” approach

Problems with the standard argument

Alternative solutions

Not just value learning

52

Ω 21

52

Ω 21