Preface: I think my question is a rather basic one, but I haven't been able to find a good answer to it yet. I did find one post that touches on similar areas, which might be good background reading (the comments are great too).

Let's start with the standard example of building a super intelligence and telling it to bring you coffee. We give it a utility function which is 1 if you have coffee, 0 otherwise. This goes terribly wrong, of course, because this utility function is not what you actually wanted. As we all know, this is the basis on which much of the concern about AI alignment rests. However, it seems to me that an important detail here has been glossed over by most discussions of AI alignment that I've read.

My question is: how do we, even in principle, get to the point of having an AI that has this (or any) pre-specified utility function in the first place, and what does that tell us about AI alignment? Our desired utility function must be formalizable if we want to be able to say it has been "specified" in any meaningful sense, but in the real world, whether I have coffee or not is not obviously formalizable. In other words, if I build a super intelligence, what is the actual concrete work that is involved in giving it a utility function that I picked ahead of time, even an extremely foolish one?

I can think of a few possibilities:

1) Let's assume the AI understands basic physics: You input a formal definition about "having coffee" based on the location and properties of atoms.

2) You tell the AI to try things (maybe asking you first) and after each experiment it performs, you tell it whether you have coffee.

3) You have previously taught the AI to understand human language, and you just say "now bring me coffee", or, if you wish, "maximize the utility function that is 1 when I have coffee, and 0 when I don't".

4) You have previously taught the AI to understand and manipulate formal systems, and you input a formalized version of "maximize the utility function that is 1 when I have coffee, and 0 when I don't".

5) This is silly! A human is clearly capable, in principle, of slavishly maximizing a simple utility function. This is an existence proof that such a system can exist in nature, even if we don't yet know how to build it.

I think there are basic conceptual problems with each of these proposals:

1) The physical definition: Yes, you could do something incredibly idiotic like saying that the atoms that make up your body should be close to a mixture of atoms that match the composition of coffee. But the concern is not that people will be unbelievable stupid, it's that they will do something that seems smart but has a loophole or unintended consequence they didn't foresee. So, to take this approach, we need to come of with a definition of "having coffee" that is a formal property of an arrangement of atoms, but isn't obviously stupid to anyone smart enough to attempt this work in the first place. I don't see how you can even begin to approach this. As an analogy, it would be as if a contemporary AI researcher attempted to train an image recognition system to recognize cats by using a formal definition of "cat" involving properties of pixels. Not only would no one attempt to do this, if you knew how to do it, you wouldn't need the AI.

2) Training by human feedback: This has nothing to do with pre-specified utility functions and so is beyond the scope of this question. (The standard concerns about the ways that this sort of training might go wrong still apply.)

3) Specification through natural language: This is question begging. We're assuming that the AI has a way to turn a natural language statement into a formalized utility function, and further assuming that it has been motivated to do so. So now you're left with the task of giving the AI the utility function "1 if I turn natural language statements from humans into formal utility functions and maximize them, 0 otherwise". And we're back where we started, except with what seems to me like a far harder utility function to formalize.

4) Specification through formal systems: Even worse question begging. In addition to the previous objection, this also assumes that we can formalize the predicate "I have coffee", which was the motivation for this post.

5) Human existence proof: A human that decides to act like an amoral maximizing agent must either take this question seriously and attempt to formalize the utility function, or else fall back on human intuitions about what it means to "have coffee". In the former case, we have more question begging. In the latter case, we have fallen short of an existence proof of the possibility of an amoral maximizing agent targeting a pre-specified formal utility function.

Ok, so why my obsession with formal, pre-specified utility functions? A lot of work in AI alignment that I have looked at seems focused on proving formal results about utility functions, e.g. the failure of naive attempts to give AIs off switches that they don't immediately disable. Obviously as a matter of basic science, this is worthwhile research. But if it isn't possible to give an AI a pre-specified formal utility function about the real world in the first place, then none of these formal results matter in the real world[1]. And if that's the case, then the task of building friendly AI has nothing to do with formal properties of utility functions, and everything to do with how we train AI and what "values" become embedded in the AI as a result of the training.

(Caveat: There is one case where it is easy to input a formal utility function, which is the case where you are building an AI purely for the purpose of manipulating a formal system in the first place. For instance, it does seem conceivable that a super intelligence that is told to "be as good at go/chess as possible" or "find a proof of the goldbach conjecture" might decide to start turning all available matter into a computer. I think there might be similar objections to this scenario, but I haven't yet thought them through.)

Thank you for reading, and I look forward to reading your replies.

[1] I am aware that for any sufficiently coherent-acting agent, a utility function describing its preferences exists. This still leaves open the question of whether we can construct an agent that has a known and fully specified UF that we picked ahead of time. If we can't do this, there's no point in trying to figure out how to design a UF that would result in a friendly AI.

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 6:04 AM

Indeed, my sense is that getting to the point where it's possible to give an AI formally specified preferences that relate to the real world is where most of the difficulty of the alignment problem is. See: ontology identification problem.

I think there are a couple of situations where trying to build FAI by specifying a utility function can make sense (which don't include things like "get me a coffee").

  1. We can determine with some certainty that just maximizing some simple utility function can get us most of the potential value of the universe/multiverse. See this post of mine.
  2. We can specify a utility function using "indirect normativity". See this post by Paul Christiano (which doesn't work but gives an idea of what I mean here).

I'm not sure if the papers that you're puzzled about or criticizing have one of these in mind or something else. It might be helpful if you cited a few of them.

For instance, it does seem conceivable that a super intelligence that is told to "be as good at go/chess as possible" or "find a proof of the goldbach conjecture" might decide to start turning all available matter into a computer.

Just to show it's probably as bad as you think, even these sorts of statements wouldn't likely cache out for the AI as having a goal to "be as good at go/chess as possible" or "find a proof of the goldbach conjecture" because those are ways we interpret and give meaning to what the AI is doing and you could build an AI to do those things without it understanding its own goals. We can and do build AI now that have no conception of their own actions the same way computer programs and non-electronic machinery don't know what they're doing, and since it's easier to do this it's far more likely that's what a super intelligence pointed at these problems would look like.

That is to say, the paperclip maximizers we worry about probably don't even know they're maximizing paperclips; they're just doing stuff in a way that we interpret as maximizing paperclips.

I think this also illustrates something suggested by the linked post: you can have a utility function without it meaning anything to the thing optimizing it. I think what you're seeing is that it's much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.

I think what you're seeing is that it's much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.

If I've understood you correctly, I think I'm actually arguing something like the opposite. It seems to me that a) we don't know how to specify even very bad goals such as "maximize paperclips" and b) if we did, we wouldn't know how to install such a specified goal in an AI. At least not for the meaning of "specified" that is required in order for formal proofs about what the goal really means to apply.

There's some sense in which we can do this though because we already do it. After all, AlphaGo really does tell you moves that result in winning a game of Go, even though AlphaGo seems unlikely to have any idea what Go is or what winning means. We've specified that it should do something, but that something is only meaningful to us.

Put another way, I definitely wrote boring old code we wouldn't even call AI this morning that tries to satisfy the goal of "insert X into database Y" and my code exists as a kind of specification of the goal (although a very precise one that says exactly how to do it) that the computer will use to accomplish that goal, but this is a far cry from the computer trying for itself to insert X into database Y itself because I had some way to specify that's what I wanted it to do other than just telling it exactly what to do.

Maybe we are talking at different levels here, but it seems to me that if we can specify the goal now we do so in such a specific way as to be something we can make formal proofs about but not be very interesting because the system has little power on its own to do things we didn't specifically ask it to do, but I agree we don't know how to specify goals to more complex systems that do things for themselves the way we can ask people to do things much less in a way that we can make formal proofs about their properties.

Right - as I mentioned near the end of my post, it is clearly easy to specify formal utility functions that are about formal systems, like Go or databases. My question is how do you specify a formal utility function that is about the real world? Almost any remotely interesting goal I can think of (such as "get me coffee") seems impossible to formalize without relying on pre-theoretical notions of what it means for "me" to "have coffee".

If I was just trying to build an AI, this question wouldn't be terribly interesting. Obviously, you give the AI the utility function "maximize approval from the human trainer or concurrence with the training set" or whatever. The reason I'm posing the question is that the main research goal of AI safety appears to me to be "how do we close the gap between what humans value and what amoral maximizers do, and how do we prove that we've done it correctly." One strand of research appears to be pursuing this goal through formal reasoning, and I just don't understand where that can possibly lead, since you can't formalize the stuff you care about in the first place.

Again, I feel like this is an extremely basic question that I have no doubt people doing the research have thought of, but I haven't been able to find any previous discussion about it.

The closest thing I can think of is on the capabilities side looking at how to create intension, although it's never been resolved (which is a big part of why GOFAI failed). You're right that we mostly assume it will be figured out somehow, but safety research at least does not seem to be much addressing this question.