Jul 11, 2018
Preface: I think my question is a rather basic one, but I haven't been able to find a good answer to it yet. I did find one post that touches on similar areas, which might be good background reading (the comments are great too).
Let's start with the standard example of building a super intelligence and telling it to bring you coffee. We give it a utility function which is 1 if you have coffee, 0 otherwise. This goes terribly wrong, of course, because this utility function is not what you actually wanted. As we all know, this is the basis on which much of the concern about AI alignment rests. However, it seems to me that an important detail here has been glossed over by most discussions of AI alignment that I've read.
My question is: how do we, even in principle, get to the point of having an AI that has this (or any) pre-specified utility function in the first place, and what does that tell us about AI alignment? Our desired utility function must be formalizable if we want to be able to say it has been "specified" in any meaningful sense, but in the real world, whether I have coffee or not is not obviously formalizable. In other words, if I build a super intelligence, what is the actual concrete work that is involved in giving it a utility function that I picked ahead of time, even an extremely foolish one?
I can think of a few possibilities:
1) Let's assume the AI understands basic physics: You input a formal definition about "having coffee" based on the location and properties of atoms.
2) You tell the AI to try things (maybe asking you first) and after each experiment it performs, you tell it whether you have coffee.
3) You have previously taught the AI to understand human language, and you just say "now bring me coffee", or, if you wish, "maximize the utility function that is 1 when I have coffee, and 0 when I don't".
4) You have previously taught the AI to understand and manipulate formal systems, and you input a formalized version of "maximize the utility function that is 1 when I have coffee, and 0 when I don't".
5) This is silly! A human is clearly capable, in principle, of slavishly maximizing a simple utility function. This is an existence proof that such a system can exist in nature, even if we don't yet know how to build it.
I think there are basic conceptual problems with each of these proposals:
1) The physical definition: Yes, you could do something incredibly idiotic like saying that the atoms that make up your body should be close to a mixture of atoms that match the composition of coffee. But the concern is not that people will be unbelievable stupid, it's that they will do something that seems smart but has a loophole or unintended consequence they didn't foresee. So, to take this approach, we need to come of with a definition of "having coffee" that is a formal property of an arrangement of atoms, but isn't obviously stupid to anyone smart enough to attempt this work in the first place. I don't see how you can even begin to approach this. As an analogy, it would be as if a contemporary AI researcher attempted to train an image recognition system to recognize cats by using a formal definition of "cat" involving properties of pixels. Not only would no one attempt to do this, if you knew how to do it, you wouldn't need the AI.
2) Training by human feedback: This has nothing to do with pre-specified utility functions and so is beyond the scope of this question. (The standard concerns about the ways that this sort of training might go wrong still apply.)
3) Specification through natural language: This is question begging. We're assuming that the AI has a way to turn a natural language statement into a formalized utility function, and further assuming that it has been motivated to do so. So now you're left with the task of giving the AI the utility function "1 if I turn natural language statements from humans into formal utility functions and maximize them, 0 otherwise". And we're back where we started, except with what seems to me like a far harder utility function to formalize.
4) Specification through formal systems: Even worse question begging. In addition to the previous objection, this also assumes that we can formalize the predicate "I have coffee", which was the motivation for this post.
5) Human existence proof: A human that decides to act like an amoral maximizing agent must either take this question seriously and attempt to formalize the utility function, or else fall back on human intuitions about what it means to "have coffee". In the former case, we have more question begging. In the latter case, we have fallen short of an existence proof of the possibility of an amoral maximizing agent targeting a pre-specified formal utility function.
Ok, so why my obsession with formal, pre-specified utility functions? A lot of work in AI alignment that I have looked at seems focused on proving formal results about utility functions, e.g. the failure of naive attempts to give AIs off switches that they don't immediately disable. Obviously as a matter of basic science, this is worthwhile research. But if it isn't possible to give an AI a pre-specified formal utility function about the real world in the first place, then none of these formal results matter in the real world. And if that's the case, then the task of building friendly AI has nothing to do with formal properties of utility functions, and everything to do with how we train AI and what "values" become embedded in the AI as a result of the training.
(Caveat: There is one case where it is easy to input a formal utility function, which is the case where you are building an AI purely for the purpose of manipulating a formal system in the first place. For instance, it does seem conceivable that a super intelligence that is told to "be as good at go/chess as possible" or "find a proof of the goldbach conjecture" might decide to start turning all available matter into a computer. I think there might be similar objections to this scenario, but I haven't yet thought them through.)
Thank you for reading, and I look forward to reading your replies.
 I am aware that for any sufficiently coherent-acting agent, a utility function describing its preferences exists. This still leaves open the question of whether we can construct an agent that has a known and fully specified UF that we picked ahead of time. If we can't do this, there's no point in trying to figure out how to design a UF that would result in a friendly AI.