Wiki Contributions



Couldn't HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn't it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).


There have been a lot of words written about how and why almost any conceivable goal, even a mundane one like "improve efficiency of a steel plant", carelessly specified, can easily result in a hostile AGI. The basic outline of these arguments usually goes something like:

  1. The AGI wants to do what you told it ("make more steel"), and will optimize very hard for making as much steel as possible.
  2. It also understands human motivations and knows that humans don't actually want as much steel as it is going to make. But note carefully that it wasn't aligned to respect human motivations, it was aligned to make steel. It's understanding of human motivations is part of its understanding of its environment, in the same way as its understanding of metallurgy. It has no interest in doing what humans would want it to do because it hasn't been designed to do that.
  3. Because it knows that humans don't want as much steel as it is going to make, it will correctly conclude that humans will try to shut it off as soon as they understand what the AGI is planning to do.
  4. Therefore it will correctly reason that its goal of making more steel will be easier to achieve if humans are unable to shut it off. This can lead to all kinds of unwanted actions such as the AGI making and hiding copies of itself everywhere, very persuasively convincing humans that it is not going to make as much steel as it secretly plans to so that they don't try to shut it off, and so on all the way up to killing all humans.

Now, "make as much steel as possible" is an exceptionally stupid goal to give an AGI, and no one would likely do that. But every less stupid goal that has been proposed has had plausible flaws pointed out which generally lead either to extinction or some form of permanent limitation of human potential.


Right - as I mentioned near the end of my post, it is clearly easy to specify formal utility functions that are about formal systems, like Go or databases. My question is how do you specify a formal utility function that is about the real world? Almost any remotely interesting goal I can think of (such as "get me coffee") seems impossible to formalize without relying on pre-theoretical notions of what it means for "me" to "have coffee".

If I was just trying to build an AI, this question wouldn't be terribly interesting. Obviously, you give the AI the utility function "maximize approval from the human trainer or concurrence with the training set" or whatever. The reason I'm posing the question is that the main research goal of AI safety appears to me to be "how do we close the gap between what humans value and what amoral maximizers do, and how do we prove that we've done it correctly." One strand of research appears to be pursuing this goal through formal reasoning, and I just don't understand where that can possibly lead, since you can't formalize the stuff you care about in the first place.

Again, I feel like this is an extremely basic question that I have no doubt people doing the research have thought of, but I haven't been able to find any previous discussion about it.

I think what you're seeing is that it's much harder to make systems that do the things we think we want them to do so that they intentionally do them rather than do the shadow of the things we think we want them to do that we knew how to specify.

If I've understood you correctly, I think I'm actually arguing something like the opposite. It seems to me that a) we don't know how to specify even very bad goals such as "maximize paperclips" and b) if we did, we wouldn't know how to install such a specified goal in an AI. At least not for the meaning of "specified" that is required in order for formal proofs about what the goal really means to apply.