However, an FAI is not to be given the CEV as it's goal but rather a superintelligence is to use our CEV to determine what goals an FAI should be given. What this means though is that there will be a point where a superintelligence exists that is not friendly.
There will be a superintelligence that wants to be friendly, but isn't sure what friendliness means exactly. It will at that point already have some useful heuristics for friendliness (e. g. killing humans is unlikely to be friendly, actions that would cause reactions consistent with pain are unlik...
What this means though is that there will be a point where a superintelligence exists that is not friendly.
However, this is only true of imaginary superintelligences based on overly simplified models. In reality, we are not going to be able to build something with even a significant fraction of human intelligence, until we understand a great deal more than we do now about such matters as the acquisition, representation and use of large bodies of tacit knowledge -- which is also precisely what we need to make real progress on the problem of getting machi...
To reliably avoid 1 (the human expectation of the results of an AI implementing a goal differing from the results that the AI actually works toward) - and do anything reasonably useful, I think you pretty-much have to scan the human's brain to find out what its actual expectations were.
If you do that non-invasively, that's some pretty high technology you are talking about there - pushing this scenario way off into the future - long after we have created machine intelligence.
The Friendly AI problem is complicated enough that it can be divided into a large number of subproblems. Two such subproblems could be:
Let's call an AI which does not suffer from these problem a pseudofriendly AI. Would this be a useful type of AI to produce? Well, maybe or maybe not. But even if it fails to be useful in and of itself, solving the pseudofriendly AI problem may be a helpful step toward developing the mode of thinking needed to solve the Friendly AI problem.
It's also possible that pseudofriendliness might be able to interact usefully with Eliezer's Coherent Extrapoltated Volition (CEV - see here for more details). Eliezer has expressed CEV as follows:
However, an FAI is not to be given the CEV as it's goal but rather a superintelligence is to use our CEV to determine what goals an FAI should be given. What this means though is that there will be a point where a superintelligence exists that is not friendly. Could a pseudofriendly AI fill a gap here? Probably not - pseudofriendliness is not friendliness, nor should it be confused with it. However, it might be part of a solution that help the CEV approach to be safely implemented.
Why all this hassle though? We seem to have exchanged one very important problem for two less important ones. Well, part of the benefit of pseudofriendliness is that it seems like it should be easier to formalise. First, let us introduce the concept of an interpretation system.
An interpretation system takes a partially specified world state (called a goal) and outputs a triple (Wx, Sx, Cx) where Wx is a partially specified world state, Sx is a set containing sets of subgoals and Cx is a chosen subgoal.
What does all of this mean? Well, the input could be thought of as a goal (stop the humans on that island from being drowned by rising sea waters) which is expressed as a partial world state (ie. the world state where the humans on the island remain undrowned). The interpretation system then outputs a partially specified world state which may be the same or different. In humans, various aspects of our cognitive system would make us interpret this goal as a different world state. For example, we may implicitly not consider tying all of the humans to giant stakes so they were above the level of the water but were unable to move or act.So we would output one world state while an AI may well output another. This is enough to specific the problem of goal interpretation as follows:
The problem of goal interpretation is as follows. An interpretation system Ix given a goal G outputs Wx. A second interpretation system Iy outputs Wy on receiving the goal. Systems Ix and Iy suffer from the goal interpretation problem if Wx ≠ Wy.
The interpretation systems also output a set of goals to be used to bring about the world state and a set of goal sets which could altenatively be used to bring it about. Going back to our rising sea water example, even if Wx = Wy, these are only partially specified world views and hence do not determine whether every aspect of the AI's actions would produce outcomes that we want. This means that the subgoals used to get to a goal may still be undesirable. We can now specify the problem of innate drives as:
System Ix suffers from the weak problem of innate drives from the perspective of system Iy if Cx ≠ Cy. It suffers from the strong problem of innate drives if Cx is not a member of Sx.
If these definitions stand up, then pseudofriendly AI is certainly more formally specified than Friendly AI. However, even if not, it seems plausible that it is likely to be easier to formalise pseudofriendliness than friendliness. If you buy that, then the questions remaining are: