Thank you for the thoughtful reply!
Deducing the correct utility of a utility maximiser is one thing (which has a low level of uncertainty, higher if the agent is hiding stuff).
In the white-box approach it can't really hide. But I guess it's rather tangential to the discussion.
Assigning a utility to an agent that doesn't have one is quite another... Humans don't follow anything like a utility function, which is a first problem, so you're asking the AI to construct something that isn't there.
What do you mean by "follow a utility function"? Why do you thinks humans don't do it? If it isn't there, what does it mean to have a correct solution to the FAI problem?
The robot is a behavior-executor, not a utility-maximizer.
The main problem with Yvain's thesis is in the paragraph:
Again, give the robot human level intelligence. Teach it exactly what a hologram projector is and how it works. Now what happens? Exactly the same thing - the robot executes its code, which says to scan the room until its camera registers blue, then shoot its laser.
What does Yvain mean by "give the robot human level intelligence"? If the robot's code remained the same, in what sense does it have human level intelligence?
Then you have to knit this together into a humanity utility function, which is very non trivial.
This is the part of the CEV proposal which always seemed redundant to me. Why should we do it? If you're designing the AI, why wouldn't you use your own utility function? At worst, an average utility function of the group of AI designers? Why do we want / need the whole humanity there? Btw, I would obviously prefer my utility function in the AI but I'm perfectly willing to settle on e.g. Yudkowsky's.
Suppose the AI decides to kill everyone, then replay, in an endless loop, the one upload it has, having a marvellous experience... the AI does something stupid in this model (eg: replaces everyone with chatterbots that describe their ever increasing happiness and fulfilment)...
It seems that you're identifying my proposal with something like "maximize pleasure". The latter is a notoriously bad idea, as was discussed endlessly. However, my proposal is completely different. The AI wouldn't do something the upload wouldn't do because such an action is opposed to the upload's utility function.
You may object that these problems won't happen - but you can't be confident of this, as you haven't defined your solution formally...
Actually, I'm not far from it (at least I don't think I'm further than CEV). Note that I have already defined formally I(A, U) where I=intelligence, A=agent, U=utility function. Now we can do something like "U(A) is defined to be U s.t. the probability that I(A, U) > I(R, U) for random agent R is maximal". Maybe it's more correct to use something like a thermal ensemble with I(A, U) playing the role of energy: I don't know, I don't claim to have solved it all already. I just think it's a good research direction.
What do you mean by "follow a utility function"? Why do you thinks humans don't do it?
Humans are neither independent not transitive. Human preferences change over time, depending on arbitrary factors, including how choices are framed. Humans suffer because of things they cannot affect, and humans suffer because of details of their probability assessment (eg ambiguity aversion). That bears repeating - humans have preference over their state of knowledge. The core of this is that "assessment of fact" and "values" are not disc...
To construct a friendly AI, you need to be able to make vague concepts crystal clear, cutting reality at the joints when those joints are obscure and fractal - and them implement a system that implements that cut.
There are lots of suggestions on how to do this, and a lot of work in the area. But having been over the same turf again and again, it's possible we've got a bit stuck in a rut. So to generate new suggestions, I'm proposing that we look at a vaguely analogous but distinctly different question: how would you ban porn?
Suppose you're put in change of some government and/or legal system, and you need to ban pornography, and see that the ban is implemented. Pornography is the problem, not eroticism. So a lonely lower-class guy wanking off to "Fuck Slaves of the Caribbean XIV" in a Pussycat Theatre is completely off. But a middle-class couple experiencing a delicious frisson when they see a nude version of "Pirates of Penzance" at the Met is perfectly fine - commendable, even.
The distinction between the two case is certainly not easy to spell out, and many are reduced to saying the equivalent of "I know it when I see it" when defining pornography. In terms of AI, this is equivalent with "value loading": refining the AI's values through interactions with human decision makers, who answer questions about edge cases and examples and serve as "learned judges" for the AI's concepts. But suppose that approach was not available to you - what methods would you implement to distinguish between pornography and eroticism, and ban one but not the other? Sufficiently clear that a scriptwriter would know exactly what they need to cut or add to a movie in order to move it from one category to the other? What if the nude "Pirates of of Penzance" was at a Pussycat Theatre and "Fuck Slaves of the Caribbean XIV" was at the Met?
To get maximal creativity, it's best to ignore the ultimate aim of the exercise (to find inspirations for methods that could be adapted to AI) and just focus on the problem itself. Is it even possible to get a reasonable solution to this question - a question much simpler than designing a FAI?