Thanks, and nice work!
Thus the utility of (a1, o) for o in Press should be equivalent to the utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa
Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:
U(a1, o, a2) :=
{ UN(a1, o, a2) + E[US|do(O in Press)] if o not in Press
; US(a1, o, a2) + E[UN|do(O not in Press)] else }
(Choosing how to compare UN values to US values makes the choice of priors redundant. If you want the priors to be 2:1 in favor of US then you could also have just doubled US in the first place instead; the degree of freedom in the prior is the same as the degree of freedom in the relative scaling. See also Loudness Priors, a technical report from the last workshop.)
This method does seem to fulfill all the desiderata in the paper, although we're not too confident in it yet (it took us a little while to notice the "managing the news" problem in the first version, and it seems pretty likely that this too will have undesirable properties lurking somewhere). I'm fairly pleased with this solution, though, and a little miffed -- we found something similar to this a little while back (our research outstrips our writing speed, unfortunately) and now you've gone and ruined the surprise! :-)
(In seriousness, though, nice work. Next question is, can we pick any holes in it?)
What do we mean by "o in Press"? That whoever wrote the list defining "Press" made a different decision that day?
Maybe we can have the definition of "Press" as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that... Need some sort of abstract version of this (might be doable).
This still seems vulnerable to Benja's blackmail. Let me explain:
Assume that a2 is irrelevant, that o=Pr, an...
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.