Stuart_Armstrong

Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.

Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?

If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.

Thanks! Corrected (though it is indeed a good hard problem).

Pre-training and domain specific knowledge are not needed.

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Which problems are you thinking of?