Sorted by New

Wiki Contributions


I saw a talk earlier this year that mentioned this 2015 Corrigibility paper as a good starting point for someone new to alignment research. If that's still true, I started writing up some thoughts on a possible generalization of the method in that paper.

Anyway, submitting this draft early to hopefully get some feedback whether I'm on the right track:

GeneralizedUtilityIndifference_Draft_Latest.pdf (edited)

The new version does better on sub-agent shutdown and eliminates the "managing the news" problem.

(Let me know if someone already thought of this approach!)

EDIT 2017-11-09: filled in the section on the -action model.