Yup, this all seems basically right. Though in reality I'm not that worried about the "we might outlaw some good actions" half of the dilemma. In real-world settings, actions are so multi-faceted that being able to outlaw a class of actions based on any simple property would be a research triumph.
Also see https://www.lesswrong.com/posts/LR8yhJCBffky8X3Az/using-predictors-in-corrigible-systems or https://www.lesswrong.com/posts/qpZTWb2wvgSt5WQ4H/defining-myopia for successor lines of reasoning.
Yes, I too am more concerned from a 'maybe this framing isn't super useful as it fails to capture important distinctions between corrigible and non-corrigible' point of view rather than a 'we might outlaw some good actions' point of view.
Thanks for the links, they look interesting!
Recently, I read Corrigibility by Soares et al. and became confused. I followed most of the mathematical reasoning but am now struggling to understand what the point or end goal of this avenue of research is meant to be. I know that MIRI now pursues a different research direction so maybe some or all of what I write here is moot.
Doesn't this paragraph undermine the whole project?
In Section 2.1, a 'manipulative action' a−1 is considered. This action results in a lower utility according to UN (the 'trial' utility function which the agent should maximize unless the shutdown button is pressed) but lowers the probability that the shutdown button is pressed. It makes sense that such an action could be manipulative. But the authors then clarify the following:
Ok, cool. But if we are using a framework where the manipulative action has exactly the same mathematical representation as a non-manipulative action, isn't this an indication that our whole framework is wrong? Surely we need to distinguish between these two types of action? Doesn't this paragraph show precisely that corrigibility can't be accurately captured by this model using utility functions and actions?
Suppose I designed a utility function and proved, in this framework that it never took deceptive actions like a−1, then this proof would also show that my utility function doesn't allow for the kind of helpful, non-manipulative actions described in the quoted paragraph above.
Conversely, if I did create a real-life instantiation of a corrigible agent, it would distinguish between manipulative and non-manipulative actions and thus couldn't be modelling the world in the way which is used in this paper, since the mathematical representation used in the paper does not always properly distinguish between these two types of actions.