by A.H.

# 4

Recently, I read Corrigibility by Soares et al. and became confused. I followed most of the mathematical reasoning but am now struggling to understand what the point or end goal of this avenue of research is meant to be. I know that MIRI now pursues a different research direction so maybe some or all of what I write here is moot.

## Doesn't this paragraph undermine the whole project?

In Section 2.1, a 'manipulative action'  is considered. This action results in a lower utility according to  (the 'trial' utility function which the agent should maximize unless the shutdown button is pressed) but lowers the probability that the shutdown button is pressed. It makes sense that such an action could be manipulative. But the authors then clarify the following:

We remark that [the mathematical condition satisfied by ] is not by any means a necessary or sufficient description of manipulative actions preventing [pressing of the button]. Some actions satisfying [these conditions] might be non-manipulative (for example, if the agent expends extra effort ensuring that anew subagent is corrigible, and this behavior helps convince the programmers that the agent is indeed corrigible). Furthermore, many manipulative actions may not have this form
(as we will see in Section 4.2). Rather, we consider one manipulative action  which happens to be both manipulative and have this form, and show that if  is set to be too low, then U will incentivize U-agents to prefer this  to the
default action

Ok, cool. But if we are using a framework where the manipulative action has exactly the same mathematical representation as a non-manipulative action, isn't this an indication that our whole framework is wrong? Surely we need to distinguish between these two types of action? Doesn't this paragraph show precisely that corrigibility can't be accurately captured by this model using utility functions and actions?

Suppose I designed a utility function and proved, in this framework that it never took deceptive actions like , then this proof would also show that my utility function doesn't allow for the kind of helpful, non-manipulative actions described in the quoted paragraph above.

Conversely, if I did create a real-life instantiation of a corrigible agent, it would distinguish between manipulative and non-manipulative actions and thus couldn't be modelling the world in the way which is used in this paper, since the mathematical representation used in the paper does not always properly distinguish between these two types of actions.