Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I am pleased to announce the availability of a long-format paper with new results on AGI safety: Corrigibility with Utility Preservation.

You can get the paper at , and in the related software repository at .


Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility
maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration.

A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to break safety features. Some open problems related to graceful degradation when an agent is successfully attacked are identified.

The results in this paper were obtained by concurrently developing an AGI agent simulator, an agent model, and proofs. The simulator is available under an open source license. The paper contains simulation results which illustrate the safety related properties of corrigible AGI agents in detail.

This post can be used for comments and questions.

The paper contains several results and observations that do not rely on the heavy use of math, but other key results and discussions are quite mathematical. Feel to post questions and comments even if you have not read all the mathematical parts.

As this is my first post on LessWrong, and my first paper on AGI safety, I feel I should say something to introduce myself. I have a Ph.D. in software design, but my professional life so far has been very diverse and multidisciplinary. Among other things I have been an experimental physicist, a standards developer and negotiator, an Internet privacy advocate, a wireless networking expert, and a systems architect in an industrial research lab. So I bring a wide range of tools and methodological traditions to the field. What made me interested in the field of AGI safety in particular is that it seems to have open problems where real progress can be made using mathematical techniques that I happen to like. I am currently on a sabbatical: basically this means that I decided to quit my day job, and to use my savings to work for a while on some interesting problems that are different from the interesting problems I worked on earlier.

New Comment
11 comments, sorted by Click to highlight new comments since:

Even though it is super-intelligent, the AU agent has no emergent incentive to spend any resources to protect its utility function. This is because of how it was constructed: it occupies a universe in which no physics process could possibly corrupt its utility function. With the utility function being safe no matter what, the optimal strategy is to devote no resources at all to the matter of utility function protection.

I call this style of approach the "lonely engineer" (a term which comes from MIRI, and I think specifically from Scott Garrabrant, though I could be mistaken on that point). I think lonely-engineer-style approaches (as I think yours basically falls into) are super interesting, and definitely on to something real and important.

That being said, my biggest concern with approaches of that form is that they make an extraordinarily strong assumption about our ability to select what utility function our agent has. In particular, if you imagine training a machine learning model on a lonely engineer utility function, it won't work at all, as gradient descent is not going to be able to distinguish between preferences that are entirely about abstract mathematical objects and those that are not. For a full exploration of this problem, see "Risks from Learned Optimization."

My other concern is that while the lonely engineer themselves might have no incentive to behave non-corrigibly, if it runs any sort of search process (doing its own machine learning, for example), it will have no incentive to do so in an aligned way. Thus, a policy that its search results in might have preferences over things other than abstract mathematical objects, even if the lonely engineer themselves does not.

I am actually just in the middle of reading "Risks from Learned Optimization", I want to hold off on commenting more fully on the relation between the two problems until I have read the whole thing.

Before I comment in detail on the concerns you raise, can you clarify further what you mean with the "lonely engineer" approach? I am not fully sure to what parts of my paper your concerns apply. You describe the "lonely engineer" approach after quoting a part about the AU agent, so I am wondering if it has to do with this particular agent definition. The main parts of the paper are about A agents, which do in fact have the emergent incentive to protect their utility function. I might call these A agents less lonely than AU agents because they have a greater awareness of not being alone in their universe. But I am not sure if this is what you mean with lonely/not lonely.

Self-modification of policy and utility function in rational agents defines three agent types: Hedonistic, Ignorant, and Realistic. The A agents in my paper are Realistic in this taxonomy. The AU agents are kind-of Ignorant but not entirely, because the universe cannot change their utility function, so I ended up calling them Platonic. These labels Hedonistic, Ignorant, and Realistic describe the expectation of how the agent will behave when supplied with a simple hand-coded utility function, but things get more complex if you create one via learning. If you use a training set and optimiser to produce a usually opaque but still computable world model+utility function combination that you then slot into any of these agent models to make them executable, the resulting agent might behave in a way not fully covered by the category label. This mismatch could happen in particular if the training set includes cases where the environment can try to attack the agent's infrastructure, with the agent having the option to deflect these attacks.

But my main question is if your concerns above apply to A agents also, or to AU agents in particular. I am not sure if you might have a very different concern in mind about how the corrigibility layer approach should or should not be combined with training-based agents.


I think your agent is basically exactly what I mean by a lonely engineer: the key lonely engineer trick is to give the agent a utility function which is a mathematical entity that is entirely independent of its actions in the real world. I think on my first reading of your paper it seemed like you were trying to get the agent to approximate the agent, but reading it more closely now it seems like what you're doing is using the correction term to make it indifferent to the button being pressed; is that correct?

I think in that case I tend towards thinking that such corrections are unlikely to scale towards what I eventually want out of corrigibility, though I do think there are other ways in which such approaches can still be quite useful. I mainly see approaches of that form as broadly similar to impact measure approaches such as Attainable Utility Preservation or Relative Reachability where the idea is to try to disincentive the agent from acting non-corrigibly. The biggest hurdles there seem to be a) making sure your penalty doesn't miss anything, b) figuring out how to trade off between the penalty and performance, and c) making sure that if you train on such a penalty the resulting model will actually learn to always behave according to it (I see this as the hardest part).

On that note, I totally agree that it becomes quite a lot more complicated when you start imagining creating these things via learning, which is the picture I think about primarily. In fact, I think your intuitive picture here is quite spot on: it's just very difficult to control exactly what sort of model you'll get when you actually try to train an agent to maximize something.

Yes you are correct, the paper is not really about agents, but about applying correction functions like to the type agents.

The rest of this comment is all about different types of corrigibility. Basically it is about identifying and resolving a terminological confusion.

Reading your link om 'what I eventually want out of corrigibility', I think I see a possible reason why there may be a disconnect between the contents of my paper and what kind of things you are looking for. I'll unpack this statement a bit. If I do a broad scan of papers and web pages the I see that different types of 'corrigibility' are being considered in the community: they differ in what people want from it, and in how they expect to design it. To assign some labels, there are at least button-corrigibility, and preference-learner-corrigibility.

The Soares, Fallenstein, Armstrong and Yudkowsky corrigibility paper, and my paper above, are about what I call button-corrigibility.

The corrigibility page by Christiano you link to is on what I call preference-learner-corrigibility. I describe this type of corrigibility in my paper as follows:

Agents that are programmed to learn can have a baseline utility function that incentivizes the agent to accept corrective feedback from humans, feedback that can overrule or amend instructions given earlier. This learning behavior creates a type of corrigibility, allowing corrections to be made without facing the problem of over-ruling the emergent incentive of the agent to protect itself. This learning type of corrigibility has some specific risks: the agent has an emergent incentive to manipulate the humans into providing potentially dangerous amendments that remove barriers to the agent achieving a higher utility score. There is a risk that the amendment process leads to a catastrophic divergence from human values. This risk exists in particular when amendments can act to modify the willingness of the agent to accept further amendments. The corrigibility measures considered here can be used to add an extra safety layer to learning agents, creating an emergency stop facility that can be used to halt catastrophic divergence. A full review of the literature about learning failure modes is out of scope for this paper. [OA16] discusses a particular type of unwanted divergence, and investigates ’indifference’ techniques for suppressing it. [Car18] discusses (in)corrigibility in learning agents more broadly.

Ideally, we want an agent that has both good button-corrigibility and good preference-learner-corrigibility: I see these as independent and complementary safety layers.

Christiano expresses some hope that a benign act-based agent will have good preference-learner-corrigibility. From a button-corrigibility standpoint, the main failure mode I am worried about is not that the the benign act-based agent will be insufficiently skilled at building a highly accurate model of human preferences, but that it will conclude that this highly accurate model is much easier to build and maintain once it has made all humans addicted to some extreme type of drug.

At the risk of oversimplifying: in preference-learner-corrigibility, the main safety concern is that we want to prevent a learning process that catastrophically diverges from human values. In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score. The solution in button-corrigibility (or at least the solution considered in Soares et al. and in my paper) is to have a special 'ritual', e.g. a button press, that the humans can perform in order to change the utility function of the agent, with the agent designed to be indifferent about whether or not the humans will perform the special ritual. This solution approach is quite different from improving learning, in fact it can be said to imply the opposite of learning.

Say that the button-corrigible agent agent in my paper becomes somewhat aware that the people might be planning to push the button. If so, the constructed indifference property implies that the agent will have no motivation whatsoever to act on this awareness by launching a deeper investigation into what values or emotions might motivate the people to have these plans. Having improved knowledge about motivations might be very useful if the agent wanted to stop or delay the people, if the agent wanted to engage in acts of lobbying as I call it in my paper. But as the agent is completely indifferent about stopping them, it has no motivation to spend energy on the sub-goal of learning more about their motivations. So in a sense, button-corrigibility has the side effect of making the agent into a non-learner when it comes to certain topics. (This is all modulo what is inside : the in the agent might contain an incentive for the agent to explore the new phenomenon further if it becomes aware that people might have button pushing plans, but if so this has no effect on how works to suppress lobbying.)

So overall, if you are looking in a button-corrigibility paper for a mechanism that generates a corrective pressure to improve a learning process about human values, a mechanism that improves preference-learner-corrigibility, you will probably not find it.

Personal experience: I often come across a paper or web page where the authors introduce the concept of corrigibility by referencing Soares et al, so then I read on expecting a discussion of button-corrigibility concerns and approaches. But when I read on I often discover that it is really about preference-learner-corrigibility concerns and approaches. This happened to me also when reading your paper Risks from Learned Optimization: section 4.4 says that 'Furthermore, in both deceptive and corrigible alignment, the mesa-optimizer will have to spend time learning about the base objective to enable it to properly optimize for it'. As I explained above, a correctly working button-corrigible mesa-optimiser will not be motivated to spend any time learning about the base objective so that it can improve its ability to be deceptively aligned. A preference-learner-corrigible mesa-optimiser might, tough. So the above sentence triggered me into realising that you and your co-authors were likely not thinking about button-corrigible optimisers when it was written.


In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score.

I generally think of solving this problem via ensuring your agent is myopic rather than via having a specific ritual which the agent is indifferent to, as you describe. It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward. Thus, it seems like what you really want here is some guarantee that the agent is only considering its per-episode reward and doesn't care at all about its cross-episode reward, which is the condition I call myopia. That being said, how to reliably produce myopic agents is still a very open problem.

It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward.

Yes this is true: basically the fundamental job of the button-corrigible agent is still to perform a search over all actions, and find those actions that best maximise the reward computed by the utility function. As mentioned in the paper, such actions may be to start a hobby club, where humans come together to have fun and build petrol engines that will go into petrol cars. Starting a hobby club is definitely an action that influences humans, but not in a bad way.

Button-corrigibility intends to support a process where bad influencing behaviour can be corrected when it is identified, without the agent fighting back. The assumption is that we are not smart enough yet to construct an agent that is perfect on such metrics when we start it up initially.

On myopia: this is definately a technique that might make agents safer, though I would worry if this implies that an agent is incapable of considering long term effects when it chooses actions: this would make the agent unsafe in many cases.

Short term, I think myopia is a useful safety technique for many types of agents. Long term, if we have agents that can modify themselves or build new sub-agents, myopia has a big open problem: how do we ensure that any successor agents will have exactly the same myopia? Button-corrigibility avoids having to solve this preservation of ignorance problem: it also works for agents and successor agents that are maximally perceptive and intelligent, because it modifies the agent goals, not agent perception or reasoning capability. The successor agent building problem does not go away completely through: button-corrigibility still has some hairy problems with agent modification and sub-agents, but overall I have found these to be more tractable than the problem of ignorance preservation.

making sure that if you train on such a penalty the resulting model will actually learn to always behave according to it (I see this as the hardest part).

I agree that this is a huge problem if the penalty is applied at the level of the base optimizer. I see some promise in the approach of explicitly whitelisting mesa optimizers which are designed to be safe. I talk more about it in my post yesterday, so I will quote directly from it,

To oversimplify things for a bit, there are a few ways that we could ameliorate the issue of misaligned mesa optimization. One way is that we could find a way to robustly align arbitrary mesa objectives with base objectives. I am a bit pessimistic about this strategy working without some radical insights, because it currently seems really hard. If we could do that, it would be something which would require a huge chunk of alignment to be solved.
Alternatively, we could whitelist our search space such that only certain safe optimizers could be discovered. This is a task where I see impact measurements could be helpful.
When we do some type of search over models, we could construct an explicit optimizer that forms the core of each model. The actual parameters that we perform gradient descent over would need to be limited enough such that we could still transparently see what type of "utility function" is being inner optimized, but not so limited that the model search itself would be useless.
If we could constrain and control this space of optimizers enough, then we should be able to explicitly add safety precautions to these mesa objectives. The exact way that this could be performed is a bit difficult for me to imagine. Still, I think that as long as we are able to perform some type of explicit constraint on what type of optimization is allowed, then it should be possible to penalize mesa optimizers in a way that could potentially avoid catastrophe.
During the process of training, the model will start unaligned and gradually shift towards performing better on the base objective. At any point during the training, we wouldn't want the model to try to do anything that might be extremely impactful, both because it will initially be unaligned, and because we are uncertain about the safety of the trained model itself. An impact penalty could thus help us to create a safe testing environment.

...I just completed reading "Risks from Learned Optimization", so here are some remarks on the corrigibility related issues raised in that paper. These remarks might also address your concerns above, but I am not 100% sure if I am addressing them all.

Full disclosure: I came across "Risks from Learned Optimization" while preparing the related work section of my paper. My first impression then was that it dealt with a somewhat different safety concern, so I put it on my read-later backlog. Now that I have read it, I see that section 4 discuses safety concerns of a type that corrigibility intends to address, so the two papers are more related than I originally thought.

Note for the reader: I will use the terminology of the "Risks from.." paper in the paragraphs below.

Section 4 or "Risks from..." considers the problem that the agent that is constructed by the base optimizer might have an emergent incentive to resist later updates, updates intended to make it more aligned. This resistance may take the form of 'deceptive alignment'. Intuitively, to me, this is a real risk: it may not apply to current base optimisers but I expect it to come up in future architectures. For me this risk is somewhat orthogonal to the question of whether the constructed agent might also contain a mesa-optimiser.

The "Risks.." paper warns, correctly I feel, that if the base optimizer is supplied with an objective/utility function that includes a layer to ensure corrigibility, this does not imply that the agent created by the base optimizer will also be perfectly corrigible. Some loss of fidelity might occur during the construction.

As a sub-case, and I think this relates to the concern raised in the comment above, if the base optimizer is supplied with an objective/utility function carefully constructed so that it does not create any incentive to spend resources resisting utility function updates, it does not follow that the agent created by the base optimizer will also perfectly lack any incentive to resist. Again there may be a loss of fidelity. As mentioned in the comment above, it would be extraordinarily strong to make the assumption that a carefully constructed property of ignorance or indifference present in the training utility function is exactly preserved in the constructed agent.

I believe that including a corrigibility layer into the utility function used by the base optimizer will likely help to improve the safety of the constructed agent -- if the agent that is constructed is scored on displaying corrigible behaviour, it is likely to be more corrigible.

But my preferred way of applying the corrigibility safety layer would be to apply it directly to the agent that was constructed. An important point of the corrigibility design is that it can take a very complex and opaque base utility function U_N, and then constructs an agent with provable properties.

So what we really want to do is to take the agent constructed by the base optimizer, and apply a software transformation that adds the corrigibility layer to it. General theories of computation and machine intelligence say that a such a software transformation will exist. However, applying the transformation may be non-trivial, or may lead to an unacceptable performance penalty. For the agent architectures defined in the various corrigibility papers, the transformation is easy, and has a relatively low impact on agent performance. But it is unclear to me if and how a transformation could be done for a neural net or table-driven agent. Agents that contain clear mesa-optimizers inside, agents that search a solution space while scoring alternative outcomes numerically, may be actually easier to transform than agents using other architectures. So I am not sure if it should be a goal to come up with base optimises that suppress the construction of mesa-optimizers. Designing base optimisers to produce outputs that slot into a mesa-optimiser architecture to which software transformations can be applied may be another way to go.

I have some additional thoughts and comments on "Risks from Learned Optimization", but I will post them later in the comment sections there.


I think I mostly agree with what you're saying here, except that I do have some concerns about applying a corrigibility layer to a learned model after the fact. Here are some ways in which I could see that going wrong:

  1. The learned model might be so opaque that you can't figure out how to apply corrigibility unless you have some way of purposefully training it to be transparent in the right way.
  2. If you aren't applying corrigibility during training, then your model could act dangerously during the training process.
  3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.

I think if you don't have some way of training your model to be transparent, you're not going to be able to make it corrigible after the fact, and if you do have some way of training your model to be transparent, then you might as well apply corrigibility during training rather than doing it at the end. Also, I think making sure your model is always corrigible during training can also help with the task of training it to be transparent, as you can use your corrigible model to work on the task of making itself more transparent. In particular, I mostly think about informed oversight via an amplified overseer as how you would do something like that.

2. If you aren't applying corrigibility during training, then your model could act dangerously during the training process.

I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.

3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.

Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)

Haven't read the paper yet, but note that AU is already an abbreviation for attainable utility (in particular, "AUP agents"). Similarly for "utility preserving", which might be confusing (but maybe inevitable).