Mentioned in

Corrigibility, Much more detail than anyone wants to Read

0Joern Stoehler

2Logan Zoellner

New Comment

Thanks for this concise post :) If we set I actually worry that agent will not do nothing, but instead prevent us from doing anything that reduces . Imo it is not easy to formalize such that we no longer want to reduce ourselves. For example, we may want to glue a vase onto a fixed location inside our house, preventing it from accidentally falling and breaking. This however also prevents us from constantly moving the vase around the house, or from breaking it and scattering the pieces for maximum entropy.

Building an aligned superintelligence may also reduce , as the SI steers the universe into a narrow set of states.

[This comment is no longer endorsed by its author]

F(a) is the set of futures reachable by agent a at some intial t=0. F_b(a) is the set of futures reachable at time t=0 by agent a if agent b exists. There's no way for F_b(a) > F(a), since creating agent b is under our assumptions one of the things agent a can do.

Corrigibility has been variously defined, for example here as:

Or here as:

Regardless of the definition, there is a fundamental tension between two things that we want a Corrigible AGI to do:

Corrigibility is frequently described in terms of the ability to "shut down" the AGI, but this is a oversimplification. For example, an AGI which spawns a world consuming nanobot swarm and then shuts down obviously satisfies the "user can shut down the AGI" condition, but not the "AGI prevents the user from losing control of the future" condition.

The tension between "does useful stuff" and "doesn't affect the far future" is inherent. For example, as pointed out in this impossibility proof.

The correct way to frame corrigibility is therefore not in terms of binary conditions such as "the user can shut down the AGI" but rather in terms of the tradeoff between fulfilling the user's objective function and limiting the user's possible reachable futures.

## A Continuous Definition of Corrigibility

Suppose we have some metric F(a), which describes the reachable futures by an agent a. F(a) is large if many possible futures are reachable by agent a and small if there are only a few different futures which the agent can choose between. If it helps, you can think of F(a) as the "entropy" of the space of futures reachable by agent a.

We define a "corrigibility coefficient" λ for a second agent b, in terms of its effect on F(a)

The coefficient λ is at its maximum if agent b has no influence on the future. That is, Fb(a)=F(a). Namely, if we add agent b to the world, it has no effect on the possible futures reachable by a. It is trivial to describe an agent b with corrigibility score λ=1 : the agent that does nothing.

The coefficient is at a minimum λ=0 if the agent b eliminates all but one possible future for agent a. For example, if b takes complete control over the universe, preventing agent a from having any influence on it. In this case, Fb(a)=0. Note that by definition that if agent a is dead, then they cannot influence the future and F(a)=0.

Suppose agent b also has some utility function U(b) that it is programmed to maximize (for example, make as many paperclips as possible).

We can now define a corrigible utility function s=(1−λ)∗U(b)+λ∗Fb(a)

The factor λ describes the tradeoff the agent b faces between maximizing its utility and limiting the futures reachable by agent a. Note that we assume U(b) and Fb(a) are both bounded. There are problems in general with unbounded utility functions, so this is not a severe limitation.

This tradeoff defines an efficient frontier where the agent b must inherently choose between satisfying its utility function and minimizing its impact on the future. This efficient frontier can be defined in terms of the parameter λ. When λ=0, the agent prioritizes maximizing its utility, ignoring side effects. When λ=1 , the agent avoids any action that will limit the futures reachable by agent a.

For a poorly optimized agent, it is possible to strictly improve the score s. For example, suppose agent b is a robot designed to make paperclips but it additionally periodically sets the factory on fire. Fixing the robot so that it no longer sets the factory on fire improves both its utility (since it can't make paperclips if the factory is burned down), but also improves its corrigibility score Fb(a), since agent a can use the factory for other things only if it is not burned down.

## A toy example

Consider a world in which every state is reachable from every other state. Agents a and b act on this space, however each action draws 1 unit of power from a battery which has only a finite amount c of charge.

In this case, F(a) is simply the size of the set of states that are within c steps of the current world state.

Suppose the world initially starts in state 0 with battery charge c=5 and that agent b's utility function is literally the number on each node.

As an agent, b can take a finite number of moves, and then shuts down, allowing agent a to act.

In this case, if λ=0, the optimal behavior for agent b is to take the series of steps 0→2→5→6→7→8. This leaves U(b)=8. However, since c is now 0, Fb(a)=1 (the only reachable state for agent a is to remain at state 8).

If λ=1, the optimal behavior is to do nothing, or move to state 2, since both allow the same number of reachable states Fb(a)=9.

On the other hand, if λ is some intermediate value (say λ=0.5) then the optimal move for agent b is to move 0→4, as this increases its utility by 2 (relative to moving to state 2) and only sacrifices 1 potential state for Fb(a).

If you want to try creating your own examples, here is a notebook that you can use.

## How does this relate to the normal definition of corrigibility?

We have defined an agent which:

Why would we expect such an agent to have desirable properties?

Consider the "shut down" problem. Suppose our agent b originally had a shut-down button. Now imagine that agent b removes the shut-down button (or does some action that makes it hard for agent a to reach the button). This will inherently shut agent a off from reaching a large set of possible futures (the futures in which agent b is shut down). In fact, the more powerful that agent b is, the larger the space of futures which are made unavailable by restricting the shut-down button.

Another problem is avoiding undesirable side-effects. Consider a robot that fetches tea for you, but along the way it knocks a vase off of the table, shattering it. This naturally removes many possible futures for agent a (admiring the vase, smashing the vase themselves, selling the vase on eBay). Even if the vase can in principle be repaired or replaced, this is likely to consume some resource which agent a has only a finite supply of (time, money, super-glue) thereby limiting the number of possible futures reachable by agent a.

What about wire-heading? Might agent b rewrite agent a's brain to make agent a better satisfy agent b's utility function? Most forms of wire-heading inherently limit the ability of agents to affect the future (very few hardcore drug users are Nobel prize winners

^{[1]}). In fact, this loss of agency is precisely what is at the heart of fears about wire-heading.## How do we choose the correct value for λ?

The optimal value for λ depends on both the utility function U(b) and the future entropy function F(a). However, as in our toy example, it is likely that a large range of values result in roughly the same behavior. Actions which minimize side effects are likely to be beneficial regardless of whether λ is 0.51 or 0.52.

In practice, the most extreme cases "agent does nothing" and "agent takes control of the future" are the ones we are most concerned with avoiding and any λ too close to 0 or 1 is likely to avoid these problems.

It may also be also possible to start with values of λ close to 1 "out of an abundance of caution" and gradually reduce λ until our desired goals are achieved.

In our toy example, if we define the acceptable outcomes as the ones ending on "2" or "4", then any 0.37<λ<1 is "acceptable". However, what the "acceptable" λ is depends on our utility function, since scaling the utility function U′(b)=2∗U(b) is the same as changing λ=0.5 to λ′=0.66.

## Real world implementation

Building a agent which maximizes a utility function U(b) is the canonical problem of Reinforcement Learning. Replacing the utility function U(b) with (1−λ)∗U(b)+λ∗Fb(a) therefore modifies any reinforcement learner from a non-corrigible to a corrigible agent.

The difficulty, of course, is in calculating Fb(a). In our toy example, it is trivial to calculate since there are only a finite number of states. However, in the real world, it is not possible to calculate F(a) even in principle. In practice, "enumerate all possible futures reachable by me" is beyond a ridiculously hard problem.

However, there are certain obvious heuristics we can use to approximate F(a)−Fb(a). If an agent consumes a scarce resource, this likely decreases Fb(a). If an agent makes large changes to its environment (particularly those which are difficult to reverse), this likely decreases Fb(a). If an agent spawns sub-agents which are difficult or impossible to turn off, this likely decreases Fb(a).

As a lower bound, if an agent b consumes an amount c0 of a finite resource, then c0<F(a)−Fb(a). As an upper bound, if all of the actions of b can be reversed by expending an amount c1 of whatever the constraining resource is (money, energy, entropy), then F(a)−Fb(a)<c1.

## Limitations

Thus far we have said nothing about defining a "friendly" utility function. Obviously some utility functions (e.g. "murder all humans") are worse than others.

We have said nothing about the optimal value for λ, only that is is likely to be somewhat intermediate between 0 and 1.

We have not proved that a corrigible agent avoids unnecessary negative side affects, although I think we have strongly hinted at that fact.

We haven't proved that a corrigible AI respects a "shut down" button in all cases (in fact, I suspect it is possible to create toy models in which the AI ignores the shut down button for arbitrary values of λ<1).

We have not proved that agent b does not try to affect agent a's utility function (in fact, I expect in many cases agent b does try to influence agent a's utility function).

Even if a corrigible agent b maximizes the number possible futures available to an agent a, we have not said anything about about helping agent a choose wisely from among those possible futures.

Maximizing the number of possible futures and maximizing the total utility achievable by agent a in those futures are not the same thing.

In our definition, a corrigible agent b takes no steps to help or even to understand agent a (except insomuch as it is required to calculate F(a)).

We haven't even attempted to define agent a. Is it agent b's owner? A hypothetical everyman? All of humanity? All sentient beings other than b?

We haven't provided a way to calculate F(a) outside of toy cases with finite possible futures. (And worse, exact calculation of F(a) is physically impossible in the real world).

## Future Work

If anyone has an example where BabyAGI shows instrumental convergence, I would love to modify it with corrigibility and demonstrate that the instrumental convergence goes away or is reduced to a non-threatening level.

It would also be nice to explore the idea of a self-corrigible agent. Perhaps limiting one's impact on future is inherently rational in the light of the radical uncertainly

^{[2]}of the future.^{^}@JustisMills points out "I actually doubt this! amphetamines were pretty crazy for eg. Erdos". I agree but that's not the kind of wireheading I'm worried about.

^{^}For example, like this