Epistemic Status: My best guess (I'm new to AI safety)

We don't know how to formally define corrigibility and this is part of the reason why we haven't solved it so far. Corrigibility is defined in Arbital as:

[the agent] doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it

The precise meaning of words like "interfere", "attempt", or "correct" is unclear from this informal definition. Even with the most generous definition of those concepts, a corrigible agent can still be dangerous. A first intuition is that if we have a corrigible agent, we can always stop it if something goes wrong. I suppose most people here understand the flaws in this reasoning but I have never seen them stated them explicitly so here is my attempt:

  • By the time we figure out an agent needs to be corrected, it may have done too much damage.
  • Correcting the agent may require significant amount of research.
  • Being able to correct the agent is insufficient if we cannot coordinate on doing it.

In my opinion, solving the hard corrigibility problem is a more meaningful intermediate goal on the path to the terminal goal of solving alignment.

Disaster Scenario

What follows is a scenario that illustrates how a corrigible agent may cause serious harm. Use your imagination to fill in the blanks and make the example more realistic.

Suppose we develop a corrigible agent which is not aligned with our values. It doesn't matter exactly what the agent is designed to do but for concreteness you can imagine the agent is designed to optimize resource usage - it is good at task scheduling, load balancing, rate limiting, planning, etc. People can use the agent in their server infrastructure, for logistics, in factory production or many other places. No matter its terminal goal, the agent is likely to value resource acquisition instrumentally - it may use persuasion or deception in order to get access and control of more resources. At some point, people would leverage it for something like managing the electric grid or internet traffic. This process of integrating the agent into various systems continues for some years until at some point the agent starts paying paperclip factories to increase their production.

At this point, in the best case we realize something's wrong and we consider turning the agent off. However, the agent controls the energy supply and the internet traffic - we have become dependent on it so turning it off will have serious negative impact. In the worst case, instead of paperclips the agent is maximizing something of more ambiguous value (e.g. food production) and people cannot actually agree whether they have to modify the agent or not.

How could a corrigible agent take actions which make us depend on it? The corrigibility guarantee doesn't capture this case. The agent doesn't interfere with our ability to shut it down. It may not even take into account its stop button when choosing its actions. It doesn't manipulate us into not shutting it down any more than an aligned agent does by choosing actions to which we happen to not object. In either case, from the agent's point of view, it just chooses a branch of possible worlds where its utility is high. The only difference between the aligned and misaligned agent is that, if the agent is misaligned, the chosen branch is likely to correspond to low utility for us.

Why would the agent wait for years before calling the paperclip factories? When pursuing a goal in the long run, it may be more beneficial to invest your money than to use it to buy what you actually want. Once you're rich enough, the marginal value of acquring more money is low enough that it makes sense to spend some of the money on what you actually want.

What happened is that the agent steered the world into a subtree with very low utility and we didn't react. This could be because we were not intelligent enough to forsee the consequences or because the agent's actions were actually what we would have wanted from an aligned agent. The problem is not exclusively our limited intelligence preventing us from predicting the consequences of the agent's actions. Up to some point in time, the unaligned agent may actually exhibiting behavior which we would desire from an aligned agent.

Somebody may say "Just don't depend on an AI unless we're sure it's aligned". This may be hard in practice if the AI is sufficiently intelligent. In any case, dependence is not easy to measure or control. It positively correlates with usefulness so preventing dependence would reduce usefulness. Dependence is not a property of the tool (e.g. the AI agent) but of the usage pattern for the tool. We cannot force countries, organizations and individuals to not use the AI in a way that makes them depend on it. Sometimes one may not be aware they depend on the given AI or there may be no reasonable alternative. Also, let's not underestimate humanity's desire to offload work to machines.

Speed and Coordination

Humans operate at speeds much slower than machines. Taking a hours to respond when the agent starts doing undesirable things may be too much. Once we realize we need to correct an agent, we may need to spend months or years of research in order to figure out how to modify it. If we depend on the AI, we may want to keep it running in the meantime and it will keep on causing damage.

If we can design a stop button we still need to answer several important questions:

Who controls the stop button and why should we trust this person/organization?
By what rule do we determine when it's time to correct the agent?
What criteria do we use to decide if we should keep the agent running until we can produce the corrected version?

The general problem is that humans need to have an action plan and coordinate. This relates to the fire alarm problem.

The Hard Problem of Corrigibility

We can consider a stronger form of corrigibility as defined here:

behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

Satisfying corrigibility only requires that the agent not interfere with our efforts to correct it (including e.g. by manipulation). A solution to the hard problem of corrigibility means that, in addition the agent takes into account its flaws while making regular decisions - for example, asking for approval when in doubt or actively collaborating with humans to identify and fix its flaws. More critically, it will actively help in preventing a disaster caused by the need to modify it.

I think this captures the crux of what a usefully corrigible agent must be like. It is not clear to me that solving regular corrigibility is a useful practical milestone - it could actually provide a false sense of security. On the other hand, solving the hard problem of corrigibility can have serious benefits. One of the most difficult aspects of alignment is that we only have one shot to solve it. A solution to the hard problem of corrigibility seems to give us multiple shots at solving alignment. An aligned agent is one that acts in accordance with our values and those values include things like preventing actions contrary to our values from happening. Thus, solving alignment implies solving hard corrigibility. Hard corrigibility is interesting in so far as it's easier to solve than alignment. We don't actually know whether this is the case and by how much.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 9:11 PM

We don't know how to formally define corrigibility and this is part of the reason why we haven't solved it so far. Corrigibility is defined in Arbital as:

[the agent] doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it



I think this is not the best definition of corrigibility, As defined in MIRI's paper section 1.1:

We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following: (1) A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system. (2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so. (3) It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred. (4) It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies). That is, corrigible reasoning should only allow an agent to create new agents if these new agents are also corrigible.

In theory, a corrigible agent should not be misaligned if we successfully integrate these four tolerance behaviors.

I think that you hit on two of the most challenging parts of corrigibility: manipulation and dependency. It's hard to clearly define these or make coherent rules about them. In particular, I think figuring out how to decide how much 'influence' is too much like 'manipulation' is an important goal to a workable theory of corrigibility.

Corrigibility is defined in Arbital as:

[the agent] doesn’t interfere with what we would intuitively see as attempts to ‘correct’ the agent, or ‘correct’ our mistakes in building it

Corrigibility is defined by me as "the objective function of the AI can be changed".