lucarade - LessWrong

Only just learned of this, unfortunately.

OP (please comment there): https://medium.com/@lucarade/issues-with-iterated-distillation-and-amplification-5aa01ab37173

I will show that, even under the most favorable assumptions regarding the feasibility of IDA and the solving of currently open problems necessary for implementing IDA, it fails to produce an aligned agent in the sense of corrigibility.

Part 1: The assumptions

Class 1: There are no problems with the human overseer.

1.1: Human-generated vulnerabilities are completely eliminated through security amplification. (See this post for a lengthy overview and intuition, and this postfor a formalization). In short, security amplification converts the overseer in IDA from high-bandwidth (receiving the full input in one piece) to low-bandwidth (receiving inputs divided into small pieces), to make impossible an attack which inputs data in such a way as to exploit human vulnerability to manipulation. See this post for a good explanation of a high-bandwidth vs low-bandwidth overseer.
My critique applies equally to high-bandwidth and low-bandwidth overseers so I make no assumption on that front.

1.2: There is no moral hazard in the human overseers. This eliminates one of Stuart’s critiques. Furthermore, the human overseer displays corrigible behaviors without error.

1.3: The relevant experts are willing to put in a substantial amount of time for the training process. This is a non-trivial assumption which I have not yet seen discussed.

Class 2: The framework and its auxiliary components function as intended.

2.1: Reliability amplification functions as intended. In summary, reliability amplification uses a voting ensemble of agents at each stage of amplification to avoid error amplification, in which an initially small probability of error grows with each iteration.

2.2: Corrigibility, not optimal value-aligned performance, is our goal. All we care about is that our agent “is trying to do what its operator wants it to do.” It may be bad at actually figuring out what its operator wants or at carrying out those wants, but the point is that it cares about improving, and will never intentionally carry out an action it knows is contrary to what its operator would want it to do (see this post and this post for a clarification of Paul’s approach to AI alignment by achieving corrigibility).

Stuart has pointed out problems with corrigibility, which I agree with. Essentially, the concept is ill-defined given the fuzziness of human values, and to properly implement corrigibility an agent must completely understand human values, thus reducing to the much harder value learning problem. However, we will assume that an agent which understands and implements the general concept of corrigibility, even if it accidentally misbehaves in many cases and causes widespread harm upon initial implementation as Stuart’s argument suggests, will still avoid existential risk and allow us to improve it over time, and is thus satisfactory. I think this is Paul’s approach to the matter.

Even a fully corrigible agent can be catastrophically misaligned, as detailed in this post. As addressed in the comments of that post, however, if we assume humans are smart enough to avoid a corrigible AI causing existential risk in this manner then the issue goes away.

2.3: There is no coordination possible among any of the A[n]s, eliminating another of Stuart’s critiques.

2.4: The informed oversight problem is solved. In summary, the problem is that it is difficult for a more powerful aligned overseer agent to fully understand the decision-making process of a weaker agent in a way that allows the overseer to push the weaker agent towards alignment. However, it does not follow that it is possible for a weaker aligned overseer to understand the decision-making process of a more powerful agent. It seems like this will never be possible by definition, because a weaker agent cannot understand the decision-making processes of a more powerful agent even if full transparency were possible.

2.5: Worst-case scenario techniques are effective at eliminating the risk of a treacherous turn, in which an agent performs well on training and test data but behaves catastrophically on some input in the real world which was not accounted for.

Part 2: Corrigibility is unlikely to be trained

The crux of my argument is that, even given the very optimistic assumptions above, IDA will fail to train a corrigible agent. I will list the problems sequentially, but my argument is that all five are necessary for failure. It is possible a subset is sufficient but I will not explore that possibility in this post.

Problem 1: Limited training data. Even given assumption 1.3, carrying out the process of training each new iteration is extremely time-intensive. It would take years of many experts’ full-time commitment as overseers to produce anything close to a reasonable amount of data.

Problem 2: Hidden error amplification. If an error is identifiable by an aligned agent, then reliability amplification will almost certainly remove it. However, a small initial hidden error will be rapidly amplified, since in the next iteration it will be manifested in various ways in many of the thousands of copies of the error-containing agent. Thus in the distillation step, the initial error will be propagated in many different forms, leading to a multiplication of errors. Informed oversight doesn’t help since the human is not more powerful than the subroutines, and even if transparency is achieved, if the human can verify all output then only very limited capability amplification is possible.

I will show in Problem 4 that such hidden errors are likely to arise.

Problem 3: Information loss. The human overseer has access to information, in the form of memories (broadly construed as information from before the current training process stored in the human’s mind), that is not accessible to the agent A[n+1] training on (H+A[n]), who sees only (H+A[n])’s actions. A[n+1] will thus erroneously attribute decisions made by the human relying on non-available information to available information. This information loss and the resulting distortion is greatly exacerbated by the same process described in Problem 2, doubly: first in the amplification of the information loss of the previous overseer, and second in the introduction of further information loss of the current overseer.

Paul thinks this is a general problem of learning and thus falls outside of the scope of things that specifically cause IDA to fail, as pointed out in this postand in the comments on this post, but I disagree. One can certainly imagine (and some have experienced) a human robustly learning another human’s decision-making heuristics over time without direct access to the other human’s memories, and can by extension also imagine an artificial agent extracting information from a human to robustly understand that human’s decision-making process. The problem exists not in all forms of learning but in the class of training techniques which do not involve a direct and adaptive extraction of information from a human in some form.

Problem 4: No prior concept of corrigibility. Because of information loss, an agent has no way of extracting the concept of corrigibility from its training data, only the behavior of corrigibility. The way the agent implements corrigibility will thus necessarily be an approximation, even if an extremely good one, and will not necessarily be robust to drastic changes in context. This causes the small hidden errors that are then amplified through the hidden error amplification in Problem 2, making reliability amplification ineffective. Without hidden error amplification this would probably not be a problem, since agents which successfully approximate corrigibility behaviorally will be able to detect all but the tiniest deviations from optimal corrigibility (ie, understanding the concept the way you and I do). However, hidden error amplification causes a nontrivial corrosion of corrigibility throughout iterations, and as each newly distilled agent approximates an increasingly corrupted behavioral corrigibility that deviates from our ideal conceptual corrigibility, reliability amplification is keeping us close to each further deviated behavioral corrigibility but not close to the ideal conceptual corrigibility. The process behaves essentially as a high-dimensional random walk with extremely small steps, but with thousands of steps per iteration manifested in the copies of A[n].

Problem 5: Temporal inconsistency of proxy dynamics (TIPD). Any incomplete simulation is not robust over time without an adaptive capacity. There are certain underlying processes which are time-invariant, such as the laws of physics and the mathematics of evolution. However, clearly we can never completely simulate any non-trivial situation purely in terms of these processes. Thus, an agent must necessarily rely on proxy dynamics for decision-making: emergent properties of the fundamental processes, which fairly reliably approximate cause-and-effect relationships between actions and outputs. However, because of the complexity of the underlying dynamics and their interactions, these proxy dynamics change over time, and often quite drastically over short periods (see the literature on chaos theory, critical transitions, bifurcation points). Thus, an agent which performs robustly at one point in time may behave catastrophically at another. The only solution is for the agent to be capable of adapting its policy to changes in the proxy dynamics it uses.

This sounds like the treacherous turn problem, but it is distinct, and harder. In the treacherous turn problem, we have an agent that is not sufficiently well trained given the input-output relationships of the world. This can probably be solved by worst-case scenario techniques like adversarial training. In TIPD, even if we succeed in training a robust policy, the proxy dynamics used to inform decisions will change such that an action in response to an input which previously would have produced a safe behavior now produces a catastrophic behavior.

As a result, behavioral corrigibility, whether corrupted or not, is not robust over time since it does not adapt to changing input-output relationships. An agent must possess conceptual corrigibility for such adaptation to occur, which is extremely hard, and may reduce to the value learning problem.

Part 3: Achieving alignment in this process through anything but corrigibility is doomed.

This is fairly obvious, and mostly follows from Part 2. Any proxy of the human’s decision-making process will clearly fail without an adaptive capacity, and it is not clear how such an adaptive capacity could be robustly implemented. And clearly this method will never achieve anything but a proxy due to information loss.

Conclusion

I have argued that even under the most optimistic assumptions about the human overseer and the successful operation of the framework, IDA will fail to produce a corrigible agent. This failure is a result of the interplay between hidden error amplification, information loss, the ability to learn behavioral corrigibility but not conceptual corrigibility, and the temporal inconsistency of proxy dynamics (TIPD). The solution to these problems seems very hard, and may reduce to the value learning problem, in which case the IDA framework does not provide us with any advantage.

LESSWRONG
LW

Posts

Wiki Contributions

Comments

Part 1: The assumptions

Part 2: Corrigibility is unlikely to be trained

Conclusion