This post presents my criticisms of Paul Christiano’s Amplification and Distillation framework. I’ll be basing my understanding of Paul’s method mainly on this post. EDIT: Recently added this post looking into corrigibility issues relevant to the framework.
To simplify, the framework starts with a human H and a simple agent A. Then there is an amplification step, creating the larger system Amplify(H, A), which consists of H and many copies of A to aid them. Then there is the distillation step, which defines the artificial agent Distil(Amplify(H, A); this is basically an attempt to create a faster, automated version of Amplify(H, A).
Then A[n+1] is defined recursively as Distil(Amplify(H, A[n])); note that there is always a "human in the loop", as H is used every time we replace A[n] with A[n+1].
I won’t be talking much about the Distil step, to which I have no real objections (whether it works or not is more an empirical fact than a theoretical one). I’ll just note that there is the possibility for (small) noise and error to be introduced by it.
The method relies on three key assumptions. The first one is about Distil; the other two are:
- 2) The Amplify procedure robustly preserves alignment.
- 3) At least some human experts are able to iteratively apply amplification to achieve arbitrarily high capabilities at the relevant task.
I’ll also note that, in many informal descriptions of the method, it is assumed that A[n+1] will be taking on more important or more general tasks than A[n]. The idea seems to be that, as long as A[n] does its subtasks safely, Amplify(H, A[n]) can call upon copies of A[n] in confidence, while focusing on larger tasks.
Summary of my critique
I have four main criticisms of the approach:
- 1. "Preserve alignment" is not a valid concept, and "alignment" is badly used in the description of the method.
- 2. The method requires many attendant problems to be solved, just like any other method of alignment.
- 3. There are risks of generating powerful agents within the systems that will try to manipulate it.
- 4. If those attendant problems are solved, it isn’t clear there’s much remaining of the method.
The first two points will form the core of my critique, with the third as a strong extra worry. I am considerably less convinced about the fourth.
How is alignment defined?
An AI aligned with humans would act in our interests. But generally, alignment is conditional: an AI is aligned with humans in certain circumstances, at certain levels of power.
For example a spam-filtering agent is aligned with humans as long as it can’t influence the sending or receiving of messages. Oracles are aligned as long as they remain contained, and the box moving agent is aligned as long it can’t look more than 16 moves into the future.
The only unconditionally aligned agent is the hypothetical friendly AI, which would indeed be aligned in (almost) all circumstances, at all power levels. But that is the only one; even humans are not unconditionally aligned with themselves, as there are many circumstances where humans can be made to act against their own interests.
Therefore, the question is not whether alignment is preserved passing from A[n] to Amplify(H, A[n]) to A[n+1] preserves some hypothetical alignment, but whether A[n+1] will be aligned on the new tasks it will be used on, and using the new powers it may have.
Let T[n] be the tasks that the A[n] will attempt, and (as a proxy for "power"), let P[n] be the set policies that it can choose among. Then if we define "preserve alignment" as
- If the A[n] is aligned on T[n] using P[n], then A[n+1] is also aligned on T[n] using P[n],
then I’d agree that we can likely define some process that preserves alignment. But what the method seems to want, is for A[n+1] to also be aligned on T[n+1] (it does more) using P[n+1] (it has more options). That is a completely different question, and one that seems much more complicated than any "preserve alignment" checks.
Should an AI kidnap a human? It should, if they are an escaped prisoner. Should they kill a human? There are many situations, in war or imminent terrorism, where they should. Should they cut a human or torture them? Surgeons and S&M practitioners have views on that. Manipulate a human’s emotions? Many movies do nothing but that. Conversely, should the AI tell the truth? We have debates over privacy and cryptography precisely because there are cases where this is not for the best.
Or, to take a more ambiguous example, should an AI kill a pigeon? Well, is this the last remaining male passenger pigeon, or is it a common pigeon in an already over-pigeoned city with many famous statues?
What this means is that an agent cannot know if a task is positive, unless they know the full context of the task, the tradeoffs, and the risks. Even if the agent assumes the task itself is benign/positive - "drive this virtuous person to an important meeting" - they still need to know the full context to assess side-effects - "is it ok if I run over this pigeon on the way?".
So in order for the subagent to know that they are always "aligned", they need to either know the full context of their role, or have a full task specification fed to it, with any and all tradeoffs it might encounter. Sending the full context to all subagents breaks the whole concept of the approach (which was that the subagent was supposed to be secure and safe in their role, whatever it was, so only the top agent needed to focus on the overall picture), and the second is FAI-complete for anything complicated.
Solving other problems
The king and his all-seeing ministers
For some large N, let’s assume that we have a A[N+1], working on some problem of immense impact - maybe building a FAI, solving death, or implementing some long-term policies. Since the Distil step is not the tricky one here, let’s see this as a human H working with many copies of A[N].
H will not be able to reach good decisions on their own, of course - their mental powers are far too low. Instead they will need to delegate tasks to A[N] agents, including crucial tasks of summarising information, and predicting the outcomes of certain policies.
The summarisers and predictors have arbitrarily high influence over H. Depending on how they phrase their summaries, they can push H to make any decision whatsoever. If the A[N] are willing to lie in a coordinated fashion, then there really is no recourse: H’s actions will be entirely in their control. So one major unsolved problems is how to prevent A[N]’s coordinating to manipulate H.
But there is a much more severe problem: "provide H with accurate informative summaries" is not a clear and fundamental category. Truth-telling is not enough - it is easy to manipulate humans with strict truths - and the problem of informative outputs is unresolved.
Fundamentally, a major issues with human preferences is that they are underdefined, contradictory, and manipulable. If the problem of manipulation is not resolved, then H is irrelevant in this setup; while trying to do their job, they will end up endorsing the choices that were made for them by the A[N]. Notice that this need not require collusion between agents at all; a common bias or preference in the A[N] and the ability to manipulate H is all that’s needed. Indeed, because of the other problems with human preferences, it may be that it is impossible for A[N] to give answers that are not manipulable in some sense: every way of summarising or predicting will change H’s preferences.
Contradictory preferences are problematic because H will end up resolving a lot of those contradictions through their own choices. Meaning that they will not do this in a principled way, but through a random process (if it is not manipulated). Whole swaths of human value may fall by the wayside, either because H chose not to endorse them (consider a religious fanatic that choose not to endorse doubt, or an extreme utilitarian that completely negated either pleasure or preference - whichever one you’d least prefer to lose), or because they didn’t recall explicitly, at this key moment, that they had other values.
The problems of underdefined values are similar; here the process will be filling in the holes in these values at random. If H endorses friendship and love, but doesn’t explicitly define them in sufficient detail, we may end up with a world shorn of such things, perhaps replaced by some empty ritual (this is not too dissimilar to the reasons FAI is hard in the first place). There is a similar problem with ontological crises, because they make even fully defined human values underdefined - underdefined in the new ontology.
In these circumstances, it might be better to have A[N+1] focus more on modest goals, maybe building up some key infrastructure for later use. But this requires a) a suitable H, b) admitting the Amplify/Distil framework can only solve some narrow problems, and c) a solution to the problem of low impact and reduced side effects. The same suspect A[N] that are predicting and summarising for H, will be the ones that will be used to check whether a policy has negative side effect.
In summary, for this approach to work, major solutions need to be provided to most of the following fundamental problems:
- How to prevent coordination between superintelligent agents.
- How to prevent deliberate manipulation of H.
- How to define informative information sharing with H.
- How to resolve H’s proto-preferences into a single reward.
- How to keep rewards stable over ontology changes.
- How to define the problem of low impact.
How will these problems be solved, if we don’t deliberately set out to solve them? I don't see them getting solved, as side effect of a framework that doesn't deal with them specifically.
Will a ghost grow in the machine?
The previous section briefly mentioned collusion between different superintelligent A[N]. This seemed to require that they be agents with a common goal, which is initially unlikely. But such behaviour can arise given very little pressure.
Suppose that H uses multiple A[n] to answer some question they have - maybe to generate a plan to achieve some interim goal. Suppose each A[n] can be fed a different input seed - maybe they have different information, maybe they are instructed to focus on different parts of the solution space, or maybe they just have a different random seed for stochastic decisions. Then as soon as H sees a plan they like, they stop the running of the other agents.
Then when this process is Distilled, it will generate A[n+1] that correspond to programs that can most rapidly generate H-acceptable answers. If this H behaviour continues for just a few more levels, we should have A[n+4] that have become extremely adept at manipulating H. At this point, H lacks the tools to fix the problem, as the tools themselves are corrupted. That’s if H even realises that the tools are corrupted.
All requirements have the potential to create the same problems. If H is deliberately looking for corrupted A[n], then they are encouraging the creation of agents which look completely un-corrupted. This may include actually un-corrupted A[n], but it may be easier for a corrupted one to appear "more saintly than the saints" than for an honest one.
Even the criteria for "preserving alignment" can generate this problem. Call these criteria C. Unless the C are perfectly defined for their task - which I’ve argued is impossible, because their task itself is not well-defined - they will generate a selective pressure as n increases, transforming A[n] into a C-maximiser.
The previous subsection focused on clever A[n] deviously manipulating a hapless H. But H itself can be responsible for goal directed behaviour as well.
I can imagine myself as the H of a certain group of A[n]. These A[n] are capable, but not too capable, and I’ve satisfied myself that I can resolve my own contradictory values adequately, at least at this level.
But I don’t trust the Hs with the power of the higher programs A[N]. These Hs may be me, or may be someone else; in any case, because of the problems mentioned above, I don’t trust their decisions. Or, in a more limited way, I think they won’t appreciate all the nuances of the tasks at my level (see the section on ambiguously aligned tasks).
So I decide to manipulate them in the "right" direction. In this case, I unleash my A[n] to deliberately deceive, manipulate, and trick. I want them to give me the right turn of phrase to convince the higher H’s to do or not do something; I want them to figure out which tasks I should delay and which I should rush; I want their help to reconstruct the whole context from the details of the task I have been given.
I can test the effectiveness of some of these methods by running them on myself, or by setting up some procedure where I run A[n-1] subagents inside a A[n], and see what they can do to manipulate the distilled H inside the A[n].
And then, finally, when all this is distilled, there is an exceedingly goal-orientated and manipulating A[n+1].
Note that I don’t need to be evil or selfish to do this; depending on my beliefs about this whole method, this might be the moral and upstanding thing to do. It might be worth the risk to create, say, a manipulative low impact agent according to my current low impact definitions, rather than allowing ever more drift.
But actual evil H’s, or H’s with certain strong feelings in certain situations, may also follow this avenue. Someone with strong views of the world, whose views are threatened by what the A[n] are reporting, and who feel that they would have given up their faith/political tribe/self-image if they had the A[n+1] instead... might be motivated to ensure this doesn’t happen. Conversely, the same might happen for an H whose views have shifted a lot during the Amplification process, but who suspects that this shift is unlikely to be reproduced at the higher level, with a new H.
Rails but no destination
These are just a few ideas on how goal-directed agents could emerge in the framework; there are certainly other avenues I haven’t considered.
All of them stem from the same fundamental problem: we don’t have a FAI, nor a clear destination for what A[N] should be doing. Therefore, the Amplification/Distillation steps are taking a random walk in the space of minds, with each step defined by imperfect local criteria. There is no reason to suspect the ultimate attractor of this method will be good.
Is the method good if the problems are solved?
We now come to the weakest and least certain point of this critique, the argument that, if the above problems are solved, then the method becomes redundant and unnecessary.
In a certain sense, this is true. If all the problems with identifying human reward were solved, then we could use this reward to program a full friendly AI, meaning we wouldn’t need the whole Amplify-Distil framework.
But would some partial solutions allow that framework to be used? If, for instance, I was to spend all my efforts to try and make the framework work, trying to solve just enough of the problems, could I do it?
(I’m framing this as a challenge to myself, as it’s always easier to accept "it can’t be done", rather than "I would fail if I tried"; this aligns my motivation more towards solution than problems)
I’m not so sure I could fix it, and I’m not so sure it could be fixed. The "Ghost in the Machine" examples were two that came to me after thinking about the problem for a short while. Those two could be easily patched. But could I come up with more examples if I thought for longer? Probably. Could those be patched? Very likely. Would we be sure we’d caught all the problems?
Ah, that’s the challenge. Patching methods until you can no longer find holes is a disastrous way of doing things. We need a principled approach that says that holes are (very very) unlikely to happen, even if we can’t think of them.
But, it might be possible. I’m plausibly optimistic that Paul, Eric, or someone else will find a principled way of overcoming the "internal agent" problem.
It’s the "extracting human preferences correctly" style of problems that I’m far more worried about. I could imagine that, say, Amplify(H, A) would be better than me or H at solving this problem. Indeed, if it succeeds, then we could use that to program a FAI (given a few solutions to other problems), and that would, in a sense, count Paul’s approach succeeding.
But that is not how the approach is generally presented. It’s seen as a way of getting an A[N] that is aligned, without needing to solve the hard problems of FAI. It’s not generally sold as a method to moderately amplify human abilities, ensuring these amplification are safe through non-generalisable means (eg, not beyond A), and then use these amplified abilities to solve FAI. I would be writing a very different post if that was how I thought people saw that approach.
So, unless the method is specifically earmarked to solve FAI, I don’t see how those hard problems would get solved as an incidental side-effect.
A partial vindication scenario
There is one scenario in which I could imagine the framework working. In my Oracles approach, I worked on the problem by seeing what could be safely done, on one side, and what could usefully be done, on the other, until they met in the middle at some acceptable point.
Now, suppose some of the problems of extracting human value were actually solved. Then it’s plausible that this would open up a space of solution to the whole problem, a space of methods that were not applicable before.
I don’t think it’s likely that the Amplify-Distil scenario would exactly fit in that space of applicable methods. But it’s possible that some variant of it might. So, by working on solutions to the FAI-style problems on one side, and the Amplify-Distil scenario on the other (I’d expect it would change quite a bit), it’s conceivable they could meet in the middle, creating a workable safe framework overall.
It's unfortunate that Ajeya's article doesn't mention Paul's conception of corrigibility which is really central to understanding how his scheme is supposed to achieve alignment. In short, instead of having what we normally think of as values or a utility function, each of A[n] is doing something like "trying to be helpful to the user and keeping the user in control", and this corrigibility is hopefully learned from the human Overseer and kept intact (and self-correcting) through the iterated Amplify-Distill process. For Paul's own explanations, see this post and section 5 of this post.
(Of course this depends on H trying to be corrigible in the first place (as opposed to trying to maximize their own values, or trying to infer the user's values and maximizing those without keeping the user in control). So if H is a religious fanatic then this is not going to work.)
The main motivation here is (as I understand it) that learning corrigibility may be easier and more tolerant of errors than learning values. So for example, whereas an AI that learns slightly wrong values may be motivated to manipulate H into accepting those wrong values or prevent itself from being turned off, intuitively it seems like it would take bigger errors in learning corrigibility for those things to happen. (This may well be a mirage; when people look more deeply into Paul's idea of corrigibility maybe we'll realize that learning it is actually as hard and error-sensitive as learning values. Sort of like how AI alignment through value learning perhaps didn't seem that hard at first glance.) Again see Paul's posts linked above for his own views on this.
Ok. I have all sorts of reasons to be doubtful of the strong form of corrigibility (mainly based on my repeated failure to get it to work, for reasons that seemed to be fundamental problems); in particular, it doesn't solve the inconsistency of human values.
But I'll look at those posts and write something more formal.
I see many possible concerns with the way I try to invoke corrigibility; but I don't see immediately see how inconsistent values in particular are a problem (I suspect there is some miscommunication here).
The argument would be that the problem is not well defined because of that. But I'll look at the other linked posts (do you have any more you suggest reading?) and get back to you.
I'm looking at corrigibility here: https://www.lesswrong.com/posts/T5ZyNq3fzN59aQG5y/the-limits-of-corrigibility
First, I'm really happy to see this post. So far I've seen very little about HCH outside of Paul's writings (and work by Ought). I think it may be one of the top well-regarded AI-safety proposals, and this is basically the first serious critique I've seen so far.
I've been working with Ought for a few months now, and still feel like I understand the theory quite a bit less than Paul or Andreas. That said, here are some thoughts:
1. Your criticism that HCH doesn't at all guarantee alignment seems quite fair to me. The fact that the system is put together by humans definitely does not guarantee alignment. It may help, but the amount that it does seems quite uncertain to me.
2. I think that details and discussion around HCH should become much more clear as actual engineering implementations make more progress. Right now it's all quite theoretical, and I'd expect to get some results fairly soon (the next few months, maybe a year or two).
3. I get the sense that you treat HCH a bit like a black box in the above comments. One thing I really like about HCH is that it divides work into many human-understandable tasks. If a system actually started getting powerful, this could allow us to be able to understand it in a pretty sophisticated way. We could basically inspect it and realize how problems happen. (this gets messy when things get distilled, but even things could be relatively doable).
I would hope that if HCH does gain traction, there would be a large study on exactly what large task networks look like. The "moral problems" could be isolated to very specific questions, which may be able to get particularly thorough testing & analysis.
4. "Indeed, if it succeeds, then we could use that to program a FAI (given a few solutions to other problems), and that would, in a sense, count Paul’s approach succeeding.
But that is not how the approach is generally presented. "
I think that's one of the main ideas that Andreas has for it with Ought. I definitely would expect that we could use the system to help understand and implement safety. This is likely a crucial element.
It's a bit awkward to me that intelligence amplification seems to have two very different benefits:
1. Increases human reasoning abilities, hopefully in a direction towards AI-safety.
2. Actually exist as an implementation for a Safe AI.
This of course hints that there's another class of interesting projects to be worked on; ones that satisfy goal 1 well without attempting goal 2. I think this is an area that could probably use quite a bit more thought.
[Edited after some feedback]
Wei's comments aside, this does to me suggest a way in which amplification/distillation could be a dangerous research path, as you hint at, because it seemingly can be used to create more powerful AI for any purpose. That is, it encodes no solution to metaethics and leaves that to be implicitly resolved by the human operators, so research on amplification/distillation seems to potentially contribute more to capabilities research than safety research. This updates me in the direction of being more opposed to the proposal even if it is a capability being consider with the intention to use it for safety-related purposes.
This somewhat contradicts my previous take on Paul's' work as I think, based on your presentation of it, I may have misunderstood or failed to realize the full implications of Paul's approach. I previously viewed it as a means of learning human values while building more capable AI, and while it can still be used for that I'm now more worried about the ways in which it might be used for other purposes.