HCH, introduced in Humans consulting HCH, is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more subquestions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification would be able to do in the limit of infinite compute. HCH can also be used to decompose the question of "is Iterated Amplification safe" into “is HCH safe” and “If HCH is safe, will Iterated Amplification approximate the behaviour of HCH in a way that is also safe”.
I think there's a way to interpret HCH in a way that leads to incorrect intuitions about why we would expect it to be safe. Here, I describe three models of how one could think HCH would work, and why we might expect them to be safe.
Mechanical Turk: The human Bob, is hired on Mechanical Turk to act as a component of HCH. Bob takes in some reasonable length natural language question, formulates subquestions to ask other Turkers, and turns the responses from those Turkers into an answer to the original question. Bob only sees the question he is asked and thinks for a short period of time before asking subquestions or returning an answer. The question of "is HCH corrigible" is about "how does the corrigibility of Bob translate into corrigibility of the overall system"? To claim that HCH is safe in this scenario, we could point to Bob being well-intentioned, having human-like concepts and reasoning in a human-like way. Also, since Bob has to communicate in natural language to other humans, those communications could be monitored or reflected upon. We could claim that this leads the reasoning that produces the answer to stay within the space of reasoning that humans use, and so more likely to reflect our values and less likely to yield unexpected outcomes that misinterpret our values.
Lookup Table: An AI safety research team lead by Alice writes down a set of 100 million possible queries that they claim capture all human reasoning. For each of these queries, they then write out the subquestions that would need to be written, along with simple computer code that combines the answers to the subquestions into an answer to the original question. This produces a large lookup table, and the "human" in HCH is just a call to this lookup table. The question of "is HCH corrigible" is about "has Alice's team successfully designed a set of rules that perform corrigible reasoning"? To justify this, we point to Alice's team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice, etc.
Overseer's Manual: An AI safety research team lead by Alice has written a manual on how to corrigibly answer questions by decomposing them into subquestions. This manual is handed to Bob, who was hired to decompose tasks. Bob carefully studies the manual and applies the rules in it when he is performing his task (and the quality of his work is monitored by the team). Alice's team has carefully thought about how to decomposed tasks, and performed many experiments with people like Bob trying to decompose tasks. So they understand the space of strategies and outputs that Bob will produce given the manual. The "human" in HCH is actually a human (Bob), but in effect Bob is acting as a compressed lookup table, and is only necessary because the lookup table is too large to write down. An analogy is that it would take too much space and time to write down a list of translations of all possible 10 word sentences from English to German, but it is possible to train humans who, given any 10 word English sentence can produce the German translation. The safety properties are caused by Alice's team's preparations, which include Alice's team modelling how Bob would produce answers after reading the manual. To justify the safety of the system, we again point to Alice's team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice etc.
I claim that the Mechanical Turk scenario is incomplete about why we might hope for an HCH system to be safe. Though it might be safer than a computation without human involvement, I would find it hard to trust that this system would continue to scale without running into problems, like handing over control deliberately or accidentally to some unsafe computational process. The Mechanical Turk scenario leaves out the process of design that Alice’s team takes part in the Lookup Table and Overseer’s Manual scenarios, which can include at least some consideration of AI safety issues (though how much of this is necessary is an open question). I think this design process, if done right, is the thing that could give the system the ability to avoid these problems as it scales. I think that we should keep these stronger Lookup Table and Overseer’s Manual scenarios in mind when considering whether HCH might be safe.
(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)