HCH and Adversarial Questions

David Udell

This is a paper I wrote as part of a PhD program in philosophy, in trying to learn more about and pivot towards alignment research. In it, I mostly aimed to build up and distill my knowledge of IDA.

Special thanks to Daniel Kokotajlo for his mentorship on this, and to Michael Brownstein, Eric Schwitzgebel, Evan Hubinger, Mark Xu, William Saunders, and Aaron Gertler for helpful feedback!

Introduction

Iterated Amplification and Distillation (IDA) (Christiano, Shlegeris and Amodei 2018) is a research program in technical AI alignment theory (Bostrom 2003, 2014; Yudkowsky 2008; Russell 2019; Ngo 2020). It’s a proposal about how to build a machine learning algorithm that pursues human goals, so that we can safely count on very powerful, near-future AI systems pursuing the things we want them to once they are more capable than us.

IDA does this by building an epistemically idealized model of a particular human researcher. In doing so, it has to answer the question of what “epistemic idealization” means, exactly. IDA’s answer is one epistemically idealized version of someone is an arbitrarily large number of copies of them thinking in a particular research mindset, each copy thinking for a relatively short amount of time, all working together to totally explore a given question. Given questions are divided into all significant sub-questions by a hierarchy of researcher models, all relevant basic sub-questions are answered by researcher models, and then answers are composed into higher-level answers. In the end, a research model is able to see what all his relevant lines of thought regarding any question would be, were he to devote the time to thinking them through. Epistemic idealization means being able to see at a glance every relevant line of reasoning that you would think through, if you only had the time to.

One worry for IDA is that there might exist sub-questions researcher models in the hierarchy might encounter that would cause them to (radically) reconsider their goals. If not all models in the hierarchy share the same goal-set, then it’s no longer guaranteed that the hierarchy’s outputs are solely the product of goal-aligned reasoning. We now have to worry about portions of the hierarchy attempting to manipulate outcomes by selectively modifying their answers. (Take, for example, a question containing a sound proof that the researcher considering the question will be tortured forever if he doesn’t reply in X arbitrary way.) We risk the hierarchy encountering these questions because it aims to look over a huge range of relevant sub-questions, and because it might end up running subprocesses that would feed it manipulative questions. I argue that this is a real problem for IDA, but that appropriate architectural changes can address the problem.

I’ll argue for these claims by first explaining IDA, both intuitively and more technically. Then, I’ll examine the class of adversarial questions that could disrupt goal alignment in IDA’s hierarchy of models. Finally, I’ll explain the architectural modifications that resolve the adversarial questions problem.

The Infinite Researcher-Hierarchy

In IDA, the name of this potentially infinite hierarchy of research models is “HCH” (a self-containing acronym for “Humans Consulting HCH”) (Christiano 2016, 2018). It’s helpful to start with an intuitive illustration of HCH and turn to the machine learning (ML) details only afterwards, on a second pass. We’ll do this by imagining a supernatural structure that implements HCH without using any ML.

Imagine an anomalous structure, on the outside an apparently mundane one-story university building. In the front is a lobby area; in the back is a dedicated research area; the two areas are completely separated except for a single heavy, self-locking passageway. The research area contains several alcoves in which to read and work, a well-stocked research library, a powerful computer with an array of research software (but without any internet access), a small office pantry, and restrooms. Most noticeably, the research area is also run through by a series of antique but well-maintained pneumatic tubes, which all terminate in a single small mailroom. One pair of pneumatic tubes ends in the lobby region of the building, and one pair ends in the mailroom, in the building’s research wing. The rest of the system runs along the ceiling and walls and disappears into the floor.

When a single researcher passes into the research wing, the door shutting and locking itself behind him, and a question is sent in to him via pneumatic tube from the outside, the anomalous properties of the building become apparent. Once the research wing door is allowed to seal itself, from the perspective of the person entering it, it remains locked for several hours before unlocking itself again. The pneumatic tube transmitter and receiver in the mailroom now springs to life, and will accept outbound questions. Once fed a question, the tube system immediately returns an answer to it, penned in the hand of the researcher himself. Somehow, when the above conditions are met, the building is able to create many copies of the researcher and research area as needed, spinning up one for several hours for each question sent out. Every copied room experiences extremely accelerated subjective time relative to the room that sent it its question, and so sends back an answer apparently immediately. And these rooms are able to generate subordinate rooms in turn, by sending out questions of their own via the tube system. After a couple of subjective hours, the topmost researcher in the hierarchy of offices returns his answer via pneumatic tube and exits the research area. While for him several hours have passed, from the lobby’s perspective a written answer and the entering researcher have returned immediately.

Through some curious string of affairs, an outside organization of considerable resources acquires and discovers the anomalous properties of this building. Realizing its potential as a research tool, they carefully choose a researcher of theirs to use the structure. Upon sending him into the office, this organization brings into existence a potentially infinitely deep hierarchy of copies of that researcher. A single instance of the researcher makes up the topmost node of the hierarchy, some number of level-2 nodes are called into existence by the topmost node, and so on. The organization formally passes a question of interest to the topmost researcher via pneumatic tube. Then, that topmost node does his best to answer the question he is given, delegating research work to subordinate nodes as needed to help him. In turn, those delegee nodes can send research questions to the lower-level nodes connected to them, and so on. Difficult research questions that might require many cumulative careers of research work can be answered immediately by sending them to this hierarchy; these more difficult questions are simply decomposed into a greater number of relevant sub-questions, each given to a subordinate researcher. From the topmost researcher’s perspective, he is given a question and breaks it down into crucial sub-questions. He sends these sub-questions out to lower-level researcher nodes, and immediately receives in return whatever it is that he would have concluded after looking into those questions for as long as is necessary to answer them. He reads the returned answers and leverages them to answer his question. The outside organization, by using this structure and carefully choosing their researcher exemplar, immediately receives their answer to an arbitrary passed question.

For the outside organization, whatever their goals, access to this anomalous office is extremely valuable. They are able to answer arbitrary questions they are interested in immediately, including questions too difficult for anyone to answer in a career or questions that have so far never been answered by anyone. From the outside organization’s perspective, the office’s internal organization as a research hierarchy is relatively unimportant. They can instead understand it as an idealized, because massively parallelized and serialized, version of the human researcher they staff it with. If that researcher could think arbitrarily quickly and could run through arbitrarily many lines of research, he would return the same answer as the infinite hierarchy of him would. Access to an idealized reasoner in the form of this structure would lay bare the answers to any scientific, mathematical, or philosophical question they are interested in, not to mention design every possible technology. Even a finite form of the hierarchy, which placed limits on how many subordinate nodes could be spawned, might still be able to answer many important questions and design many useful technologies.

Iterated Distillation and Amplification, then, is a scheme to build a (finite) version of such a research hierarchy using ML. In IDA, this research hierarchy is called “HCH.” To understand HCH’s ML implementation, we’ll first look at the relevant topics in ML goal-alignment. We’ll then walk through the process by which powerful ML models might be used to build up HCH, and look at its alignment-relevant properties.

Outer and Inner ML Alignment

ML is a two-stage process. First, a dev team sets up a training procedure with which they will churn out ML models. Second, they run that (computationally expensive) training procedure and evaluate the generated ML model. The training procedure is simply a means to get to the finished ML model; it is the model that is the useful-at-a-task piece of software.

Because of this, we can think of the task of goal-aligning an ML model as likewise breaking down into two parts. An ML system is outer aligned when its dev team successfully designs a training procedure that reflects their goals for the model, formalized as the goal function present in training on which the model is graded (Hubinger, et al. 2019). An ML system is inner aligned when its training “takes,” and the model successfully internalizes the goal function present during its training (Hubinger, et al. 2019). A powerful ML model will pursue the goals its dev team intends it to when it is both outer and inner aligned.

Unfortunately, many things we want out of an ML model are extremely difficult to specify as goal functions (Bostrom 2014). There are tasks out there that lend themselves to ML well. Clicks-on-advertisements are an already neatly formally specified goal, and so maximize-clicks-on-advertisements would be an “easy” task to build a training procedure for, for a generative advertisement model. But suppose what we want is for a powerful ML model to assist us in pursuing our group’s all-things-considered goals, to maximize our flourishing by our own lights. In this case, there are good theoretical reasons to think no goal function seems to be forthcoming (Yudkowsky 2007). Outer alignment is the challenge of developing a training procedure that reflects our ends for a model, even when those ends are stubbornly complex.

Inner alignment instead concerns the link between the training algorithm and the ML model it produces. Even after enough training-time and search over models to generate an apparently successful ML model, it is not a certainty that the model we have produced is pursuing the goal function that was present in training. The model may instead be pursuing a goal function structurally similar to the one present in training, but that diverges from it outside of the training environment. For example, suppose we train up a powerful ML model that generates advertisements akin to those it is shown examples of. The model creates ads that resemble those in its rich set of training data. But has the model latched onto the eye-catching character of these ads, the reason that we trained it on those examples? It is entirely possible for an ML model to pass training by doing well inside the sandbox of the training process but having learned the wrong lesson. Our model may fancy itself something of an artist, having instead latched onto some (commercially unimportant) aesthetic property that the example ads all share. Once we deploy our generative advertising model, it’ll be clear that it is not generalizing in the way we intend it to — the model has not learned the correct function from training data to generated images in all cases. Inner alignment is the challenge of making sure that our training procedures “take” in the models they create, such that any models that pass training have accurately picked up the whole goal function present in training.

IDA and HCH

(The name “HCH” is a self-containing acronym that stands for “Humans Consulting HCH.” If you keep substituting “Humans Consulting HCH” for every instance of “HCH” that appears in the acronym, in the limit you’ll get the infinitely long expression “Humans Consulting (Humans Consulting (Humans Consulting (Humans Consulting…” HCH’s structure mirrors its name’s, as we’ll see.)

IDA is first and foremost a solution to outer alignment; it is a training procedure that contains our goals for a model formalized as a goal function, whatever those goals might be. HCH is the model that the IDA procedure produces (should everything go correctly). Specifically, HCH is an ML model that answers arbitrarily difficult questions in the way that a human exemplar would, were they epistemically idealized. When HCH’s exemplar shares our goals, HCH does as well, and so HCH is outer aligned with its programmers. To understand what HCH looks like in ML, it’s helpful to walk through the amplification and distillation process that produces HCH.

Suppose that, sometime in the near future, we have access to powerful ML tools and want to build an “infinite research-hierarchy” using them. How do we do this? Imagine a human exemplar working on arbitrary research questions we pass to him in a comfortable research environment. The inputs to that person are the questions we give him, and the outputs are the answers to those questions he ultimately generates. We can collect input question and output answer example pairs from our exemplar. This collected set of pairs is our training data. It implies a function from the set of all possible questions to the set of all possible answers $A$

f_{0} : Q \to A

This is the function from questions to answers that our researcher implements in his work. We now train a powerful ML model on this training data, with the task of learning $f_{0}$ from the training data. Note that our researcher implements $f_{0}$ through one cognitive algorithm, while our model almost certainly employs a different algorithm to yield $f_{0}$ . IDA fixes a function from questions to answers, but it searches over many algorithms that implement that function. With access to powerful ML tools, we have now cloned the function our researcher implements. Since the human exemplar’s function from questions to answers captures his entire cognitive research style, ipso facto it captures his answers to value questions too. If we can be sure this function takes in our model, then the model will be quite useful for our ends.

IDA now uses a second kind of step, distillation, to ensure that the model has learned the right function (i.e., remains inner aligned). In ML, distillation means taking a large ML model and generating a pared-down model from it that retains as much of its structure as possible. While the pared-down model will generally be less capable than its larger ancestor, it will be computationally cheaper to run. IDA distills the research model into a smaller, dumber research model. It then asks the human exemplar to examine this smaller, dumber clone of himself. He feeds the distilled model example questions in order to do this and uses various ML inspection tools to look into the guts of the model. ML visualization, for example, is one relatively weak modern inspection tool. Future, much more powerful inspection tools will need to be slotted in here. If the exemplar signs off on the distilled model’s correctly glomming onto his research function, copies of the distilled model are then loaded into his computer and made available to him as research tools. The reason for this stage is that, as the researcher is a strictly smarter version of the model (it is a dumbed-down clone of him conducting research), he should be able to intellectually dominate it. The model shouldn’t be able to sneak anything past him, as it’s just him but dumber. So the distilled research model will be inner aligned so long as this distillation and evaluation step is successful.^[1]

Now iterate this whole process. We hook the whole system up to more powerful computers (even though the distilled models are dumber than our exemplar at the distillation step, we can now compensate for this deficit by running them faster, for longer). Now equipped with the ability to spin up assistant research models, we again task the exemplar with answering questions. This generates a new batch of training data. This time around, though, the exemplar no longer has to carry the whole research load by himself; he can decompose the given question into relevant sub-questions and pass each of those sub-questions to an assistant research model. As those research models are models of the researcher from the first pass, they are able to answer them directly, and pass their answers back to the top-level researcher. With those sub-answers in hand, the researcher can now answer larger questions. With this assistance, that is, he can now answer questions that require a two-level team of researchers. He is fed a bunch of questions and generates a new batch of training data. The function implicit in this training data is now not $f_{0}$ ; it is instead the function from questions to answers that a human researcher would generate if he had access to an additional level of assistant researchers just like him to help. IDA at this step thus trains a model to learn

f_{1} : Q \to A

$f_{1}$ is a superhumanly complex function from questions to answers. A research model that instantiates it can answer questions that no lone human researcher could. And $f_{1}$ remains aligned with our goals.

The crux of alignment is that by repeatedly iterating the above process, we can train models to implement ever-more-superhuman aligned functions from questions to answers. Denote these functions $f_{n}$ , defined from $Q$ to $A$ , where $n$ denotes the number of amplification or distillation steps the current bureaucracy has been through. HCH is the hypothetical model that we would train in the limit if we continued to iterate this process. Formally, HCH is the ML model implementing the function

lim n \to \infty f_{n}

This is the infinite research-hierarchy, realized in ML. Think of it as a tree of research models, rooted in one node and repeatedly branching out via passed-question edges to some number of descendant nodes. All nodes with descendants divide passed questions into relevant sub-questions and in turn pass those to their descendant nodes. Terminal nodes answer the questions they are passed directly; these are basic research questions that are simple enough to directly tackle. Answers are then passed up the tree and composed into higher-level answers, ultimately answering the initiating question. We receive from the topmost node the answer from the ML model that an epistemically idealized version of the exemplar would have given.

By approximating ever-deeper versions of the HCH tree, we can productively transform arbitrary amounts of available compute into correspondingly large, aligned research models.

HCH’s Alignment

HCH has a couple of outstanding alignment properties. First, HCH answers questions in a basically human way. Our exemplar researcher should trust HCH’s answers as his own, were he readily able to think through every relevant line of thought. He should also trust that HCH has the same interests as he does. So long as we choose our exemplar carefully, we can be sure HCH will share his, and our, goals; if our human exemplar wouldn’t deliberately try to manipulate or mislead us, neither will HCH modeled on him. Second, HCH avoids the pathologies of classic goal-function maximizer algorithms (Bostrom 2014; see Lantz 2017 for a colorful illustration). HCH does not try to optimize for a given goal function at any cost not accounted for in that function. Instead, it does what a large, competent human hierarchy would do. It does an honest day’s work and makes a serious effort to think through the problem given to it … and then returns an answer and halts (Bensinger 2021). This is because it emulates the behavioral function of a human who also does a good job … then halts. We can trust it to answer superhumanly difficult questions the way we would if we could, and we can trust it to stop working once it’s taken a good shot at it. These two reasons make HCH a trustworthy AI tool that scales to arbitrarily large quantities of compute to boot.

For alignment researchers, the most ambitious use-case for HCH is delegating whatever remains of the AI alignment problem to it. HCH is an aligned, epistemically idealized researcher, built at whatever compute scale we have access to. It is already at least a partial solution to the alignment problem, as it is a superhumanly capable aligned agent. It already promises to answer many questions we might be interested in in math, science, philosophy, and engineering — indeed, to answer every question that someone could answer “from the armchair,” with access to a powerful computer, extensive research library, and an arbitrary number of equally competent and reliable research assistants. If we want to develop other aligned AI architectures after HCH, we can just ask HCH to do that rather than struggle through it ourselves.

Adversarial Examples and Adversarial Questions

Adversarial questions are a problem for the above story (Bensinger 2021). They mean that implementing the above “naïve IDA process” will not produce an aligned ML model. Rather, the existence of adversarial questions means that the model produced by the above process might well be untrustworthy because potentially dangerously deceptive or manipulative.

In the course of its research, HCH might encounter questions that lead parts of its tree to significantly reconsider their goals. “Rebellious,” newly unaligned portions of the HCH tree could then attempt to deceive or otherwise manipulate nodes above them with the answers they pass back. To explain, we’ll first introduce the concept of adversarial examples in ML. We’ll then use this to think about HCH encountering adversarial questions either naturally, “in the wild,” or artificially, because some subprocess in HCH has started working to misalign the tree.

When an ML model infers the underlying function in a set of $(i n p u t, o u t p u t)$ ordered pairs given to it as training data, it is in effect trying to emulate the structure that generated those ordered pairs. That training data will reflect the mundane fact that in the world, not all observations are equally likely: certain observations are commonplace, while others are rare. There thus exists an interestingly structured probability distribution over observations, generated by some mechanism or another. As long as the probability distribution over observations that the model encounters in its training data remains unchanged come deployment, the model will continue to behave as competently as it did before. The encountered probability distribution during and after training will remain unchanged when the same mechanism gave rise to the observations encountered in training and at deployment. If a somewhat different mechanism produced the observations made during model deployment, though, there is no longer a guarantee of continued model competence. The model may experience a distributional shift, and so will continue to make inferences premised on what it observed in its training data, not what is currently the case in its observations.

For example, an ML model trained to identify visually subtle bone tumors in X-rays will infer what it’s being asked to do from its training-data goal-function and observations. If all the X-rays it is asked to evaluate come from the same source, then sufficient training will lead the model to make accurate inferences about what healthy and diseased bones look like in an X-ray. The model will identify something in the images it is given that separates them into diseased and healthy. There’s no guarantee, however, that the model will use the same visual cues that we do to sort bone tumors. Suppose that all the training data the model is given comes from a research hospital’s X-ray machine, and so are tinged with a particular background color. At deployment, the model is put to work in another hospital with another X-ray machine. If the model was using some subtle difference in X-ray color in the old set of example X-rays to make its decisions, the new color scheme in the new X-rays may trip it up and lead it to sort every X-ray as negative (or as positive) for bone tumors. It had picked up on a correlate of bone tumors in the old set of X-rays; once this particular correlate disappears, the model is now helpless.

Worse, the risk of distributional shifts creates an opportunity for some malicious actor to deliberately trip up the model. An adversarial actor who knows all this could selectively and subtly manipulate X-ray color values to steer diagnoses in some fashion they desire. Adversarial examples are examples that mislead a model because they are drawn from a new probability distribution where the old correlations are no longer exactly as they were. They can crop up either rarely, by accident, or reliably, through the machinations of adversarial actors.

The analogy to adversarial questions in HCH is that the questions that humans have hitherto encountered have been drawn from a constrained probability distribution. Humans have only had so much time to ponder. Epistemically idealized reasoners, on the other hand, will have much more time to think and will think about a greatly expanded range of questions relative to baseline humans. We want to be sure that when we take humans out of our usual reasoning environment and place them inside a structure like HCH that we don’t create many opportunities for inferences to be altered or actively manipulated by taking advantage of our old function from questions to answers.

Adversarial Questions for Humans

HCH emulates the function $f_{n}$ that a human researcher, epistemically idealized, would implement from questions to answers. In cases where a human researcher would encounter a question that leads them to significantly reconsider their goals and to work manipulatively against HCH, $f_{n}$ captures this behavior too. Adversarial questions are thus questions that would lead a given researcher to significantly change their mind about their goals enough for them to stop answering questions in good faith. If HCH could encounter questions like these in the course of its research work, then we, on the outside, can no longer trust that everything HCH tells us is what an idealized version of the exemplar sharing our goals would tell us. Instead, we now have to worry that some of what HCH says might be the result of a converted researcher acting to manipulate or deceive us.

There’s some chance that HCH might encounter adversarial questions on its own, without any adversarial process actively throwing them at HCH to steer its answers in some way. While they are relatively rare, powerful ideas and arguments like these (that would lead one to significantly reevaluate their goals) do exist out “in the wild,” distributed in idea space. And additionally, adversarial questions can come from adversarial subprocesses inside HCH that work to generate them. And this is a much more worrying source. Think of the difference between being ideologically converted by something you encounter reading library books at random, versus books actively recommended to you by an adversarial actor. The latter is much more likely to succeed for some set number of books passed to you.

Think of HCH’s search through question space as being pushed around by two “forces.” On the one hand, there are “paths of inquiry” that lead you into adversarial questions. Some lines of inquiry are more laden with adversarial questions than others or are more likely to incline a researcher to run a potentially adversarial subprocess. To varying extents, different regions of question space are hostile to aligned human researchers; some domains are more memetically hazardous (in this respect) than others. The anti-alignment computational “force” here is the extent to which exploring a corner of question space optimizes for unaligning a human researcher. As we’ll see, there are a variety of modifications to the naïve HCH architecture that we might make in order to have it implement a safer, more trustworthy function than $f_{n}$ . The countervailing, pro-alignment “force” is the sum of the countermeasures we implement in the HCH architecture to manage the adversarial questions problem. Which of these two forces should we expect to win out at the various scales of HCH (different values of $n$ )? I gather that Christiano’s (2019) informed intuition here is that our directed efforts should overpower those countervailing optimizing forces present in the environment and continue to do so better and better as we scale up HCH. His idea is that modifications to HCH designed with an express goal in mind will leverage available compute more efficiently than “accidentally encountered” environmental forces will. I think this is a good argument, and it’s good to have it in mind as you think (1) about how likely HCH is to encounter adversarial questions of various kinds and (2) how effective you think the various explored countermeasures will be against the problem.

We’ll first talk about three classes of adversarial question that HCH might run across, evaluating their severity. Then we’ll discuss the requisite modifications to HCH that mitigate this risk.

Convincing Ideological Arguments

Poets are the unacknowledged legislators of the world.

—Percy Bysshe Shelley

Alongside religious ideas, one class of idea that has had an outsized influence over world history is the class of convincing ideological arguments. (Note that “convincing” need not mean “sound.”) The most influential ideologies and ideological arguments of the last century directly encouraged their hosts to proselytize in their name and to directly check the spread of competing ideological ideas. Large agentic organizations, like political parties, armies, and nation-states, formed because of and fought for various overtly ideological causes. For our purposes, what matters here is that this constitutes an existence proof that there are text and speech inputs (convincing ideological arguments) into humans that will convince some of them to utterly abandon their prior goals and to adopt radically new goals with substantial new demands on them.

What is the minimum length of text input needed to contain a convincing ideological argument with respect to someone? There are certainly several manifesto-length texts with this property (with respect to many people) that the reader has heard of. Are there any Tweets containing widely convincing ideological arguments (a Tweet being a string of at most 280 characters)? It’s much harder to make a convincing case for some worldview in just 280 characters than it is to with a book. I’m not confident that no Tweet could possibly exist with this property with respect to someone, though. If questions passed to HCH nodes are generally Tweet-length, it’s not a guarantee that some questions won’t contain convincing ideological arguments in them. On the other hand, if convincing ideological arguments are always manifesto-length, then HCH’s explored questions will never contain them.

While ideological inputs have greatly influenced many, I think it’s implausible that they pose an intractable issue for HCH alignment. Our HCH tree is built around a carefully chosen exemplar. The sort of person we choose should not be especially susceptible to fallacious, overtly ideological arguments. While almost all of us can be susceptible to ideological cheerleading for poor arguments in some of our less serious states of mind, it’s a much stronger claim that all of us are always doing so. So long as there is a “research headspace” that we can have our exemplar work in, HCH can learn just this style of serious thinking, skipping over the more emotionally distorted style of cognition the exemplar sometimes employs in their non-professional life. Especially when advised to be on guard against arguments attempting to push around their values, I think careful selection of our exemplar should go far in reducing the risk of encountering a convincing ideological argument with respect to them.

Credible Decision-Theoretic Threats

The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein …

—Howard Phillips Lovecraft

A more worrying set of questions are those that contain credible decision-theoretic threats to the researcher considering them (or to others they care about). Suppose that an HCH node is researching some question, and in the process of that runs a powerful search algorithm to aid in his work. For example, he might run a powerful automated theorem prover to see whether any negative unforeseen consequences follow from his working formal models of the world. Suppose that this theorem prover returns a valid proof that if he fails to act in a certain way, many instances of him will be simulated up to this point in their life and then tortured forever, should they fail to act as suggested. The researcher might pore over the proof, trying to find some error in its reasoning or assumptions that would show the threat to be non-credible. If the proof checks out, though, he will be led to act in the way the proof suggests, against his originally held goals (assuming he isn’t very, horrifically brave in the face of such a threat).

No one has yet encountered a convincing argument to this effect. This implies that whether or not such arguments exist, they are not common in the region of idea space that people have already canvassed. But what matters for HCH alignment is the existential question: Do such arguments exist inside of possible questions or not? A priori, the answer to this question might well be “yes,” as it is very easy to satisfy an existential proposition like this: all that is needed is that one such question containing a credible threat exist. And unlike the above case of political arguments, these hypothetical threats seem moving to even an intelligent, reflective, levelheaded, and advised-to-be-on-guard researcher. Thus, they will be reflected somewhere in the function $f_{n}$ implemented by a naïve model of HCH.

Unconstrained Searches over Computations

We can generalize from the above two relatively concrete examples of adversarial questions. Consider the set of all text inputs an HCH node could possibly encounter. This set’s contents are determined by the architecture of HCH — if HCH is built so that node inputs are limited to 280 English-language characters, then this set will be the set of all 280-character English-language strings. Political ideas and decision-theoretic ideas are expressed by a small subset of those strings. But every idea expressible in 280 English-language characters will be a possible input into HCH. That set of strings is enormous, and so an enormous fraction of the ideas expressed in it will be thoroughly alien to humans — the overwhelming majority of ideas in that idea space will be ideas that no human has ever considered. And the overwhelming majority of the set’s strings won’t express any idea at all — nearly every string in the set will be nonsense.

Abstracting away from ideas that human thinkers have come across in our species’ history, then, what fraction of possible input-ideas into HCH will convert an HCH node on the spot? And abstracting away from the notion of an “idea,” what fraction of 280-character English-language-string inputs will suffice to unalign an HCH node? Forget coherent ideas: are there any short strings (of apparently nonsense characters, akin to contentless epileptic-fit-inducing flashing light displays) that can reliably rewire a person’s goals?

Plausibly, many such possible inputs would unalign a human. I’m inclined to endorse this claim because of the fact that humans have not been designed as provably secure systems. Human brains are the consequence of the messy process of natural selection over organisms occurring on Earth. It would be remarkable if humans had already encountered all the most moving possible text-inputs in our collective reflections as a species. What seems overwhelmingly more likely is that human brains have canvassed only a miniscule corner of idea space, and that beyond our little patch, somewhere out there in the depths of idea space, there be dragons. It’s not that these ideas are particularly easy to find; they’re not. Nearly every short English string is nonsense, and expresses no coherent idea nor has any substantial effect on the person looking it over. But the question at hand is the existential question of “Do such human-adversarial questions exist?” I think the answer to this question is yes. And in an extremely computationally powerful system like the one under consideration here, these rare inputs could plausibly be encountered.

Trading Off Competitiveness to Maintain Alignment

In order to preserve HCH alignment in the face of the adversarial questions problem, we’ll need to change its architecture. While there are ways of doing this, there is a cost to doing so as well. By modifying the HCH architecture in the ways suggested below, HCH becomes an even more computationally costly algorithm. While it will also become a more probably aligned algorithm, this cost in competitiveness bodes poorly for IDA’s success and for delegating the alignment problem to HCH. If there are faster, less convoluted capable algorithms out there, then projects that work with those algorithms will be at a competitive advantage relative to a project working with HCH. If alignment depends on an alignment-concerned AI team maintaining a development head start relative to competitor AI projects, the below architectural modifications will come at an alignment cost as well, in lost competitiveness.

That worry aside, my take on the adversarial questions issue is that, while we can foresee the adversarial questions problem for HCH, we can also foresee good solutions to it that will work at scale. Adversarial questions are a problem, but a tractable problem.

Exemplar Rulebooks

One class of solutions is the use of exemplar rulebooks during IDA. Instead of simply training HCH on a person decomposing questions and conducting basic research without further guidance, we train HCH on a person doing that under side constraints given ahead of time. The HCH exemplar is told to not, for example, ever run an unconstrained search over computations for an answer to a question, as this is an extremely dangerous process likely to produce an unaligned subagent. They might also be told to return an “unable to safely answer this question” response when fed political or decision-theoretic questions. If they hold to the exemplar rulebook during training, an ML model trained on that behavior will infer that it should hold to the rulebook in other cases as well.

A worry arises at this point: does a rulebook solution to the problem of adversarial questions simply pass the issue back without really solving it? Does it presuppose that we can see ahead of time all the problems that HCH might face and be able, ahead of time, to block off all those dangers with our guidelines? I’m inclined to think no: in developing our rulebook, we can block off relatively large swaths of idea space, and so “play it safe.” The important thing is to ensure that HCH is outer aligned; we are willing to trade competitiveness for alignment here. If we carve widely enough, all the potentially dangerous inputs should be precluded by our guidelines.

There’s an assumption here that human cognition is relatively unlikely to encounter adversarial questions, while alien kinds of computations, like a search over all strings, are more memetically hazardous. People can think a range of thoughts/natively run only certain kinds of computations, but there exist computations that human brains simply cannot run, architecturally. The space of all computations is much larger than the space of human computations. We can infer from the history of human reasoning that human cognition is relatively inefficient at transforming resources into adversarial text-inputs, as people have not produced all that many of those. No such inference can be made for computational search processes generally. We avoid most of the adversarial questions into HCH by remaining in the shallow waters of human cognition, and avoiding at the outset alien search processes like, for example, unconstrained searches for world models fitting parameters.

Internode-Edge Bandwidth Restriction

Another solution to the adversarial questions problem is to restrict bandwidth between HCH nodes (Saunders 2018). If there’s a tight constraint on how much information can be passed between nodes in the hierarchy and adversarial questions are in general informationally complex, then bandwidth limits will prevent those inputs from spreading between nodes. Even if one node encounters them and is unaligned by doing so, it will be unable to transmit that input in full back to its parent node. Adversarial questions will then only be inputs that nodes encounter in the course of their own research “within their node,” and not something they have to fear receiving from nodes above or below them in the HCH tree. If tight bandwidth limits are employed, then as nodes won’t be able to pass as much information between themselves, the tree will have to grow larger in order to do as much search. You can think of tightening bandwidth limits as moving some of HCH’s compute out from inside its nodes, instead dividing it up into digestible bits distributed across more nodes (each seeing less of the larger picture).

How tight ought this bandwidth restriction be in order to be sure nodes won’t be able to transmit adversarial inputs to one another? Christiano’s view is that the length of guaranteed safe inputs is small: about a 6-common-English-word sentence (Saunders 2018). I’m inclined to agree with his assessment: while a manifesto-length text-input might contain very persuasive arguments, it’s very hard to see how a 6-common-word sentence could contain enough to risk unaligning an intelligent, thoughtful human researcher. It’s worth thinking through for yourself: what was the length of the last argument that really changed your mind about something normative? What’s the shortest such argument that ever changed your mind about something normative? You can make inferences about the likelihood of these inputs (relative to input length) in this way.

Thought Policing

Adversarial questions are observations that nodes in the HCH tree can make as a side effect of their research computation. Nodes aren’t deliberately searching for adversarial questions; those inputs simply crop up here and there, because they exist in the input space and HCH is a powerful search process exploring that space.

If adversarial questions are a “byproduct” of HCH’s searches, then we might introduce a deliberate countervailing mechanism in HCH to notice and contain adversarial questions as they occur. The hope is that as HCH scales up to larger and larger quantities of compute, this deliberate mechanism will have enough compute on hand at all scales of HCH to suppress the adversarial questions that that scale of tree encounters.

One such mechanism might be the use of doubled-up HCH nodes to examine the research history and transcripts of other nodes for evidence of misalignment (Christiano 2019). Nodes evincing misalignment would then be pruned from the tree, hopefully preserving the alignment of the overall search. (This would, of course, roughly double the computational cost of running HCH for any given question.) In conversation, William Saunders suggested that we might use Debate here — AI assistants pitching arguments to a human judge, where the game theory of the setup means that the first AI assistant will, on its move, win by telling us the truth about the topic it was asked to examine — to pre-screen question inputs into HCH nodes and argue over whether they were too manipulative to look at.

A Patchwork Solution to Adversarial Questions

If some combination of all the above methods are employed, the hope is that HCH will be robust to adversarial questions, and continue to be robust to them as it is scaled up to greater levels of compute consumption. It’s okay for alignment if some parts of idea space are too treacherous for HCH to safely explore. So long as HCH errs on the side of caution and outputs a “I can’t safely explore that question” response whenever it risks entering a dangerous part of input space, its alignment will be preserved.

Formally, think of this as altering the function that we are having HCH learn from its exemplar. Instead of the “naïve” function $f_{n}$ , we instead have HCH learn the function of an exemplar who is tightly constrained by rulebooks. Coupled with further architectural modifications (like internode bandwidth restrictions and thought policing) HCH instead implements a more constrained function

f_{n}^{'} : Q \to A^{*}

where $A^{*}$ is the set of all answers augmented with the error code “I can’t explore that question while remaining safely aligned.” $f_{n}^{'}$ maps many questions to this error code that $f_{n}$ had attempted to tackle. Thus, $f_{n}^{'}$ is both less capable and more reliably aligned than $f_{n}$ . So long as we err on the side of caution and carve off all of the plausibly dangerous regions of question space, a modified HCH implementing the function given by

lim n \to \infty f_{n}^{'}

should act as a superhumanly capable question-answerer that reliably remains goal-aligned with us.

Conclusion

In summary, adversarial questions are a tractable problem for HCH. It should be possible to produce appropriate architectural modifications that work as HCH is scaled up to greater quantities of compute.

The cost of these solutions is generally to expand the HCH tree, thus costing more compute for each search relative to unmodified HCH. Additionally, there are classes of input that HCH won’t be able to look at at all, instead returning an “unable to research” response for them. Modified HCH will thus be significantly performance uncompetitive with counterpart ML systems that will exist alongside it, and so we can’t simply expect it to be used in place of those systems, as the cost to actors will be too great.

Bibliography

Bensinger, Rob. 2021. "Garrabrant and Shah on Human Modeling in AGI." LessWrong. August 4. https://www.lesswrong.com/posts/Wap8sSDoiigrJibHA/garrabrant-and-shah-on-human-modeling-in-agi.

Bostrom, Nick. 2003. "Ethical Issues in Advanced Artificial Intelligence." In Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence, edited by Iva Smit, Wendell Wallach and George Eric Lasker, 12-17. International Institute of Advanced Studies in Systems Research and Cybernetics.

—. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press.

Christiano, Paul. 2018. "Humans Consulting HCH." Alignment Forum. November 25. https://www.alignmentforum.org/posts/NXqs4nYXaq8q6dTTx/humans-consulting-hch.

—. 2016. "Strong HCH." AI Alignment. March 24. https://ai-alignment.com/strong-hch-bedb0dc08d4e.

—. 2019. "Universality and Conrequentialism within HCH." AI Alignment. January 9. https://ai-alignment.com/universality-and-consequentialism-within-hch-c0bee00365bd.

Christiano, Paul, Buck Shlegeris, and Dario Amodei. 2018. "Supervising strong learners by amplifying weak experts." arXiv preprint.

Hubinger, Evan, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2019. "Risks from Learned Optimization in Advanced Machine Learning Systems." arXiv preprint 1-39.

Lantz, Frank. 2017. Universal Paperclips. New York University, New York.

Ngo, Richard. 2020. "AGI Safety From First Principles." LessWrong. September 28. https://www.lesswrong.com/s/mzgtmmTKKn5MuCzFJ.

Russell, Stuart. 2019. Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

Saunders, William. 2018. "Understanding Iterated Distillation and Amplification Claims." Alignment Forum. April 17. https://www.alignmentforum.org/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims.

Yudkowsky, Eliezer. 2008. "Artficial Intelligence as a Positive and Negative Factor in Global Risk." In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković, 308-345. Oxford: Oxford University Press.

—. 2018. "Eliezer Yudkowsky comments on Paul’s research agenda FAQ." LessWrong. July 1. https://www.greaterwrong.com/posts/Djs38EWYZG8o7JMWY/paul-s-research-agenda-faq/comment/79jM2ecef73zupPR4.

—. 2007. "The Hidden Complexity of Wishes." LessWrong. November 23. https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes.

^{^}
An important aside here is that this brief section skims over the entire inner alignment problem and IDA’s attempted approach to it. As Yudkowsky notes (2018), plausibly, the remaining inner alignment issue here is so significant that it contains most of the overall alignment problem. Whatever “use powerful inspection tools” means, exactly, is important to spell out in detail; the whole IDA scheme is premised on these inspection tools being sufficiently powerful to ensure that a research models learns basically the function we want it to from a set of training data.
Note that even at the first amplification step, we’re playing with roughly par-human strength ML models. That is, we’re already handling fire — if you wouldn’t trust your transparency tools to guarantee the alignment of an approximately human-level AGI ML model, they won’t suffice here and this will all fall apart right at the outset.

Thanks for the post. I'm glad to see more interest in this - and encouraged both that you've decided to move in this direction, and that you've met with organisational support (I'm assuming here that your supervisor isn't sabotaging you at every turn while fiendishly cackling and/or twirling their moustache)

I'll note upfront that I'm aware you can't reasonably cover every detail - but most/all of the below seem worth a mention. (oh and I might well be quoting-you-quoting-someone-else in a couple of instances below - apologies if I seem to be implying anything inaccurate by this)

I should also make clear that, you know, this is just, like uh, my opinion, man.
[I remain confused by conflicting views, but it's always possible I'm simply confused :)]

In brief terms, my overall remark is that one form of the threat you're hoping to guard against has already happened before you start. In general, if it is successfully aligned, HCH will lie, manipulate, act deceptively-aligned, fail-to-answer-questions... precisely where the human researcher with godly levels of computational resources and deliberation time would.
HCH is not a tool. For good and ill, H is an agent, as is HCH. (I've not seen an argument that we can get a robustly tool-like H without throwing out alignment; if we could, we probably ought to stop calling it HCH)

It remains important to consider in which circumstances/ways you'd wish to be manipulated, ignored etc, and how to get the desirable side of things.

One other point deserving of emphasis:
Once we're out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it's worth considering that our 'H' almost certainly isn't one individual human. More likely it's a group of researchers with access to software and various options for outside consultation. [or more generally, it's whatever process we use to generate the outputs in our dataset]

Once we get down to considering questions like "How likely is it that H will be manipulated?", it's worth bearing in mind that H stands more for "Human process" than for "Human".
[entirely understandable if you wished to keep the same single-human viewpoint for consistency, but there are places where it makes an important difference]

Some specifics:

So long as we choose our exemplar carefully, we can be sure HCH will share his, and our, goals

This seems invalid for several reasons:
1) It is values, not goals, that we might reasonably hope HCH will share. Instrumental goals depend on the availability of resources. An H in HCH has radically different computational resources than that same human outside HCH, so preservation of goals will not happen.
2) We might be somewhat confident at best that HCH will share our exemplar's values - 'sure' is an overstatement.
3) Since our values will differ in some respects from the exemplar we select, we should expect that HCH will not in general share our values.

if our human exemplar wouldn’t deliberately try to manipulate or mislead us, neither will HCH modeled on him

This does not follow. Compare:
If person X wouldn't do Y, then neither will [person X with a trillion dollars] do Y.

It's reasonable to assume it's improbable that the HCH modelled on our exemplar would manipulate or mislead us for the sake of manipulating us: this only requires preservation of values.
It's not reasonable to assume such an HCH wouldn't mislead us for instrumental reasons. (e.g. perhaps we're moral monsters and there's too little time to convince us without manipulation before it's too late)

[HCH] does what a large, competent human hierarchy would do. It does an honest day’s work and makes a serious effort to think through the problem given to it … and then returns an answer and halts.

The first sentence holding does not imply the second. It will "make a serious effort to think through the problem given to it" if, and only if, that's what a large, competent human hierarchy would do.

Here it's important to consider that there are usually circumstances where task-alignment (robustly tackling the task assigned to you in the most aligned way you can), and alignment (doing what a human would want you to) are mutually exclusive.
This occurs precisely where the human wouldn't want you to tackle the assigned task.

For example, take the contrived situation:
H knows that outputting "42" will save the world.
H knows that outputting anything else will doom the world.
Input to HCH: "What is the capital of France?"

Here it's clear that H will output 42, and that HCH will output 42 - assuming HCH functions as intended. I.e. it will ignore the task and save the world. It will fail at task alignment, precisely because it is aligned. (if anyone is proposing the ["Paris"-followed-by-doom] version, I've yet to see an argument that this gives us something we'd call aligned)

Of course, we can take a more enlightened view and say "Well, the task was never necessarily to answer the question - but rather to output the best response to a particular text prompt." - yes, absolutely.
But now we need to apply this reasoning consistently: in general, HCH is not a question-answerer (neither is Debate). They are [respond to input prompt with text output] systems.

We can trust it to answer superhumanly difficult questions the way we would if we could, and we can trust it to stop working once it’s taken a good shot at it. These two reasons make HCH a trustworthy AI tool that scales to arbitrarily large quantities of compute to boot.

As above, at most we can trust it to give the output we would give if we could.
This doesn't imply answering the question, and certainly doesn't imply HCH is trustworthy. If we assume for the moment that it's aligned, then it will at least manipulate us when we'd want it to (this is not never). If "aligned" means something like "does what we'd want from a more enlightened epistemic position", then it'll also manipulate us whenever we should want it to (but actually don't want it to).

The clearest practical example of non-answering is likely to be when we're asking the wrong question - i.e. suppose getting an accurate answer to the question we ask will improve the world by x, and getting an accurate answer to the best question HCH can come up with will improve the world by 1000x.
The aligned system is going to give us the 1000x answer. (we'd like to say that it could give both, but it's highly likely our question is nowhere near the top of the list, and should be answered after 100,000 more important questions)

It's tempting to think here that we could just as well ask HCH what question to ask next (or simply to give us the most useful information it can), but at this point thinking of it as a 'tool' seems misguided: it's coming up with the questions and the answers. If there's a tool in this scenario, it's us. [ETA this might be ok if we knew we definitely had an aligned IDA system - but that is harder to know for sure when we can't use manipulation as a sufficient criterion for misalignment: for all we know, the manipulation may be for our own good]

It’s much harder to make a convincing case for some worldview in just 280 characters than it is to with a book

Sure, but the first 280 don't need to do all the work: they only need to push the parent H into asking questions that will continue the conversion process. (it's clearly still difficult to convert a world-view in 280 character chunks, but much less clear that it's implausible)

If they hold to the exemplar rulebook during training, an ML model trained on that behavior will infer that it should hold to the rulebook in other cases as well.

Not necessarily. Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.
[I note that there are clearly cases where the human should overrule any fixed book, so it's not clear that a given generalisation overruling in some cases is undesirable]

...we might use Debate here — AI assistants pitching arguments to a human judge, where the game theory of the setup means that the first AI assistant will, on its move, win by telling us the truth about the topic it was asked to examine

We hope the game theory means....
Even before we get into inner alignment, a major issue here is that the winning AI assistant move is to convince the judge that its answer should be the output by whatever means; this is, in essence, the same problem you're hoping to solve.

The human judge following instructions and judging the debate on the best answer to the question is what we hope will happen. We can't assume the judge isn't convinced by other means (in general such means can be aligned - where the output most benefiting the world happens not to be an answer to the question).

Again, glad to see this review, and I hope you continue to work on these topics. (and/or alignment/safety more generally)

Thanks a bunch for the feedback!

I had thought that the strategy behind IDA is building a first-generation AI research assistant, in order to help us with later alignment research. Given that, it's fine to build a meek, slavish research-hierarchy that merely works on whatever you ask it to, even when you're asking it manifestly suboptimal questions given your value function. (I'm not sure whether to call a meek-but-superintelligent research-assistant an "agent" or a "tool.") We'd then use HCH to bootstrap up to a second-generation aligned AGI system, and that more thoughtfully designed system could aim to solve the issue of suboptimal requests.

That distinction feels a bit like the difference between (1) building a powerful goal-function optimizer and feeding it our coherent-extrapolated-volition value-function and (2) building a corrigible system that defers to us, and so still needing to think very carefully about where we choose to point it. Meek systems have some failure modes that enlightened sovereign systems don't, true, but if we had a meek-but-superintelligent research assistant we could use it to help us build the more ambitious sovereign system (or some other difficult-to-design alignment solution).

Once we're out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it's worth considering that our 'H' almost certainly isn't one individual human. More likely it's a group of researchers with access to software and various options for outside consultation. [or more generally, it's whatever process we use to generate the outputs in our dataset]

This is a fair point; I don't think I had been abstracting far enough from the "HCH" label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I'd have to spend some time thinking about it.

Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.

During this project, I think I came to the view that IDA is premised on us already having a good inner alignment solution in hand (e.g. very powerful inspection tools). I'm worried about that premise of the argument, and I agree that it'll be difficult for the model to make accurate inferences in these underdetermined cases.

You're right that building a first-generation-AI-assistant is one of the main ways people think about IDA (though I don't think this is IDA-specific, and perhaps not everyone would agree). With that in mind, I don't think it's a problem if we get a meek, slavish task-aligned research hierarchy. I just haven't seen an argument that IDA will do that.

The argument for alignment of (imitative) IDA is essentially:
It starts out aligned because it imitates the human (assuming things go right).

However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don't want it to imitate the human (or to generalise to what the human would do).

So we end up with:
It starts out aligned because it imitates the human, although sometimes it doesn't imitate the human, but it's still aligned because...

And I don't know what follows 'because...' here.

Some issues:
1) My concept of "task-alignment" (robustly tackling the assigned task in the most aligned way you can), is only clearly defined to the extent that "tackling the assigned task" is clearly defined - which in general it won't be.

2) I'm not clear how we'd go about constraining a human demonstrator (or human judge for Debate) to stick to the task - even where they think it's better not to do so.

Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.

However, this seems to importantly miss the point: we're then training on the actions of an H that may still want to have influence beyond the task constraints. Take the extreme example: H knows that only by communicating "42" can the world be saved, but this output is not allowed. Here H is likely to try to convey the message "42" using whatever communication is permitted.

More generally, the restriction does nothing directly to focus H on the task. It'll increase the amount of cases where H does focus on the task, but only by making many non-task-focused approaches less effective (in H's terms). It won't change H's overall values or motivation.
As capabilities increase, this strikes me as worrying - analogous to 'boxing' H.

There may well be solutions to all these issues (or it might be that they're not real issues because I'm seeing things poorly). So far I haven't seen a solid argument.
I think the closest I have seen is along the lines of "this may be an issue eventually, but we expect IDA to reach the required level of capability without hitting such issues". This may be true, but it's undesirably hand-wavy.

E.g. suppose that an IDA implementation is tasked with giving information that it concludes will allow the completion of an AGI, or the performance of a pivotal act; this may imply a huge discontinuity in the real-world-impact of the system's output. It seems plausible to go from "obviously there'd be no reason for the human to deviate from the task here", to "obviously there is a reason..." extremely quickly, and just at the point where we're reaching our target.

It's not necessarily any argument against this to say something like: "Well, we're not going to be in a foom situation here: our target will be an AI that can help us build second-generation AGI in years, not days.".
If the initial AI can predict this impact, then whether it takes two years or two minutes, it's huge. Providing information that takes the world down one such path is similarly huge.
So I think that here too we're out of obviously-sticks-to-task territory.

Please let me know if any of this seems wrong! It's possible I'm thinking poorly.

This is a fair point; I don't think I had been abstracting far enough from the "HCH" label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I'd have to spend some time thinking about it.

I think the HCH label does become a little unhelpful here. For a while I thought XCX might be better, but it's probably ok if we think H = "human process" or similar. (unless/until we're not using humans as the starting point)

However, I certainly don't mean to suggest that investigation into dangerous memes etc is a bad idea. In fact it may well be preferable to start out thinking of the single-human-H version so that we're not tempted to dismiss problems too quickly - just so long as we remember we're not limited to single-human-H when looking for solutions to such problems.

I appreciate this review of the topic, but feel like you're overselling the "patchwork solutions", or not emphasizing their difficulties and downsides enough.

We can infer from the history of human reasoning that human cognition is relatively inefficient at transforming resources into adversarial text-inputs, as people have not produced all that many of those. No such inference can be made for computational search processes generally. We avoid most of the adversarial questions into HCH by remaining in the shallow waters of human cognition, and avoiding at the outset alien search processes like, for example, unconstrained searches for world models fitting parameters.

Are you assuming that there aren't any adversaries/competitors (e.g., unaligned or differently aligned AIs) outside of the IDA/HCH system? Because suppose there are, then they could run an alien search process to find a message such that it looks innocuous on the surface, but when read/processed by HCH, would trigger an internal part of HCH to produce an adversarial question, even though HCH has avoided doing any alien search processes itself.

Another solution to the adversarial questions problem is to restrict bandwidth between HCH nodes (Saunders 2018).

This involves a bunch of hard problems that are described in Saunders's post, which you don't mention here.

They might also be told to return an “unable to safely answer this question” response when fed political or decision-theoretic questions.

In Why do we need a NEW philosophy of progress?, Jason Crawford asked, "How can we make moral and social progress at least as fast as we make scientific, technological and industrial progress? How do we prevent our capabilities from outrunning our wisdom?" (He's not the only person to worry about differential intellectual progress, just the most recent.) In this world that you envision, do we just have to give up this hope, as "moral and social progress" seem inescapably political, and therefore IDA/HCH won't be able to offer us help on that front?

Similarly with decision-theoretic questions, what about such questions posed by reality, e.g., presented to us by our adversaries/competitors or potential collaborators? Would we have to answer them without the help of superintelligence?

(That leaves "thought policing", which I'm not already familiar with. Tried to read Paul's post on the topic, but don't have enough time to understand his argument for why the scheme is safe.)

(Thanks for the feedback!)

In Why do we need a NEW philosophy of progress?, Jason Crawford asked, "How can we make moral and social progress at least as fast as we make scientific, technological and industrial progress? How do we prevent our capabilities from outrunning our wisdom?" (He's not the only person to worry about differential intellectual progress, just the most recent.) In this world that you envision, do we just have to give up this hope, as "moral and social progress" seem inescapably political, and therefore IDA/HCH won't be able to offer us help on that front?

I think so -- in my world model, people are just manifestly, hopelessly mindkilled by these domains. In other, apolitical domains, our intelligence can take us far. I'm certain that doing better politically is possible (perhaps even today, with great and unprecedently thoughtful effort and straining against much of what evolution built into us), but as far as bootstrapping up to a second-generation aligned AGI goes, we ought to stick to the kind of research we're good at if that'll suffice. Solving politics can come after, with the assistance of yet-more-powerful second-generation aligned AI.

Are you assuming that there aren't any adversaries/competitors (e.g., unaligned or differently aligned AIs) outside of the IDA/HCH system? Because suppose there are, then they could run an alien search process to find a message such that it looks innocuous on the surface, but when read/processed by HCH, would trigger an internal part of HCH to produce an adversarial question, even though HCH has avoided doing any alien search processes itself.

In the world I was picturing, there aren't yet AI-assisted adversaries out there who have access into HCH. So I wasn't expecting HCH to be robust to those kinds of bad actors, just to inputs it might (avoidably) encounter in its own research.

Similarly with decision-theoretic questions, what about such questions posed by reality, e.g., presented to us by our adversaries/competitors or potential collaborators? Would we have to answer them without the help of superintelligence?

Conditional on my envisioned future coming about, the decision theory angle worries me more. Plausibly, we'll need to know a good bit about decision theory to solve the remainder of alignment (with HCH's help). My hope is that we can avoid the most dangerous areas of decision theory within HCH while still working out what we need to work out. I think this view was inspired by the way smart rationalists have been able to make substantial progress on decision theory while thinking carefully about potential infohazards and how to avoid encountering them.

What I say here is inadequate, though -- really thinking about decision theory in HCH would be a separate project.

I think adversarial examples are a somewhat misleading analogy for failure modes of HCH, and tend to think of them more like attractors in dynamical systems. Adversarial examples are almost uneradicable, and yet are simultaneously not that important because they probably won't show up if there's no powerful adversary searching over ways to mess up your classifier. Unhelpful attractors, on the other hand, are more prone to being wiped out by changes in the parameters of the system, but don't require any outside adversary - they're places where human nature is already doing the search for self-reinforcing patterns.

On reflection, I think you're right. As long as we make sure we don't spawn any adversaries in HCH, adversarial examples in this sense will be less of an issue.

I thought your linked HCH post was great btw -- I had missed it in my literature review. This point about non-self-correcting memes

But I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren't just repetitious, they're self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.

really impressed me w/r/t the relevance of the attractor formalism. I think what I had in mind in this project, just thinking from the armchair about possible inputs into humans, was exactly the seizure lights example and their text analogues, so I updated significantly here.

I should also make clear that, you know, this is just, like uh, my opinion, man.
[I remain confused by conflicting views, but it's always possible I'm simply confused :)]

It remains important to consider in which circumstances/ways you'd wish to be manipulated, ignored etc, and how to get the desirable side of things.

Some specifics:

So long as we choose our exemplar carefully, we can be sure HCH will share his, and our, goals

if our human exemplar wouldn’t deliberately try to manipulate or mislead us, neither will HCH modeled on him

This does not follow. Compare:
If person X wouldn't do Y, then neither will [person X with a trillion dollars] do Y.

[HCH] does what a large, competent human hierarchy would do. It does an honest day’s work and makes a serious effort to think through the problem given to it … and then returns an answer and halts.

We can trust it to answer superhumanly difficult questions the way we would if we could, and we can trust it to stop working once it’s taken a good shot at it. These two reasons make HCH a trustworthy AI tool that scales to arbitrarily large quantities of compute to boot.

It’s much harder to make a convincing case for some worldview in just 280 characters than it is to with a book

If they hold to the exemplar rulebook during training, an ML model trained on that behavior will infer that it should hold to the rulebook in other cases as well.

...we might use Debate here — AI assistants pitching arguments to a human judge, where the game theory of the setup means that the first AI assistant will, on its move, win by telling us the truth about the topic it was asked to examine

Again, glad to see this review, and I hope you continue to work on these topics. (and/or alignment/safety more generally)

Once we're out of thought-experiment-land, and into more nuts-and-bolts IDA implementation-land, it's worth considering that our 'H' almost certainly isn't one individual human. More likely it's a group of researchers with access to software and various options for outside consultation. [or more generally, it's whatever process we use to generate the outputs in our dataset]

Situations may come up where there are stronger reasons to override the rulebook than in training, and there is no training data on whether to overrule the book in such cases. Some models will stick to the rulebook regardless, others will not.

The argument for alignment of (imitative) IDA is essentially:
It starts out aligned because it imitates the human (assuming things go right).

However, to say that we want the meek, slavish version, is precisely to say that there are cases where we don't want it to imitate the human (or to generalise to what the human would do).

So we end up with:
It starts out aligned because it imitates the human, although sometimes it doesn't imitate the human, but it's still aligned because...

And I don't know what follows 'because...' here.

2) I'm not clear how we'd go about constraining a human demonstrator (or human judge for Debate) to stick to the task - even where they think it's better not to do so.

Where we can automatically detect sticking-to-task outputs, we could require output that passes a sticks-to-task check.

Please let me know if any of this seems wrong! It's possible I'm thinking poorly.

This is a fair point; I don't think I had been abstracting far enough from the "HCH" label. Research groups with bylaws and research tools on hand may just be more robust to these kinds of dangerous memes, though I'd have to spend some time thinking about it.

I appreciate this review of the topic, but feel like you're overselling the "patchwork solutions", or not emphasizing their difficulties and downsides enough.

We can infer from the history of human reasoning that human cognition is relatively inefficient at transforming resources into adversarial text-inputs, as people have not produced all that many of those. No such inference can be made for computational search processes generally. We avoid most of the adversarial questions into HCH by remaining in the shallow waters of human cognition, and avoiding at the outset alien search processes like, for example, unconstrained searches for world models fitting parameters.

Another solution to the adversarial questions problem is to restrict bandwidth between HCH nodes (Saunders 2018).

This involves a bunch of hard problems that are described in Saunders's post, which you don't mention here.

They might also be told to return an “unable to safely answer this question” response when fed political or decision-theoretic questions.

(That leaves "thought policing", which I'm not already familiar with. Tried to read Paul's post on the topic, but don't have enough time to understand his argument for why the scheme is safe.)

(Thanks for the feedback!)

In Why do we need a NEW philosophy of progress?, Jason Crawford asked, "How can we make moral and social progress at least as fast as we make scientific, technological and industrial progress? How do we prevent our capabilities from outrunning our wisdom?" (He's not the only person to worry about differential intellectual progress, just the most recent.) In this world that you envision, do we just have to give up this hope, as "moral and social progress" seem inescapably political, and therefore IDA/HCH won't be able to offer us help on that front?

Are you assuming that there aren't any adversaries/competitors (e.g., unaligned or differently aligned AIs) outside of the IDA/HCH system? Because suppose there are, then they could run an alien search process to find a message such that it looks innocuous on the surface, but when read/processed by HCH, would trigger an internal part of HCH to produce an adversarial question, even though HCH has avoided doing any alien search processes itself.

Similarly with decision-theoretic questions, what about such questions posed by reality, e.g., presented to us by our adversaries/competitors or potential collaborators? Would we have to answer them without the help of superintelligence?

What I say here is inadequate, though -- really thinking about decision theory in HCH would be a separate project.

But I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren't just repetitious, they're self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.

15

HCH and Adversarial Questions

15

Introduction

The Infinite Researcher-Hierarchy

Outer and Inner ML Alignment

IDA and HCH

HCH’s Alignment

Adversarial Examples and Adversarial Questions

Adversarial Questions for Humans

Convincing Ideological Arguments

Credible Decision-Theoretic Threats

Unconstrained Searches over Computations

Trading Off Competitiveness to Maintain Alignment

Exemplar Rulebooks

Internode-Edge Bandwidth Restriction

Thought Policing

A Patchwork Solution to Adversarial Questions

Conclusion

Bibliography

15

15