(*Note: In the past I have referred to this process as ‘bootstrapping’ or ‘policy amplification,’ but those terms are too broad — there are other dimensions along which policies can be amplified, and ‘bootstrapping’ is used all over the place.*)

Defining the “intended behavior” of a powerful AI system is a challenge.

We don’t want such systems to simply imitate human behavior — we want them to improve upon human abilities. And we don’t want them to only take actions that look good to humans — we want them to improve upon human judgment.

We also don’t want them to pursue simple goals like “minimize the probability that the bridge falls down” or “pick the winning move.” A precise statement of our real goals would be incredibly complicated, and articulating them precisely is itself a massive project. Moreover, we often care about consequences over years or decades. Such long-term consequences would have little use as a *practical* problem definition in machine learning, even if they could serve as a *philosophical* problem definition.

So: what else can we do?

Instead of defining what it means for a policy to be “good,” we could define a transformation which turns one policy into a “better” policy.

I call such a transformation *capability amplification* — it “amplifies” a weak policy into a strong policy, typically by using more computational resources and applying the weak policy many times.

## Motivation

I am interested in capability amplification because I think it is the most plausible route to defining the goals of powerful AI systems, which I see as a key bottleneck for building aligned AI. The most plausible alternative approach is probably inverse RL, but I think that there are still hard philosophical problems to solve, and that in practice IRL would probably need to be combined with something like capability amplification.

More directly, I think that capability amplification might be a workable approach to training powerful RL systems when combined with semi-supervised RL, adversarial training, and informed oversight (or another approach to reward engineering).

## Example of capability amplification: answering questions

Suppose that we would like like to amplify one question-answering system **A**into a “better” question-answering system **A⁺**.

We will be given a question *Q* and an implementation of **A**; we can use **A**, or any other tools at our disposal, to try to answer the question *Q*. We have some time limit; in reality it might be eight hours, but for the purpose of a simple example suppose it is twenty seconds. The amplification **A⁺**(*Q*) is defined to be whatever answer we come up with by the end of the time limit. The goal is for this answer to be “better” than the answer that **A** would have given on its own, or to be able to answer harder questions than **A** could have answered directly.

For example, suppose that *Q* = “Which is more water-soluble, table salt or table sugar?” Suppose further that **A** can’t answer this question on its own: **A**(“Which is more water-soluble…”) = “I don’t know.”

I could start by computing **A**(“How do you quantify water-solubility?”); say this gives the answer “By measuring how much of the substance can dissolve in a fixed quantity of water.” Then I ask **A**(“How much table salt will dissolve in a liter of water?”) and get back the answer “360 grams.” Then I ask **A**(“How much sugar will dissolve in a liter of water?”) and get back the answer “2 kilograms.” Then I reply “Sugar is about six times more soluble than salt.”

Thus **A⁺**(“Which is more water-soluble, table salt or table sugar?”) = “Sugar is about six times more soluble than salt.” This is better than the answer that **A**gave — in some sense, we’ve successfully amplified **A** into something smarter.

## The general problem

The capability amplification problem is to use one policy **A** to implement a new policy **A⁺** which is strictly “better” than **A**. (Recall that a policy is a mapping from inputs to outputs.)

We’ll discuss the definition of “better” in the next section, but for now you can use an intuitive definition. Note that “better” *does not* mean that we can implement **A⁺** using fewer computational resources than **A** — in fact we will implement **A⁺** by using a huge amount of computation and time.

What does it mean to “implement” the amplified policy **A⁺**? It means that we have some process that takes as input an observation *o*[1] and produce an action *a*[1]. It then takes as input the next observation *o*[2] and produces the next action *a*[2], and so on.

The process that implements **A⁺** may instantiate any number of agents who use the policy **A** and interact with them. The process might make copies of any of these agents. And the process can involve us personally thinking about the problem, or using any other tools that we have available — having access to **A**may be a useful resource, but we can also do things from scratch if that’s easier.

The capability amplification problem comes with a time limit — we need to provide an implementation that runs within that time limit. (When we submit one action *a*[k], we immediately see the next observation *o*[k+1].) Once the time limit runs out, we automatically output a nil action in response to each additional observation. One way to be “better” is to be able to handle longer sequences of observations.

The time limit could be arbitrary, but I’ll pick one day for concreteness. I’ll assume that we have an implementation of **A** that runs in one second per episode.

Note that we can start from the trivial policy **∅** which always outputs nil. In this case, **∅⁺** is a policy that we can implement “from scratch.”

## Reachability

To measure how well we can solve capability amplification, we’ll introduce the concept of *reachability*.

Reachability is defined with respect to a class of policies **𝒜** and a preference ordering ⪰. Intuitively:

**𝒜**is a class of policies that we are capable of implementing efficiently. For example,**𝒜**might be the set of policies that can be implemented by a ten layer neural network.- We say that
**A**⪰**B**if we are*at least as happy*with policy**A**as with policy**B**(in any situation that we think might arise in practice).

We say that **C** is *reachable* from **A** if:

**A**⁺ ⪰**C**, where**A**⁺ is the amplification as described in the last section; or- There is an intermediate
**B**∈ 𝓐 which is reachable from**A**and which can reach**C**.

Equivalently:

**C**is reachable from**A**if there is a chain of policies in**𝒜**which starts at**A**and ends at**C**, and where each policy in the chain is no better than the amplification of the previous policy.

The better we are at capability amplification, the more policies will be reachable from any given starting point. Our goal is to have as many policies as possible be reachable from the trivial policy **∅** — ideally, *every* policy in 𝓐 would be reachable from **∅**.

## Obstructions

An *obstruction* to capability amplification is a partition of the policy class 𝓐 into two parts 𝓛 and 𝓗, such that we cannot amplify *any* policy in 𝓛 to be at least as good as *any* policy in 𝓗.

Obstructions are dual to reachability in a natural sense. If there are any non-reachable policies, then there is some corresponding obstruction. The desired output of research on capability amplification are a *matching* amplification strategy and obstruction — a way to reach many policies, and an obstruction that implies that we can’t reach any more.

Analogously, we say that a function *L* : 𝓐 → ℝ is an obstruction if our amplification procedure cannot always increase *L*. That is, *L* is an obstruction if there exists a threshold ℓ such that the two sets { **A** ∈ 𝓐 : *L*(**A**) ≤ ℓ } and { **A**∈ 𝓐 : *L*(**A**) > ℓ} are an obstruction, or such that { **A** ∈ 𝓐 : *L*(**A**) < ℓ } and { **A**∈ 𝓐 : *L*(**A**) ≥ ℓ} are an obstruction.

If we could find a convincing argument that some partition was an obstruction, then that would help further our understanding of value alignment. The next step would be to ask: can we sensibly define “good behavior” for policies in the inaccessible part 𝓗? I suspect this will help focus our attention on the most philosophically fraught aspects of value alignment.

In the appendices I give an example of an obstruction in a particular simple model.

# Relationship to value alignment

## Why capability amplification seems feasible

Capability amplification is a special case of the general problem of “building an AI that does the right thing.” It is easier in two respects:

- In the general problem we need to construct a “good” policy from scratch. In capability amplification we need to construct a good policy
**A**⁺ starting from a slightly weaker policy**A**. - In the general problem we must
*efficiently implement*a good policy. In capability amplification our implementation of**A⁺**is allowed to take up to a day, even though the goal is to improve upon a policy**A**that runs in one second.

Intuitively, these seem like large advantages.

Nevertheless, it may be that capability amplification contains the hardest aspects of value alignment. If true, I think this would change our conception of the value alignment problem and what the core difficulties are. For example, if capability amplification is the “hard part,” then the value alignment problem is essentially orthogonal to the algorithmic challenge of building an intelligence.

## Why capability amplification seems useful

Capability amplification can be combined with reward engineering in a natural way:

- Define
**A0**=**∅** - Apply capability amplification to obtain
**A0⁺** - Apply reward engineering to define a reward function, and use this to train an agent
**A1**which is better than**A0** - Apply capability amplification to obtain
**A1⁺** - Repeat to obtain a sequence of increasingly powerful agents

This is very informal, and actually carrying out such a process requires resolving many technical difficulties. But it suggests that capability amplification and reward engineering might provide a foundation for training an aligned AI.

# What to do?

## Theory

The best approach seems to be to work from both sides, simultaneously searching for challenging obstructions and searching for amplification procedures that address those obstructions.

There are at least two very different angles on capability amplification:

- Collaboration: figure out how a bunch of agents using
**A**can break a problem down into smaller pieces and attack those pieces separately, allowing them to solve harder problems than they could solve independently. - Philosophy: try to better understand what “good” reasoning is, so that we can better understand how good reasoning is composed of simpler steps. For example, mathematical proof is a technique which relates hard problems to long sequences of simple steps. There may be more general ideas along similar lines.

In the appendices, I describe some possible amplification schemes and obstructions, along with some early ideas about capability amplification in general.

## Experiment

Today, it is probably most worthwhile to study capability amplification when **A** is a *human’s* policy.

In this setting, we are given some weak human policy **A** — say, a human thinking for an hour. We would like to amplify this to a strong collaborative policy **A⁺**, by invoking a bunch of copies of **A** and having them interact with each other appropriately.

In some sense this is the fully general problem of organizing human collaborations. But we can focus our attention on the most plausible obstructions for capability amplification, and try to design collaboration frameworks that let us overcome those obstructions.

In this context, I think the most interesting obstruction is working with concepts that are (slightly) too complicated for any individual copy of **A** to understand on its own. This looks like a hard problem that is mostly unaddressed by usual approaches to collaboration.

This post lays out a closely related problem — quickly evaluating arguments by experts — which gets at most of the same difficulties but may be easier to study. Superficially, evaluating arguments may seem easier than solving problems from scratch. But because it is so much easier to collaboratively *create* arguments once you have a way to evaluate them, I think the gap is probably only superficial.

# Conclusion

The capability amplification problem may effectively isolate the central *philosophical* difficulties of value alignment. It’s not easy to guess how hard it is — we may already have “good enough” solutions, or it may effectively be a restatement of the original problem.

Capability amplification asks us to implement a powerful policy that “behaves well,” but it is easier than value alignment in two important respects: we are given access to a slightly weaker policy, and our implementation can be extremely inefficient. It may be that these advantages are not significant advantages, but if so that would require us to significantly change our understanding of what the value alignment problem is about.

Capability amplification appears to be less tractable than the other research problems I’ve outlined. I think it’s unlikely to be a good research direction for machine learning researchers interested in value alignment. But it may be a good topic for researchers with a philosophical focus who are especially interested in attacking problems that might otherwise be neglected.

*(This research was supported as part of the* *Future of Life Institute* *FLI-RFP-AI1 program, grant #2015–143898.)*

## Appendix: iterating amplification

Let **H** be the input-output behavior of a human + all of the non-**A** tools at their disposal. Then an amplification procedure defines **A**⁺ as a simple computation that uses **H** and **A** as subroutines.

In particular, **∅⁺** is a computation that uses **H** as a subroutine. If we amplify again, we obtain **∅⁺⁺**, which is a computation that uses **H** and **∅⁺** as subroutines. But since **∅⁺** is a simple computation that uses **H** as a subroutine, we can rewrite **∅⁺⁺** as a simple computation that uses only **H** as a subroutine.

We can go on in this way, reaching **∅⁺⁺⁺**, **∅⁺⁺⁺⁺** and so on. By induction, all of these policies are defined by simple computations that use as **H** as a subroutine. (Of course these “simple computations” are exponentially expensive, even though they are easy to specify. But they have a simple form and can be easily written down in terms of the amplification procedure.)

Under some simple ergodicity assumptions, this sequence converges to a fixed point **Ω** (very similar to HCH). So a capability amplification procedure essentially uniquely defines an “optimal” policy **Ω**; this policy is uncomputable, but has a concise representation in terms of **H**.

If there is anything that **Ω** can’t do, then we have found an unreachable policy. This perspective seems useful for identifying the hard part of the capability amplification problem.

Specifying an amplification strategy also specifies a way to set up an interaction between a bunch of copies of **H** such that they implement **Ω**. Indeed, designing such an interaction is *easier* than designing an amplification procedure that converges to **Ω**. So if we can’t design a procedure for a bunch of copies of **H** to collaboratively accomplish some task T, then we also can’t design an amplification procedure that can reach policies that can accomplish T.

But in some cases this seems extremely challenging. For example, suppose that we would like to write a very good textbook about general relativity. Recall that **H** is the policy implemented by a human over one day. To make the game hard, suppose that **H** has never encountered calculus.

The most natural approaches require at least one copy of **H** to understand general relativity. But **H** only lasts a single day, and there many not be *any* sequence of observations that makes **H** understand general relativity within a day.

So in order for **Ω** to succeed at this task, it needs to be able to write a textbook without *any* single copy of **H** understanding general relativity. The understanding must instead be distributed across a large number of copies of **H**. Yet they must somehow collaborate to write individual sentences that correctly communicate the theory.

This sounds like a hard problem.

## Appendix: examples of capability amplification

ETA: my current best proposal is meta-execution, which is essentially a formalization of the “universal” scheme below.

**Warmup I: Preprocessing**

We instantiate a “leader” agent who uses the policy **A**. For each observation *o*[*k*]:

- We instantiate a new agent using
**A**. We give them observation: “think about the observation {*o*[*k*]} and output a useful hint for a copy of yourself who has to respond to it.” Let*h*[*k*] be the agent’s action. Then we destroy the agent. - We give the leader the observation “The observation is {
*o*[*k*]}, and a copy of you prepared the hint {*h*[*k*]}.” Let*a*[*k*] be its action.

We might hope that this composite is smarter than **A**. After all, it has twice as long to think — even if the copy can’t come up with a very helpful hint, it doesn’t seem so hard to come up with a hint that is better than nothing.

**Warmup II: Asking questions**

We instantiate a “leader” agent who uses the policy **A**. We give each observation *o*[*k*] to the leader, and ask them to reply with an action *a*[*k*].

But we also allow the leader to instantiate subagents and communicate with those agents.

That is, if the leader takes an action of the special form “Instantiate a new agent” then we create a new agent using **A**, assign it a unique ID, and give the leader the observation: “A new agent has been created with ID [X].” Similarly, the leader can take the special action “Copy the agent with ID [X].”

If the leader takes an action of the special form “Ask [Q] to the agent with ID [X]” then we give Q as an observation to the agent with ID X. That agent will reply with some action A, and we give the leader the next observation: “The agent replied with [A].”

Eventually, the leader takes an action of the special form “Take action [A]”. Then we set *a*[*k*] = A, and move on to the next observation.

The scheme from the last section is essentially the special case where the leader instantiates a single subagent and asks the single question “What is a useful hint for responding to observation *o*[*k*]?”

**Warmup III: Recursion**

We improve the scheme from the last section by allowing *every* agent to instantiate new subagents, rather than restricting this capability to the leader.

The messages between agents can contain not only words, but pointers to other agents. This allows the agents to pass arbitrarily large messages. For example, they can build a data structure out of agents, and just pass a “pointer” to the root of that data structure.

We can also slightly improve the scheme by allowing the leader to copy *itself*. This may be helpful if the leader has built up some useful context for the current situation.

**Universal amplification**

We improve the scheme further by giving the agents a large external memory. Rather than directly giving the leader the observation *o*[*k*], we write the observation to the external memory and tell the leader where the observation is written.

In addition to allowing agents to instantiate new agents and communicate with them, any agent can take the action “execute [P]”, where P is a program to be run against the external memory; the agent sees the observation “the program returned [*x*]” where *x* is P’s return value (or “nothing”).

Rather than having the leader directly return an action, the leader can take the special action “output the action returned by program [P].”

Messages between agents can now contain pointers to this external memory. For example, the leader could instantiate a subagent and ask it the question “Can you distinguish [*x*] from an array of random bytes?” where *x* is a pointer to an array in external memory.

We can easily generalize this setup to a parallel model of computation. We can also replace the shared memory by a more natural model for interprocess communication.

## Appendix: knowledge about humans

Human values are complex. If you are only able to interact with a human for a day, it may be completely impossible to figure out what they value, no matter how smart you are. Understanding what someone values may require giving them a large amount of time to reflect on their values, doing neuroscience, or carrying out other processes that take longer than a day.

This may imply an obstruction to capability amplification — we can’t reach policies that have more knowledge about humans than can be acquired by interacting with **H.**

However, even if this is a real obstruction, it does not seem to be an important one, for the following reason.

Suppose that we are able to train a very good policy, which does not reflect any complex facts about human values-upon-reflection. This optimal policy still can reflect many basic facts about human preferences:

- We don’t want anything terrible to happen.
- We want to “stay in control” of the agents we build.
- We don’t want our agent to get left behind by its competitors; it should fight as hard as it can to retain influence over the world, subject to #1 and #2.

Moreover, all of these concepts are relatively easy to understand even if you have minimal understanding of human values.

So an excellent agent with a minimal understanding of human values seems OK. Such an agent could avoid getting left behind by its competitors, and remain under human control. Eventually, once it got enough information to understand human values (say, by interacting with humans), it could help us implement our values.

In the worst case the agent would lack a nuanced understanding of what we consider terrible, and so would have to either be especially conservative or else risk doing terrible things in the short term. In the scheme of things, this is not a catastrophic problem.

## Appendix: an example obstruction

Suppose that my brain encodes a random function *f*: {0, 1}* → {0, 1} in the following sense: you can give me a sequence of bits, one per second, and then I can tell you the value of *f* on that sequence. There is no way to evaluate *f*other than to ask me.

Let N be the length of our capability amplification procedure, in seconds.

Let 𝓛 ⊆ 𝓐 be the set of policies that can be implemented using an oracle for *f*, restricted to inputs of length N.

Then it’s easy to see that 𝓛 forms an obstruction:

- We can simulate access to any policy in 𝓛 using an oracle for
*f*restricted to inputs of length N. And we can simulate my role in the amplification procedure using an oracle for*f*restricted to inputs of length N. So policies in 𝓛 can only be amplified to other policies in 𝓛. - We cannot evaluate
*f*on even a single input of length N+1 using an oracle for*f*on inputs of length N. Most interesting classes 𝓐 will contain some policies not in 𝓛.

Whether this is a real obstruction depends on what the information is about:

- If it’s just random bits, then we don’t care at all — any other random bits would be “just as good.”
- If the random function encodes important information about my values, then we are in the situation described in the previous section, which doesn’t seem so bad.
- The worst case is when the function
*f*encodes important information about how to behave effectively. For example, it encodes information about how to make accurate predictions. In this case we may actually be in trouble, since a policy that doesn’t know*f*may be outcompeted by one which does.

*This was originally posted here on 2nd October 2016.*

*The next post in this sequence will be 'Learning with catastrophes' by Paul Christiano.*

*Tomorrow's post will be 'Following Human Norms' in the sequence Value Learning by Rohin Shah.*

I was very surprised to read this part of the conclusion:

Is there a good explanation somewhere of why this is true? Based on the rest of this sequence, I would have expected capability amplification to be relatively tractable, and an excellent research direction for value alignment.

Certainly if you just have access to a weaker policy, this doesn't make the problem any easier. If you could take a weak policy and amplify it into a stronger policy efficiently, then you could just repeatedly apply this policy-improvement operator to some very weak base policy (say, a neural net with random weights) to solve the full problem. (If you have a much stronger aligned base policy, eg. the human policy with short inputs and over a short time horizon; in that case this assumption is more powerful.) The more interesting assumption is that you have lots of time and compute, which does seem to have a lot of potential. I feel pretty optimistic that a human thinking for a long time could reach "superhuman performance" by our current standards; though of course capability amplification asks for a stronger guarantee: can we can do this in a particular structured way.

This sounds like a partial order to me. But then:

I interpret this as saying: If (A,B) form part of the chain, then B≯A+. But I believe that the property we want is A+≥B, which is a different condition if ≥ defines a partial ordering. Does that seem right to you?

I might rephrase it as "where the amplification of each policy in the chain is at least as good as the subsequent policy".

It took me a while to realize why you went with this definition. I thought you were going for a simple recursive definition, in which case you could define C to be reachable from A if A≥C, or if C is reachable from A+. Equivalently, there is a chain of amplifications of A such that the resulting policy dominates C. The problem with this definition is that there isn't a corresponding notion of obstructions for my definition, because it isn't transitive. It is possible to have B reachable from A, and C reachable from B, but not C reachable from A.

On the other hand, I believe your definition is the transitive closure of the relation R, where (A,C)∈R iff A+≥C, and so a notion of obstructions comes out naturally.

... to its maximal value in 𝓐. (Obvious, but worth saying.)

An easy way to deal with this difficulty is to replace '

at least as happywith policyAas with policyB(in any situation that we think might arise in practice)' with 'at least as happywith policyAas with policyB(when averaged over the distribution of situations that we expect to arise)', though this is clearly much weaker.To me it seems that the reason this stronger sense of ordering is used is because we expect this amplification procedure to be of a sort that produces results such that A+ is strictly better than A but that even if this wasn't the case, the concept of an obstruction would still be a useful one. Perhaps it would be reasonable to take the more relaxed definition but expect that amplification would produce results that are strictly better.

I also agree with Chris below that defining an obstruction in terms of this 'better than' relation brings in serious difficulty. There are exponentially many policies Bthat are no better than A+ and there may well be a subset of these can be amplified beyond A+ but as far as I can tell there's no clear way to identify these. We thus have an exponential obstacle to progress even within a partition, necessitating a stronger definition.

The concept of reachability lets you amplify a policy, select a policy worse than the amplification, then amplify this worse policy. The problem is that a policy that is worse may actually amplify better than a policy that is better. So I don't find this definition very useful unless we also have an algorithm for selecting the optimal worse policy to amplify. Unfortunately, that problem doesn't seem very tractable at all.

"the agent would lack a nuanced understanding of what we consider terrible" - isn't it the whole narrative for Eliezer's genie tales? While having #2 as a separate request is good, failure to follow #1 can still be catastrophic enough because computers think faster, so our formal "staying in control" may not matter enough.

This seems to be circular, since determining 𝓛 and 𝓗 depends on good behaviour being defined.. I guess what is meant is that we amplify policies in 𝓛 until we hit a ceiling (a fixed point?). Then 𝓗 = {A ∈ 𝓐 | ¬ A reachable from 𝓛}. We suspect that 𝓗 is non-empty, but we don't know how the policies in there look.

(Deleted.)

Mentioned before, but I wonder if progressive/iterative summarization would increase peer review between AI researchers a bunch. I.e. marginal efforts towards summarizing the either adversarial or generative ideas a post is a response to, which gives the original authors and outsiders more surfaces to offer ideas along. I think this would also lead to more settling on shared terminology, which is a significant fraction of progress in a field as far as I can tell from the history of science. If it increased engagement then it would be directly incentivized as soon as people knew/experienced that.