Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

AlphaGo Zero is an impressive demonstration of AI capabilities. It also happens to be a nice proof-of-concept of a promising alignment strategy.

How AlphaGo Zero works

AlphaGo Zero learns two functions (which take as input the current board):

  • A prior over moves p is trained to predict what AlphaGo will eventually decide to do.
  • A value function v is trained to predict which player will win (if AlphaGo plays both sides)

Both are trained with supervised learning. Once we have these two functions, AlphaGo actually picks it moves by using 1600 steps of Monte Carlo tree search (MCTS), using p and v to guide the search. It trains p to bypass this expensive search process and directly pick good moves. As p improves, the expensive search becomes more powerful, and p chases this moving target.

Iterated capability amplification

In the simplest form of iterated capability amplification, we train one function:

  • A “weak” policy A, which is trained to predict what the agent will eventually decide to do in a given situation.

Just like AlphaGo doesn’t use the prior p directly to pick moves, we don’t use the weak policy A directly to pick actions. Instead, we use a capability amplification scheme: we call A many times in order to produce more intelligent judgments. We train A to bypass this expensive amplification process and directly make intelligent decisions. As A improves, the amplified policy becomes more powerful, and A chases this moving target.

In the case of AlphaGo Zero, A is the prior over moves, and the amplification scheme is MCTS. (More precisely: A is the pair (p, v), and the amplification scheme is MCTS + using a rollout to see who wins.)

Outside of Go, A might be a question-answering system, which can be applied several times in order to first break a question down into pieces and then separately answer each component. Or it might be a policy that updates a cognitive workspace, which can be applied many times in order to “think longer” about an issue.

The significance

Reinforcement learners take a reward function and optimize it; unfortunately, it’s not clear where to get a reward function that faithfully tracks what we care about. That’s a key source of safety concerns.

By contrast, AlphaGo Zero takes a policy-improvement-operator (like MCTS) and converges towards a fixed point of that operator. If we can find a way to improve a policy while preserving its alignment, then we can apply the same algorithm in order to get very powerful but aligned strategies.

Using MCTS to achieve a simple goal in the real world wouldn’t preserve alignment, so it doesn’t fit the bill. But “think longer” might. As long as we start with a policy that is close enough to being aligned — a policy that “wants” to be aligned, in some sense — allowing it to think longer may make it both smarter and more aligned.

I think designing alignment-preserving policy amplification is a tractable problem today, which can be studied either in the context of existing ML or human coordination. So I think it’s an exciting direction in AI alignment. A candidate solution could be incorporated directly into the AlphaGo Zero architecture, so we can already get empirical feedback on what works. If by good fortune powerful AI systems look like AlphaGo Zero, then that might get us much of the way to an aligned AI.


This was originally posted here on 19th October 2017.

Tomorrow's AI Alignment Forum sequences will continue with a pair of posts, 'What is narrow value learning' by Rohin Shah and 'Ambitious vs. narrow value learning' by Paul Christiano, from the sequence on Value Learning.

The next post in this sequence will be 'Directions for AI Alignment' by Paul Christiano on Thursday.

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 9:02 PM
Using MCTS to achieve a simple goal in the real world wouldn’t preserve alignment, so it doesn’t fit the bill.

Also, an arbitrary supervised learning step that updates and is not safe. Generally, making that Distill step safe seems to me like the hardest challenge of the iterated capability amplification approach. Are there already research directions for tackling that challenge? (if I understand correctly, your recent paper did not focus on it).

I think Techniques for optimizing worst-case performance may be what you're looking for.

Thank you.

I see how the directions proposed there (adversarial training, verification, transparency) can be useful for creating aligned systems. But if we use a Distill step that can be trusted to be safe via one or more of those approaches, I find it implausible that Amplification would yield systems that are competitive relative to the most powerful ones created by other actors around the same time (i.e. actors that create AI systems without any safety-motivated restrictions on the model space and search algorithm).

Paul's position in that post was:

All of these ap­proaches feel very difficult, but I don’t think we’ve run into con­vinc­ing deal-break­ers.

I think this is meant to include the difficulty of making them competitive with unaligned ML, since that has been his stated goal. If you can argue that we should be even more pessimistic than this, I'm sure a lot of people would find that interesting.

In this 2017 post about Amplification (linked from OP) Paul wrote: "I think there is a very good chance, perhaps as high as 50%, that this basic strategy can eventually be used to train benign state-of-the-art model-free RL agents."

The post you linked to is more recent, so either the quote in your comment reflects an update or Paul has other insights/estimates about safe Distill steps.

BTW, I think Amplification might currently be the most promising approach for creating aligned and powerful systems; what I argue is that in order to save the world it will probably need to be complemented with governance solutions.

BTW, I think Amplification might currently be the most promising approach for creating aligned and powerful systems; what I argue is that in order to save the world it will probably need to be complemented with governance solutions.

How uncompetitive do you think aligned IDA agents will be relative to unaligned agents, and what kinds of governance solutions do you think that would call for? Also, I should have made this clearer last time, but I'd be interested to hear more about why you think Distill probably can't be made both safe and competitive, regardless of whether you're more or less optimistic than Paul.

How uncompetitive do you think aligned IDA agents will be relative to unaligned agents

For the sake of this estimate I'm using a definition of IDA that is probably narrower than what Paul has in mind: in the definition I use here, the Distill steps are carried out by nothing other than supervised learning + what it takes to make that supervised learning safe (but the implementation of the Distill steps may be improved during the Amplify steps).

This narrow definition might not include the most promising future directions of IDA (e.g. maybe the Distill steps should be carried out by some other process that involves humans). Without this simplifying assumption, one might define IDA as broadly as: "iteratively create stronger and stronger safe AI systems by using all the resources and tools that you currently have". Carrying out that Broad IDA approach might include efforts like asking AI alignment researchers to get into a room with a whiteboard and come up with ideas for new approaches.

Therefor this estimate uses my narrow definition of IDA. If you like, I can also answer the general question: "How uncompetitive do you think aligned agents will be relative to unaligned agents?".

My estimate:

Suppose it is the case that if OpenAI decided to create an AGI agent as soon as they could, it would have taken them X years (assuming an annual budget of $10M and that the world around them stays the same, and OpenAI doesn't do neuroscience, and no unintentional disasters happen).

Now suppose that OpenAI decided to create an aligned IDA agent with AGI capabilities as soon as they could (same conditions). How much time would it take them? My estimate follows; each entry is in the format:

[years]: [my credence that it would take them at most that many years]

(consider writing down your own credences before looking at mine)

1.0X:

0.1%

1.1X:

3%

1.2X:

3%

1.5X:

4%

2X:

5%

5X:

10%

10X:

30%

100X:

60%

Generally, I don't see why we should expect that the most capable systems that can be created with supervised learning (e.g. by using RL to search over an arbitrary space of NN architectures) would perform similarly to the most capable systems that can be created, at around the same time, using some restricted supervised learning that humans must trust to be safe. My prior is that the former is very likely to outperform by a lot, and I'm not aware of strong evidence pointing one way or another.

So for example, I expect that an aligned IDA agent will be outperformed by an agent that was created by that same IDA framework when replacing the most capable safe supervised learning in the Distill steps with the most capable unrestricted supervised learning available at around the same time.

How uncompetitive do you think aligned IDA agents will be relative to unaligned agents

I think they will probably be uncompetitive enough to make some complementary governance solutions necessary (this line replaced an attempt for a quantitative answer which turned out long; let me know if you want it).

what kinds of governance solutions do you think that would call for?

I'm very uncertain. It might be the case that our world must stop being a place in which anyone with $10M can purchase millions of GPU hours. I'm aware that most people in the AI safety community are extremely skeptical about governments carrying out "stabilization" efforts etcetera. I suspect this common view fails to account for likely pivotal events (e.g. some advances in narrow AI that might suddenly allow anyone with sufficient computation power to carry out large scale terror attacks). I think Allan Dafoe's research agenda for AI Governance is an extremely important and neglected landscape that we (the AI safety community) should be looking at to improve our predictions and strategies.

Generally, I don’t see why we should expect that the most capable systems that can be created with supervised learning (e.g. by using RL to search over an arbitrary space of NN architectures) would perform similarly to the most capable systems that can be created, at around the same time, using some restricted supervised learning that humans must trust to be safe. My prior is that the former is very likely to outperform by a lot, and I’m not aware of strong evidence pointing one way or another.

This seems similar to my view, which is that if you try to optimize for just one thing (efficiency) you're probably going to end up with more of that thing than if you try to optimize for two things at the same time (efficiency and safety) or if you try to optimize for that thing under a heavy constraint (i.e., safety).

But there are people (like Paul) who seem to be more optimistic than this based on more detailed inside-view intuitions, which makes me wonder if I should defer to them. If the answer is no, there's also the question of how do we make policy makers take this problem seriously (i.e., that safe AI probably won't be as efficient as unsafe AI) given the existence of more optimistic AI safety researchers, so that they'd be willing to undertake costly preparations for governance solutions ahead of time. By the time we get conclusive evidence one way or another, it may be too late to make such preparations.

If the answer is no, there's also the question of how do we make policy makers take this problem seriously (i.e., that safe AI probably won't be as efficient as unsafe AI) given the existence of more optimistic AI safety researchers (so that they'd be willing to undertake costly preparations for governance solutions ahead of time).

I'm not aware of any AI safety researchers that are extremely optimistic about solving alignment competitively. I think most of them are just skeptical about the feasibility of governance solutions, or think governance related interventions might be necessary but shouldn't be carried out yet.

In this 80,000 Hours podcast episode, Paul said the following:

In terms of the actual value of working on AI safety, I think the biggest concern is this, “Is this an easy problem that will get solved anyway?” Maybe the second biggest concern is, “Is this a problem that’s so difficult that one shouldn’t bother working on it or one should be assuming that we need some other approach?” You could imagine, the technical problem is hard enough that almost all the bang is going to come from policy solutions rather than from technical solutions.
And you could imagine, those two concerns maybe sound contradictory, but aren’t necessarily contradictory, because you could say, “We have some uncertainty about this parameter of how hard this problem is.” Either it’s going to be easy enough that it’s solved anyway, or it’s going to be hard enough that working on it now isn’t going to help that much and so what mostly matters is getting our policy response in order. I think I don’t find that compelling, in part because one, I think the significant probability on the range … like the place in between those, and two, I just think working on this problem earlier will tell us what’s going on. If we’re in the world where you need a really drastic policy response to cope with this problem, then you want to know that as soon as possible.
It’s not a good move to be like, “We’re not going to work on this problem because if it’s serious, we’re going to have a dramatic policy response.” Because you want to work on it earlier, discover that it seems really hard and then have significantly more motivation for trying the kind of coordination you’d need to get around it.

I’m not aware of any AI safety researchers that are extremely optimistic about solving alignment competitively.

I'm not sure what you'd consider "extremely" optimistic, but I gathered some quantitative estimates of AI risk here, and they all seem overly optimistic to me. Did you see that?

Paul: I just think working on this problem earlier will tell us what’s going on. If we’re in the world where you need a really drastic policy response to cope with this problem, then you want to know that as soon as possible.

I agree with this motivation to do early work, but in a world where we do need drastic policy responses, I think it's pretty likely that the early work won't actually produce conclusive enough results to show that. For example, if a safety approach fails to make much progress, there's not really a good way to tell if it's because safe and competitive AI really is just too hard (and therefore we need a drastic policy response), or because the approach is wrong, or the people working on it aren't smart enough, or they're trying to do the work too early. People who are inclined to be optimistic will probably remain so until it's too late.

but I gathered some quantitative estimates of AI risk here, and they all seem overly optimistic to me. Did you see that?

I only now read that thread. I think it is extremely worthwhile to gather such estimates.

I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on "no governance interventions"). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic.

Maybe we should gather researchers' credences for predictions like:
"If there will be no governance interventions, competitive aligned AIs will exist in 10 years from now".

I suspect that gathering such estimates from publicly available information might expose us to a selection bias, because very pessimistic estimates might be outside the Overton window (even for the EA/AIS crowd). For example, if Robert Wiblin would have concluded that an AI existential catastrophe is 50% likely, I'm not sure that the 80,000 Hours website (which targets a large and motivationally diverse audience) would have published that estimate.

I agree with this motivation to do early work, but in a world where we do need drastic policy responses, I think it's pretty likely that the early work won't actually produce conclusive enough results to show that. For example, if a safety approach fails to make much progress, there's not really a good way to tell if it's because safe and competitive AI really is just too hard (and therefore we need a drastic policy response), or because the approach is wrong, or the people working on it aren't smart enough, or they're trying to do the work too early.

I strongly agree with all of this.

I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on "no governance interventions"). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic

I normally give ~50% as my probability we'd be fine without any kind of coordination.

Upvoted for giving this number, but what does it mean exactly? You expect "50% fine" through all kinds of x-risk, assuming no coordination from now until the end of the universe? Or just assuming no coordination until AGI? Is it just AI risk instead of all x-risk, or just risk from narrow AI alignment? If "AI risk", are you including risks from AI exacerbating human safety problems, or AI differentially accelerating dangerous technologies? Is it 50% probability that humanity survives (which might be "fine" to some people) or 50% that we end up with a nearly optimal universe? Do you have a document that gives all of your quantitative risk estimates with clear explanations of what they mean?

(Sorry to put you on the spot here when I haven't produced anything like that myself, but I just want to convey how confusing all this is.)

MCTS works as amplification because you can evaluate future board positions to get a convergent estimate of how well you're doing - and then eventually someone actually wins the game, which keeps p from departing reality entirely. Importantly, the single thing you're learning can play the role of the environment, too, by picking the opponents' moves.

In trying to train A to predict human actions given access to A, you're almost doing something similar. You have a prediction that's also supposed to be a prediction of the environment (the human), so you can use it for both sides of a tree search. But A isn't actually searching through an interesting tree - it's searching for cycles of length 1 in its own model of the environment, with no particular guarantee that any cycles of length 1 exist or are a good idea. "Tree search" in this context (I think) means spraying out a bunch of outputs and hoping at least one falls into a fixed point upon iteration.

EDIT: Big oops, I didn't actually understand what was being talked about here.

I agree there is a real sense in which AGZ is "better-grounded" (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)

Oh, I've just realized that the "tree" was always intended to be something like task decomposition. Sorry about that - that makes the analogy a lot tighter.

Isn't A also grounded in reality by eventually giving no A to consult with?

This is true when getting training data, but I think it's a difference between A (or HCH) and AlphaGo Zero when doing simulation / amplification. Someone wins a simulated game of Go even if both players are making bad moves (or even random moves), which gives you a signal that A doesn't have access to.

I don't suppose you could explain how it uses P and V? Does it use P to decide which path to go down and V to avoid fully playing it out?

In the simplest form of iterated capability amplification, we train one function:
A “weak” policy A, which is trained to predict what the agent will eventually decide to do in a given situation.
Just like AlphaGo doesn’t use the prior p directly to pick moves, we don’t use the weak policy A directly to pick actions. Instead, we use a capability amplification scheme: we call A many times in order to produce more intelligent judgments. We train A to bypass this expensive amplification process and directly make intelligent decisions. As A improves, the amplified policy becomes more powerful, and A chases this moving target.

This is totally wild speculation, but the thought occurred to me whether the human brain might be doing something like this with identities and social roles:

A lot of (but not all) people get a strong hit of this when they go back to visit their family. If you move away and then make new friends and sort of become a new person (!), you might at first think this is just who you are now. But then you visit your parents… and suddenly you feel and act a lot like you did before you moved away. You might even try to hold onto this “new you” with them… and they might respond to what they see as strange behavior by trying to nudge you into acting “normal”: ignoring surprising things you say, changing the topic to something familiar, starting an old fight, etc. [...]
For instance, the stereotypical story of the worried nagging wife confronting the emotionally distant husband as he comes home really late from work… is actually a pretty good caricature of a script that lots of couples play out, as long as you know to ignore the gender and class assumptions embedded in it.
But it’s hard to sort this out without just enacting our scripts. The version of you that would be thinking about it is your character, which (in this framework) can accurately understand its own role only if it has enough slack to become genre-savvy within the web; otherwise it just keeps playing out its role. In the husband/wife script mentioned above, there’s a tendency for the “wife” to get excited when “she” learns about the relationship script, because it looks to “her” like it suggests how to save the relationship — which is “her” enacting “her” role. This often aggravates the fears of the “husband”, causing “him” to pull away and act dismissive of the script’s relevance (which is “his” role), driving “her” to insist that they just need to talk about this… which is the same pattern they were in before. They try to become genre-savvy, but there (usually) just isn’t enough slack between them, so the effort merely changes the topic while they play out their usual scene.

If you squint, you could kind of interpret this kind of a dynamic to be a result of the human brain trying to predict what it expects itself to do next, using that prediction to guide the search of next actions, and then ending up with next actions that have a strong structural resemblance to its previous ones. (Though I can also think of maybe better-fitting models of this too; still, seemed worth throwing out.)

How do you know MCTS doesn't preserve alignment?

As I understand it - MCTS is used to maximize a given computable utility function, and so it is non alignment-preserving in the general sense that a sufficiently strong optimization of a non-perfect utility function is non alignment-preserving.