There is this Alignment Forum post by Paul Christiano titled "AlphaGo Zero and capability amplification", part of the Iterated Amplification sequence, that bugged me when I came across it a few years ago. I'd just previously skimmed "Mastering the game of Go without human knowledge", henceforth "the AGZ paper", and felt like Christiano's post had a few weird inaccuracies.
Since I'm not an ML person I assumed that I was probably just missing something context/lingo etc. or misremembering. So I went back to the AGZ paper but that failed to solve my confusion. I took a few notes and left it at that. When I came across those notes by chance today I decided to look into it a bit more.
There are a few pretty minor things that bug me but that are really not worth the trouble and then two larger ones which I do want to highlight.
The post begins:
"AlphaGo Zero is an impressive demonstration of AI capabilities. It also happens to be a nice proof-of-concept of a promising alignment strategy."
with "AlphaGo Zero" being a link to a blog post by DeepMind. Makes sense.
This is followed by a section titled "How AlphaGo Zero works":
AlphaGo Zero learns two functions (which take as input the current board):
- A prior over moves p is trained to predict what AlphaGo will eventually decide to do.
- A value function v is trained to predict which player will win (if AlphaGo plays both sides)
Both are trained with supervised learning. Once we have these two functions, AlphaGo actually picks it[s] moves by using 1600 steps of Monte Carlo tree search (MCTS), using p and v to guide the search. It trains p to bypass this expensive search process and directly pick good moves. As p improves, the expensive search becomes more powerful, and p chases this moving target."
This section sounds wrong to me. In isolation "Both are trained with supervised learning" can be read as referring to the inner loop where "inner loop" is the part where the "weak" policy is improved. [1] That would be fine.
The problem is the sentence right after, "Once we have these two functions.. "
In AlphaGo Zero, henceforth "AGZ", there is no point at which we do not have p and v! We start with random play! It is ~the defining feature of AlphaGo Zero, that it starts from Zero!
Thus, to me, the passage reads naturally as something like "p and v are first trained using supervised learning followed by RL self-play using MCTS", which is not the case.
My next thought was that "Trained by SL, then amplified at game time via MCTS" would have been a reasonable fit for its predecessor AlphaGo Fan [2] (see the first page of the AGZ paper):
The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte-Carlo Tree Search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in the tree.
Maybe the distinction between the two does not really matter for the analogy the post is making? [3]
Maybe those two got mixed up but no harm no foul since they work equally well?
That is not the case because in AlphaGo Fan MCTS comes in only at play time. It lacks lookahead via MCTS inside the training loop which was first introduced, in the AlphaGo lineage, with AGZ. From the AGZ paper (page 1):
To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning.
See also Reinforcement Learning: An Introduction (second edition, chapter 16, page. 447, link):
A significant difference between AlphaGo Zero and AlphaGo is that AlphaGo Zero used MCTS to select moves throughout self-play reinforcement learning, whereas AlphaGo used MCTS for live play after—but not during—learning."
It lacks the whole "moving target" idea that makes AGZ interesting to Christiano here in the first place!
Even setting all this aside, Christiano's choice represents a clear clash with the terminology as used by DeepMind and others which insist on describing AGZ as "trained solely by self-play reinforcement learning" Again from the AGZ paper:
Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data.
This insistence on only reinforcement learning is repeated many places:
Where in addition to reinforcement learning, AlphaGo relied on supervised learning from a large database of expert human moves, AlphaGo Zero used only reinforcement learning and no human data or guidance beyond the basic rules of the game (hence the Zero in its name).
Later on, in the section "Iterated capability amplification", he writes:
In the case of AlphaGo Zero, A is the prior over moves, and the amplification scheme is MCTS. (More precisely: A is the pair (p, v), and the amplification scheme is MCTS + using a rollout to see who wins.)
"MCTS + using a rollout to see who wins" sounds an awful lot like the description of AlphaGo Fan I quoted earlier:
Once trained, these networks were combined with a Monte-Carlo Tree Search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte-Carlo rollouts using a fast rollout policy) to evaluate positions in the tree.
It does not work well as a description of AGZ. Here a description taken from Reinforcement Learning: An Introduction (second edition, chapter 16, page 447, link):
AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy. Each iteration of AlphaGo Zero’s MCTS ran a simulation that ended at a leaf node of the current search tree instead of at the terminal position of a complete game simulation.
Or another taken from the AGZ paper:
Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts.
If he uses "rollout" to refer to something different that is fine but the blog post by DeepMind that he himself links uses the term as in the quote above! See here
AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions.
where it is listed as one of the notable differences to prior versions. So at the very least he is breaking with the terminology as it is used in this context.
A thought that occurred to me was that he might take "rollout" to refer to the fact that the games generated by self-play are played to some end. [4] This sense of "rollout" belongs to what the AGZ paper calls the policy evaluation operator part which is then used in the distillation/optimization step:
Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner
as a sample of the value – may be viewed as a powerful policy evaluation operator.
The problem is that this interpretation is ruled out by the immediate context of the "rollout" in Christiano's post.
In the case of AlphaGo Zero, A is the prior over moves, and the amplification scheme is MCTS. (More precisely: A is the pair (p, v), and the amplification scheme is MCTS + using a rollout to see who wins.)
It occurs in a description of the amplification scheme which, in the Iterated Amplification sequence, is the part where a weak policy A is amplified by some means (see for instance the post Capability amplification).
The AGZ paper uses the term policy improvement operator:
The MCTS search outputs probabilities
of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities of the neural network ; MCTS may therefore be viewed as a powerful policy improvement operator
And it is this part which does not use rollouts in AGZ!
(As contrasted with AlphaGo Fan where rollouts are used in the policy improvement operator/amplification scheme)
Both of these seem like weird things to get wrong, and weirder still to remain that way. Even if this sort of stuff is not relevant for the point the post is trying to make!
But maybe that is because I am simply misunderstanding something? Would love to know!
in the case of AGZ, the way the neural network's parameters are updated to make
though this is also not really a perfect fit! ↩︎
this seemed all the more likely at first given the question of rollouts below ↩︎
"both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length", (AGZ paper) ↩︎