This is a linkpost for: Acquisition of Chess Knowledge in AlphaZero.
What is being learned by superhuman neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts, our ability to understand faithful explanations of their decisions will be restricted, ultimately limiting what we can achieve with neural network interpretability. In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess. By probing for a broad range of human chess concepts we show when and where these concepts are represented in the AlphaZero network. We also provide a behavioural analysis focusing on opening play, including qualitative analysis from chess Grandmaster Vladimir Kramnik. Finally, we carry out a preliminary investigation looking at the low-level details of AlphaZero's representations, and make the resulting behavioural and representational analyses available online.
I think this work represents a significant improvement in interpretability for reinforcement agents and makes me moderately more optimistic about interpretability in general. However, the paper relies on hand-coded representations of chess concepts to probe the network. I worry that it will be hard to extend such an approach to more realistic or fuzzier domains where we can't hand code concepts reliably. Section 7 does contain some results for unsupervised concept-level detection, though this wasn't the focus of the paper.
There's already some discussion here