This is a linkpost for https://arxiv.org/abs/2310.16410
Using contrast pairs, the authors extract linear directions in the activation space of AlphaZero which correspond to concepts. By observing AlphaZero's play in situations that use these concepts, human grandmasters can improve their own play.
This is related to the following recent research:
Collin Burns has argued that unsupervised methods for concept discovery should scale to superhuman systems, offering an empirical average-case approach to ELK.
Section 4.1 describes the method for constructing contrast pairs and finding linear directions representing concepts. The full paper can be found here.