This is a linkpost for

Using contrast pairs, the authors extract linear directions in the activation space of AlphaZero which correspond to concepts. By observing AlphaZero's play in situations that use these concepts, human grandmasters can improve their own play. 

This is related to the following recent research:

  • Burns et al. (2022) found directions which correlate with truth in a language model’s latent space by using unsupervised methods applied to contrast pairs. 
  • Research by Alex Turner's SERI MATS stream used contrast pairs to identify directions representing various concepts in the latent space of GPT-2 and RL agents. By adding or subtracting these vectors during the models’ forward passes, they controlled model outputs in sophisticated ways. (1, 2, 3)
  • Zou et al. (2023) describe and motivate the research direction of representation engineering. They empirically evaluate variants of these techniques, showing that representations can be effectively used to monitor and control various AI behaviors. 

Collin Burns has argued that unsupervised methods for concept discovery should scale to superhuman systems, offering an empirical average-case approach to ELK

AlphaZero has superhuman performance and may not use the same ontology as human players. This creates an empirical opportunity to study the problem of ontology mismatch

Section 4.1 describes the method for constructing contrast pairs and finding linear directions representing concepts. The full paper can be found here

New to LessWrong?

New Comment