Searching for a model's concepts by their shape – a theoretical framework
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort Introduction I think that Discovering Latent Knowledge in Language Models Without Supervision (DLK; Burns, Ye, Klein, & Steinhardt, 2022) is a very cool paper – it proposes a way to do unsupervised mind reading[1] – diminished only by not making its conceptual coolness evident enough in the paper writeup. This is in large part corrected in Collin Burns's more conceptual companion post. I'm rooting for their paper to sprout more research that finds concepts / high-level features in models by searching for their shape[2]. The aim of this post is to present a conceptual framework for this kind of interpretability, which I hope will facilitate turning concepts into structures to look for in ML models. Understanding this post does not require having read DLK, as the reader will be walked through the relevant parts of the result in parallel with developing this framework. Epistemic status: This post contains a theoretical framework that suggests a bunch of concrete experiments, but none of these experiments have been run yet (well, except for ones that were run prior to the creation of the framework), so it's quite possible the framework ends up leading to nothing in practice. Treat this as an interim report, and as an invitation to action to run some of these experiments yourself. This post is written by Kaarel Hänni on behalf of a team consisting of Georgios Kaklamanos, Kay Kozaronek, Walter Laurito, June Ku, Alex Mennen, and myself. The rough picture of the approach The idea is to start with a list of related concepts (or possibly just a single concept), write down a list of relations that the concepts satisfy, and then search for a list of features inside the model satisfying these relations. The hope is that this gives a way to find concepts inside models which requires us to make fewer assumptions about the model's thinking than if we were e.g. training a probe on activations