Agentic Interpretability: A Strategy Against Gradual Disempowerment
Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord Full paper on arXiv We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising...
We hope to borrow much ideas from the cogsci work, where mental models between people (e.g., co-working, teacher/student situations) are well studied. This work that we cite may give a good idea of the flavor: https://langcog.stanford.edu/papers_new/goodman-2016-tics.pdf or https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0048. In other words, cogsci folks have been studying how humans work together to understand each other to work better together or to enable better education, and the agentic interpretability is advocating to do something similar (tho it may look very different) with machines.