beenkim

Agentic Interpretability: A Strategy Against Gradual Disempowerment

Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord Full paper on arXiv We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising...

Jun 17, 202517

LESSWRONG
LW

LESSWRONG
LW

beenkim

beenkim

Agentic Interpretability: A Strategy Against Gradual Disempowerment

beenkim

beenkim

beenkim

Agentic Interpretability: A Strategy Against Gradual Disempowerment