Agentic Interpretability: A Strategy Against Gradual Disempowerment

beenkim; Neel Nanda

Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord

We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising the question of whether we could ask and help AI systems to build mental models of us which help us to build mental models of the LLMs.

The core idea of agentic interpretability is for a method to `proactively assist human understanding in a multi-turn interactive process by developing and leveraging a mental model of the user which in turn enables humans to develop better mental models of the LLM’. In other words, enable machines to help us understand them.

In the AGI safety community, interpretability is primarily considered as a means of detecting deception and other forms of misalignment. In contrast, here we use interpretability in the broad sense of techniques that aid us in building a deeper understanding of models, what they are doing, and why. Agentic interpretability is not primarily intended to be robust to adversarial systems (although we suggest an idea of “open-model surgery” in the paper, summarized below), and we will likely also need solutions to deceptive misalignment via different approaches. We instead believe that one of the main forms of AGI safety relevance is by increasing human empowerment and helping reduce bad outcomes like gradual disempowerment. What is the optimal human-AI communication framework that enables human users to understand AI operations and reasoning, rather than simply deferring all decision-making power to the AI?

The idea of Open-model surgery: While agentic interpretability is not primarily intended for adversarial systems, we suggest some interesting ways to use agentic interpretability to help detect deception: Imagine an ``open-model surgery" where researchers actively converse with the model while intervening in its internal mechanisms—ablating connections, amplifying activations, or injecting specific inputs into identified circuits. The model, guided by its understanding of the researchers' goal (to understand a specific component's function), is then encouraged to explain the resulting behavioral or internal state changes. This interactive probing, akin to neurosurgeons conversing with patients during awake brain surgery to map critical functions, offers a dynamic way to test hypotheses and build understanding. A model whose internal states are being directly manipulated and inspected, yet which is simultaneously engaged in an explanatory dialogue, faces a stringent test of coherence, analogous to an interrogation where a suspect's claims are immediately cross-referenced with physical evidence or observed physiological responses.

Increasingly powerful AI systems pose increasing challenges to our ability to understand them. Agentic interpretability recognizes that these systems also provide the opportunity to enlist them in service of our understanding of them. We believe methods that pursue this direction will have complementary benefits to the existing interpretability landscape.

Adding some clarifications re my personal perspective/takes on how I think about this from an AGI Safety perspective: I see these ideas as Been's brainchild, I largely just helped out with the wording and framing. I do not currently plan to work on agentic interpretability myself, but still think the ideas are interesting and plausibly useful, and I’m glad the perspective is written up! I still see one of my main goals as working on robustly interpreting potentially deceptive AIs and my guess is this is not the comparative strength of agentic interpretability.

Why care about it? From a scientific perspective, I'm a big fan of baselines and doing the simple things first. "Prompt the model and see what happens" or "ask the model what it was doing" are the obvious things you should do first when trying to understand a behaviour. In internal experiments, we often find that we can just solve a problem with careful and purposeful prompting, no need for anything fancy like SAEs or transcoders. But it seems kinda sloppy to “just do the obvious thing”, I’m sure there’s a bunch of nuance re doing this well, and in training models for this to be easy to do. I would be excited for there to be a rigorous science of when and how well these kinds of simple black box approaches actually work. This is only part of what agentic interpretability is about (there’s also white box ideas, more complex multi-turn stuff, an emphasis on building mental models of each other, etc) but it’s a direction I find particularly exciting – If nothing else, we need to answer to know where other interpretability methods can add value.

It also seems that, if we're trying to use any kind of control or scalable oversight scheme where we're using weak trusted models to oversee strong untrusted models, that the better we are at having high fidelity communication with the weaker models the better. And if the model is aligned, I feel much more excited about a world where the widely deployed systems are doing things users understand rather than inscrutable autonomous agents.

Naturally, it's worth thinking about negative externalities. In my opinion, helping humans have better models of AI Psychology seems robustly good. AIs having better models of human psychology could be good for the reasons above, but there's the obvious concern that it will make models better at being deceptive, and I would be hesitant to recommend such techniques to become standard practice without better solutions to deception. But I expect companies to eventually do things vaguely along the lines of agentic interpretability regardless, and so either way I would be keen to see research on the question of how such techniques affect model propensity and capability for deception.

I don't understand how it is an answer to gradual disempowerement.

In the GD scenario, both individual humans and AI understand that :

AI-powered governance/economy leads to better "utilitarian outcome" (more wealth, better governance)
AI and humans understand that this leads to disempowerement
Moral/economics pressure leads to that disempowerement

The problem is not in the understanding. It is in the incentives. The threat model was never "oops, we did not understood that dynamics, and now we are disempowered… if only we knew…". We already know before it happens — that’s the point of the original GD paper. Having an LLM understand and explain that dynamics does not help avert that scenario ?

I wouldn't go as far as calling it an answer, but I think it helps. The mechanism is by lowering the diff between agent powered things and things where a human understands what is being done or is otherwise in the loop

Do you have ideas about how to do this?

I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.

But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.

Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.

We hope to borrow much ideas from the cogsci work, where mental models between people (e.g., co-working, teacher/student situations) are well studied. This work that we cite may give a good idea of the flavor: https://langcog.stanford.edu/papers_new/goodman-2016-tics.pdf or https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0048. In other words, cogsci folks have been studying how humans work together to understand each other to work better together or to enable better education, and the agentic interpretability is advocating to do something similar (tho it may look very different) with machines.

I was asking more "how does the AI get a good model of itself", but your answer was still interesting, thanks. Still not sure if you think there's some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)

Here's another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking "how do you communicate with humans to make them good at RL feedback," you're asking "how do you communicate with humans to make them good at participating in verbal chain of thought?"