Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez

9 Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

12th Jan 2026

1 min read

9

I've released a research agenda (in development since May with collaborators) proposing that intervention outcomes should be the ground truth for interpretability. Publishing now so others can see the ideas, build on them, or work on pieces if interested.

Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for actionability: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them?

This situates itself alongside the recent pragmatic (Nanda/GDM), ambitious (Gao), and curiosity-driven (Bau) framings—but argues that interpretability without human empowerment is incomplete. The agenda outlines 8 research questions forming a pipeline from query to verified intervention, plus applications to CoT faithfulness verification, emergent capability prediction, and capability composition mapping.

Full agenda here: https://aigi.ox.ac.uk/publications/automated-interpretability-driven-model-auditing-and-control-a-research-agenda/

Frontpage

9

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

9

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

9

9

9