Rejected for the following reason(s):
- Not obviously not Language Model.
- We are sorry about this, but submissions from new users that are mostly just links to papers on open repositories (or similar) have usually indicated either crackpot-esque material, or AI-generated speculation.
Read full explanation
I've released a research agenda (in development since May with collaborators) proposing that intervention outcomes should be the ground truth for interpretability. Publishing now so others can see the ideas, build on them, or work on pieces if interested.
Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for actionability: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them?
This situates itself alongside the recent pragmatic (Nanda/GDM), ambitious (Gao), and curiosity-driven (Bau) framings—but argues that interpretability without human empowerment is incomplete. The agenda outlines 8 research questions forming a pipeline from query to verified intervention, plus applications to CoT faithfulness verification, emergent capability prediction, and capability composition mapping.
Full agenda here: https://aigi.ox.ac.uk/publications/automated-interpretability-driven-model-auditing-and-control-a-research-agenda/