Automating Interpretability with Agents
This work was produced as part of the SPAR Program - Fall 2025 Cohort, with support from Georg Lange. Kseniya Parkhamchuk, Jack Payne TL;DR Automated feature explanations from Delphi fail 38% of the time. The failures are sensitivity issues, output feature misidentification, factual incorrectness, and poor human readability. We built...
May 18