x

LESSWRONG

LW

Jack Payne — LessWrong

Jack Payne

Jack Payne

Message

17

1

1

1y

Jack Payne

17

1y

Automating Interpretability with Agents

This work was produced as part of the SPAR Program - Fall 2025 Cohort, with support from Georg Lange. Kseniya Parkhamchuk, Jack Payne TL;DR Automated feature explanations from Delphi fail 38% of the time. The failures are sensitivity issues, output feature misidentification, factual incorrectness, and poor human readability. We built...