Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
* Authors sorted alphabetically. Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced. 1 Introduction A question that all mechanistic interpretability work must answer is, “how well does this interpretation explain the phenomenon being studied?”. In the many recent papers in mechanistic interpretability, researchers have generally relied on ad-hoc methods to evaluate the quality of interpretations.[1] This ad hoc nature of existing evaluation methods poses a serious challenge for scaling up mechanistic interpretability. Currently, to evaluate the quality of a particular research result, we need to deeply understand both the interpretation and the phenomenon being explained, and then apply researcher judgment. Ideally, we’d like to find the interpretability equivalent of property-based testing—automatically checking the correctness of interpretations, instead of relying on grit and researcher judgment. More systematic procedures would also help us scale-up interpretability efforts to larger models, behaviors with subtler effects, and to larger teams of researchers. To help with these efforts, we want a procedure that is both powerful enough to finely distinguish better interpretations from worse ones, and general enough to be applied to complex interpretations. In this work, we propose causal scrubbing, a systematic ablation method for testing precisely stated hypotheses about how a particular neural network[2] implements a behavior on a dataset. Specifically, given an informal hypothesis about which parts of a model implement the intermediate calculations required for a b
Training Attack Policies for Control Evaluations via Prover-Verifier Games (aka Control with Training)
By @Ansh Radhakrishnan and @Buck
Credit goes to Buck for most of these ideas – most of this is written by Ansh, with some contributions from Buck. Thanks to John Schulman, Fabien Roger, Akbir Khan, Jiaxin Wen, Ryan Greenblatt, Aengus Lynch, and Jack Youstra for their comments.
The original AI control paper evaluates the robustness-to-intentional-subversion of safety techniques by having a red team construct an attack policy (a drop-in replacement for the untrusted model) and then running a simulation to see the probability that that attack policy would cause a catastrophe before getting caught. They constructed their attack policies entirely using prompting and... (read 1919 more words →)