Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

1mo

I've released a research agenda (in development since May with collaborators) proposing that intervention outcomes should be the ground truth for interpretability. Publishing now so others can see the ideas, build on them, or work on pieces if interested.

Rather than optimizing for plausible explanations or proxy task performance, the system is optimized for actionability: can domain experts use explanations to identify errors, and can automated tools successfully edit models to fix them?

This situates itself alongside the recent pragmatic (Nanda/GDM), ambitious (Gao), and curiosity-driven (Bau) framings—but argues that interpretability without human empowerment is incomplete. The agenda outlines 8 research questions forming a pipeline from query to verified intervention, plus applications to CoT faithfulness verification, emergent capability prediction, and capability composition mapping.

Full agenda here: https://aigi.ox.ac.uk/publications/automated-interpretability-driven-model-auditing-and-control-a-research-agenda/

Best-of-N Jailbreaking

John Hughes

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez, mrinank_sharma

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.

Abstract

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on

... (read 492 more words →)

Visualizing neural network planning

Nevan Wichers

Nevan Wichers, Victor Tao, fbarez, Riccardo Volpato

TLDR

We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it’s representing the states it’s planning through internally. We successfully reveal intermediate states in a simple Game of Life model, but find no evidence of planning in an AlphaZero chess model. We think the idea won’t work in its current state for real world NNs because they use higher-level, abstract representations for planning that our current technique cannot decode. Please comment if you have ideas that may work for detecting more abstract ways the NN could be planning.

Idea and motivation

To make safe ML,... (read 1443 more words →)

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda

Neel Nanda, LawrenceC, fbarez

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress!

We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional” mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations,... (read more)

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks

lukemarks, Amirali Abdullah, Rauno Arike, fbarez, nothoughtsheadempty

This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design.

Introduction

Sparse Autoencoders Find Highly Interpretable Directions in Language Models showed that sparse coding achieves SOTA performance in making features interpretable using OpenAI's method of automated interpretability. We briefly tried to extend these results to reward models learned during RLHF in Pythia-70m/410m. Our method can be summarized as follows:

1. Identify layers $L$ in an language model fine-tuned through $M_{R L H F}$ likely involved in reward modeling. We do so by... (read 1422 more words →)

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran

Esben Kran, fbarez, Sabrina Zaki, gabrielrecc, rz2383

We ran a hackathon on scalable oversight with Gabriel Recchia as keynote speaker (watch the talk) and Ruiqi Zhong as co-judge. Here, we share the top projects and results. In summary:

We can automate the “sandwiching” paradigm from Cotra [1] by having a smaller model ask structured questions to elicit a true answer from a larger model and getting a response accuracy rate as output.
We can understand coordination abilities between humans and large language models quantitatively using asymmetric-information language games such as Codenames.
We can study scaling and prompt specificity phenomena in-depth using a simple framework. In this case, word reversal is investigated to evaluate the emergent abilities of language models.

Watch the project presentations on... (read 1771 more words →)

LESSWRONG
LW

LESSWRONG
LW

fbarez

Best-of-N Jailbreaking

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Best-of-N Jailbreaking

Visualizing neural network planning

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

fbarez

Best-of-N Jailbreaking

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Best-of-N Jailbreaking

Visualizing neural network planning

Mechanistic Interpretability Workshop Happening at ICML 2024!

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Abstract

TLDR

Idea and motivation

Introduction