fbarez — LessWrong

Martian Interpretability Challenge: The Core Problems In Interpretability

TLDR; Interpretability today often fails on four fronts: it’s not truly mechanistic (more correlation than causal explanation), not useful in real engineering/safety workflows, incomplete (narrow wins that don’t generalize), and doesn’t scale to frontier models. Martian’s $1M prize targets progress on those gaps—especially via strong benchmarks, generalization across models, and...

Mar 119

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

I've released a research agenda (in development since May with collaborators) proposing that intervention outcomes should be the ground truth for interpretability. Publishing now so others can see the ideas, build on them, or work on pieces if interested. Rather than optimizing for plausible explanations or proxy task performance, the...

Jan 129

Best-of-N Jailbreaking

by John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez, and mrinank_sharma

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack. Abstract > We introduce Best-of-N (BoN) Jailbreaking, a...

Dec 14, 202479

Visualizing neural network planning

by Nevan Wichers, Victor Tao, fbarez, and Riccardo Volpato

TLDR We develop a technique to try and detect if a NN is doing planning internally. We apply the decoder to the intermediate representations of the network to see if it’s representing the states it’s planning through internally. We successfully reveal intermediate states in a simple Game of Life model,...

May 9, 20244

Mechanistic Interpretability Workshop Happening at ICML 2024!

by Neel Nanda, LawrenceC, and fbarez

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress! We'd love to get papers submitted if any...

May 3, 202448

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

by lukemarks, Amirali Abdullah, Rauno Arike, fbarez, and nothoughtsheadempty

This research was performed by Luke Marks, Amirali Abdullah, nothoughtsheadempty and Rauno Arike. Special thanks to Fazl Barez from Apart Research for overseeing the project and contributing greatly to direction and oversight throughout. We'd also like to thank Logan Riggs for feedback and suggestions regarding autoencoder architecture and experiment design....

Oct 3, 202318

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

by Esben Kran, fbarez, Sabrina Zaki, gabrielrecc, and rz2383

We ran a hackathon on scalable oversight with Gabriel Recchia as keynote speaker (watch the talk) and Ruiqi Zhong as co-judge. Here, we share the top projects and results. In summary: * We can automate the “sandwiching” paradigm from Cotra [1] by having a smaller model ask structured questions to...

Feb 23, 20238