LESSWRONG
LW

Interpretability (ML & AI)Eliciting Latent KnowledgeAI
Frontpage

11

Mechanistic Anomaly Detection Research Update

by Nora Belrose, David Johnston
6th Aug 2024
1 min read
0

11

This is a linkpost for https://blog.eleuther.ai/mad_research_update/
Interpretability (ML & AI)Eliciting Latent KnowledgeAI
Frontpage

11

New Comment
Moderation Log
More from Nora Belrose
View more
Curated and popular this week
0Comments

Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.

We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at. 

We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.

Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!

Code: https://github.com/EleutherAI/cupbearer/tree/attribution_detector

Mentioned in
195Shallow review of technical AI safety, 2024