The Engineer’s Interpretability Sequence

Wiki Contributions



Are you concerned about AI risk from narrow systems of this kind?

No. Am I concerned about risks from methods that work for this in narrow AI? Maybe. 

This seems quite possibly useful, and I think I see what you mean. My confusion is largely from my initial assumption that the focus of this specific point directly involved existential AI safety and from the word choice of "backbone" which I would not have used. I think we're on the same page. 

Thanks for the post. I'll be excited to watch what happens. Feel free to keep me in the loop. Some reactions:

We must grow interpretability and AI safety in the real world.

Strong +1 to working on more real-world-relevant approaches to interpretability. 

Regulation is coming – let’s use it.

Strong +1 as well. Working on incorporating interpretability into regulatory frameworks seems neglected by the AI safety interpretability community in practice. This does not seem to be the focus of work on internal eval strategies, but AI safety seems unlikely to be something that has a once-and-for-all solution, so governance seems to matter a lot in the likely case of a future with highly-prolific TAI. And because of the pace of governance, work now to establish concern, offices, precedent, case law, etc. seems uniquely key. 

Speed potentially transformative narrow domain systems. AI for scientific progress is an important side quest. Interpretability is the backbone of knowledge discovery with deep learning, and has huge potential to advance basic science by making legible the complex patterns that machine learning models identify in huge datasets.

I do not see the reasoning or motivation for this, and it seems possibly harmful.

First, developing basic insights is clearly not just an AI safety goal. It's an alignment/capabilities goal. And as such, the effects of this kind of thing are not robustly good. They are heavy-tailed in both directions. This seems like possible safety washing. But to be fair, this is a critique I have of a ton of AI alignment work including some of my own. 

Second, I don't know of any examples of gaining particularly useful domain knowledge from interpretability related things in deep learning other than maybe the predictivness of nonrobust features. Another possible example could be using deep-learning to find new algorithms for things like matrix multiplication, but this isn't really "interpretability". Do you have other examples in mind? Progress in the last 6 years on reverse-engineering nontrivial systems has seemed to be tenuous at best

So I'd be interested in hearing more about whether/how you expect this one type of work to be robustly good and what is meant by "Interpretability is the backbone of knowledge discovery with deep learning."

I do not worry a lot about this. It would be a problem. But some methods are model-agnostic and would transfer fine. Some other methods have close analogs for other architectures. For example, ROME is specific to transformers, but causal tracing and rank one editing are more general principles that are not. 

Thanks for the comment. I appreciate how thorough and clear it is. 

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are  equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one. 

Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.

+1, but this seems difficult to scale. 

Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.

+1, see It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws. 

(e.g. detecting an asteroid heading towards the earth)

This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn't be called deceptive. I don't think my definition of deceptive alignment applies to this because my definition requires that the model does something we don't want it to. 

Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.

Strong +1. This points out a difference between trojans and deception. I'll add this to the post. 

This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.



Thanks. See also EIS VIII.

Could you give an example of a case of deception that is quite unlike a trojan? Maybe we have different definitions. Maybe I'm not accounting for something. Either way, it seems useful to figure out the disagreement.  

Thanks! I am going to be glad to have this post around to refer to in the future. I'll probably do it a lot. Glad you have found some of it interesting. 

Yes, it does show the ground truth.

The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

The MNIST CNN was trained only on the 50k training examples. 

I did not guarantee that the models had perfect train accuracy. I don't believe they did. 

I think that any interpretability tools are allowed. Saliency maps are fine. But to 'win,' a submission needs to come with a mechanistic explanation and sufficient evidence for it. It is possible to beat this challenge by using non mechanistic techniques to figure out the labeling function and then using that knowledge to find mechanisms by which the networks classify the data. 

At the end of the day, I (and possibly Neel) will have the final say in things. 

Thanks :)

Load More