The Engineer’s Interpretability Sequence

Wiki Contributions


A good critical paper about potentially risky industry norms is this one


To use your argument, what does MI actually do here?

The inspiration, I would suppose. Analogous to the type claimed in the HHH and hyena papers. 


And yes to your second point.

Nice post. I think it can serve as a good example about how the hand waviness of how interpretability can help us do good things with AI goes both ways. 

I'm particularly worried about MI people studying instances of when LLMs do and don't express types of situational awareness and then someone using these insights to give LLMs much stronger situational awareness abilities.


On the other hand, interpretability research is probably crucial for AI alignment. 

I don't think this is true and I especially hope it is not true because (1) mechanistic interpretability still fails to do impressive things by trying to reverse engineer networks and (2) it is entirely fungible from a safety standpoint with other techniques that often do better for various things.

Several people seem to be coming to similar conclusions recently (e.g., this recent post). 

I'll add that I have as well and wrote a sequence about it :)

Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones. 

Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don't think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.

I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make. 

I think this is very exciting, and I'll look forward to seeing how it goes!

Thanks, we will consider adding each of these. We appreciate that you took a look and took the time to help suggest these!

No, I don't think the core advantages of transparency are really unique to RLHF, but in the paper, we list certain things that are specific to RLHF which we think should be disclosed. Thanks.

Thanks, and +1 to adding the resources. Also Charbel-Raphael who authored the in-depth post is one of the authors of this paper! That post in particular was something we paid attention to during the design of the paper. 

Load More