I think not. Maybe circuits-style mechanistic interpretability is though. I generally wouldn't try dissuading people from getting involved in research on most AIS things.
We talked about this over DMs, but I'll post a quick reply for the rest of the world. Thanks for the comment.
A lot of how this is interpreted depends on what the exact definition of superposition that one uses and whether it applies to entire networks or single layers. But a key thing I want to highlight is that if a layer represents a certain set amount of information about an example, then they layer must have more information per neuron if it's thin than if it's wide. And that is the point I think that the Huang paper helps to make. The fact that deep and thin networks tend to be more robust suggests that representing information more densely w.r.t. neurons in a layer does not make these networks less robust than wide shallow nets.
Thanks, +1 to the clarification value of this comment. I appreciate it. I did not have the tied weights in mind when writing this.
Thanks for the comment.
In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways.
This seems completely plausible to me. But I think that it's a little hand-wavy. In general, I perceive the interpretability agendas that don't involve applied work to be this way. Also, few people would argue that basic insights, to the extent that they are truly explanatory, can be valuable. But I think it is at least very non-obvious that it would be differentiably useful for safety.
there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.
No qualms here. But (1) the point about program synthesis/induction/translation suggests that the toy problems are fundamentally more tractable than real ones. Analogously, imagine saying that having humans write and study simple algorithms for search, modular addition, etc. to be part of an agenda for program synthesis. (2) At some point the toy work should lead to competitive engineering work. think that there has not been a clear trend toward this in the past 6 years with the circuits agenda.
I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?
Thanks for the question. It might generalize. My intended point with the Ramanujan paper is that a subnetwork seeming to do something in isolation does not mean that it does that thing in context. The Ramanujan et al. weren't interpreting networks, they were just training the networks. So the underlying subnetworks may generalize well, but in this case, this is not interpretability work any more than just gradient-based training of a sparse network is.
I just went by what it said. But I agree with your point. It's probably best modeled as a predictor in this case -- not an agent.
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate.
One idea that comes to mind is to see if a chatbot who is vulnerable to DAN-type prompts could be made to be robust to them by self-distillation on non-DAN-type prompts.
I'd also really like to see if self-distillation or similar could be used to more effectively scrub away undetectable trojans. https://arxiv.org/abs/2204.06974
I agree with this take. In general, I would like to see self-distillation, distillation in general, or other network compression techniques be studied more thoroughly for de-agentifying, de-backdooring, and robistifying networks. I think this would work pretty well and probably be pretty tractable to make ground on.
There are existing tools like lucid/lucent, captum, transformerlens, and many others that make it easy to use certain types of interpretability tools. But there is no standard, broad interpretability coding toolkit. Given the large number of interpretability tools and how quickly methods become obsolete, I don't expect one.