TL;DR: More work to study interactions and synergies between different AI interpretability tools seems important, tractable, and neglected. 

Interpretable AI research tends to focus on one technique at a time

There is a lot of interpretable/explainable AI research. But one thing that almost all research papers seem to have in common is a focus on individual techniques, one at a time. From a practical perspective, this makes sense – this is usually the simplest way to straightforwardly make progress by introducing something new. Studying interactions between multiple things can also quickly get hard and complicated. Progress in ML tends to work this way.

Studying combinations of interpretability tools may be valuable 

However, sometimes great work can be done by combining multiple tools to study networks, and they often find complementary interactions between them here are a few random examples

In engineering disciplines, progress is rarely made by the introduction of single methods that make a huge splash. Instead, game-changing innovations usually come from multiple techniques being applied at once. Progress on ImageNet is an excellent example of this. It took combinations of many new techniques – batch normalization, residual connections, inception modules, deeper architectures, hyperparameter engineering, etc – to improve the performance of CNNs so much so quickly. There was no silver bullet.

Progress on ImageNet has been driven by combining many techniques together over the years.

An example of “rainbowing” from RL

There is some precedent for valuable, influential papers that make no novel contributions other than combining old techniques. A somewhat-famous example of this comes from reinforcement learning. Hessel et al. (2017) improved greatly on the previous state-of-the-art on deep Q learning by simply combining a bunch of existing tricks. Because their technique combined many compatible old ones, they called it “Rainbow.”

In RL, the "Rainbow" paper set a new state-of-the-art with no new contributions at all.

It is particularly easy to combine intrinsic and post-hoc interpretability tools

In AI interpretability research, interesting combinations of methods are especially easy to come by. Interpretability tools can be divided into two kinds: intrinsic interpretability tools which are applied before or during training and are meant to make the model easier to analyze versus post hoc tools which are applied after training and are meant to produce explanations of how it works. There are many tools of each type, and they almost never conflict with each other. So almost any combination of an intrinsic and a post hoc interpretability tool is something that we could study for useful synergy. And interestingly, with some exceptions (e.g. Elhage et al. (2022)), most work from the AI safety interpretability community is on post hoc methods, so there seems to be a lot of room for more of this. 

New to LessWrong?

New Comment