Part 10 of 12 in the Engineer’s Interpretability Sequence.
The previous post focused in-depth on how research on interpretability and adversaries are inseparably connected. This post is dedicated to discussing how this is not itself a complete story. There is a much larger, richer one about the connections between interpretability, adversaries, continual learning, modularity, and biological brains – likely some other things too. These connections may be a useful mine for insight and inspiration.
Below are discussions of my understanding of each of these topics and how they relate to others. I’ll include some citations here, but see the Toward Transparent AI survey (Räuker et al., 2022) survey for full discussions.
Continual learning is a fairly large subfield of deep learning that focuses on finding ways to help neural networks learn new information without forgetting old information. This is also described as the goal of avoiding “catastrophic forgetting.” Notably, biological brains are good at this, but artificial neural networks are not by default. Sections 2A and 3A of the Toward Transparent AI survey (Räuker et al., 2022) both focus entirely on how continual learning methods are interpretability tools. Please see the survey for the full discussion.
Methods for continual learning are based on replay, regularization, or parameter isolation (De Lange et al., 2019). Methods taking the latter two strategies are based on the broader principle of getting neural networks to have some weights or neurons that specialize in particular types of data. In other words, they encourage specialized task-defined modules inside the network. Thus, these can be used as intrinsic interpretability tools that help us train models that are more easy or natural to interpret out of the box.
Modularity is a common property of engineered systems, and separating neural networks into distinct, specialized modules is very appealing for interpreting them. The weights in neural network layers are typically initialized and updated according to uniform rules, and all neurons in one layer are typically connected to all neurons in the previous and next layers. Unfortunately, this does not help networks develop specialized modules. Meanwhile, neurons in biological brains come in multiple types and can only communicate with nearby ones. This has contributed to modularity in brains in which different brain regions specialize in processing information for distinct tasks.
See Sections 4B-4C of the Toward Transparent AI survey (Räuker et al., 2022) for a full discussion on modularity. In artificial neural networks, neural networks can be trained to be modular using either “hard” architectural constraints or “soft” modularity aided by initialization, regularization, a controller, or sparse attention. Meanwhile, Serra et al. (2018) found that soft modularity via sparse attention helped with continual learning. And even when networks are not trained to be explicitly modular, one can still interpret them post hoc in terms of modules.
Some neurons and weights are frivolous, meaning that they are either redundant with others or are simply not useful to the network’s performance at all. Frivolous components of the network can be understood as useless modules that can be adapted for continual learning. Networks that contain frivolous weights or neurons can also be compressed by removing them which makes the interpretation of circuits inside of the network simpler. Meanwhile, compression can guide interpretations (e.g. Li et al. (2018) or causal scrubbing), and interpretations can guide compression (e.g. Kaixuan et al. (2021)).
Biological brains have many nice properties including adversarial robustness, continual learning, modularity, and a high degree of redundancy (Glassman, 1987) – implying compressibility. Meanwhile, network architectures that emulate biological visual cortex are more adversarially robust (Dapello et al., 2020), and adversarially robust networks do a better job of modeling representations in biological visual cortex (Schrimpf et al., 2020; Berrios and Deza, 2022)
My knowledge of the broader machine learning and neuroscience fields is limited, and I strongly suspect that there are connections to other topics out there – perhaps some that have already been studied, and perhaps some which have yet to be. For example, there are probably interesting connections between interpretability and dataset distillation (Wang et al., 2018). I’m just not sure what they are yet.
Research spanning the 6 different fields discussed here is much sparser than research within each. So in the future, more work to better understand these connections, gain insights, and refine methods from each of them may be highly valuable for interpretability. This point will be elaborated on in the next post.
Wanted to chime in and say that I've been thoroughly enjoying this sequence so far, and that it deserves far more traction that it's currently getting. Hopefully people will glance back and realise how many useful and novel thoughts/directions are packed into these posts.
Thanks! I am going to be glad to have this post around to refer to in the future. I'll probably do it a lot. Glad you have found some of it interesting.
The dataset condensation (distillation) may be seen as the global explanation of the training dataset. And, we could also utilize DC to extract some spurious correlations like backdoor triggers. We made a naive attempt in this direction: https://openreview.net/forum?id=ix3UDwIN5E .
cool post. It could be a little more detailed, but the pointers are there.