Biological neural networks (i.e. brains) and artificial neural networks have sufficient commonalities that it's often reasonable to treat our knowledge about one as a good starting point for reasoning about the other. So one way to predict how the field of neural network interpretability will develop is by looking at how neuroscience interprets the workings of human brains. I think there are several interesting things to be learned from this, but the one I'll focus on in this post is the concept of modularity: the fact that different parts of the brain carry out different functions. Neuroscientists have mapped many different skills (such as language use, memory consolidation, and emotional responses) to specific brain regions. Note that this doesn’t always give us much direct insight into how the skills themselves work - but it does make follow-up research into those skills much easier. I’ll argue that, for the purposes of AGI safety, this type of understanding may also directly enable important safety techniques.
What might it look like to identify modules in a machine learning system? Some machine learning systems are composed of multiple networks trained on different objective functions - which I’ll call architectural modularity. But what I’m more interested in is emergent modularity, where a single network develops modularity after training. Emergent modularity requires that the weights of a network give rise to a modular structure, and that those modules correspond to particular functions. We can think about this both in terms of high-level structure (e.g. a large section of a neural network carrying out a broad role, analogous to the visual system in humans) or lower-level structure, involving a smaller module carrying out more specific functions. (Note that this is a weaker definition than the one defended in philosophy by Fodor and others - for instance, the sets of neurons don’t need to contain encapsulated information.)
In theory, the neurons which make up a module might be distributed in a complex way across the entire network with only tenuous links between them. But in practice, we should probably expect that if these modules exist, we will be able to identify them by looking at the structure of connections between artificial neurons, similar to how it’s done for biological neurons. The first criterion is captured in a definition proposed by Filan et al. (2021).: a network is modular to the extent that it can be partitioned into sets of neurons where each set is strongly internally connected, but only weakly connected to other sets. They measure this by pruning the networks, then using graph-clustering algorithms, and provide empirical evidence that multi-layer perceptrons are surprisingly modular.
The next question is whether those modules correspond to internal functions. Although it’s an appealing and intuitive hypothesis, the evidence for this is currently mixed. On one hand, Olah et al.’s (2020) investigations find circuits which implement human-comprehensible functions. And insofar as we expect artificial neural networks to be similar to biological neural networks, the evidence from brain lesions in humans and other animals is compelling. On the other hand, they also find evidence for polysemanticity in artificial neural networks: some neurons fire for multiple reasons, rather than having a single well-defined role.
If it does turn out to be the case that structural modules implement functional modules, though, that has important implications for safety research: if we know what types of cognition we’d like our agents to avoid, then we might be able to identify and remove the regions responsible for them. In particular, we could try to find modules responsible for goal-directed agency, or perhaps even ones which are used for deception. This seems like a much more achievable goal for interpretability research than the goal of “reading off” specific thoughts that the network is having. Indeed, as in humans, very crude techniques for monitoring neural activations may be sufficient to identify many modules. But doing so may be just as useful for safety as precise interpretability, or more so, because it allows us to remove underlying traits that we’re worried about merely by setting the weights in the relevant modules to zero - a technique which I’ll call module pruning.
Of course, removing significant chunks of a neural network will affect its performance on the tasks we do want it to achieve. But it’s possible that retraining it from that point will allow it to regain the functionality we’re interested in without fully recreating the modules we’re worried about. This would be particularly valuable in cases where extensive pre-training is doing a lot of work in developing our agents’ capabilities, because that pre-training tends to be hard to control. For instance, it’s difficult to remove all offensive content from a large corpus of internet data, and so language models trained on such a corpus usually learn to reiterate that offensive content. Hypothetically, though, if we were able to observe small clusters of neurons which were most responsible for encoding this content, and zeroed out the corresponding parameters, then we could subsequently continue training on smaller corpora with more trustworthy content. While this particular example is quite speculative, I expect the general principle to be more straightforwardly applicable for agents that are pre-trained in multi-agent environments, in which they may acquire a range of dangerous traits like aggression or deception.
Module pruning also raises a counterintuitive possibility: that it may be beneficial to train agents to misbehave in limited ways, so that they develop specific modules responsible for those types of misbehaviour, which we can then remove. Of course, this suggestion is highly speculative. And, more generally, we should be very uncertain about whether advanced AIs will have modules that correspond directly to the types of skills we care about. But thinking about the safety of big neural networks in terms of emergent modules does seem like an interesting direction - both because the example of humans suggests that it’ll be useful, and also because it will push us towards lower-level and more precise descriptions of the types of cognition which our AIs carry out, and the types which we’d like to prevent.