Tl;dr We are attempting to make neural networks (NN) modular, have GPT-N interpret each module for us, in order to catch mesa-alignment and inner-alignment failures.
Train a neural net with an added loss term that enforces the sort of modularity that we see in well-designed software projects. To use this paper's informal definition of modularity
a network is modular to the extent that it can be partitioned into sets of neurons where each set is strongly internally connected, but only weakly connected to other sets.
Example of a “Modular” GPT. Each module should be densely connected w/ relatively larger weights. Interfaces between modules should be sparsely connected w/ relatively smaller weights.
Once we have a Modular NN (for example, a GPT), we will use a normal GPT to map each module into a natural language description. Notice that there are two different GPT’s at work here.
GPT-N reads in each “Module” of the “Modular GPT”, outputting a natural language description for each module.
If successful, we could use GPT-N to interpret any modular NN in natural language. Not only should this help our understanding of what the model is doing, but it should also catch mesa-alignment and inner-alignment failures.
There are a few intuitions we have that go counter to other’s intuitions. Below is an elaboration of our thoughts and why we think this project could work.
Finding a Loss function that Induces Modularity
We currently think a Gomory-Hu Tree (GH Tree) captures the relevant information. We will initially convert a NN to a GH Tree to calculate the new loss function. This conversion will be computationally costly, though more progress can be made to calculate the loss function directly from the NN. See Appendix A for more details
Small NN’s are Human Interpretable
We’re assuming humans can interpret small NN’s, given enough time. A “Modular” NN is just a collection of small NN’s connected by sparse weights. If humans could interpret each module in theory, then GPT-N could too. If humans can interpret the interfaces between each, then GPT-N could too.
Examples from NN Playground are readily interpretable (such as the above example).
GPT-3 can already turn comments into code. We don't expect the reverse case to be fundamentally harder, and neural nets can be interpreted as just another programming language.
Microscope AI has had some success in interpreting large NN’s. These are NN’s that should be much harder to interpret than modular NN’s that we would be interpreting.
First question: Capabilities will likely be lost by adding a modularity loss term. Can we spot-check capability of GPT by looking at the loss of the original loss terms? Or would we need to run it through NLP metrics (like Winograd Schema Challenge questions)?
To create a modular GPT, we have two paths, but I'm unsure of which is better.
- Train from scratch with modified loss
- Train OpenAI’s gpt-2 on more data, but with added loss term. The intuition here is that it’s already capable, so optimizing for modularity starting here will preserve capabilities.
If you are interested in the interpretability of GPT (even unrelated to our project), I can add you to a discord server full of GPT enthusiasts (just DM me). If you're interested in helping out our project specifically, DM me and we'll figure out a way to divvy up tasks.
Gomory-Hu Tree Contains Relevant Information on Modularity
Some readily accessible insights:
- The size of the minimum cut between two neurons can be used to measure the size of the interface between their modules.
- Call two graphs G and G’ on the same vertices equivalent if for every two u,v, the sizes of their minimum cuts are the same in G and G’. It turns out that there always exists a G’ which is a tree! (The Gomory-Hu tree.)
- It turns out that the minimum cut between two neurons within a module never needs to expose the innards of another module.
Therefore, the Gomory-Hu tree probably contains all the information needed to calculate the loss term and the hierarchy of software modules.