I think this is an interesting direction and I've been thinking about pretty similar things (or more generally, "quotient" interpretability research). I'm planning to write much more in the future, but not sure when that will be, so here are some unorganized quick thoughts in the meantime:
Hi, great post. Modularity was for a while on the top of my list of things-missing-for-capabilities, and it is interesting to see it's relevance to safety too.
Some points about the hypothesised modules:
ML models, like all software, and like the NAH would predict, must consist of several specialized "modules".
After reading source code of MySQL InnoDB for 5 years, I doubt it. I think it is perfectly possible - and actually, what I would expect to happen by default - to have a huge working software, with no clear module boundaries.
Take a look at this case in point: the row_search_mvcc()
function https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/row/row0sel.cc#L4377-L6019 which has 1500+ lines of code and references hundreds of variables. This function is in called in the inner loop of almost every SELECT
query you run, so on the one hand it probably works quite correctly, on the other was subject to "optimization pressure" over 20 years, and this is what you get. I think this is because Evolution is not Intelligent Design and it simply uses the cheapest locally available hack to get things done, and that is usually to reuse the existing variable#135, or more realistically combination of variables#135 and #167 to do the trick - see how many of the if
statements have conditions which use more than a single atom, for example:
if (set_also_gap_locks && !trx->skip_gap_locks() &&
prebuilt->select_lock_type != LOCK_NONE &&
!dict_index_is_spatial(index)) {
(Speculation: I suspect that unless you chisel your neural network architecture to explicitly disallow connecting a neuron in question directly to neuron#145 and #167, it will, as soon as it discovers they provide useful bits of information. I suspect this is why figuring out what layers and connectivity between them you need is difficult. Also, I suspect this is why simply ensuring right high-level wiring between parts of the brain and how to wire them to input/output channels might the most important part to encode in DNA, as the inner connections and weights can be later figured out relatively easily)
I'm very interested in examples of non-modular systems, but I'm not convinced by this one, for multiple reasons:
I wouldn't be surprised to learn we just have somewhat different notions of what "modularity" should mean. For sufficiently narrow definitions, I agree that lots of computations are non-modular, I'm just less interested in those.
Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).
(continuation of the same comment - submitted by mistake and cannot edit...) Assuming modules A,B are already "talking" and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B
tl;dr: ML models, like all software, and like the NAH would predict, must consist of several specialized "modules". Such modules would form interfaces between each other, and exchange consistently-formatted messages through these interfaces. Understanding the internal data formats of a given ML model should let us comprehend an outsized amount of its cognition, and allow to flexibly interfere in it as well.
1. A Cryptic Analogy
Let's consider three scenarios. In each of them, you're given the source code of a set of unknown programs, and you're tasked with figuring out their exact functionality. Details vary:
In the first scenario, the task is all but trivial. You read the source code, make notes on it, run parts of it, and comprehend it. It may not be quick, but it's straightforward.
The second scenario is a nightmare. There must be some structure to every program's implementation — something like the natural abstraction hypothesis must still apply, there must be modules in this mess of a code that can be understood separately, etc. There is some high-level structure that you can parcel out into tiny pieces that can fit into a human mind. The task is not impossible.
But suppose you've painstakingly reverse-engineered one of the programs this way. You move on to the next one, and... Yup, you're essentially starting from zero. Well, you've probably figured out something about natural abstractions when working on your first reverse-engineering, so it's somewhat easier, but still a nightmare. Every program is structured in a completely different way, you have to start from the fundamentals every time.
The third scenario is a much milder nightmare. You don't focus on reverse-engineering the programs here — first, you reverse-engineer the programming language. It's a task that may be as complex as reverse-engineering one of the individual programs from (2), but once you've solved it, you're essentially facing the same problem as in (1) — a problem that's comparably trivial.
The difference between (2) and (3) is:
(2) is a parallel to the general problem of interpretability, different programs being different ML models. Is there some interpretability problem that's isomorphic to (3), however?
I argue there is.
2. Interface Theory
Suppose that we have two separate entities with different specializations: they both can do some useful "work", but there are some types of work that only one of them can perform. Suppose that they want to collaborate: combine their specializations to do tasks neither entity can carry out alone. How can they do so?
For concreteness, imagine that the two entities are a Customer, which can provide any resource from the set of resources R, and an Artist, which can make any sculpture from some set S given some resource budget. They want to "trade": there's some sculpture sx the Customer wants made, and there are some resources rx the Artist needs to make it. How can they carry out such exchanges?
The Customer needs to send some message Msx to the Artist, such that it would uniquely identify sx among the members of S. The Artist, in turn, would need to respond with some message Mrx, which uniquely identifies rx in the set R.
That implies an interface: a data channel between the two entities, such that every message passed along this channel from one entity to another uniquely identifies some action the receiving entity must take in response.
This, in turn, implies the existence of interpreters: some protocols that take in a message received by an entity, and map it onto the actions available to that entity (in this case, making a particular sculpture or passing particular resources).
For example, if the sets of statues and resources are sufficiently small, both entities can agree on some orderings of these sets, and then just pass numbers. "1" would refer to "statue 1" and "resource package 1", and so on. These protocols would need to be obeyed consistently: they'd have to always use the same ordering, such that the Customer saying "1" always means the same thing. Otherwise, this system falls apart.
Now, suppose that a) the sets are really quite large, such that sorting them would take a very long time, yet b) members of sets are modular, in the sense that each member of a set is a composition of members of smaller sets. E. g., each sculpture can be broken down into the number of people to be depicted in it, the facial expressions each person has, what height each person is, etc.
In this case, we might see this modularity reflected in the messages. Instead of specifying statues/resources holistically, they'd specify modules: "number of people: N, expression on person 1: type 3...", et cetera.
To make use of this, the Artist's interpreter would need to know what "expression on person 1" translates to, in terms of specifications-over-S. And as we've already noted, this meaning would need to be stable across time: "expression on person 1" would need to always mean the same thing, from request to request.
That, in turn, would imply data formats. Messages would have some consistent structure, such that, if you knew the rules by which these structures are generated, you'd be able to understand any message exchanged between the Customer and the Artist at any point.
3. Internal Interfaces
Suppose that we have a ML model with two different modules, each of them specialized for performing different computations. Inasmuch as they'll work together, we'll see the same dynamics between them as between the Customer and the Artist: they'll arrive at some consistent data formats for exchanging information. They'll form an internal interface.
Connecting this with the example from Section 1, I think this has interesting implications for interpretability. Namely: internal interfaces are the most important interpretability targets.
Why? Two reasons:
Cost-efficiency. Understanding a data format would allow us to understand everything that can be specified using this data format, which may shed light on entire swathes of a model's cognition. And as per the third example in Section 1, it should be significantly easier than understanding an arbitrary feature, since every message would follow the same high-level structure.
Editability. Changing a ML model's weights while causing the effects you want and only the effects you want is notoriously difficult — often, a given feature is only responsible for a part of the effect you're ascribing to it, or it has completely unexpected side-effects. An interface, however, is a channel specifically built for communication between two components. It should have minimal side-effects, and intervening on it should have predictable results.
As far as controlling the AI's cognition goes, internal interfaces are the most high-impact points.
4. What Internal Interfaces Can We Expect?
This question is basically synonymous with "what modules can we expect?". There are currently good arguments for the following:
Here's a breakdown of the potential interfaces as I see them:
"Cracking" any of these interfaces should give us a lot of control over the ML model: the ability to identify specific shards, spoof or inhibit their activations, literally read off the AI's plans, et cetera.
I'm particularly excited about (7) and (5), though: if the world-model is itself an interface, it must be consistently-formatted in its entirety, and getting the ability to arbitrarily navigate and edit it would, perhaps, allow us to solve the entire alignment problem on the spot. (A post elaborating on this is upcoming.)
(6) is also extremely high-value, but might be much more difficult to decode.
5. Research Priorities