Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

One pretty major problem with today’s interpretability methods (e.g. work by Chris Olah & co) is that we have to redo a bunch of work whenever a new net comes out, and even more work when a new architecture or data modality comes along (like e.g. transformers and language models).

This breaks one of the core load-pillars of scientific progress: scientific knowledge is cumulative (approximately, in the long run). Newton saw far by standing on the shoulders of giants, etc. We work out the laws of mechanics/geometry/electricity/chemistry/light/gravity/computation/etc, and then reuse those laws over and over again in millions of different contexts without having to reinvent the core principles every time. Economically, that’s why it makes sense to pump lots of resources into scientific research: it’s a capital investment. The knowledge gained will be used over and over again far into the future.

But for today’s interpretability methods, that model fails. Some principles and tools do stick around, but most of the knowledge gained is reset when the next hot thing comes out. Economically, that means there isn’t much direct profit incentive to invest in interpretability; the upside of the research is mostly the hope that people will figure out a better approach.

One way to frame the Selection Theorems agenda is as a strategy to make interpretability transferable and reusable, even across major shifts in architecture. Basic idea: a Selection Theorem tells us what structure is selected for in some broad class of environments, so operating such theorems “in reverse” should tell us what environmental features produce observed structure in a trained net. Then, we run the theorems “forward” to look for the analogous structure produced by the same environmental features in a different net.

How would Selection Theorems allow reuse?

An example story for how I expect this sort of thing might work, once all the supporting pieces are in place:

  • While probing a neural net, we find a circuit implementing a certain filter.
  • We’ve (in this hypothetical) previously shown that a broad class of neural nets learn natural abstractions, so we work backward through those theorems and find that the filter indeed corresponds to a natural abstraction in the dataset/environment.
  • We can then work forward through the theorems to identify circuits corresponding to the same natural abstraction in other nets (potentially with different architectures) trained on similar data.

In this example, the hypothetical theorem showing that a broad class of neural nets learn natural abstractions would be a Selection Theorem: it tells us about what sort of structure is selected for in general when training systems in some class of environments.

More generally, the basic use-case of Selection-Theorem-based interpretability would be:

  • Identify some useful interpretable structure in a trained neural net (similar to today’s work).
  • Work backward through some Selection Theorems to find out what features of the environment/architecture selected for that structure.
  • Work forward through the theorems to identify corresponding internal structures in other nets.

Assuming that the interpretable structure shows up robustly, we should expect a fairly broad class of environments/architectures will produce similar structure, so we should be able to transfer structure to a fairly broad class of nets - even other architectures which embed the structure differently. (And if the interpretable structure doesn’t show up robustly, i.e. the structure is just an accident of this particular net, then it probably wasn’t going to be useful for future nets anyway.)

Other Connections

This picture connects closely to the convergence leg of the Natural Abstraction Hypothesis: “a wide variety of cognitive architectures learn and use approximately-the-same internal summaries”. The Selection Theorem approach to interpretability looks for those convergent summaries, and tries to directly prove that they are convergent. The Selection Theorems approach is also potentially more general than that; for instance, we might characterize the conditions in which an internal optimization process is selected for, or a self-model. But that sort of high-level structure is probably a ways out yet; I expect theorems about simple classes of natural abstractions to come along first.

It also connects closely to the Pointers Problem. Once we can locate convergent “concepts” in a trained system, and know what real-world things they correspond to, we can potentially “point” systems directly at particular real-world things.

Why Not Just…

One could imagine pursuing a similar strategy without the Selection Theorems. For instance, why not just directly look for isomorphic structures in two neural nets?

… ok, that question basically answers itself. “Look for isomorphic structures in two neural nets” immediately brings to mind superposed images of a grad student sobbing and a neon sign flashing “NP COMPLETE”.

(As with any problem which immediately brings to mind superposed images of a grad student sobbing and a neon sign flashing “NP COMPLETE”, someone will inevitably suggest that we Use Neural Networks to solve it. In response to that suggestion, I will severely misquote James Mickens:

This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. The point of interpretability is that we don’t trust the magic black boxes without directly checking what’s going on inside of them, so asking a magic black box to interpret another magic black box is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

)

Point is, to translate structure from one net to another, especially across an architectural shift, we need some way to know what to look for and where to look for it. Selection Theorems will, I expect, give us that.

The other problem with just looking for similar structures directly is that we have no idea when an architectural shift will break our methods. I expect Selection Theorems will be general enough to basically solve that problem.

The Gap

There’s still a wide gap to cross before we can directly use Selection Theorems for interpretability. We need the actual theorems.

The good news is, interpretability has relatively good feedback loops, and we can potentially carry those over to research on Selection Theorems. We can look for structure in a net, and empirically test how sensitive that structure is to features in the environment, in order to provide data and feedback for our theorem-development.

New Comment
14 comments, sorted by Click to highlight new comments since: Today at 4:23 PM

As another "why not just" which I'm sure there's a reason for:

in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.

If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I'd expect those nodes to have much more similar responses to those images than any other nodes in the network. I would expect that cosine similarity of the "how strongly does this node respond to this image" would be able to pick out the node(s) in question fairly well? Perhaps I'm wrong about that.

And, of course, this idea seems only directly applicable to feed-forward convolution networks that take an image as the input, and so, not so applicable when trying to like, understand how an agent works, probably.

(well, maybe it would work in things that aren't just a convolutions-and-pooling-and-dilation-etc , but seems like it would be hard to make the analogous synthetic inputs which exemplify the sort of thing that the node responds to, for inputs other than images. Especially if the inputs are from a particularly discrete space, like sentences or something. )

But, this makes me a bit unclear about why the "NP-HARD" lights start blinking.

Of course, "find isomorphic structure", sure.
But, if we have a set of situations which exemplify when a given node does and does not fire (rather, when it activates more and when it activates less) in one network, searching another network for a node that does/doesn't activate in those same situations, hardly seems NP-hard. Just check all the nodes for whether they do or don't light up. And then, if you also have similar characterizations for what causes activation in the nodes that came before the given node in the first network, apply the same process with those on the nodes in the second network that come before the nodes that matched the closest.

I suppose if you want to give a overall score for each combination of "this sub-network of nodes in the new network corresponds to this other network of nodes in the old-and-understood network", and find the sub-network that gives the best score, then, sure, there could be exponentially many sub-networks to consider. But, if each well-understood node in the old network generally has basically only one plausibly corresponding node in the new network, then this seems like it might not really be an issue in practice?

But, I don't have any real experience with this kind of thing, and I could be totally off.

You've correctly identified most of the problems already. One missing piece: it's not necessarily node-activations which are the right thing to look at. Even in existing work, there's other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.

Doesn't this strategy necessarily depend on having a way to quantify the structure of the environment? That is, to run the theorems backwards from circuit to environmental structure, you need some representation of the environmental structure.

And quantifying the structure of the environment seems to me to be the difficult problem that neural network models solve in the first place. So it seems like this strategy wouldn't be sufficient without an alternative to neural network models?

Indeed. One would need to have some whole theory of what kinds of relevant structures are present in the environment, and how to represent them.

And that's why I started with abstraction.

Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for "relevant structures" (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?

I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it "better"/more efficiently/more like the natural abstraction structures that exist). 

I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.

Those structures would likely also be represented with neural nets, though, right? So in practice it seems like it would end up quite similar to looking for isomorphic structures between neural networks, except you specifically want to design a highly interpretable kind of neural network and then look for isomorphisms between this interpretable neural network and other neural networks.

They would not necessarily be represented with neural nets, unless you're using "neural nets" to refer to circuits in general.

I think by "neural nets" I mean "circuits that get optimized through GD-like optimization techniques and where the vast majority of degrees of freedom for the optimization process come from big matrix multiplications".

Yeah, I definitely don't expect that to be the typical representation; I expect neither an optimization-based training process nor lots of big matrix multiplications.

Interesting and exciting. Can you reveal more about how you expect it to work?

Based on the forms in Maxent and Abstraction, I expect (possibly nested) sums of local functions to be the main feature-representation. Figuring out which local functions to sum might be done, in practice, by backing equivalent sums out of a trained neural net, but the net wouldn't be part of the representation.

I don't follow why we shouldn't use neural networks to find isomorphic structures. Yes, asking a liar if he's honest is silly; but 1. asking a traitor to annotate their thought transcripts still makes it harder for them to plan a coup if we can check some of their work; 2. if current networks can annotate current networks, we might fix the latter's structure and scale its components up in a manner less likely to shift paradigms.

A useful heuristic: for alignment purposes, most of the value of an interpretability technique comes not from being able to find things, but being able to guarantee that we didn't miss anything. The ability to check only some of the work has very little impact on the difficulty of alignment - in particular, it means we cannot apply any optimization pressure at all to that interpretability method (including optimization pressure like "the humans try new designs until they find one which doesn't raise any problems visible to the interpretability tools"). The main channel through which partial interpretability would be useful is if it leads to a method for more comprehensive interpretability.

Plus, structures are either isomorphic or they're not. If they are, humans examining them should be able to check the work of the annotator and figure out why it said they were isomorphic in the first place. This seems to me like the AI would be a way of focusing human attention more efficiently on the places that are interesting in the neural net, rather than doing the work for us, which obviously doesn't aid our own ability to interpret at all!