How Do Selection Theorems Relate To Interpretability?

As another "why not just" which I'm sure there's a reason for:

in the original circuits thread, they made a number of parameterized families of synthetic images which certain nodes in the network responded strongly to in a way that varied smoothly with the orientation parameter, and where these nodes detected e.g. boundaries between high-frequency and low-frequency regions at different orientations.

If given another such network of generally the same kind of architecture, if you gave that network the same images, if it also had analogous nodes, I'd expect those nodes to have much more similar responses to those images than any other nodes in the network. I would expect that cosine similarity of the "how strongly does this node respond to this image" would be able to pick out the node(s) in question fairly well? Perhaps I'm wrong about that.

And, of course, this idea seems only directly applicable to feed-forward convolution networks that take an image as the input, and so, not so applicable when trying to like, understand how an agent works, probably.

(well, maybe it would work in things that aren't just a convolutions-and-pooling-and-dilation-etc , but seems like it would be hard to make the analogous synthetic inputs which exemplify the sort of thing that the node responds to, for inputs other than images. Especially if the inputs are from a particularly discrete space, like sentences or something. )

But, this makes me a bit unclear about why the "NP-HARD" lights start blinking.

Of course, "find isomorphic structure", sure.
But, if we have a set of situations which exemplify when a given node does and does not fire (rather, when it activates more and when it activates less) in one network, searching another network for a node that does/doesn't activate in those same situations, hardly seems NP-hard. Just check all the nodes for whether they do or don't light up. And then, if you also have similar characterizations for what causes activation in the nodes that came before the given node in the first network, apply the same process with those on the nodes in the second network that come before the nodes that matched the closest.

I suppose if you want to give a overall score for each combination of "this sub-network of nodes in the new network corresponds to this other network of nodes in the old-and-understood network", and find the sub-network that gives the best score, then, sure, there could be exponentially many sub-networks to consider. But, if each well-understood node in the old network generally has basically only one plausibly corresponding node in the new network, then this seems like it might not really be an issue in practice?

But, I don't have any real experience with this kind of thing, and I could be totally off.

[-]johnswentworth3y50

You've correctly identified most of the problems already. One missing piece: it's not necessarily node-activations which are the right thing to look at. Even in existing work, there's other ways interpretable information is embedded, like e.g. directions in activation space of a bunch of neurons, or rank-one updates to matrices.

[-]tailcalled3y50

Doesn't this strategy necessarily depend on having a way to quantify the structure of the environment? That is, to run the theorems backwards from circuit to environmental structure, you need some representation of the environmental structure.

And quantifying the structure of the environment seems to me to be the difficult problem that neural network models solve in the first place. So it seems like this strategy wouldn't be sufficient without an alternative to neural network models?

[-]johnswentworth3y40

Indeed. One would need to have some whole theory of what kinds of relevant structures are present in the environment, and how to represent them.

And that's why I started with abstraction.

[-]Joe Kwon3y40

Hi John. One could run useful empirical experiments right now, before fleshing out all these structures and how to represent them, if you can assume that a proxy for human representations (crude: conceptnet, less crude: similarity judgments on visual features and classes collected by humans) is a good enough proxy for "relevant structures" (or at least that these representations more faithfully capture the natural abstractions than the best machines in vision tasks for example, where human performance is the benchmark performance), right?

I had a similar idea about ontology mismatch identification via checking for isomorphic structures, and also realized I had no idea how to realize that idea. Through some discussions with Stephen Casper and Ilia Sucholutsky, we kind of pivoted the above idea into the regime of interpretability/adversarial robustness where we are hunting for interesting properties given that we can identify the biggest ways that humans and machines are representing things differently (and that humans, for now, are doing it "better"/more efficiently/more like the natural abstraction structures that exist).

I think am working in the same building this summer (caught a split-second glance at you yesterday); I would love a chance to discuss how selection theorems might relate to an interpretability/adversarial robustness project I have been thinking about.

[-]tailcalled3y20

Those structures would likely also be represented with neural nets, though, right? So in practice it seems like it would end up quite similar to looking for isomorphic structures between neural networks, except you specifically want to design a highly interpretable kind of neural network and then look for isomorphisms between this interpretable neural network and other neural networks.

[-]johnswentworth3y20

They would not necessarily be represented with neural nets, unless you're using "neural nets" to refer to circuits in general.

[-]tailcalled3y20

I think by "neural nets" I mean "circuits that get optimized through GD-like optimization techniques and where the vast majority of degrees of freedom for the optimization process come from big matrix multiplications".

[-]johnswentworth3y20

Yeah, I definitely don't expect that to be the typical representation; I expect neither an optimization-based training process nor lots of big matrix multiplications.

[-]tailcalled3y20

Interesting and exciting. Can you reveal more about how you expect it to work?

[-]johnswentworth3y20

Based on the forms in Maxent and Abstraction, I expect (possibly nested) sums of local functions to be the main feature-representation. Figuring out which local functions to sum might be done, in practice, by backing equivalent sums out of a trained neural net, but the net wouldn't be part of the representation.

[-]Gurkenglas3y40

I don't follow why we shouldn't use neural networks to find isomorphic structures. Yes, asking a liar if he's honest is silly; but 1. asking a traitor to annotate their thought transcripts still makes it harder for them to plan a coup if we can check some of their work; 2. if current networks can annotate current networks, we might fix the latter's structure and scale its components up in a manner less likely to shift paradigms.

[-]johnswentworth3y60

A useful heuristic: for alignment purposes, most of the value of an interpretability technique comes not from being able to find things, but being able to guarantee that we didn't miss anything. The ability to check only some of the work has very little impact on the difficulty of alignment - in particular, it means we cannot apply any optimization pressure at all to that interpretability method (including optimization pressure like "the humans try new designs until they find one which doesn't raise any problems visible to the interpretability tools"). The main channel through which partial interpretability would be useful is if it leads to a method for more comprehensive interpretability.

[-]MSRayne3y10

Plus, structures are either isomorphic or they're not. If they are, humans examining them should be able to check the work of the annotator and figure out why it said they were isomorphic in the first place. This seems to me like the AI would be a way of focusing human attention more efficiently on the places that are interesting in the neural net, rather than doing the work for us, which obviously doesn't aid our own ability to interpret at all!

LESSWRONG
LW

LESSWRONG
LW

60

How Do Selection Theorems Relate To Interpretability?

60

Ω 33

60

Ω 33

How would Selection Theorems allow reuse?

Other Connections

Why Not Just…

The Gap