Regardless, here's something people might find amusing - researchers found that a simple VGG-like 3D CNN model can look at electron microscope images of neural tissue and do a task that humans don't know how to do. The network distinguishes neurons that specialize in certain neurotransmitters. From the abstract to this preprint:

"The network successfully discriminates between six types of neurotransmitters (GABA, glutamate, acetylcholine, serotonin, dopamine, and octopamine) with an average accuracy of 87% for individual synapses and 94% for entire neurons, assuming each neuron expresses only one neurotransmitter. This result is surprising as there are often no obvious cues in the EM images that human observers can use to predict neurotransmitter identity."

They are developing explainability techniques to try to figure out how the CNN does this classification (see the figures in this preprint). In addition to the custom methods they've developed I know they have also used more bog-standard activation maximization techniques as well (personal communication with Jan Funke in January year). Jan told me he's read Chris Olah et al.'s publications in Distill. They think the network may be cuing in on subtle differences in the size/shape of vesicles.

Reply

[-]Gurkenglas4yΩ130

Someone needs to check if we can use ML to guess activations in one set of neurons from activations in another set of neurons. The losses would give straightforward estimates of such statistical quantities as mutual information. Generating inputs that have the same activations in a set of neurons illustrates what the set of neurons does. I might do this myself if nobody else does.

Reply

[-]paulfchristiano4yΩ220

I'm not clear on what you'd do with the results of that exercise. Suppose that on a certain distribution of texts you can explain 40% of the variance in half of layer 7 by using the other half of layer 7 (and the % gradually increases as you use make the activation-predicting-model bigger, perhaps you guess it's approaching 55% in the limit). What's the upshot of models being that predictable rather than more or less, or the use of the actual predictor that you learned?

Given an input x, generating other inputs that "look the same as x" to part of the model but not other parts seems like it reveals something about what that part of the model does. As a component of interpretability research that seems pretty similar to doing feature visualization or selecting input examples that activate a given neuron, and I'd guess it would fit in the same way into the project of doing interpretability.

I'd mostly be excited about people developing these techniques as part of a focused project to understand what models are thinking. I'm not really sure what to make of them in isolation.

Reply

[-]Gurkenglas4y20

I'm not really sure what to make of them in isolation.

I score such techniques on how surprised I am how well they fit together, as with all good math. In this case my evidence is: My current approach is to thoroughly analyze the likes of mutual information for modularity only on the neighborhood of one input, since that is tractable with mere linear algebra, but an activation-predicting-model is even less extra theory (since we were already working with neural nets) and just happens to produce per cross-entropy loss the same KL divergences I'm already trying to measure.

IIRC you study problem decomposition. Would your results say I'll need the same magic natural language tools that would assemble descriptions for every hierarchy node from descriptions of its children in order to construct the hierarchy in the first place? Do they say anything about how to continuously go between hierarchies as the model trains? Have you tried describing how well a hierarchy decomposes a problem by the extent to which "a: TA -> A" which maps a list of subsolutions to a solution satisfies the square

on that hierarchy?

Reply

[-]Gurkenglas4y*20

If you can find two halves with little mutual information, you can understand one before having understood the other. I suspect that interpreting a model should be decomposed by hierarchically clustering neurons using such measurements. Since the measurement is differentiable, you can train a network for modularity to make this work better.

It sure is similar to feature visualization! I prefer it because it doesn't go out of distribution and doesn't feel like it implicitly assumes that the model implements a linear function.

I agree that interpretability is the purpose and the cure.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

91

Comments on OpenPhil's Interpretability RFP

91

Ω 40

91

Ω 40

Why I'm excited about interpretability

Why I'm not worried about scalability

Comparison to Circuits

Some caveats