Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.

Circuits Thread (Various people) (summarized by Rohin): We’ve previously (AN #111) summarized the three main claims of the Circuits thread. Since then, several additional articles in the series have been published, so now seems like a good time to take another look at the work in this space.

The three main claims of the Circuits agenda, introduced in Zoom In, are:

1. Neural network features - the activation values of hidden layers - are understandable.

2. Circuits - the weights connecting these features - are also understandable.

3. Universality - when training different models on different tasks, you will get analogous features.

To support the first claim, there are seven different techniques that we can use to understand neural network features that all seem to work quite well in practice:

1. Feature visualization produces an input that maximally activates a particular neuron, thus showing what that neuron is “looking for”.

2. Dataset examples are the inputs in the training dataset that maximally activate a particular neuron, which provide additional evidence about what the neuron is “looking for”.

3. Synthetic examples created independently of the model or training set can be used to check a particular hypothesis for what the neuron activates on.

4. Tuning involves perturbing an image and seeing how the neuron activations change. For example, we might rotate images and see how that impacts a curve detector.

5. By looking at the circuit used to create a neuron, we can read off the algorithm that implements the feature, and make sure that it makes intuitive sense.

6. We can look to see how the feature is used in future circuits.

7. We can handwrite circuits that implement the feature after we understand how it works, in order to check our understanding and make sure it performs as well as the learned circuit.

Note that since techniques 5-7 are based on understanding circuits, their empirical success also supports claim 2.

The third claim, that the features learned are universal, is the most speculative. There isn’t direct support for it, but anecdotally it does seem to be the case that many vision networks do in fact learn the same features, especially in the earlier layers that learn lower-level features.

In order to analyze circuits, we need to develop interpretability tools that work with them. Just as activations were the primary object of interest in understanding the behavior of a model on a given input, weights are the primary object of interest in circuits. Visualizing Weights explains how we can adapt all of the techniques from Building Blocks to this setting. Just as in Building Blocks, a key idea is to “name” each neuron with its feature visualization: this makes each neuron meaningful to humans, in the same way that informative variable names make the variable more meaningful to a software engineer.

Once we have “named” our neurons, our core operation is to visualize the matrix of weights connecting one neuron in layer L to another in layer L+1. By looking at this visualization, we can see the “algorithm” (or at least part of the algorithm) for producing the layer L+1 neuron given the layer L neuron. The other visualizations are variations on this core operation: for example, we might instead visualize weights for multiple input neurons at once, or for a group of neurons together, or for neurons that are separated by more than one layer.

We then apply our machinery to curve detectors, a collection of 10 neurons that detect curves. We first use techniques 1-4 (which don’t rely on circuits) to understand what the neurons are doing in Curve Detectors. This is probably my favorite post of the entire thread, because it spends ~7000 words and many, many visualizations digging into just 10 neurons, and it is still consistently interesting. It gives you a great sense of just how complex the behavior learned by neural networks can be. Unfortunately, it’s hard to summarize (both because there’s a lot of content and because the visualizations are so core to it); I recommend reading it directly.

Curve Circuits delves into the analysis of how the curve detectors actually work. The key highlight of this post is that one author set values for weights without looking at the original network, and recreated a working curve detection algorithm (albeit not one that was robust to colors and textures, as in the original neural network). This is a powerful validation that at least that author really does “understand” the curve detectors. It’s not a small network either -- it takes around 50,000 parameters. Nonetheless, it can be described relatively concisely:

Gabor filters turn into proto-lines which build lines and early curves. Finally, lines and early curves are composed into curves. In each case, each shape family (eg conv2d2 line) has positive weight across the tangent of the shape family it builds (eg. 3a early curve). Each shape family implements the rotational equivariance motif, containing multiple rotated copies of approximately the same neuron.

(Yes, that’s a lot of jargon; it isn’t really meant to be understood just from that paragraph. The post goes into much more detail.)

So why is it amenable to such a simple explanation? Equivariance helps a lot in reducing the amount we need to explain. Equivariance occurs when there is a kind of symmetry across the learned neurons, such that there’s a single motif copied across several neurons, except transformed each time. A typical example of equivariance is rotational equivariance, where the feature is simply rotated some amount. Curve detectors show rotational equivariance: we have 10 different detectors that are all implemented in roughly the same way, except that the orientation is changed slightly. Many of the underlying neurons also show this sort of equivariance. In this case, we only need to understand one of the circuits and the others follow naturally. This reduces the amount to understand by about 50x.

This isn’t the only type of equivariance: you can also have e.g. neuron 1 that detects A on the left and B on the right, and neuron 2 that detects A on the right and B on the left, which is a reflection equivariance. If you trained a multilayer perceptron (MLP), that would probably have translational equivariance, where you’d see multiple neurons computing the same function but for different spatial locations of the image. (CNNs build translational equivariance into the architecture because it is so common.)

The authors go into detail by showing many circuits that involve some sort of equivariance. Note that equivariance is a property of features (neurons), rather than the circuits that connect them. Nonetheless, this is reflected in the structure of the weights that make up the circuit. The post gives several examples of such circuits. These can be of three types: invariant-equivariant circuits (building a family of equivariant features starting from invariant features), equivariant-invariant circuits, and equivariant-equivariant circuits.

High-Low Frequency Detectors applies all of these tools to understand a family of features that detect areas with high frequencies on one side, and low frequencies on the other. It’s mostly interesting as another validation of the tools, especially because these detectors were found through interpretability techniques and weren’t hypothesized before (whereas we could have and maybe did predict a priori that an image classifier would learn to detect curves).

Rohin's opinion: I really like the Circuits line of research; it has made pretty incredible progress in the last few years. I continue to be amazed at the power of “naming” neurons with their feature visualizations; it feels a bit shocking how helpful this is. (Though like most good ideas, in hindsight it’s blatantly obvious.) The fact that it is possible to understand 50,000 parameters well enough to reimplement them from scratch seems like a good validation of this sort of methodology. Note though that we are still far from the billions of parameters in large language models.

One thing that consistently impresses me is how incredibly thorough the authors are. I often find myself pretty convinced by the first three distinct sources of evidence they show me, and then they just continue on and show me another three distinct sources of evidence. This sort of care and caution seems particularly appropriate when attempting to make claims about what neural networks are and aren’t doing, given how hard it has historically seemed to be to make reasonable claims of that sort.

I’m pretty curious what equivariance in natural language looks like. Is it that equivariance is common in nearly all neural nets trained on “realistic” tasks, or is equivariance a special feature of vision?

Multimodal Neurons in Artificial Neural Networks (Gabriel Goh et al) (summarized by Rohin): CLIP is a large model that was trained to learn a separate embedding for images and text, such that the embedding for an image is maximally similar to the embedding for its caption. This paper uses feature visualization and dataset examples to analyze the vision side of the model, and shows that there are many multimodal neurons. For example, there is a Spiderman neuron that responds not just to pictures of Spiderman, but also to sketches of Spiderman, and even the word “spider” (written in the image, not the caption). The neurons are quite sophisticated, activating not just on instances of the concept, but also things that are related to the concept. For example, the Spiderman neuron is also activated by images of other heroes and villains from the Spiderman movies and comics. There are lots of other neurons that I won’t go into, such as neurons for famous people, regions, facial emotions, religions, holidays, abstract concepts, numbers, and text.

Unsurprisingly, many of these neurons encode stereotypes that we might consider problematic: for example, there is an immigration neuron that responds to Latin America, and a terrorism neuron that responds to the Middle East.

The concepts learned by CLIP also have some notion of hierarchy and abstraction. In particular, the authors find that when they train a sparse linear classifier on top of the CLIP features, the resulting classifier has a “hierarchy” that very approximately matches the hierarchy used to organize the ImageNet classes in the first place -- despite the fact that CLIP was never trained on ImageNet at all. (I’m not sure how approximate this match is.)

As mentioned before, the neurons can respond to text in the image, and in a few cases they can even respond to text in different languages. For example, a “positivity” neuron responds to images of English “Thank You”, French “Merci”, German “Danke”, and Spanish “Gracias”. The fact that the model is so responsive to text in images means that it is actually very easy to influence its behavior. If we take an apple (originally correctly classified as a Granny Smith) and tape on a piece of paper with the word “iPod” written on it, it will now be classified as an iPod with near certainty. This constitutes a new and very easy to execute “typographic” adversarial attack.

However, not everything that CLIP is capable of can be explained with our current interpretability techniques. For example, CLIP is often able to tell whether an image is from San Francisco (and sometimes even what region within San Francisco), but the authors were not able to find a San Francisco neuron, nor did it look like there was a computation like “California + city”.

Read more: Distill paper

Rohin's opinion: The “typographic adversarial attack” is interesting as a phenomenon that happens, but I’m not that happy about the phrasing -- it suggests that CLIP is dumb and making an elementary mistake. It’s worth noting here that what’s happening is that CLIP is being asked to look at an image of a Granny Smith apple with a piece of paper saying “iPod” on it, and then to complete the caption “an image of ???” (or some other similar zero-shot prompt). It is quite possible that CLIP “knows” that the image contains a Granny Smith apple with a piece of paper saying “iPod”, but when asked to complete the caption with a single class from the ImageNet classes, it ends up choosing “iPod” instead of “Granny Smith”. I’d caution against saying things like “CLIP thinks it is looking at an iPod”; this seems like too strong a claim given the evidence that we have right now.

Pixels still beat text: Attacking the OpenAI CLIP model with text patches and adversarial pixel perturbations (Stanislav Fort) (summarized by Rohin): Typographic adversarial examples demonstrate that CLIP can be significantly affected by text in an image. How powerfully does text affect CLIP, and how does it compare to more traditional attack vectors like imperceptible pixel changes? This blog post seeks to find out, through some simple tests on CIFAR-10.

First, to see how much text can affect CLIP’s performance, we add a handwritten label to each of the test images that spells out the class (so a picture of a deer would have overlaid a picture of a handwritten sticker of the word “deer”). This boosts CLIP’s zero-shot performance on CIFAR-10 from 87.37% to literally 100% (not a single mistake), showing that text really is quite powerful in affecting CLIP’s behavior.

You might think that since text can boost performance so powerfully, CLIP would at least be more robust against pixel-level attacks when the sticker is present. However, this does not seem to be true: even when there is a sticker with the true class, a pixel-level attack works quite well (and is still imperceptible).

This suggests that while the text is powerful, pixel-level changes are more powerful still. To test this, we can try adding another, new sticker (with the same label). It turns out that this does successfully switch the label back to the original correct label. In general, you can keep iterating the text sticker attack and the pixel-change attack, and the attacks keep working, with CLIP’s classification being determined by whichever attack was performed most recently.

You might think that the model's ability to read text is fairly brittle, and that's what's being changed by pixel-level attacks, hence adding a fresh piece of text would switch it back. Unfortunately, it doesn't seem like anything quite that simple is going on. The author conducts several experiments where only the sticker can be adversarially perturbed, or everything but the sticker can be adversarially perturbed, or where the copy-pasted sticker is one that was previously adversarially perturbed; unfortunately the results don't seem to tell a clean story.

Rohin's opinion: This is quite an interesting phenomenon, and I'm pretty curious to understand what's going on here. Maybe that's an interesting new challenge for people interested in Circuits-style interpretability? My pretty uneducated guess is that it seems difficult enough to actually stress our techniques, but not so difficult that we can't make any progress.


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
4 comments, sorted by Click to highlight new comments since:

It is quite possible that CLIP “knows” that the image contains a Granny Smith apple with a piece of paper saying “iPod”, but when asked to complete the caption with a single class from the ImageNet classes, it ends up choosing “iPod” instead of “Granny Smith”. I’d caution against saying things like “CLIP thinks it is looking at an iPod”; this seems like too strong a claim given the evidence that we have right now.

Yes, it's already been solved. These are 'attacks' only in the most generous interpretation possible (since it does know the difference), and the fact that CLIP can read text in images to, arguably, correctly note the semantic similarity in embeddings, is to its considerable credit. As the CLIP authors note, some queries benefit from ensembling, more context than a single word class name such as prefixing "A photograph of a ", and class names can be highly ambiguous: in ImageNet, the class name "crane" could refer to the bird or construction equipment; and the Oxford-IIIT Pet dataset labels one class "boxer".

Ah excellent, thanks for the links. I'll send the Twitter thread in the next newsletter with the following summary:

Last week I speculated that CLIP might "know" that a textual adversarial example is a "picture of an apple with a piece of paper saying an iPod on it" and the zero-shot classification prompt is preventing it from demonstrating this knowledge. Gwern Branwen [commented]( to link me to this Twitter thread as well as this [YouTube video]( better prompt engineering significantly reduces these textual adversarial examples, demonstrating that CLIP does "know" that it's looking at an apple with a piece of paper on it.

Related: Interpretability vs Neuroscience: Six major advantages which make artificial neural networks much easier to study than biological ones. Probably not a major surprise to readers here.


I really like this article.  Just a bit of brainstorming, something that has almost certainly been tried:

   a.  You can feed in all your examples of a specific class into these image detectors and get a list of which neurons have non-negligible activations

   b.  You can then break a trained and effective [multi-class image detector] into separate [single class detectors] from just the neurons that 'matter'

   c.  Some of these detectors may have 'evolved' a better way to detect their class than others.  You can then do a variety of mixing and matching by architecture transfer from a 'good' detector to a weak detector.  

Essentially my thought is that while it may be exhaustive in human effort to really 'understand' how a given subsystem is done, you should be able to get closer to true global maxima in performance by taking the best examples of a repeating element that gets rediscovered over and over, and basically plugging it in everywhere this element will help.  

It seems like there are many different instances of 'duplicated function' in networks, and like duplicated code, some of the code is going to be better than others.

Moreover, these methods would be general-purpose.