Zoom In: An Introduction to Circuits

Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter, Shan

doi:10.23915/distill.00024.001

This essay introduces some new terminology, and also uses some terminology which isn’t common. To help, we provide the following glossary:

Circuit - A subgraph of a neural network. Nodes correspond to neurons or directions (linear combinations of neurons). Two nodes have an edge between them if they are in adjacent layers. The edges have weights which are the weights between those neurons (or n₁Wn₂^T if the nodes are linear combinations). For convolutional layers, the weights are 2D matrices representing the weights for different relative positions of the layers.

Circuit Motif - A recurring, abstract pattern found in circuits, such as equivariance or unioning over cases. Inspired by the use of circuit motifs in systems biology .

Client Neuron or Client Feature - A neuron in a later layer which relies on a particular earlier neuron. For example, a circle detector is a client of curve detectors.

Direction - A linear combination of neurons in a layer. Equivalently, a vector in the representation of a layer. A direction can be an individual neuron (which is a basis direction of the vector space). For intuition about directions as an object, see Building Blocks (in particular, the section titled “What Does the Network See?“) and Activation Atlases.

Downstream / Upstream - In a later layer / In an earlier layer.

Equivariance - For equivariance in the context of circuits (eg. equivariant features, equivariant circuits), see the circuit article on equivarinace. For the more general idea of equivariance in mathematics, see the wikipedia equivariant map article.

Family - A set of features found in one layer which detect small variations of the same thing. For example, curve detectors exist in a family detecting curves in different orientations.

Feature - A scalar function of the input. In this essay, neural network features are directions, and often simply individual neurons. We claim such features in neural networks are typically meaningful features which can be rigorously studied (Claim 1).

Meaningful Feature - A feature that genuinely responds to an articulable property of the input, such as the presence of a curve or a floppy ear. Meaningful features may still be noisy or imperfect.

Polysemantic Feature - A feature that responds to multiple unrelated latent variables, such as the cat/car neuron . This can be seen a special case of a “multifaceted features” which responds to multiple different cases, but include both “real” multi-faceted features such as the pose-invariant dog head or polysemantic neurons. Contrast with pure.

Pure Feature - A feature which responds to only a single latent variable. Contrast with polysemantic.

Universal Feature - A feature which reliably forms across different models and tasks.

Representation - The vector space formed by the activations of all neurons in a layer, with vectors of the form (activation of neuron 1, activation of neuron 2, …). A representation can be thought of as the collection of all features that exist in a layer. For intuition about representations in vision models, see Activation Atlases.

Author Contributions

Writing: The text of this essay was primarily written by Christopher Olah, drawing extensively on the research and thinking of the entire Clarity team. Nick Cammarata was deeply involved in developing the framing and revising the final text.

Research: This essay articulates themes that developed as a result of several people’s research into how neural networks implement features. Chris began initial attempts to understand the mechanistic implementations of neurons in terms of their weights in 2018, and developed several tools that enabled this line of work. This work was extended by Gabriel Goh, who discovered the first of what we now call motifs (using negative weights for specialization), in addition to describing the mechanisms behind several neurons. At this point, Nick Cammarata took up this line of research to characterize much larger and deeper circuits, greatly expand the number of neurons we understood mechanistically, and performed detailed, rigorous characterizations of curve detectors. Nick also introduced the connection to Systems biology. Ludwig Schubert performed detailed analysis of high-low frequency detectors. Chris gave research advice and mentorship throughout.

Infrastructure: Michael Petrov, Shan Carter, Ludwig and Nick built a variety of infrastructural tools which made our research possible.

Historical Note

The ideas in this introductory essay were previously presented as a keynote talk by Chris Olah at VISxAI 2019. It was also informally presented at MILA, the Vector Institute, the Redwood Center for Neuroscience, and a private workshop.

Acknowledgments

All our work understanding InceptionV1 is indebted to Alex Mordvintsev, whose early explorations of vision models paved pathways we still follow. We’re deeply grateful to Nick Barry, and Sophia Sanborn for their deep engagement on potential connections between our work and neuroscience, and to Tom McGrath who pointed out the similarities between Kuhn’s “pre-paradigmatic science” and the state interpretability as a field to us. The careful comments and criticism of Brice Menard were also invaluable in sharpening this essay.

In addition to Nick and Sophia’s deep engagement, we’re more generally appreciative of the neuroscience community’s engagement with us, especially in sharing hard-won lessons about methodological weaknesses in our work. In particular, we appreciate Brian Wandell pushing us in 2019 on not using tuning curves and the importance of families of neurons, which we think has made our work much stronger. We’re also very grateful for the comments and support of Mareike Grotheer, Natalia Bilenko, Bruno Olshausen, Michael Eickenberg, Charles Frye, Philip Sabes, Paul Merolla, James Redd, Thong-Wei Koh, and Ivan Alvarez. We think we have a lot to learn from the neuroscience community and are excited to continue doing so.

One of the privileges of working on circuits has been the open collaboration and feedback in the Distill slack’s #circuits channel. We’ve especially appreciated the detailed feedback we received from Stefan Sietzen, Shahab Bakhtiari, and Flora Liu (Stefan has additionally run with many of these ideas, and we’re excited to see his work in future articles in this thread!).

We benefitted greatly from the comments of many people on meta-science and framing questions around this essay, but especially appreciated the comments of Arvind Satyanarayan, Miles Brundage, Amanda Askell, Aaron Courville, and Martin Wattenberg. We’re grateful to Taco Cohen, Tess Smidt, and Sara Sabour for their extremely helpful comments on equivariance. We’re grateful to Nikita Obidin, Nick Barry, and Chelsea Voss for helpful conversation and references about systems biology and circuit motifs. (Nikita and Nick initially introduced Nick Cammarata to circuit motifs.) Finally, we’re grateful for the institutional support of OpenAI, and for the support and comments of all our colleagues and friends across institutions, including Dario Amodei, Daniela Amodei, Jonathan Uesato, Laura Ball, Katarina Slama, Alethea Power, Jacob Hilton, Jacob Steinhardt, Tom Brown, Preetum Nakkiran, Ilya Sutskever, Ryan Lowe, Erin McCloskey, Eli Chen, Fred Hohman, Jason Yosinski, Pallavi Koppol, Reiichrio Nakano, Sam McCandlish, Daniel Dewey, Anna Goldie, Jochen Görtler, Hendrik Strobelt, Ravi Chunduru, Tom White, Roger Grosse, David Duvenaud, Daniel Burkhardt, Janelle Tam, Jeff Clune, Christian Szegedy, Alec Radford, Alex Ray, Evan Hubinger, Scott Gray, Augustus Odena, Mikhial Pavlov, Daniel Filan, Jascha Sohl-Dickstein and Kris Sankaran.

[hooke1666micrographia] Micrographia: or Some Physiological Descriptions of Minute Bodies Made by Magnifying Glasses. With Observations and Inquiries Thereupon [link]
Hooke, R., 1666. The Royal Society. DOI: 10.5962/bhl.title.904

[karpathy2015visualizing] Visualizing and understanding recurrent networks [PDF]
Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXiv preprint arXiv:1506.02078.

[erhan2009visualizing] Visualizing higher-layer features of a deep network [PDF]
Erhan, D., Bengio, Y., Courville, A. and Vincent, P., 2009. University of Montreal, Vol 1341, pp. 3.

[olah2017feature] Feature Visualization [link]
Olah, C., Mordvintsev, A. and Schubert, L., 2017. Distill. DOI: 10.23915/distill.00007

[simonyan2013deep] Deep inside convolutional networks: Visualising image classification models and saliency maps [PDF]
Simonyan, K., Vedaldi, A. and Zisserman, A., 2013. arXiv preprint arXiv:1312.6034.

[nguyen2015deep] Deep neural networks are easily fooled: High confidence predictions for unrecognizable images [PDF]
Nguyen, A., Yosinski, J. and Clune, J., 2015. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427--436. DOI: 10.1109/cvpr.2015.7298640

[mordvintsev2015inceptionism] Inceptionism: Going deeper into neural networks [HTML]
Mordvintsev, A., Olah, C. and Tyka, M., 2015. Google Research Blog.

[nguyen2016plug] Plug & play generative networks: Conditional iterative generation of images in latent space [PDF]
Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A. and Yosinski, J., 2016. arXiv preprint arXiv:1612.00005.

[zeiler2014visualizing] Visualizing and understanding convolutional networks [PDF]
Zeiler, M.D. and Fergus, R., 2014. European conference on computer vision, pp. 818--833.

[fong2017interpretable] Interpretable Explanations of Black Boxes by Meaningful Perturbation [PDF]
Fong, R. and Vedaldi, A., 2017. arXiv preprint arXiv:1704.03296.

[kindermans2017patternnet] PatternNet and PatternLRP--Improving the interpretability of neural networks [PDF]
Kindermans, P., Schutt, K.T., Alber, M., Muller, K. and Dahne, S., 2017. arXiv preprint arXiv:1705.05598. DOI: 10.1007/978-3-319-10590-1_53

[reif2019visualizing] Visualizing and Measuring the Geometry of BERT [PDF]
Reif, E., Yuan, A., Wattenberg, M., Viegas, F.B., Coenen, A., Pearce, A. and Kim, B., 2019. Advances in Neural Information Processing Systems, pp. 8592--8600.

[carter2019activation] Activation atlas [link]
Carter, S., Armstrong, Z., Schubert, L., Johnson, I. and Olah, C., 2019. Distill, Vol 4(3), pp. e15. DOI: 10.23915/distill.00015

[hohman2019summit] Summit: Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations [PDF]
Hohman, F., Park, H., Robinson, C. and Chau, D.H.P., 2019. IEEE Transactions on Visualization and Computer Graphics, Vol 26(1), pp. 1096--1106. IEEE.

[mikolov2013distributed] Distributed representations of words and phrases and their compositionality [PDF]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J., 2013. Advances in neural information processing systems, pp. 3111--3119.

[radford2017learning] Learning to generate reviews and discovering sentiment [PDF]
Radford, A., Jozefowicz, R. and Sutskever, I., 2017. arXiv preprint arXiv:1704.01444.

[zhou2014object] Object detectors emerge in deep scene cnns [PDF]
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A., 2014. arXiv preprint arXiv:1412.6856.

[netdissect2017] Network Dissection: Quantifying Interpretability of Deep Visual Representations [PDF]
Bau, D., Zhou, B., Khosla, A., Oliva, A. and Torralba, A., 2017. Computer Vision and Pattern Recognition.

[donnelly2019interpretability] On Interpretability and Feature Representations: An Analysis of the Sentiment Neuron
Donnelly, J. and Roegiest, A., 2019. European Conference on Information Retrieval, pp. 795--802.

[jo2017measuring] Measuring the tendency of CNNs to Learn Surface Statistical Regularities [PDF]
Jo, J. and Bengio, Y., 2017. arXiv preprint arXiv:1711.11561.

[geirhos2018imagenet] ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness [PDF]
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A. and Brendel, W., 2018. arXiv preprint arXiv:1811.12231.

[brendel2019approximating] Approximating cnns with bag-of-local-features models works surprisingly well on imagenet [PDF]
Brendel, W. and Bethge, M., 2019. arXiv preprint arXiv:1904.00760.

[ilyas2019adversarial] Adversarial examples are not bugs, they are features [PDF]
Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B. and Madry, A., 2019. Advances in Neural Information Processing Systems, pp. 125--136.

[morcos2018importance] On the importance of single directions for generalization [PDF]
Morcos, A.S., Barrett, D.G., Rabinowitz, N.C. and Botvinick, M., 2018. arXiv preprint arXiv:1803.06959.

[lecun2015deep] Deep learning [PDF]
LeCun, Y., Bengio, Y. and Hinton, G., 2015. nature, Vol 521(7553), pp. 436--444. Nature Publishing Group.

[szegedy2015going] Going deeper with convolutions [PDF]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A. and others,, 2015. DOI: 10.1109/cvpr.2015.7298594

[hubel1962receptive] Receptive fields, binocular interaction and functional architecture in the cat's visual cortex
Hubel, D.H. and Wiesel, T.N., 1962. The Journal of physiology, Vol 160(1), pp. 106--154. Wiley Online Library.

[carter2017using] Using Artificial Intelligence to Augment Human Intelligence [link]
Carter, S. and Nielsen, M., 2017. Distill. DOI: 10.23915/distill.00009

[nguyen2016multifaceted] Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks [PDF]
Nguyen, A., Yosinski, J. and Clune, J., 2016. arXiv preprint arXiv:1602.03616.

[alon2019introduction] An introduction to systems biology: design principles of biological circuits
Alon, U., 2019. CRC press. DOI: 10.1201/9781420011432

[li2015convergent] Convergent learning: Do different neural networks learn the same representations? [PDF]
Li, Y., Yosinski, J., Clune, J., Lipson, H. and Hopcroft, J.E., 2015. FE@ NIPS, pp. 196--212.

[raghu2017svcca] SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability [PDF]
Raghu, M., Gilmer, J., Yosinski, J. and Sohl-Dickstein, J., 2017. Advances in Neural Information Processing Systems 30, pp. 6078--6087. Curran Associates, Inc.

[kornblith2019similarity] Similarity of neural network representations revisited [PDF]
Kornblith, S., Norouzi, M., Lee, H. and Hinton, G., 2019. arXiv preprint arXiv:1905.00414.

[krizhevsky2012alexnet] ImageNet Classification with Deep Convolutional Neural Networks [PDF]
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Advances in Neural Information Processing Systems 25, pp. 1097--1105. Curran Associates, Inc.

[simonyan2014vggnet] Very Deep Convolutional Networks for Large-Scale Image Recognition [PDF]
Simonyan, K. and Zisserman, A., 2014. CoRR, Vol abs/1409.1556.

[kaiming2015resnet] Deep Residual Learning for Image Recognition [PDF]
He, K., Zhang, X., Ren, S. and Sun, J., 2015. CoRR, Vol abs/1512.03385.

[yamins2014performance] Performance-optimized hierarchical models predict neural responses in higher visual cortex
Yamins, D.L., Hong, H., Cadieu, C.F., Solomon, E.A., Seibert, D. and DiCarlo, J.J., 2014. Proceedings of the National Academy of Sciences, Vol 111(23), pp. 8619--8624. National Acad Sciences.

[gucclu2015deep] Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream
Gu{\c{c}}lu, U. and van Gerven, M.A., 2015. Journal of Neuroscience, Vol 35(27), pp. 10005--10014. Soc Neuroscience.

[eickenberg2017seeing] Seeing it all: Convolutional network layers map the function of the human visual system
Eickenberg, M., Gramfort, A., Varoquaux, G. and Thirion, B., 2017. NeuroImage, Vol 152, pp. 184--194. Elsevier.

[jiang2019discrete] Discrete neural clusters encode orientation, curvature and corners in macaque V4 [link]
Jiang, R., Li, M. and Tang, S., 2019. bioRxiv. Cold Spring Harbor Laboratory. DOI: 10.1101/808907

[pasupathy2001shape] Shape representation in area V4: position-specific tuning for boundary conformation
Pasupathy, A. and Connor, C.E., 2001. Journal of neurophysiology, Vol 86(5), pp. 2505--2519. American Physiological Society Bethesda, MD.

[kuhn1962structure] The structure of scientific revolutions
Kuhn, T.S., 1962. University of Chicago press. DOI: 10.7208/chicago/9780226458106.001.0001

[goodfellow2019interview] Ian Goodfellow: Generative Adversarial Networks [link]
Goodfellow, I. and Fridman, L., 2019. Artificial Intelligence Podcast.

[olah2018building] The Building Blocks of Interpretability [link]
Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI: 10.23915/distill.00010

Zoom In: An Introduction to Circuits

Authors

Affiliations

Published

DOI

Three Speculative Claims

Schwann’s Claims about Cells

Three Speculative Claims about Neural Networks

Claim 1: Features

Example 1: Curve Detectors

Argument 1: Feature Visualization

Argument 2: Dataset Examples

Argument 3: Synthetic Examples

Argument 4: Joint Tuning

Argument 5: Feature implementation (circuit-based argument)

Argument 6: Feature use (circuit-based argument)

Argument 7: Handwritten Circuits (circuit-based argument)

Example 2: High-Low Frequency Detectors

Example 3: Pose-Invariant Dog Head Detector

Polysemantic Neurons

Claim 2: Circuits

Circuit 1: Curve Detectors

Circuit 2: Oriented Dog Head Detection

Circuit 3: Cars in Superposition

Circuit Motifs

Claim 3: Universality

Curve detectors

AlexNet

InceptionV1

VGG19

ResNetV2-50

High-Low Frequency detectors

AlexNet

InceptionV1

VGG19

ResNetV2-50

Interpretability as a Natural Science

Closing Thoughts

Glossary

Author Contributions

Historical Note

Acknowledgments

References

Updates and Corrections

Reuse

Citation