AI interpretability researchers want to understand how models work. One popular approach is to try to figure out which features of an input a model detects and uses to generate outputs. For instance, researchers interested in understanding how an image classifier distinguishes animals from inanimate objects might try to uncover the properties of the image (such as fur, scales and feathers) that the model “looks for” when faced with that task. Researchers might also try to localise where in the internal workings of the model the this information is encoded and processed (is fur detected at earlier layers of a neural network than limbs?). Answering these sorts of questions is one way of peeking inside the “black box” of an AI system.

The approach just described involves applying a representational lens to AI models – the models are thought of as representing features of inputs, and these representations play some role in explaining how the model performs a task (and, when it fails, why it fails). But what the hell is a representation, anyway? 

As a philosopher who spends a lot of time thinking about representation (mainly in the context of biological minds and brains) I have a hunch that the philosophical literature on the topic contains a few nuggets of wisdom that may be useful (or at the very least interesting) to those interested in interpretability research.

Drawing on philosophy of mind and cognitive science, I’ll share a few “tools” (concepts, distinctions and ways of thinking about the issues) that may help to clarify research questions in AI interpretability. Along the way I’ll suggest some relevant literature, for those interested in digging a bit deeper. 

More broadly, this is an advertisement for the value that philosophy can add to AI safety and interpretability research, beyond the more obviously relevant sub-disciplines of moral philosophy and metaethics. 

In this first post, I’ll introduce tool number one: a handy distinction between representational content and representational vehicles.


AI interpretability research does not always explicitly use the term “representation”. Research into the properties of inputs that models detect and respond to is sometimes described instead as the search for “features”. However, the idea of a feature can be a little confusing because the term is used and defined in apparently contradictory ways. Here, I’ll draw attention to what I see as the main conceptual knot. I’ll then introduce a distinction from philosophy which may help to clear up the confusion.

In their seminal paper on circuits, Olah et al. (2020) talk about features as if they were internal to the model, such as some element of the model’s activations or parameters (all emphasis in the quotes in this section is mine):

  • “neural networks consist of meaningful, understandable features”
  • “Features are connected by weights”
  • “Early layers contain features like edge or curve detectors, while later layers have features like floppy ear detectors or wheel detectors”

On this way of talking, features are taken to be something “under the hood” of a neural network. But occasionally Olah and colleagues talk about features as if they were things external to the model – properties in the world or in the input, that a model detects, tracks or responds to:

  • “it develops a large number of neurons dedicated to recognizing dog related features, including heads
  • “it’s looking for the eyes and whiskers of a cat, for furry legs, and for shiny fronts of cars — not some subtle shared feature.”

By contrast, in another important paper in the interpretability literature, Elhage et al. (2022) go the other way, talking about features mainly in terms of environmental properties – the things which are “represented”, “encoded” or “detected” by a model:

  • “in an ‘ideal’ ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout
  • “neurons are sometimes ‘monosemantic’ responding to a single feature, and sometimes ‘polysemantic’ responding to many unrelated features”

However, they occasionally slip into talking about features as the model-internal mechanism which does the representing/encoding/detecting:

  • “curve detectors appear to reliably occur across sufficiently sophisticated vision models, and so are a feature”
  • They also talk about features as being:
    • “multidimensional manifolds”
    • “directions [in a neural network’s activation space]”
    • “neurons in sufficiently large models” 

Clearly the two ways of thinking about features are intimately related, but they are pulling us in two contradictory directions. Are features under the hood, or in the world? Here I want to suggest that the tension arises when we collapse or conflate two different aspects of representations. Philosophers hate contradictions, so they have developed a distinction for teasing apart these two ideas: this is the distinction between representational vehicles and representational contents.

In the context of biological organisms, representing something involves a relation between two things. On the one hand there are the contents of representation: the categories or properties of inputs – ways the environment can be – that the organism selectively tracks and uses to produce some goal-directed behaviour. Examples of contents that might be represented include, spatial relations between objects in the organism's environment, the presence of a predator, or whether an object is edible or not.

On the other end of the representational relation there are the vehicles of representation: the neural properties (or structures or events) which do the representing  or “carry” particular contents. Candidates for representational vehicles in the brain include the firing rate of particular neurons, patterns of activity over a whole population of neurons in a certain brain area, and various features of neural dynamics (how activity unfolds over time).

The vehicle–content distinct has its home in thinking about representations in biological organisms, and has also been applied to public representations, e.g. in distinguishing a the lines on a map (representational vehicles) from the territory it represents (the content). But here I want to suggest that AI interpretability research can also fruitfully adopt this distinction. We can talk about the contents of representations in an image classifier as including things like dog heads, curves, or having fur. The vehicles of representation in this case are not in a biological brain, but will be found in the "brain" of the model – they are the aspects of a neural network that do the representing  or “carry” particular contents. Candidates include activations of certain units and regions or directions in activation space.

Representational vehicle: The thing internal to a neural network that is responsible for encoding, detecting or representing something.

Representational content: The thing (object, property, category, relation) external to the model that is represented by a representational vehicle.




Distinguishing contents from vehicles (rather than bundling them all up under the heading of a “feature”) is helpful because it helps us to distinguish two research questions: 

One is the question of what a model represents (is this vision model able to represent dogs or is it just representing fluffy looking things?) I suggest that we think of this as the search for the contents of a model’s representations. 

The second is the question of how a model represents those contents, i.e. what parts or aspects of the network are responsible for encoding those contents (are features represented by individual neurons/units or by non-basis directions?; which layer of the neural network contains the curve detectors?) I suggest that we think of this as the search for the vehicles of representations.

Thus, the vehicle–content distinction not only helps us to avoid awkward contradictions in the way we talk about AI systems. It also allows us to more clearly see and pose these research questions – questions that target different aspects of AI interpretability and that may require different methods to answer.

Further reading: 

Shea, N. (2007). Content and its vehicles in connectionist systems. Mind & Language22(3), 246-269.

Bechtel, W. (2007). “Representations and Mental Mechanisms” in W. Bechtel, Mental Mechanisms: Philosophical Perspectives on Cognitive Neuroscience (1st ed.). Psychology Press.

New Comment
1 comment, sorted by Click to highlight new comments since:

Good post, but also there might be enough inertia to using the word "feature" in different contexts that it's hard to stop. Honestly the in-the-world vs in-the-model distinction my be the less confusing of the two common distinctions, because in both cases a feature is a part of a decomposition of the whole into parts that can be composed with each other. The more subtle one to keep straight is the distinction between features as things found by their local statistical properties vs. features as things found by their impact on the entire computation.