LESSWRONG
LW

Interpretability (ML & AI)Machine Learning (ML)AI
Frontpage

14

Zoom Out: Distributions in Semantic Spaces

by TristanTrim
6th Aug 2025
5 min read
4

14

Interpretability (ML & AI)Machine Learning (ML)AI
Frontpage

14

Zoom Out: Distributions in Semantic Spaces
4StefanHex
1TristanTrim
2Brendan Long
3TristanTrim
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:26 AM
[-]StefanHex1mo40

Thanks for posting this! Your description of transformations between layers, squashing & folding etc., reminds me of some old-school ML explanations about "how to multi-layer perceptrons work" (this is not meant as a bad thing, but a potential direction to look into!), I can't think of references right now.

It also reminds me of Victor Veitch's group's work, e.g. Park et al., though pay special attention to the refutation(?) of this particular paper.

Finally, I can imagine connecting what you say to my own research agenda around "activation plateaus" / "stable regions". I'm in the progress of producing a better write-up to explain my ideas, but essentially I have the impression that NNs map discrete regions of activation space to specific activations later on in a model (squashing?), and wonder whether we can make use of these regions.

Reply1
[-]TristanTrim26d10

Hey! Thanks for the links. I'll look into them.

Your description of "old-school ML explanations" makes me think of this Chris Olah article. It, along with the work of Mingwei Li ( grand tour, umap tour ) and a bunch of time spent trying to reason about the math and geometry of NNs, is what I base my current POV on.

map discrete regions of activation space to specific activations later on

If I understand correctly this corresponds with one of my key claims, that "position" not "direction" is fundamental to semantics in activation spaces. If "direction" is relevant, it is possible for it to be local and distort over distance, more like a vector field then a single vector that applies to the whole space.

In this and the following section in this video I give some more description of the idea if you are interested.

I'm planning to do self study from next week after I finish my final exam until November, and one of the things I want to do is a deep dive on the transformer architecture and attempt to extend my understanding of these concepts as they apply to vanilla and conv nets to transformer based nets.

I look forward to seeing your future work!

Reply
[-]Brendan Long24d20

I've read something similar to this for explaining simple neural networks (wish I could remember where), but this this still work with transformers?

Reply
[-]TristanTrim24d30

You aren't thinking of this Chris Olah article, are you? If not, I'd love to hear if you ever remember what it was.

As for applying this theory to transformers. That is a very good question! I wish I had a good answer, but that is very much one of my next goals. I want to become more familiar with the transformer architecture, getting some actual hand's on MI experience (I've only worked with vanilla and conv nets) and analyzing their structure from a math / geometry / topology perspective.

I will say that it seems to me that all "special" architectures I've looked at so far can be viewed mathematically as a very big vanilla network with specific rules for the weights. For example, LSTMs can be unrolled to make very deep networks with the rule that every one of the "unrolled" networks shares their weights with each of the others. For Conv layers, it is as if we have taken a FC layer, and set every weight connected to a far away pixel to 0 and then set each output pixel to share the kernel weights with all of the others.

I suspect something like this is true for transformers as well, but I haven't yet studied the architecture well enough to say with confidence, much less to be able to draw useful implications about transformers from the concept, but I'm optimistic about it, and plan to pursue the idea over the coming months.

Reply11
Moderation Log
More from TristanTrim
View more
Curated and popular this week
4Comments

(This article is edited and expanded on from a comment I made to someone in the newly starting BAIF slack community. Thanks for the inspiration 🙏)

Introduction

In this article I present an alternative paradigm for Mechanistic Interpretability (MI). This paradigm may turn out to be better or worse, or naturally combine with the standard paradigm I often see implicitly extended from Chis Olah's "Zoom-In".

I've talked about this concept before, in various places. Someday I may collect them and try to present a strong case including a survey of paradigms in MI literature. For now, here is a relatively short introduction to the concept assuming some familiarity with ML and MI.

Afaik, Chris Olah originally introduced the concepts of "features" and "circuits" in "Zoom-In", as a suggestion for a direction for exploration, not as a certainty. It worked very well for thinking about things like "circle" and "texture" detectors, which I think are a natural, but incorrect way of understanding what is going on.

New Mechanistic Interpretability Paradigm?

I have been developing an alternate paradigm I'm not currently sure anyone else is talking about.

It is now common to think of the collective inputs or outputs of network layers as vectors rather than individual signals. The concept which I am uncertain anyone is focusing on, is that each vector is representative of a semantic space in which distributions live.

Input Space

For example, in a cat-dog-labeling net, the input space is images and there are two distributions living in this space. The cat-distribution is all possible images that are of cats. We can make some claims about that distribution, such as the idea that it is continuous and connected. The same thing is true of the dog-distribution, but additionally, the dog-distribution may be connected to the cat distribution in several spaces containing the set of images that are ambiguous, maybe a cat, maybe a dog. There is also implicitly a distribution of images that are neither dogs nor cats, but this can be ignored in simple examples.

Output Space

The output space has very different semantics. It is meant to label images as either a cat or dog, so it may be a 1-dim space where being near (1,) corresponds to "highly cat like" and near (0,) corresponds to "highly dog like" and anything in between could be "ambiguous" or "neither cat nor dog". If there is also training for "not cat or dog" the space might be 2-dim, with the point (1,0) meaning cat-like, and (0,1) meaning dog-like. Then "neither" would be (0,0) and "ambiguous" would be (0.5, 0.5). This kind of semantic space seems somehow intuitive to me. If you gave me a set of pictures to post on a cork-board I feel like I could do a fair approximation of this.

Details of Semantics

It seems note worthy that the actual input and output semantic spaces as understood by the network may be different from the previous semantic understandings, instead based on the training dataset, with the possible distributions it could imply, and the dynamics of the training. For example, if the training puts no constraint on mapping inputs to locations like (10,0) or (-1,0) in the output, then what would the semantics of those locations mean? Would (10,0) correspond to something like "10 times as cat like"? My intuition is that there would instead be a messy distribution extending out in the (1,0) direction, and that distribution would be determined by the shape of the input distribution and the network architecture and training dynamics. In other words, it would be a result of whatever was the easiest way to separate the parts of the input distribution that need to be separated. The same applies to the (-1,0) direction. I do not expect this to have semantics meaning "the opposite of cat like". There may be something similar, especially if unsupervised rather than label methods are used, but it still has to do with the semantic distribution, not semantic directions.

Latent Semantic Spaces

Each of the latent spaces of the network could be understood as some step between the semantics of the two spaces. That might be an oversimplification. For example, there may be movement into a seemingly unrelated semantic space either to "untangle" distributions, or for some other reason. Even if it's not an oversimplification, there is a lot to be understood about what it means to step from the semantics of image-space to label-space.

Semantic Mappings

With this paradigm, the network is not understood in terms of neuron connections or circuits at all, but instead as "semantic mappings". Each layer of the network is an affine transformation (rotating, sliding, scaling, and skewing) that prepares it for the activation which "folds", "bends", "collapses", or "squashes" the parts of the space that have been moved into a negative orthant (generalization of a quadrant in n-dim). This also results in the squashing of any distributions that existed in that space. In other words, any inputs sampled from the distribution of possible inputs gets mapped to the corresponding location in the new squashed distribution. This results in the new, slightly transformed semantic space as the output of the layer.

I wish to be clear. By "folds", "bends", "collapses", or "squashes", I mean to list words which give intuition into the geometric understanding of the activation function. Most accurately what it is doing is applying the activation function, whatever that may be, but thinking in that way inspires thoughts of neurons firing and circuits. I wish to instead inspire thoughts of untangling semantic distributions, or rather, of morphing one semantic space into a different semantic space.

The goal given to the network during it's training is to find the semantic mappings that transform the input semantic space (defined by the input dataset) into the output semantic space (defined by the labels to the dataset). What I think we have found, empirically, with neural networks, is that they are up to this task. They can, through a sequence of squashings, transform semantic spaces into very different looking semantic spaces.

Then, to me, the question MI sets out to solve is "what are the semantics of the input space, output space, and intermediate latent spaces, and what transformation is each layer making to step from one semantic space to another?"


From within this paradigm, the answer to the questions:

  • "What is a concept?" -- A boundary, or set of boundaries, around specific locations in semantic space. (I'm quite happy with this definition. It feels like it could be empirically accurate.)
  • "What is a feature" -- An aspect of a latent semantic space that helps us understand the transformation of one semantic space into another. (this is fuzzy)
  • "What is meaning" -- This is the semantics of the semantic spaces we are interested in. The existence of the distribution of cat pictures. The parts of that distribution corresponding to cats looking left vs right. The parts of that distribution that make a cat black vs orange. The distribution is there and all of our ideas of meaning exist within it.

If you finished reading this. Thanks!

Let me know what you think, and if you know of any work that seems related, please send me a link.