Semantic Topological Spaces

TristanTrim

[ Edit 1, Correction: Originally I incorrectly used the term "subspace" while meaning "quotient topology". Thanks to AprilSR for pointing out the original version of Claim 2 was false with the original wording. ]

[ Edit 2, Correction: I had used the term "monotonic" instead of "strictly monotonic". Thanks to silentbob for pointing out the error. ]

This post continues from the concepts in Zoom Out: Distributions in Semantic Spaces. I will be considering semantic spaces (input, output, and latent) from an informal topological perspective.

Topology and geometry

An extremely terse explanation of topology is that it is the math focused on what it means for a space to be continuous, abstracted from familiar geometric properties. You may be familiar with the example that the surface of a doughnut is homeomorphic to the surface of a coffee mug. What this means is that any image you could put on the surface of a doughnut could be put on the surface of a mug without changing which points of the image connect to which others. The image will be warped, parts of it getting stretched, scaled, or squished, but those are all geometric properties, not topological properties.

Applying this idea to semantic spaces gives the hypothetical idea that, for a neural network, the input space may be homeomorphic to the output space.

For example, looking at the cat-dog-labeling net again, the input is the space of possible images, and within that space is the continuous distribution of images of cats and/or dogs. The distribution is continuous because some images may contain both cats and dogs, while other images may contain animals that look ambiguous, maybe a cat, maybe a dog. It is possible that this same distribution from the net's input space is also found in the net's output space.

Geometrically the distribution would be different, since the geometry of the input space maps dimensions to rgb pixels while the output space maps the dimensions to whether the image is of a cat or a dog. But these are geometric properties; topologically the distribution could be the same, meaning that for any path you could move along in the cat-dog distribution, you can move along that exact same path in the image space (input) and the label space (output). Furthermore, any other space that looks at cat-dog space from any other geometric perspective also contains that same path.

The distribution is the same in every space, but the geometry in which it is embedded allows us to see different aspects of that distribution.

Each layer does nothing or throws something away

But... that isn't quite true, because it is possible for a network to perform geometric transformations that lose some of the information that existed in the input distribution. I identify two possible ways to lose information:

Projecting into lower dimensional spaces

The first way of losing information is projecting into a lower dimensional space. Imagine squishing a sphere in 3d space into a circle in 2d space so that the two sides of the sphere meet. It is now impossible to recover which side of the sphere a point within the circle came from.

To justify this in our cat distribution example, suppose I fix a cat in a specific pose at a specific distance from a camera and view the cat from every possible angle. (This is a thought experiment, please do not try this with a real cat.) The space of possible angles defines a 4d hypersphere. Normalizing so the cat is always upright gives us a 3d sphere, which is easier to think about.

Now, just as before, this sphere from image space may be projected down to a circle, line, or point in the labelling space. It may be that some angles make the cat look more or less dog like, so it still spans some of the labelling space, rather than being projected to a single point representing confidence in the "cat" label^[1].

But regardless of the exact details of how information is lost while projecting, it is no longer possible to recover which exact image, in image space, corresponds to a given position in labelling space. Instead, each point in labelling space corresponds to a region in image space.

In a neural network, projection to a lower dimensional space occurs whenever a layer has more inputs than outputs^[2].

Folding space on itself

The other way to lose information is to fold space on itself. This is what is happening in all examples of projecting into a lower dimensional space, but it is worth examining this this more general case because of how it relates to activation functions.

If an activation function is strictly monotonic^[3], as in leaky ReLU, then the space will not fold on itself, and so information is not lost. The input and output of the function will be homeomorphic. If, on the other hand, an activation function is not strictly monotonic, as in ReLU, then some unrelated parts of the input space will get folded together into the same parts of the output space^[4].

Interpolated geometric transformation

Hopefully this is making sense and you are understanding how a semantic space could be the same topological space in the input and output of a neural network, even though it is very different geometrically.

I'll now state a hypothesis for the purpose of personal exploration:

Claim 1: The series of latent spaces between a network's input and output space are geometric transformations, of roughly equal distance, along a continuous path between the geometry of the input space and the geometry of the output space.

This might be the sort of thing that someone has proven mathematically. If so, I'd like to find and understand their proof. Alternatively, this may be a novel direction of thought. If you are interested, please reach out or ask questions.

Quotient of the Input Space

I can make a weaker claim after introducing a bit more topology terminology. When I spoke about projecting and folding topologies earlier, the way that the resulting set can be thought of as still being a topology is with the idea of a quotient topology. This is essentially the idea of the topological information of the original space being maintained except where it gets mapped into the same locations where it gets "glued" together^[5].

A weaker claim I can make based on the topological examination of semantic space is:

Claim 2: The output space of any neural network is the quotient space of the input space with respect to the function implemented by the network.

I find it difficult to imagine that this claim isn't true, but as before, I'm not aware of a proof. Regardless, I think the implications for thinking about networks as mapping semantic topologies from one geometric space to another are quite interesting.

What's it mean for interpretability?

One implication I noticed about claim 1 is that it suggests how the difficulty of interpreting latent spaces should relate to the difficulty of interpreting input and output spaces.

I suspect lower dimensional spaces are usually easier to interpret than higher dimensional spaces, but let's set that aside for now.

Image space is easy to interpret as images, but difficult to interpret in terms of labels, so I imagine layers closer to an image space would be easier to interpret in some visual way. On the other hand, layers closer to label space should be harder to interpret visually, but easier to interpret in terms of labels.

This leaves aside the question of "what is a space that is halfway between being an image and being a label?" I think that is an interesting question to explore, but implies that layers halfway between modalities will likely be unfamiliar, and therefore hardest to interpret.

This idea of semantically similar spaces implies a possible failure mode for inquiring about latent spaces: Trying to analyze them with the incorrect modality. For example, one might want to apply labels to features in spaces with visual semantic geometry without realizing this is as difficult a problem as applying labels to features in the image space itself. To say that another way, if you do not expect to be able to find meaningful feature directions in the high dimensional space of rgb image pixels, you should not necessarily expect to find meaningful feature directions in the latent space from the activations of the first layer of a network processing that image space^[6].

Even though claim 1 and claim 2 imply some interpretability work may be more difficult than expected, the claims also imply hope that there should be a familiar (topological) structure inside every semantic space. I think Mingwei Li's Toward Comparing DNNs with UMAP Tour, provides a very compelling exploration of this idea. If you haven't already viewed the UMAP Tour, I highly recommend it.

^{^}
Other reasons the distribution may span some amount of labelling space include the way the net is trained, or the architecture of the net, rather than any property of the idealized cat-dog distribution.
^{^}
Note, just because a sphere is 3d doesn't mean it can't be projected down to 2d and reconstructed into 3d. After all, world maps exist, but this is just reconstructing the surface of the sphere embedded in 3d, it isn't possible to reconstruct the entirety of the 3d space once projected to 2d, or more generally, to reconstruct (n)d once projected to (n-1)d.
^{^}
More generally, all that is required is that an activation function be injective. It need not even be continuous, as with the heaviside step function, but I don't think discontinuous activation functions are broadly used or effective, so it is likely better to focus on continuous ones.
^{^}
Specifically all orthants with any negative components will be folded into the vector subspace between themselves and the orthant with only positive components. (Note: The vector subspace is a subspace wrt the usual topology on real vector spaces. The topology under discussion getting folded would be a quotient space, not a subspace.)⁠
^{^}
For a slightly more formal explanation, see [this comment](https://www.lesswrong.com/posts/QG3xpjRBNDnLCS6LP?commentId=ed8DhjaMMBdtqbrYb). For a quite formal explanation, pick up a topology textbook ;-)
^{^}
What you should expect to see are things more like "colours" as I discuss in N Dimensional Interactive Scatter Plot (ndisp)

Do you have formal definitions of what exactly you mean by the input space, or what you mean by the output space? What are the underlying sets, and what topology are you equipping them with? Wouldn't the output space just be the interval , and the input space $[0, 1]^{N}$ ?

I do not have a formal definition, but it's the sort of thing I'm interested in.

In future posts I'd like to explore how I'm sorta talking about the distribution that exists in the actual data structures while gesturing at the idea of an idealized semantic space representing the natural phenomena being described. The natural phenomena and idealized semantic space are what I'm interested in with the actual data structures being a way to learn about that ideal space and with the motivation that understanding of the ideal space could be applied inside the domain of neural nets and machine learning and potentially applied in broader scientific/engineering domains directly.

Trying to formalize what I'm talking about would be a big part of that exploration.

I did describe this stuff in more detail in Zoom Out: Distributions in Semantic Spaces so you might want to read and comment there, but I'll try to answer your questions a bit here.

By the "input space" and "output space" I am fuzzily both referring to the space of possible values that the data structure of the networks input and output can take, and also referring to the space of configurations of phenomena generating that data. I might call these the "digital space" and "ideal space" respectively.

So in the case of visual/image space, the digital space would be the set of possible rgb values, while the ideal space would be the space of possible configurations of cameras capturing images (or other ways of generating images). Although there are many more images in the digital space than look like anything other than static to people, the ideal space is actually a much larger space than is distinguishable by the resulting data structure, because, for example, I can aim a camera at a monitor and display any pattern of static on that monitor, or aim a different camera at a different monitor and generate all the same images of static. The same data resulting from different phenomena.

So you could think of the underlying sets being:

For digital space the underlying set is a nice clean finite digital set with cardinality two to the power of however many bits there are in your data structure.
For ideal space the underlying set is the incomprehensibly large set of possible phenomena.

I also have two topologies in mind. The topology I'm more interested in I might call the "semantic topology" which would have as open sets any semantically related objects. But I'm also thinking of the semantic topology as being approximated by sufficiently high dimensional spaces with the usual topology, although the semantic topology is probably coarser than the usual topology. But that is all very ungrounded speculation.

Wouldn't the output space just be the interval [0,1]

That depends on the network architecture and training. I think it's more natural to have [0,1]^2 with one dimension mapped to "likelihood of cat" and the other to "likelihood of dog", rather than have some "cat-not-cat" classifier which might be predisposed to think dogs are even more not a cat than nothing at all. But you could train such a network and in that case, yes, the output space would be the interval [0,1].

But another consideration is whether the semantics you're actually interested in span the entire input space. It's very likely they do not, in which case it's likely they also don't span the output [0,1], but maybe [0.003, 0.998] or (0.1, pi/4) or some other arbitrary bound. This is quite certain in the case logits which get normalized by a softmax, in which case it would surprise me if the semantic distribution spanned from -infinity to infinity on any dimension.

and the input space [0,1]^N

My answer is essentially the same as the above with the exception that the digital space might be quite explicitly the entire [0,1]^N even if most of it is in an open set of the semantic topology linked by the semantic of "it's a picture of static noise".

I also note [0,1]^N has infinite resolution of colour variability between white and black. This is not true for actual pixels which have a large but finite set of possible values.

I think even formally defining what you want the underlying set of ideal space to be would be a good post.

I personally find the informal ideas you discuss in between the topology slop ( sorry :) ) to be far more interesting.

The topology I'm more interested in I might call the "semantic topology" which would have as open sets any semantically related objects.

It sounds like you want to suppose the existence of a "semantic distance", which satisfies all the usual metric space axioms, and then use the metric space topology. And you want this "semantic distance" to somehow correspond to whether humans consider two concepts to be semantically similar.

An issue if you use the euclidean topology on the output space [0,1]^2, and a "semantic topology" on the input space, is that your network won't be continuous by default anymore. The inverse image of an open set would be open in the euclidean topology, but not necessarily open in the "semantic topology". You could define the topology on the output space so that by definition the network is continuous (quotient topology) but then topology really is getting you nothing.

I am interested in semantic distance, but before that I am interested in semantic continuity. I think the idealized topology wouldn't have a metric, but the geometric spaces in which that topology is embedded gives it semantic distance, implicitly giving it a metric.

For example, in image space slight changes in lighting would give small distances, but translating or rotating an image would move it a very great distance away. So the visual space is great for humans to look at, but the semantic metric describes things about pixel similarity that we usually don't care about outside of computer algorithms.

The labelling space would have a much more useful metric. Assuming a 1d logit, distance would correspond to how much something does or does not seem like a cat. With 2d or more logits the situation would become more complicated, but again, distance represents motion towards or away from confidence of whether we're looking at a cat, a dog, or something else.

But in both cases, the metric is a choice that tells you something about certain kinds of semantics. I'm not confident there would exist a universal metric for semantic distance.

You could define the topology on the output space so that by definition the network is continuous (quotient topology) but then topology really is getting you nothing.

I'd actually be more inclined to do this. I agree it immediately gets you nothing, but it becomes more interesting when you start asking questions like "what are the open sets" and "what do the open sets look like in the latent spaces".

Bringing back the cat identifier net, if I look at the set of high cat confidence, will the preimage be the set of all images that are definitely cats? I think that's a common intuition, but could we prove it? Would there be a way to systematically explore diverse sections of that preimage to verify that they are indeed all definitely cats?

The fact that it's starting from a trivial assertion doesn't make it a bad place to start exploring imo.

I think that kinda direction might be what you're getting at mentioning "informal ideas I discuss in between the topology slop". So it's true, I might stop thinking in terms of topology eventually, but for now I think it's helping guide my thinking. I want to try to move towards thinking in terms of manifolds, and I think noticing the idea of semantic connectivity, ie, a semantic topological space, without requiring the idea of semantic distance is worthwhile.

I think that might be one of the ideas I'm trying to zero in on: The distributions in the data are always the same and what networks do is change from embedding that distribution in one geometry to embedding it in a different geometry which has different (more useful?) semantic properties.

Good luck with it. I do think the broad direction is pretty promising.

If, on the other hand, an activation function is non-monotonic, as in ReLU, then some unrelated parts of the input space will get folded together into the same parts of the output space

Just a small technicality, but you probably mean "strictly monotonic" instead of monotonic, because ReLU actually is monotonic, right? (Or perhaps "injective" would be even closer, although I suppose in continuous spaces that's practically the same as strict monotonicity)
Of course, your actual point here still holds.

Indeed. Thank you 🙏 I'll edit the post based on this. I think "injective" is most correct for the claim, although I don't know of any commonly used discontinuous activation functions.

You might also be interested in the second half of this comment.

Claim 2 sounds very likely false to me.

Awesome! I need to state things that are obviously true more since they might be false or at least not obvious.

I think it is false for actual neural networks since floating point doesn't perfectly approximate real numbers, so if that's your intuition I very much agree, but it doesn't seem likely to matter much in practice.

The claim may also fail to hold for more exotic architectures, such as transformers. Not sure. I should maybe have specified vanilla nets, but I'm not sure which architectures it would and wouldn't apply to.

Is your intuition that it is false is coming from another direction from those I mentioned? I'm interested to know more about your perspective.

My intuition is like... you get a topological circle by gluing the two ends of an interval together, but no subspace of the interval is homeomorphic to a circle. I'm not entirely sure that this sort of issue meaningfully impacts neural networks, but I don't immediately see any reason why it wouldn't?

You are of course correct!

I just opened up a topology textbook and found that I was using the word "subspace" while thinking about quotient topologies induced by a surjective function. (I wonder if there is a shorthand word for that like there is for induced subspace topologies? I think I'll just say a "quotient space".)

I'm getting the impression you are more familiar with this than me, but in case you want help recalling, or for the sake of other readers:

A subspace of topology X is a set S containing a subset of the elements of X that is "clipped", so it does not containing the topological information of elements of X not found in S. Or more formally, the set of open sets of S is the set of each open set in X intersected with S.
A quotient topology Y from X wrt function f, where f is a surjective function from X to Y, is the topology I had been thinking of where f can be thought of as squishing, folding, or projecting points of a larger set into a smaller set. Formally, the open sets of Y is each set U for which the inverse image of U is an open set in X.

Thanks for catching that! I'm thinking I should change the article with some kind of note of correction.

( I'm not sure how embarrassed I should be about making this mistake. I think if I was a professor this would be quite embarassing. It's less embarrassing as a recent BSc graduate who has only struggled through one course on topology, but is nevertheless very interested in it. Next time I'll try to notice I should reference my textbook while writing the article. I think I got confused because I was thinking about both vector subspaces, which are topological subspaces of the larger vector space, but a different topology getting mapped into that vector subspace would be a quotient space not a subspace. )

I think maybe part of the confusion is that, when you're working with vector spaces in particular, subspaces and quotient spaces are the same thing.

Yeah, that's probably part of it, although technically they are only the same with the quotient function being the very natural function of throwing away whatever component is not in the vector subspace to project straight down into that subspace, but this is not the only possible choice of function and so not the only possible space to get as a result.

I think all monotonic functions would give a homeomorphic space, but functions with discontinuities would not and I'm not sure about functions that are surjective but not injective. And functions that are not surjective fail the criteria for generating a quotient space.

Edit: I think maybe functions with discontinuities do still give a continuous space so long as they are surjective, which is required It would just break the vector properties between the two spaces, but that's not required for a topological space. This is inspiring me to want to study topology more : )

Oh, I meant in the category of (topological) vector spaces, which requires the quotient maps to be linear.

Oh yeah, that makes sense. I wouldn't want to make that assumption though, since activation functions are explicitly non-linear, otherwise the multiple layers can be multiplied together and a multi-layer perceptron would just be an indirect way of doing a single linear map.