Does the Universal Geometry of Embeddings paper have big implications for interpretability?

May 28, 2025*

120

Their results are for document embeddings (which are often derived from LLMs), not internal activation spaces in LLMs. But I suspect if we tested their method for internal activation spaces of different LLMs, at least ones of similar sizes and architectures, then we might find similar results. Someone really should test this, and publish the paper: it should be pretty easy to replicate what they did and plug various LLM embeddings in.

If that turns out to be true, to a significant extent, this seems like it should be quite useful for:

a) understanding why jailbreaks often transfer fairly well between models

b) supporting ideas around natural representations

c) letting you do various forms of interpretability in one model and then searching for similar circuits/embeddings/SAE features in other models

d) extending technique like the logit lens

e) comparing and translating between LLM's internal embedding spaces and the latent space inherent in human language (their result clearly demonstrates that the is a latent space inherent in human language). This is a significant chunk of the entire interpretability problem: it lest us see inside the black box, so that's a pretty key capability.

f) if you have a translation between two models (say of their activation vectors at their midpoint layer), then by comparing roundtripping from model A to model B and back to just roundtripping from model A to the shared latent space and back, you can identify what concepts model A understands that model B doesn't. Similarly in the other direction. That seems like a very useful ability.

Of course, their approach requires zero information about which embeddings for model A correspond to or are similar to which embeddings for model B: their translation model learns all that from patterns in the data — rather well, according to their results. However, it shouldn't be hard to supplement their approach, given that you often do have partial information about this, and have it also make use of the structures inherent in the data.

Bogdan Ionut Cirstea

May 27, 2025

Yes, I do think this should be a big deal, and even more so for monitoring (than for understanding model internals). It should also have been at least somewhat predictable, based on theoretical results like those in I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? and in All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling.

quetzal_rainbow

May 28, 2025

The problem here is that sequence embeddings should have tons of side-channels which should convey non-semantic information (like, say, frequencies of tokens in sequence) and you can come a long way with this sort of information.

What would be really interesting is to train embedding models in different languages and check whether you can translate highly metaphorical sentences with no correspondence other than semantic, or train embedding models on different representations of the same math (for example, matrix mechanics vs wave mechanics formulations of quantum mechanics) and see if they recognize equivalent theorems.

TristanTrim

Sep 21, 2025

I really want to dive into this paper, but it feels to me like it is a big deal, and is approaching from a valuable different direction what Mingwei found in Toward Comparing DNNs with UMAP Tour, that it is the dataset that determines the latent structures.

I think it is promising for interpretability, but I worry it may lead to dangerous capabilities progress if the ability to work with and design these structures manually is developed. One particular worry is what could be done with chain of though in universal latent space.

P. João

Sep 24, 2025

And how do they do that?

P. João

Sep 24, 2025

And how do they do that?

LESSWRONG
LW

LESSWRONG
LW

43

[ Question ]

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

43

43

6 Answers sorted by
top scoring

May 28, 2025*

May 27, 2025

May 28, 2025

Sep 21, 2025

Sep 24, 2025

Sep 24, 2025

43

[ Question ]

Does the Universal Geometry of Embeddings paper have big implications for interpretability?

43

43

6 Answers sorted by top scoring

May 28, 2025*

May 27, 2025

May 28, 2025

Sep 21, 2025

Sep 24, 2025

Sep 24, 2025

6 Answers sorted by
top scoring