Their results are for document embeddings (which are often derived from LLMs), not internal activation spaces in LLMs. But I suspect if we tested their method for internal activation spaces of different LLMs, at least ones of similar sizes and architectures, then we might find similar results. Someone really should test this, and publish the paper: it should be pretty easy to replicate what they did and plug various LLM embeddings in.
If that turns out to be true, to a significant extent, this seems like it should be quite useful for:
a) understanding why jailbreaks often transfer fairly well between models
b) supporting ideas around natural representations
c) letting you do various forms of interpretability in one model and then searching for similar circuits/embeddings/SAE features in other models
d) extending technique like the logit lens
e) comparing and translating between LLM's internal embedding spaces and the latent space inherent in human language (their result clearly demonstrates that the is a latent space inherent in human language). This is a significant chunk of the entire interpretability problem: it lest us see inside the black box, so that's a pretty key capability.
f) if you have a translation between two models (say of their activation vectors at their midpoint layer), then by comparing roundtripping from model A to model B and back to just roundtripping from model A to the shared latent space and back, you can identify what concepts model A understands that model B doesn't. Similarly in the other direction. That seems like a very useful ability.
Of course, their approach requires zero information about which embeddings for model A correspond to or are similar to which embeddings for model B: their translation model learns all that from patterns in the data — rather well, according to their results. However, it shouldn't be hard to supplement their approach, given that you often do have partial information about this, and have it also make use of the structures inherent in the data.
Yes, I do think this should be a big deal, and even more so for monitoring (than for understanding model internals). It should also have been at least somewhat predictable, based on theoretical results like those in I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? and in All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling.
The problem here is that sequence embeddings should have tons of side-channels which should convey non-semantic information (like, say, frequencies of tokens in sequence) and you can come a long way with this sort of information.
What would be really interesting is to train embedding models in different languages and check whether you can translate highly metaphorical sentences with no correspondence other than semantic, or train embedding models on different representations of the same math (for example, matrix mechanics vs wave mechanics formulations of quantum mechanics) and see if they recognize equivalent theorems.
Rishi Jha, Collin Zhang, Vitaly Shmatikov and John X. Morris published a new paper last week called Harnessing the Universal Geometry of Embeddings.
Abstract of the paper (bold was added by me):
We introduce the first method for translating text embeddings from one vector space to another without any paired data, encoders, or predefined sets of matches. Our unsupervised approach translates any embedding to and from a universal latent representation (i.e., a universal semantic structure conjectured by the Platonic Representation Hypothesis). Our translations achieve high cosine similarity across model pairs with different architectures, parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while preserving their geometry has serious implications for the security of vector databases. An adversary with access only to embedding vectors can extract sensitive information about the underlying documents, sufficient for classification and attribute inference.
They focus on security implications of their research, but I am trying to understand: Do these findings have major implications for interpretability research?
It seems like discovering a sort of universal structure that is shared among all LLMs would help a lot for understanding the internals of these models. But I may be misunderstanding the nature of the patterns they are translating and corresponding.