As part of the 5th iteration of the Arena AI safety research program we undertook a capstone project. For this project we ran a range of experiments on an encoder-decoder model, focusing on how information is stored in the bottleneck embedding and how modifying the input text or embedding impact the decoded text. Some of the findings are shown in this blog post, and the companion post by @Samuel Nellessen here. We helped each other throughout, but most of his work is documented in his post, and this post documents my findings. This work was also supported by @NickyP and was in part motivated by this.
Although created for the purposes of language translation and not complex reasoning, studying this type of model is interesting for a few reasons:
META’s SONAR model is a multi-lingual text and speech encoder and text decoder model trained for translation and speech-to-text purposes. In this study we look only at the text encoding and decoding and ignore the multi-modal capabilities of the model.
The model is composed of:
The encoder and decoder begin with a language token which is used by the model to properly store the information in the embedding, and to correctly translate the meaning of the input text into the target language.
When encoding some text we begin the sequence with a language token to tell the encoder which language is being input, and when decoding we do the same to tell the decoder which language to generate the output in. In this way we can use the same model to encode/decode from/to any of the large choice of languages (see here).
An open question was to check how the input language of choice impacts the embedding, is the embedding language agnostic? We might expect one of the following:
For this experiment we take about 300 English sentences and their Spanish translations. We embed each of the samples, two embeddings each (one for each language). Using this we can then run a bunch of quick experiments. Throughout we refer to "English embedding" as an embedding generated from English input text, regardless of if that information is stored in the embedding (and similarly for "Spanish embedding"). Note that decoding an English embedding back in to English seems to always generate the same text word for word (and similarly for Spanish embeddings decoded into Spanish).
This was a quick initial experiment that could do with deeper analysis. Some limitations and open questions:
It appears that the source language information is encoded in the bottleneck embedding, although why is unclear. It could be an artefact of the encoder that has no impact, or maybe it is useful for the decoder in situations that weren't covered by the examples we tested. The main takeaway is that the source language does impact the embedding, and that this information could be used by the decoder if needed.
Can we learn something about how positional information (token position in the input sequence) is encoded. Encoder-decoder models work well with semantically meaningful sequences, but they can also embed and decode other types of sequences. They must have some way of encoding a sequence in a way that differs from how they embed meaningful language. How do they do this? Can we manipulate the embedding to control the decoder's output sequence?
To begin with, we visualise how positional information is stored. If we take a sequence like "dog _ _ _ _ _", we can encode it to get an embedding. Note that "dog" is a single token, as is the token used for filler "_", the spaces in the string are just for clarity and are not present in the input sequence. We can then shift the token for dog repeatedly to get input sequences "_ dog _ _ _ ", "_ _ dog _ _ _" etc. We then look at how the embeddings change as we shift the token along. Plotting the embeddings for about 20 positions with PCA we can see that the embedding vector appears draw out a circular pattern or an arc. The distance between subsequent embeddings gets smaller as the position of the token increases - this makes sense as for super long sequences you could imagine that the exact position matters less.
Note that decoding works perfectly for these kinds of sequences. Testing on super long sequences (say above 30 tokens) does begin to show some degradation, where the decoded token position is wrong, or we just get the filler token only.
We see a nice pattern here, can we get a general vector that we can use to tweak an embedding to control the decoded sequence? We take the embedding for "_ _ dog _ _ _", subtract from it the vector from the embedding from "_ dog _ _ _ _", which hopefully then is a vector that if added to an embedding, will shift token in the decoded sequence to the right by one. We indeed find that adding this to the embedding for "_ cat _ _ _ _" and decoding does in fact give "_ _ cat _ _ _".
These vectors are not so useful however because they only work if the filler tokens are the same, so are not useful for arbitrary sequences. In fact if we change the filler token, the pattern above completely changes. There appears to always be a path followed as we shift the tokens, and always arc-like, but they are always different. This means that there isn't an obvious general arbitrary way to control the token position.
We can in principle see a pattern to how positional information is encoded in the embedding, and it appears that earlier positions are more important that later ones. Although the vectors to shift token positions work, it is only in very specific situations and probably not useful. It is interesting that the embeddings aren't completely arbitrary, they can in principle be inspected. The fact that a vector shifting a token works for any other token wasn't known a priori. There is definitely more room here for experimentation.
We want to investigate how is the information about particular input tokens stored. How cleanly is the information embedded. Is if possible to manipulate the embedding to have fine grained control on the decoded sequence?
Here we look at changing a particular token in a sequence to another by manipulating the embedding. Can we manipulate the embedding of "dog a bit dog b" so that it decodes to "dog a bit cat b"?
We first try to do this by calculating the shifting vector by subtracting the embedding for "_ _ _ dog _" from "_ _ _ cat _". If we add this vector to the embedding for "dog a bit dog b" we decode to "cat a bit cat b". This is interesting, by default we get a general "dog" to "cat" direction. Further experiments showed that indeed all instances of "dog" become "cat" regardless of the input sequence. The positional information here is ignored.
Instead if we do the same, but subtract "dog _ _ dog _" from "dog _ _ cat _", then this does in fact only change the token from "dog" to "cat" only for the "dog" in fourth position. For example we can apply this to "dog is happy dog now" to get "dog is happy cat now". We successfully leave the "dog" in the first position untouched. This tells us that we have captured more than just a meaning difference between "dog" and "cat" but also some positional information i.e. more evidence that positional information is stored in the embedding (as we can see in the previous section).
Interestingly this only works for grammatically correct sentences. If we try to do the same approach to change "dog a bit dog b" to "dog a apple dog b" (i.e. by subtracting the embedding for "_ _ apple _ _" from "_ _ bit _ _" and adding it to the embedding from "dog a bit dog b") then it fails, it decodes to "dog a likes dog b". This implies that even though we can manipulate the embedding to change a token, and even a token at a given position, we are at the mercy of what the decoder does. It seems to have a preference for generating useful/valid text and doesn't want to put a noun where a verb should be.
We can:
For experiment specific conclusions see the conclusion sections for each of individual investigations above.
Quick summary:
An overall result is that the bottleneck embedding appears to behave somewhat like the residual stream in LLMs in the sense that there seem to be orthogonal directions that carry semantic meaning. Due to this, along with their simplicity, studying these models could be useful for better understanding some LLM dynamics.
We did a wide range of investigations to probe for interesting behaviour and interesting initial findings. Given the limited time we had, we didn't have time to dig deeper or rigorously prove our findings. Take my conclusions here with a grain of salt, I believe them to be the case based on my results, but they could just be rationalisations. This definitely could do with more work and I would be happy if someone would expand on this. For anyone interested, there is a utility class for working with the Sonar model and some scripts that you can steal from to get started on your own experiments.